5 tips for cleaning dirty data

VP, Technology and Solution Delivery
Valtech

novembro 15, 2013

With the focus that is being placed on using Big Data as a way of driving insights, clean data is increasingly more important. Learn five steps for cleaning dirty data.

Dirty data is a factor that we must account for if we hope to reap the benefits of data driven decision making. In a previous post I described a scenario where data from Sitecore DMS, CRM and the finance system could be used to investigate shopping cart abandonment. Simple errors in the data like duplicate order entries can greatly skew the results of this investigation. Given the growing number of data sources and the new found concentration on using it to develop and drive business and marketing strategy, the need for clean data is becoming increasingly important. While there are some things you can do continuously to ensure that your data remains usable, like active daily maintenance of your data, we will relay five things to do directly preceding any data analysis.

Origins of dirty data

Before getting to methods for helping to clean the data, it's important to know some of the most common quality issues in data systems:

  • Missing customer records like zip code or addresses
  • Multiple representations of data (One system stores state/province in an abbreviated form such as ON, while others store as Ontario)
  • Data exists outside reasonable ranges ex: dates and currency
  • The same customer has multiple entries in the CRM

The path to clean data

If you are dealing with relatively small data sets and a small number of sources, it may be possible for the developer to use scripts or ETL tools to create a uniform view of the data. To complete the ETL process you will need to:

  • Identify and remove duplicates
  • Convert numbers to a consistent representation
  • Convert dates and times to a consistent representation
  • Remove case sensitivity, or make it consistent throughout
  • Normalize spelling for a given dictionary (i.e. U.S. English vs. British English)

.NET scripts or SQL Server Integration Service are the most accessible approaches, and they work well in simple scenarios; however if your data sources are varied or there is great degree of change in your data that would require developers to constantly tweak the ETL process, then you might consider Microsoft’s Data Quality Server and Data Quality Client, both of which are installed as part of SQL Server 2012 Data Quality Services (DQS) and play well with SSIS.

“DQS enables you to discover, build, and manage knowledge about your data. You can then use that knowledge to perform data cleansing, matching, and profiling. You can also leverage the cloud-based services of reference data providers in a DQS data-quality project.”

Looking to learn more about DQS? See the DQS blog.

As data analysis continues to become an integral element for business, maintaining clean sets will be crucial. Have you found yourself in a dirty data predicament? Get in touch and we can help.

Contate-nos