If you work with large, diverse data sets, you know that there will invariably be noise and erroneous data in the database you are using. This is one reason why I treat all data as guilty until proven innocent.
Since I work in the field of process improvement, it is necessary for me to rapidly identify bad data to eliminate this information before it becomes an integral part of a historical data mining study or potentially included in a process-improvement experiment. If there are stores or other locations that cannot or do not produce reliable data, we do not want to include them in our work unless the data production problems can be resolved. The problem is, it isn’t always easy to rapidly identify bad data when you have large files, a lot of stores or locations producing the data, or geographical differences that can produce variation in the data. In many cases, we don’t have an unlimited amount of time to do a data quality analysis as part of a measurement study, so we have got to find ways to quickly knock-out the obvious problematic data (and/or locations) and move forward with a subset of data that we can rely upon.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
To solve this dilemma, I developed a Tableau-based, data-cleaning methodology that is repeatable, is based on a sound statistical method, and is widely applicable to different types of data. In the method contained in this post, z-scores are used to identify and eliminate bad or erroneous data using an iterative approach. For an explanation of z-scores (standard scores), click here. To learn about calculating z-scores in Tableau, click here.
Rather than take the time to document (i.e., write) how I implemented this approach in Tableau, I made a video that shows all the steps needed to accomplish the task. The video is 12 minutes long but I think it is worth watching if you need a clever way to clean your data. If you want to watch the video, click the link below and use the HD button to get the best picture your can on your monitor.
Please understand that data removal shouldn’t be taken lightly as part of any actual study. The choices you make regarding how to eliminate bad data will depend upon a lot of factors specific to your situation. In some cases, you might choose a z-score of 3 for data removal, but in another case, a z-score of 2 or 4 or even 3.5 might be needed. This is a decision you have to make and will depend upon the intrinsic variation in the data that you are working with. Lastly, don’t be too quick to remove seemingly bad or erroneous data during a process improvement experiment because the whole purpose of the experiment is to trigger big changes! Understanding the biggest data outliers in those experiments might be the key to understanding how to improve the process.