This article describes how I used the combination of Tableau and Alteryx to resolve a large number of erroneous customer addresses in two steps. The errors in the database existed because the customers were allowed to enter the data on their own into a website. Whenever customers have access to databases, more bad things happen than good things.
Statement of the Problem
When people (customers, parents, etc) enter data into web pages, things generally can go bad. People do not perform quality assurance checks on their input. They want to finish the deal (i.e., data entry) as soon as possible.
I know this from years of running youth sports programs in which I used web sites for capturing player registration information. The common types of errors include mis-typed state names or abbreviations, incorrect addresses, incorrect zip codes, or even incomplete entries. Parents love to enter their kids birthdays on the day that they are doing the data entry. I can’t tell you how many players I had to register that were 0,1 or 2 days old! Some of these errors can be quickly resolved and some cannot be easily fixed.
In this article, I show how I was able to use Alteryx and Tableau to resolve most of the problems I had in a 900K customer database. I originally planned to do this work using custom geocoding in Tableau, but after trying that for a little bit of time, I realized Alteryx was the easier and more robust solution for me given the time I had to work on the problem.
Visualizing the Problem
When I originally received the database, I used Alteryx to geocode the addresses for every customer. This just means that Alteryx gave me the latitude and longitude for each customer’s address. Once I had the location of each customer, I used Tableau to visualize the customer distribution. Figure 1 shows the results of this work, which indicates a lot of inaccurate data. I know the data is inaccurate because I can see multiple colored dots for each state.
If you want to hear me discuss this problem and how I planned to solve it, watch the video shown below.
The Two-Step Procedure For Solving the Problem
I used a two-step procedure for solving the problem. Here is what I did.
- I resolved issues related to the wrong state being associated with a zip code.
- I resolved issues related to the addresses having the wrong zip code.
The next two sections will explain how that was done.
Step 1 – Making Sure The Correct State Was Associated With The Zip Code
Figure 2 shows the simple Alteryx workflow that used information contained in a master zip code file I have. You can click here if you want to download the master zip code file for your own work.
If you want to learn more about this process and how the magic happened, watch the video below.
After this solution was implemented, the customer mapping by state was much cleaner as shown in Figure 3. The problem was not completely resolved, however. There were still about 0.5% of the customers that have not been properly mapped.
Step 2 – Making Sure The Correct Zip Code Is Used For the Address
After step 1 was complete, I captured the 4600 or so customers that were not properly mapped. I called this collection of customers “the lost customers”. To resolve the locations of as many of these as I could, I developed another Alteryx workflow that used the CASS system to find the correct zip code for a given address. Figure 4 shows this very simple workflow.
Once I had the correct zip codes for these addresses, I just re-ran a slightly modified version of the original workflow as shown in Figure 5 to geocode these customers. I was able to properly map over 2600 of the 4600 lost customers, which reduced the lost customers to about 0.2% of the original population.
Figure 6 shows the map of the 2600 lost customers that were recovered after properly determining their zip code. If you want to know how to avoid one of the problems discovered in this data set (dropped leading zeros on zip codes), click here for this Alteryx-based article or click here for the Excel-based article.
If you want to learn more about this final step of the QA process, watch the video below.
The final solution to the problem involves adding two csv files together to get the complete database. If the remaining 2000 lost customers were determined to be very significant, than additional work would be required to track them down and get the correct address information from them.
I could have built one Alteryx workflow to solve this whole problem. If I had done that, the workflow would be efficient, robust and could be used for any databases that I might encounter in the future that happened to have these types of problems.
For teaching purposes in this article, however, I decided to try to keep this simple. I also wanted to quantify how many customers were recovered by each step of the process. My hope in writing this article is to demonstrate how you can methodically solve a problem in an easily-understood, step-wise fashion by using Alteryx and Tableau.