This is the second of a six part series of using Tableau to help with Alteryx applications, especially for new Alteryx users. In this post, I discuss the usage of the business database that is maintained by Dunn and Bradstreet and is accessible via Alteryx. This database allows Alteryx users to access information on businesses by SIC category. There may be other ways to access this business information other than by SIC codes, but this is just the way I learned how to do it.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
How Can Tableau Supercharge Your Access to Business Data?
When using Alteryx to access Dunn and Bradstreet business data, one thing you can do is find company information using the SIC code(s) for the type of businesses that interest you. A SIC code is the Standard Industrial Classification number listed in the Standard Industrial Classification Manual, which is published by the Office of Management and Budget. The SIC Manual is available on the Internet site http://www.osha.gov/oshstats. This website allows you to look-up an SIC code or perform a key word search as shown in Figure 1. In this example, I have searched on the word “convenience” and three types of businesses were returned with codes 3643, 5411, and 5541. Now that I have these codes, I can use them in Alteryx to extract business information for regions that I am interested in investigating with Alteryx. In my case, I’d like to use the last two SIC codes and not the first because it is not related to the type of business I am interested in.
Using SIC Codes in Alteryx
If I took the four digits codes of 5411 and 5541 and tried to find business data in Alteryx, I might be out of luck, that is unless I was very lucky. The reason for this is that the Alteryx SIC codes are different that the codes shown on the website above. To find out which codes I should use, I created a Tableau workbook that contains the Alteryx SIC codes and business descriptions. In this workbook, as shown in Figure 2, you can search on an SIC code or put a key work in the description field to find what you are looking for. In my case, I searched on the work “convenience” and found five entries. As shown in Figure 2, the four Alteryx SIC codes that contain 5411 can vary from six to eight characters but they all have the first four digits starting with 5411, which is one of the two codes (5411) shown in Figure 1. If I want to capture the second category of 5541, I need to do another search on SIC code 5541 and that is shown in Figure 3. This search has returned seven business categories of information for me to consider in my work. In general, the Alteryx SIC Codes vary from two to eight characters and there are a total of 22,314 of them in the third quarter of 2013.
Using Tableau to Understand Dunn and Bradstreet Field Names in Alteryx Output Files
Just like in did in Part 1 of this blog post series, I created a Tableau database for the 22,314 fieldnames that exist in the Dunn and Bradstreet database so that I could immediately identify the meaning of the alpha-numerically coded fields that Alteryx returned to me. This *.csv database is joined in Tableau to the *.csv output that you receive from Alteryx when you extract business level data. This joining process allows you to immediately identify the names of the fields that you have received from Alteryx. The alpha-numeric codes that are used to identify the fields are replaced with the field descriptions, just like I showed in Part 1 of this blog post series. If you were working with data from coal companies, for instance, you can see that there are a lot of fields that will get returned to you as shown in Figure 4. Unless you were exceptionally talented with a deep knowledge base in your brain, it would be hard for you to know that the field labeled BUSNAICS_ECN42121_PCT actually means % Coal Mining (2121)!
Using Tableau to Understand the D&B Data That Gets Returned to You
In this section, I have to be intentionally non-specific due to confidentiality requirements. Once I developed the Alteryx workflow that I needed to investigate the business climate in about 350+ target locations I was interested in, I received from Alteryx a *.csv file containing Dunn and Bradstreet data. Each line of this file identified the target location I was investigating as well as the other companies that performed work within the same SIC code(s) within the region I specified around my target locations. You can specify the region of interest in different ways in Alteryx, which is very powerful. You can specify a known radius around your target location (1 mile, 2 mile, etc) or you can specify a drive-time distance (5 minutes, 10 minutes, etc) around your location. Alteryx does an amazing job in calculating the drive time polygons (more on this later in this series) and it can add some high horsepower to your analysis.
By extracting the business data within a specified region around my target locations, this allowed me to understand the local competitive market in these zones. When I first received the *.csv file from Alteryx, I immediately sent the file to Tableau to visualize the spatial distribution of other companies around my target locations since latitude and longitude is included in the Dunn and Bradstreet output file. Once visualized in Tableau, however, the potential problems discussed below were quickly realized. By using Tableau in combination with Alteryx you can overcome these problems in a relatively short time frame to achieve some amazing results.
When you are in a tight-timeline project and you identify these problems, you can feel a sense of dread followed by a sense of elation once you solve the problems. This is the part of the project where you begin to understand both the power and potential pitfalls of using data like this. This example will explain how I had to study and modify the Dunn and Bradstreet output file before using it to draw conclusions in my project and how the use of Tableau was critically important to the success achieved.
Method Used for Removal of Duplicate Companies in the Dunn and Bradstreet File
The first issue I identified is that the Dunn and Bradstreet database has a lot of duplicate information for businesses. There are spelling variations in business names that create duplicate company entries and there are business name changes over time that can create duplicate information. Without visualizing the locations of competitors around your target locations, you might not immediately recognize these problems. Any database you receive from Alteryx with a Dunn and Bradstreet origin really needs to be cleaned and examined to remove these duplicates. You should not make the mistake of thinking that this database is of the highest quality of information with one record per competitor because it is clear to me now that this database has a lot of redundant and incomplete information included. In case you are wondering, these duplicate values occurred even with me specifying only one SIC code to extract the data.
Outlined below is a step by step approach that I had to develop and use to remove the duplicate entries in my file. I did this work outside of Alteryx, but an Alteryx guru might have a way to do this inside of the Alteryx. After each step of data removal, Tableau was used to visualize the remaining database to help identify if other duplicates still existed. Once this procedure was completed, I had a clean assessment of the local competitive market around each of my target locations.
- Start your process by sorting by your Alteryx output file by target location number and then by the competitor business name. This file contains the Dunn and Bradstreet information for the businesses that exist in your specific search region for the SIC codes you specified. I cannot show an example in this case.
- Use Excel to remove duplicates using only target location number and business name. In my example, there were 3225 original competitors found, in which 801 duplicates were immediately found and removed by Excel. This left 2424 potential competitors in the database.
- I noticed that two spaces were used in some business names causing additional duplicates. I removed the two spaces and replaced them with 1 space. There were 36 additional duplicates found, now leaving 2388 potential competitors, with 26% of the original database now being removed due to duplication.
- Sometimes your original target location is included in the competitor list, and sometimes it is not. This occurs because some businesses do not have accurate latitude and longitude information in the Dunn and Bradstreet database. This inaccuracy may occur due to inaccurate addresses or for some other reason, but I found many of my target locations showing up in the wrong location. Therefore, you have to sweep through the data set to remove your own target locations. In my case, there were 280 locations out of 350+ locations found and removed from the list of potential competitors. After removing these, I had 2108 potential competitors remaining, which is 2108/3225 = 65% of the original competitor list produced by Alteryx.
- Sort by target location and then business name. This puts in order the potential competitors relative to your target locations. For me, this final step identified an additional 80 duplicate competitors that had different business name spellings. The final final count of potential competitors was then 2028/3225, or 62.8% of original list. This particular study indicates that the Dunn and Bradstreet data can have in excess of 35% redundant information. Users should take the time to examine the Dunn and Bradstreet data carefully before using it to make critical project decisions.
What Else Did I Identify About the Dunn and Bradstreet Data?
I really don’t want to leave a negative impression with readers with respect to the Dunn and Bradstreet data that is available through Alteryx. D&B pride themselves on high quality business data as shown in Figure 5. With data on over 67 million businesses in the US, this database certainly can be an asset for projects that require predictive analytics.
The primary issues that I have discovered are that the database is not optimized for quick usage on a predictive analytics project. You have to study the information you receive as output from Alteryx to make informed decisions. You have to consider that a lot of the information is redundant and may be outdated. You also have to consider that a lot of businesses don’t report information like total annual sales. If companies don’t report their numbers or D&B uses estimated values, you will not be able to use that information in predictive models that you might build. I had a much higher expectation for the completeness of data items like annual sales figures than what are actually included in the D&B database. In my study, less than 20% of the businesses I investigated had any sales data (estimated or real) and only a few percent had real data. Additionally, because you cannot modify the source of the D&B data in Alteryx to removed the duplicates once you identify them, every time you go to query the D&B database and get an output file, you have to post-process the output file to remove the problems identified. You might have queried the database for competitors in a 1, 2, and 3 mile radius, for example, and you will have to clean all those output files. That repetitive workflow is a great application for Alteryx if you have the expertise to create the workflow needed to successfully post-process the files. However, the trickiness of removing the duplicates might lead you down a path of spending a lot of time in post-processing and not enough time in analysis.
Overall, access to this much information is very powerful and can be successfully used in predictive analytics projects like I recently completed. You will have to work for this success, however. On the downside, you have to really look at what is returned to you before using it in your analysis. You should visualize the output as much as possible to assure yourself that the data is properly cleaned. Tableau gives you the perfect platform for accomplishing that task.
Upcoming in Part 3 of This Series
I’m going to get into a little more detail about using Alteryx in combination with D&B data and Tableau to examine, quantify, and visualize business competition.
Part 3 has now been published, so click here to read it.
Previous Posts in This Series