In the past couple of blog posts, I’ve been writing about how I get data ready for analysis in Tableau. One special data condition that arises is the data type that I’ll call an alpha-numeric field. This data type can be tricky to work with, so you have to pay attention to the details if you want to get the right analysis completed within Tableau.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
What is an Alpha-Numeric Field?
The example I’ll use to demonstrate the slippery slope that exists with alpha-numeric fields comes from the medical field. There are diagnosis codes that are used to characterize illnesses, so when you check into a hospital, your condition is assigned a code.
One type of coding system (ICD-9) is used to generate a principal diagnostic code, which is shown as the field “PrDiag ” in the examples below. In the current medical environment in the U.S., this coding system (ICD-9) has approximately 15,000 (15K) individual codes to explain an illness being experienced by a patient.
These codes are alpha-numeric because they can start with a number, a leading zero, or a letter as shown in Figure 1. Because this field is neither a text field or a pure numeric field, you have to be careful on how you prepare the data for usage in Tableau.
If you do not prepare the data correctly, you will not complete your analysis accurately. This is especially important in this example, because if you are using this type of data, you are likely to be working with some really big data files which can make it very easy for you to overlook the details.
As an aside note, a major change is coming to this coding system in October, 2014. The number of codes to be used in the US will jump from about 15K to over 155K! Yikes! Click here if you want to read a funny/not so funny account of what this is going to mean to the US-based medical system. I’m glad I’m not a coding transcriber or a doctor that will have to find the correct code for a patient!
Methods Used to Properly Handle Alpha-Numeric Fields
This post only discusses how this type of data can be used as input for Tableau by linking to either an Excel file or a comma delimited files (csv file). I am covering these two file types because with Big Data being so prevalent, you might have to load your data into Tableau using a csv file rather than an Excel file if your data set is very large.
Case 1 – Alpha-numeric Fields in Excel
To begin, you might want to review a post I wrote a few days ago that discusses how flat-file data can be translated into a comma delimited file format that can then be loaded into Excel. At the end of that post, I showed a technique that allows you to load data into Excel so that you don’t lose the leading zeros in data fields that have them.
This technique is directly applicable to alpha-numeric fields and will guarantee that you are loading the alpha-numeric fields correctly (including leading zeros and/or letters) into Excel from an external data source. If you have entered your data directly in Excel, chances are good that the fields will be text-based fields as discussed below.
Now presuming you have loaded the data in Excel using the technique discussed above, the alpha-numeric fields will be automatically treated as text fields by Excel because they are loaded as text fields from the csv file.
Figure 2 shows that if you look at the data content of Cell B2, you will see a quote as the first character, which is a quote (“) in this case because the cell is right justified. If you connect this Excel file to Tableau and you are working with the field PRDIAG, Tableau will automatically and correctly identify this field as a text field (abc) and you will see this field show up on the dimension shelf. At this point, you are safe and your data is being interpreted perfectly by Tableau. There are no problems to worry about.
If you stick with the Excel connection, life is great, that is unless you frequently have to update your data extract because new data is being added to the database over time and your computer is starting to really bog down.
For the sake of argument, let’s add some context to this problem. Given an Excel file having about 420 thousand rows and 86 columns of data (36M data fields), Tableau takes about 7 minutes to refresh an extract on a computer having 128 GB of ram and 40 of the fastest processing cores currently available (this is NOT a typical desktop workstation – it is a monster big data cruncher).
On a standard desktop, this might easily take over an hour to accomplish the extract update. So instead of continuing to work with an Excel connection, you have decided to write the data out to a csv file because Tableau can refresh that extract in under 1 minute and you want to get the job done sooner rather than later because you’ve got a party to go to. So instead of waiting for the extract to refresh, you write the csv file out from Excel and connect that new csv file to Tableau to produce your report before the weekend arrives.
Will life still be great after you connect to the new csv file? Will you make it to the party on time before delivering your weekend stats to your boss?
The good new is that you will make it to the party! The bad news is that your data and stats will be wrong! You might have to find another job come Monday. Hope you had a great time at the party, but before I tell you why your approach failed, I have to introduce you to a fix for this problem.
As shown in Figure 3, I have created a field that mimics the PRDIAG field in Excel by writing a formula that adds quotes around each PRDIAG code. Why have I done that? The reason for this field has to do with how Excel writes the data to the CSV file, as I will explain in the next section.
Case 2 – Alpha-numeric fields in the CSV file
Unfortunately for you, life will not be so great after the weekend because you didn’t know that your PRDIAG data field would not be viewed by Tableau like it was in Excel. When Excel writes to a csv file, a text field will not be quote delimited unless there is a comma in the data fields being written.
This occurs because you cannot tell Excel what format you want each field to have in the csv file. Microsoft has made those decisions for you.
In the case of alpha-numeric fields that really need to be treated as quote delimited text fields, Excel will write them as an unquoted, numeric type field. In our PRDIAG example field, there are no commas, so this field will get written to the csv file just like any numeric field would, with no quotes to indicate that this is a text field. Now in essence, your alpha-numeric field has been inadvertently turned into a numeric field from the perspective of Tableau.
OK. So what is the big deal?
Figure 4 shows what happens when the csv file is loaded into Tableau. First, your PRDIAG field will be interpreted by Tableau as a number (wrong). Secondly, it will be interpreted as a measure (wrong, again).
Even if you change the data type to text and move the PRDIAG field to the dimensions shelf, you are still hosed because all the PRDIAG codes that started with a letter are now NULL and your codes that had leading zeros no longer have them, which you can see by examining Figure 4.
There is simply no way for Tableau to correctly interpret the values of PRDIAG without using the trick I introduced you to above in the PRDIAG (text) field. When you produced the report for your boss before going to the party, you crashed the PRDIAG field before leaving work. You probably never even noticed the destruction because Tableau still completed your tables.
Is There Anything Else That Needs to be Done before Using the CSV file in Tableau?
Yes, there is one easy clean-up that is needed before sending the much faster loading csv file to Tableau.
Figure 3 shows how the PRDIAG (text) field is created. When this field gets written to the csv, Excel will automatically quote-delimit this field!
By forcing the field to an absolute text state by adding the quotes, we have forced Excel to treat PRDIAG (text) as a text field. So what does Microsoft do with text fields? They quote delimit them in the csv file! Talk about the department of redundancy department!
Therefore, we have to remove some extra quotes in the csv file before sending it to Tableau. This is easy to do in your favorite editor. You search for “”” and replace it with “. Figure 5 shows what the triple quote format looks like in the csv file.
Update 2.5 Years Later
With time, comes wisdom. Everything I wrote in this article is still valid. However, to avoid a lot of this non-sense, I would now just use Alteryx to process this type of data. Alteryx will gobble up this data and save me all the headaches I experienced when I first did the work using the manual, brute-force method. Try Alteryx, you will love it!