I have always liked to benchmark computer programs. For that reason, I thought it would be fun to see how fast Tableau is in reading typical structured data sets that I have used in Process Improvement work. The question I want to specifically answer is:
How fast does Tableau read data from a character-separated file (*.csv) to produce a Tableau Data Extract (*.tde) file?
Experience has taught me that it doesn’t take very long for me to be performing tabular, spatial, and visual analysis of data in Tableau once I receive data from a new client. Many clients have been astounded when I call them with questions about their data within an hour of receiving the data from them. I might be asking them about their data sources, measurement systems, or I might be asking about an obvious data quality issue that I have uncovered using Tableau.
The clients typically are surprised that I was able to find these problems in such a short time because they know that I was completely unfamiliar with their data. One of my favorite quotes from a new client was, “Well you caught us with our pants down on that one!”. To understand how this is possible, I conducted a short experiment that is described in this post.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
The Benchmark Experiment
To answer my question, I chose 14 different data sets that I have used over the past two or three years. These csv files originated from different projects and industries and each have different characteristics.
These files have different numbers and types of dimensions and measures, and the data fields contained within can vary significantly. Some files are principally composed of numerical data and others have quite a bit of dimensional type data. Some files have fixed length fields whereas others can have large, variable length fields.
Since I needed files big enough to be able to record the time needed to read the csv files and complete the tasks necessary to create the Tableau data extract, I chose files having at least a half a million of rows of data (records). The largest file has over 411 million records. Only single (*.csv) files were used in the test – no data blending or data joining was included in this study.
The number of data fields (number of data columns) in the files range from a low of 7 to a high of 62. The number of characters needed to represent one record of data in each file ranged from 41 up to 2,367. The total number of data points in each file is calculated as the number of records (number of rows of data) times the number of fields. The total number of data points ranged from 5.6 million to nearly 3.3 billion.
Due to time constraints, I didn’t include any really big files but I plan to continue this benchmark study in the future on a big data platform such as Amazon Redshift.
The Workstation Used For the Tests
All the tests were conducted in a three-hour window and used the same computer using Tableau version 8.0.5. A 3-year-old Dell XPS workstation with 16 GB of Ram, 1 Tb of hard drive space and an i7-2600 CPU at 3.4 Ghz was used to process all the files.
This CPU has a Passmark CPU Mark = 8,314, with the fastest tested CPU having a mark of 16,164. The stop watch on my Note II phone was used to record the time needed to create the Tableau Data Extract files, with an accuracy of a second or two.
After completing the data loads to produce the tde files, I typed the data into Excel and loaded it into Tableau to produce both tabular and graphical analysis. There are a lot of ways to look at this data to discover interesting behaviors, so I created a parameter-based scatter plot worksheet to allow me to pick whatever variables I wanted to compare to easily investigate relationships.
I wanted to easily discover how specific data file characteristics such as percentage of null fields can affect the resulting data extracts. Although this is certainly not an exhaustive evaluation of this data, some of my findings are shown below. You can download the Tableau Public version of this spreadsheet for yourself by clicking here, if you feel like making your own discoveries.
Table 1 contains a listing of the 14 data files used for the test and gives the speed of the data read during the creation of the data extract. The speeds are recorded as Records (rows of data) per minute and Data Points per minute.
Each of these key measures are going to be dependent on factors such as the types of data stored in the files (numerical vs character based data), the length of each individual data field, and amount of null fields present in the files. By using a variety of files having differing characteristics, I was hoping to gain some insight into what types of file structures are fast and easy for Tableau to process into data extracts versus what types of files are more sluggish.
Over the past seven years of using Tableau on a daily basis, I had mentally bench-marked the tde file creation process at about 3 million records per minute. In general, this isn’t a bad estimate for this computer but there are times when this is not achieved and other times when the speed is significantly higher, depending upon the data file structure and content.
The average number of data points read per minute is fairly staggering at 50 million but is shown to go as high as nearly 90 million (Example 14 – the biggest file tested and also one of the tightest files because it contains all fixed-length fields). Figure 1 is a scatter plot that shows the time needed to create a tde is reasonably correlated to the number of data points in the csv file. The correlation is 68%, with the variation occurring because of the csv file structure and data factors mentioned above.
Although not directly comparable, I recently tested loading the same type of data as example 14 onto the Amazon Redshift platform. The speed of the data read into the SQL tables was 20 million records per minute for about 30 million records, or nearly 7 times faster than the tde file creation process.
During this work, I also realized that there was a third key measure that I could easily determine. As shown in Table 2, I calculated the file compression percentage for each of the data extracts. This compression percentage is simply the size of the tde file divided by the size of the csv file. The csv files are character-based files while the tde files are binary, so a good compression is expected.
As shown in Figure 2, the file compression percentage decreases when the data file structure isn’t tightly packed. Examples 10 and 13 both have contain a data field that can be very long, which is shown in the fourth column of Table 2. These are variable length data fields that contain written notes about particular account members.
With a maximum of over 1700 characters per record in Example 10 , the tde file compression is the worst observed at about 53%, followed by Example 13 at 62%. Interestingly, eight of the fourteen data files have maximum column sizes around 200 characters. This must be what is required on average to describe the parameters of a system that we are trying to improve.
My intuition suggested that the (*.csv) files that had the highest percentage of null fields would have the highest tde file compression. That concept does not appear to be true in general.
Figure 3 shows the TDE file compression as a function of the percentage of null fields in the file. I am not sure why this result occurs because I don’t know the details of the (*.tde) data structure. This is just one of those counter-intuitive findings that can occur sometimes.