Benchmark Results of the Tableau Hyper Data Engine

Featured

Introduction

I have always liked to benchmark computer programs, at least I thought I did! The work I show in this article took much longer than I expected it would, which effectively reduced the fun factor. Once I decide to do something, however, I have to finish the work.

The Original Work and Four and A Half Years Later

The first time I did this work was in November 2013. Now in 2018, I felt like revisiting the series of benchmark examples I established for Tableau to see how fast the new “hyper” data engine is compared to the pre-existing Tableau Data Engine.

The specific question I wanted to answer is:

How fast does Tableau read data from a character-separated file (*.csv) to produce a Tableau Data Extract (*.tde) file and a Hyper Data Extract (*.hyper)?

I really want to know if the “hype” around the “hyper” engine is real. That is the first part of my mission. The second part of my mission is to examine the file compression that is achieved in the extracts. The third part of the mission is to see if the hyper engine leads to a better Tableau experience in terms of quicker responses when rendering large data sets. This question will be answered after this article is published and I have more time to develop test cases.


If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 697 other followers


The Benchmark Experiment

To answer my question, I used 16 different real-world data sets. These csv files originated from different projects and industries and each has different characteristics. I was interested to see how the hyper engine responded to things like files that have a lot of null fields, wide columns, complicated text fields, or a lot of repetitive fields. Information about these examples will be provided below.

The Testing Methodology

Since the computer I used for the current testing is different than what I used in Nov 2013, I had to run the benchmarks through Tableau 10.3.6 to get updated times for creating the Tableau data extracts. I also had to load each benchmark file into Tableau 10.5 to generate the hyper extracts.

This means there were 32 tests completed, but the actual number was closer to 50 due to sensitivity testing I completed. The Tableau performance recorder was used to accurately determine the time needed to create the extracts (*.tde and *.hyper). Figure 1 is an example dashboard from the performance recorder, showing the time required to create an extract.

 

Figure 1 – An example performance summary dashboard created by Tableau when the performance recording option is used.

 


During this work, I did multiple tests for various benchmarks. I loaded these examples on different days, at different times, and I always reported the shortest amount of time needed to create the extract. Based on this testing, there is about a 10% variation in the time needed to write an extract. This variation exists because of the load present on a computing system, the memory available, and hard drive availability. I tried as best as possible to use consistent conditions during all of the tests to avoid creating additional variation in the results.

Figure 2 shows the number of records and total data points present in each benchmark file. Each benchmark is numbered 1 through 16 and are color-coded in some of the upcoming graphics. For all graphics shown in this article, you can click on them to get full-scale versions to make them more legible.

 

Figure 2 –  The benchmarks used in this work. The number of records varies from about .5M to over 411M records. The number of data points varies from 5.6M to 3.3B.

 


The Workstation Used For the Tests

All of the original tests were conducted in a three-hour window and used the same computer using Tableau version 8.0.4. At the time, a 3-year-old Dell XPS workstation with 16 GB of Ram, 1 Tb of hard drive space and an i7-2600 CPU at 3.4 Ghz was used to process all the files. That CPU had a Passmark CPU Mark = 8,314, with the fastest tested CPU having a mark of 16,164.

The new tests were conducted on a 3-year-old Dell M4800 mobile workstation with 32 Gb of RAM, 1Tb of hard drive space and an I7-4800MQ CPU at 2.7Ghz. This has a Passmark CPU index = 8,498, with the fastest tested CPU now having a mark of 27,895. Tableau version 10.3.6 was used to create the new Tableau data extracts (*.tde’s) and version 10.5.0 was used to produce the hyper data extracts (*.hyper).

Benchmark Results In Tabular Form

After completing the testing, the data was entered into Excel and loaded into Tableau to produce both tabular and graphical analysis. You can download my original analysis from Tableau Public by clicking hereThe newest results will be available soon by clicking here.

Figure 3 contains results for all of the tests. Three versions of Tableau were used for all the tests except benchmark 16, which was a new addition to this series. Benchmark number 10, which is a bit of a nasty file with some chaotic text fields of highly variable length, was originally interpreted by Tableau 8.0.4 to have more records than the newer versions. This file, along with number 13, is a bit different than the other files. These were included because they are real-world examples of files that are less than optimal for performing visual analytics.

 

Figure 3 – Benchmark result data for all of the tests.

 


 

Visual Benchmark Results For Speed of Creating Extracts

Figure 4 contains three box and whisker plots showing the distributions of the speeds achieved while creating the extracts. The key measure here is the number of records consumed per minute during the creation of the extracts. The results show median values of 2.7M, 4.2M, and 6.5M records per minute for Tableau 8.04, 10.3.6 and 10.5. Clearly, the hyper extracts are the fastest at creating extracts.

If you simply look at the median values of each version of Tableau, it is clear that Tableau software continues to improve the speed of its data engine. In previous work (years 2013-2015), I have noted that my mental benchmark for reading data into Tableau was 3M records per minute. Now that has to be changed to about 7M records per minute, with about half of the test cases ranging from 7M to 13M records per minute.

 

Figure 4 – Benchmark results for all 16 tests in the three versions of Tableau. The box and whisker plot shows the dynamic range of records per minute processed each version of Tableau. Each circle represents a benchmark and they are numbered 1 to 16, with 1 being on the bottom and 16 on the top of each pane.

 

One of the interesting findings shown in this chart is that Tableau version 10.3.6 had a peak speed of 15.7M records per minute compared to 13.3M for the hyper engine. Benchmarks 14 and 15 were both ingested faster in 10.3.6 than in version 10.5. These results were verified by testing these cases multiple times.


 

Figure 5 contains three box and whisker plots showing the distributions of the speeds achieved while creating the extracts. The key measure here is the number of data points consumed per minute during the creation of the extracts. The results show median values of 49.5M, 78.9M, and 103.7M data points per minute for Tableau 8.04, 10.3.6 and 10.5.

Clearly, the hyper extracts are the fastest at reading the data and creating extracts. This result is exactly proportional to those shown in Figure 4, but I included it because I like to think in ballpark terms of 100 Million data points ingested per minute. Those kinds of metrics are very impressive to me.

 

Figure 5 – Benchmark results for all 16 tests in the three versions of Tableau. The box and whisker plot shows the dynamic range of data points per minute processed each version of Tableau. Each circle represents a benchmark and they are numbered 1 to 16, with 1 being on the bottom and 16 on the top of each pane.

 

One of the interesting findings shown in this chart is that Tableau version 10.5 had a peak speed of 239.3M data points per minute in benchmark 13. This file has a lot of null values and repeated data. It seems that hyper is really fast at reading these types of files. Ideally, we don’t want to use those types of files, but if you are stuck with them, you are in luck using the hyper engine.

Figure 6 shows the percent difference in records per minute consumed in 10.3.6 vs 10.5. For benchmark #1, the hyper engine is 155% (1.55X) faster than the tde engine.  This is calculated as a percent difference in Tableau. For math lovers, this is calculated as new  speed / original speed = 7.89M/3.09M = 2.55. The change is 2.55 – 1 = 1.55. This is an increase of 1.55X or 155%. For benchmark #14, the tde engine is 12.5% faster than the hyper engine. For most cases, the hyper engine is faster than the tde engine, with an average of 53% faster, with a peak of just over 1.5X faster.

 

Figure 6 – Percent difference in records per minute consumed in Tableau10.5 vs 10.3.6. Interestingly, 14 of the 16 tests showed that the hyper engine was faster than the original Tableau data engine. In tests 14 (411M) and 15 (27M), which are the largest files tested, the original Tableau data engine produced the extracts faster than the hyper engine. Each of these files has the same structure and data types and were used to create longer load times. I’m going to have to test some other really big files having different data structures to see if this pattern holds true.

 


 

Visual Benchmark Results For Extract File Compression

 

Not that it really matters that much, but I like to look at the characteristics of the extracts that are created. I wanted to know how the compression ratio has changed over time. I’m amazed how tightly the information can get packed into the data extracts. The median data compressions have gone from about 90 to 87 to 84%, for Tableau 8.04, to 10.3.6 to 10.5.

 

Compression Whisker Results

Figure 7 – Over time, the Tableau data extract files are losing some of their compression. This is likely a balancing act between trying to make the files more responsive by adding easily accessible information into the extract data structure. If that supposition is true, the hyper extracts should be the most responsive type of extract ever created. Additional testing will be needed to see if that is true.

 

 

An interesting finding here is that the hyper engine compressed benchmark 8 at nearly 97% (compared to 92% for the original Tableau data engine). This took an original csv file size of 768 Mb and shrunk it to 24Mb ((768-24)/768 = 97%). That is a very impressive result.

When I look at the characteristics of that file, there is a lot of repeated information included. This means that one of the things that the hyper data engine does best is to identify repeated data and it is able to very efficiently compress that information. That characteristic also shows up in the improvement of compressing test #10 from just over 50% for the original Tableau data engine to almost 70% for the hyper engine.

Summary of Results

Things I have learned from this testing.

  1. The hyper engine is faster than the original Tableau data engine.  There is real truth to the “hype”. Given the test cases I used, I saw an average improvement of 53%, with a peak improvement of 155% (Figure 6). The peak improvement meant that for an individual data file (BM1), Tableau 10.3.6 read 3M records per minute while 10.5 read it at 7.9M records per minute (1.55X faster). This corresponded to a change from 37M data points per minute to 95M data points per minute. There were three benchmarks (1,2, and 13 that showed this type of improvement).
  2. For the largest data file(s) tested (BM 14,15), the original Tableau data engine slightly outperformed the hyper data engine. The peak reading rates were 15.7M records per minute (10.3.6) versus 13.3M records per minute (10.5). This result was a bit surprising and will have to be future testing with other larger files that have different data structures. The problem is, as the files get larger, the testing takes longer to complete!
  3. The hyper data engine has the capability to compress repetitive data better than the original Tableau data engine. Overall, the extracts are becoming less compact over time, which probably will lead to better overall responsiveness.

Final Thoughts

If I ever have the brilliant idea of performing another study like this, would someone please slap me upside the head and remind me of how much work this is? I completely underestimated the effort this took! So much for starting off as a bookworm (Figure 8).

1515624907182-64be4b53-a347-456a-a08a-f3cac83936d1

Figure 8 – Now I need to do some research to determine the month/year for that edition of Reader’s Digest. I’m guessing it is Nov/Dec 1963 based on the back photo and my age in the picture. One thing I can tell for sure is that it cost $0.35. At this time, I cannot remember the article I was reading before the picture was taken.

 

6 thoughts on “Benchmark Results of the Tableau Hyper Data Engine

  1. Pingback: 3danim8's Blog - Tableau #HyPeR – Is High Performance Real, or is this just HyPeRbole?

  2. Hi Ken,

    I am reading an entry in your blog after a long time. Your thoroughness in setup of the test, interpretation of the data and more importantly the presentation are all admirable – thank you very much for shedding light on the hype behind hyper…

  3. Hi Siraj,

    It’s good to hear from you, and you are welcome! Thanks for the comment as I always appreciate your thoughtfulness.

    Here is something you will be interested in. There is a guy named Ken Flerlage (http://www.kenflerlage.com/) that has a huge dose of Tableau Big Magic going on. I’ve really been enjoying his creativity in Tableau. I sent him the link to the Big Magic article (https://3danim8.wordpress.com/2016/12/08/alteryx-tableau-creativity-and-passion/) and he is now listening to the book! I can’t wait to see what he thinks about it. I bet it hits him full-force like it did me and you.

    Take care and we’ll talk again soon,

    Ken

    • Hi Ken, I checked out Ken Flerlage – awesome stuff and mind-blowing math – my heart literally wants to stop all this ‘work’ I am doing and to dive into the world of math. I hope I get some time in the future to delve into such delicacies in Tableau.

      Incidentally, this week I was creating a tutorial for my students on box plots and your article was right there in my mind. So, I have included a link to this article from Youtube video here – https://youtu.be/xI5V6jTHDG0?t=21m19s

  4. Hey Ken, any plans for a follow-up article on improvements seen on the rendering side, rather than the extract refresh side? I’d love to see those results.

    • Hi Matt,

      Thanks for the idea. It would be interesting for me to do that type of testing. I certainly have plenty of big examples that could push Tableau hard enough to document the rendering improvements. I’ll think about this topic, and see if I can find the inspiration to do the work. Lately, I’ve been taking some time off from blogging.

      Thanks,

      Ken

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s