I guess you can say that I’ve been on a mission lately to understand and document the performance, capabilities, and limitations of Tableau Desktop software. Today I decided to answer the following question:
When is it necessary to switch from a professional workstation to a “big-data platform” to analyze large structured data files with Tableau?
Notice that this context is strictly limited to structured data files. I am applying this frame of reference because for many projects and fields of study, structured data is what is used to complete projects. Unstructured data can also be considered in this type of analysis (and likely will be a bigger factor in the future), but for now, I’m looking only at well-defined, structured data sets.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
Working With Tableau on a Professional Workstation
In a recent post, I performed some benchmark work to see how fast Tableau could read structured data files. The reason for this work was to determine two things: (1) How long does it take me to be able to work with large data files once I receive them from a client?, and (2) How big of a file can I actually use with Tableau on a normal professional workstation? In performing that work, I determined that for a typical project that I am involved in, I can estimate that Tableau will read the data and create a (*.tde) extract at a rate of about 3 million rows of data per minute. If I receive a “typical” 15 million row database, I’ll be analyzing the information in Tableau on a professional workstation in just over 5 minutes. This is very fast and is one of Tableau’s greatest strengths.
Now regarding question number two, I have wanted to quantify or estimate the maximum file size that I can effectively use in Tableau. What I mean by the term “effectively use” is this: can I solve real-world problems in a reasonable time without applying special techniques or tricks. Since I am not a Tableau developer and don’t have access to the Tableau source code, what I’m about to say is somewhat speculative but is backed by a lot of practical experience (over 1,000 uses of Tableau across seven years). The technical details may not be fully comprehensive, but I think the conclusions are sound.
Since Tableau version 8.05 is currently a 32-bit program, it can address up to 4 Gigabytes (Gb) of random access memory (RAM). Some form of virtual memory access might also be used within Tableau Desktop, although I am not sure of this for the following reason. If I take a data set that borders on running out of memory on a 16 Gb workstation and place it on a 32 Gb workstation, I don’t see any benefit of having twice the RAM (or even more hard drive space). On both workstations the Tableau performance is essentially the same and Tableau Desktop runs out of memory at nearly the same place in the workflow. By simply adding more memory to a computer (above 4 Gb), you are not changing your ability to work with larger files in Tableau. It doesn’t matter if you are running a 64-bit version of Windows or not, Tableau doesn’t address above 4 Gb of memory because it is a 32-bit program.
I have determined that I can effectively work with “typical work-related” files up to about 20 million records on a professional workstation. This typical file will have between 5 and 20 columns of data to hold information such as date, time, sample identifier, location information, subcategories, and measured variables. This is a lot of information that represents 100 to 400 million pieces of data! To work with larger files or to add multiple blended tables to an already big base file, I have to be very smart with filter settings to limit the amount of information that Tableau is trying to process. If you forget to properly set filters before rendering a view, you will pay the price with a long query or an “out of memory” failure. For these larger data processing situations, Tableau will suddenly send you a warning message that indicates a memory shortage and that it might unexpectedly close. When you start working near this memory limit, you have to think a couple of moves ahead of what you ultimately want to produce or you risk getting suddenly knocked out of the Tableau workspace. At that point, you are trying to manage a difficult situation unless you happen to have a map in your head of how queries are conducted by Tableau and the amount of memory being required to handle each query.
Practically speaking, I have not found it to be an enjoyable experience working with files that contain enough data to consume the available memory. Even though Tableau will produce a (*.tde) file for much larger files (see example 14 from my benchmark post), it doesn’t mean that you are going to have a great experience working with that extract. Tableau has only so much RAM to work with in its current configuration, and when it gets consumed your options will be limited. The good news is that when the 64-bit version of Tableau is released, the theoretical limit of addressable memory will jump from 4 gigabtyes to 16 exabytes, which is practically unlimited and should allow us to unleash some serious files into the Tableau work space! When that capability is achieved, this blog post will need to be re-written.
Jumping to the “Big Data” Platform
Last night I awoke in a sweat after seeing a bill from Amazon Web Services. In the dream, I had forgot to deactivate a cluster I was using and just received a $766,334 invoice for using the Redshift platform. I was pretty sure that was over my American Express credit limit, so I was really worried about how I was going to get the bill paid! Luckily, it was just a dream. In real life, however, there is a cost of using “big data” platforms and I have been doing some testing to find out the costs, effectiveness, and limitations of using Tableau on certain platforms.
Last week, I did some basic benchmarking on Amazon Redshift. I found out that data was populated into SQL tables at the rate of 20 million rows per minute for the architecture that I was using. This metric is faster than the (*.tde) file creation process by a factor of roughly 7. When you make a live connection to Redshift, you don’t use Tableau extracts and all queries are handled directly in SQL. You still have the option of creating a Tableau extract, but last week I did not try that option. Today was the day I completed that test and Figure 1 shows some screenshots captured during this work.
Since I could not find any documentation on using (*.tde) files with Redshift, I knew that I had to test it myself. I wanted to find out the answer to two questions: (1) Where is the (*.tde) file written – on Amazon servers or on my workstation?, and; (2) How fast is writing the (*.tde) file compared to my benchmark results from last week. I thought I already had a good idea of the answers to these questions but I wanted to be sure.
Using a 27M line climate data file (about 1 Gb in *.csv format), I reinitialized a 16-node Redshift cluster that I created last week. The cluster re-initialization process took about 20 minutes and since the data was previously loaded onto Redshift, I was ready for the test. Once I connected to the data source, I simply chose to create an Tableau data extract rather than to create a live connection. I immediately knew the answers to my questions. First, there is no choice of writing the (*.tde) to the Amazon servers. The (*.tde) is a proprietary binary file that can only be written to your local workstation. There is no mechanism to write such a file to the Amazon servers (the S3 system), and even if there were, the S3 service stores data in SQL tables, not Windows-based binary files. For question 2, the (*.tde) creation speed is going to be dependent on the speed of your internet connection since the (*.tde) file has to be written across space. On my system, the process was agonizing. Prior to starting this test, I had just benchmarked this file on the local workstation and recorded a rate of 4.64 M records per minute, or 37.1 M data points per minute. On Redshift, the rates dropped by a factor of 6 down to 764 K records per minute and 6.1 M data points per minute. The time required to create the (*.tde) for this test went from under 6 minutes on the workstation to about 36 minutes on Redshift. Therefore, there is no benefit to using (*.tde) files on Redshift. You should just expect to make a live connection and accept the speed that Redshift provides in executing SQL commands. Of course, the speed at which Tableau responds during a live connection on Redshift is also proportional to the speed of your internet connection because those query results have to get sent over space back to your local workstation. For this reason, if you are going to use Redshift, you should clear-up your bandwidth as much as possible, use the fastest internet connection you can get, and be strategic in how you choose your queries before you execute them. If you take care of these details Redshift will give you access to large structured data files in Tableau. Finally, if you are going to use Redshift, purchase a 1-year or 3-year pre-purchase pricing plan because you will cut your costs down dramatically (up to a factor of 8) compared to doing the pay-as-you-go pricing plan that I have been using. This test was completed in under 1 hour and cost $13.60.