Notes on Working With CSV files Above 4 Gb in #Tableau, Part 2

Introduction.

Tableau version 8.2 is an absolute GAME CHANGER for me. The big file/data size boundaries that I was flirting with are now gone. Say good bye to the 4GB file issues that plagued us with the Jet driver.Tableau has done a great job once again in improving their software.

If you read Part 1 of this post, forget about what I said about loading your big *.csv files into Microsoft SQL and having to do the extra work to make that happen. You won’t have to write custom schema.ini files. You won’t have to buy the SQL system. Simply use Tableau 8.2 with the new database driver that replaces the Jet Driver. It is awesome, it is fast, and it gets the job done for big data. There are three findings that I want to report related to using this driver, but these are relatively minor in nature compared to the massive improvements in speed and capability that the new driver provides.  I’m not going to write a long post here with all kinds of benchmark information like I have done before because I have too much work to do with Version 8.2.


If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Benchmarking the New Driver

I specialize in process improvement work. Improvements in established processes of 10 to 15% are outstanding. Well, Tableau blew that out of the water with this new driver. Here is an example of one benchmark. This is the time it took to create a Tableau data extract for one csv file of 2.4 Gb, 19,688,771 records, 27 fields per record, and 531,596,817 total data points:

Version 8.1 – 13:08 (13 Min, 8 seconds) at 40.47 million data points read per minute

Version 8.2 – 6:49 (6 Min, 49 seconds) at 77.98 million data points read per minute.

In case you are wondering, that is a staggering 48% improvement (i.e. reduction) in the amount of time needed to load the data and being able to get to work. Just for the fun of it, I hit Tableau with the big brother file of that baby 20 million line file. This 62 million line file (1.67 Billion pieces of data) was devoured in 20 minutes at a rate of 83.7 Million data points per minute. The amazing part of this: this was the speed during a data join of the big file with another smaller file! Now that is seriously fast. By the way, that 62 million line file is an 8.65 Gb csv file that would have been impossible to read in Tableau version 8.1. The csv file size barrier has been shattered. Finally, the data types were perfectly interpreted by Tableau 8.2, with no additional work necessary on the csv file prior to loading.

Minor Issues with the New Driver

During this testing and real-world usage, I discovered a few things that seemed to have changed with the new driver. Here are the items that work differently than before:

Issue #1

In previous version of Tableau, it was always possible to work dynamically with an open Excel file and Tableau. You could make changes in the Excel file, save it and the results would be immediately available in Tableau once a live refresh or extract refresh were completed. With the new driver, I have been having multiple instances of Excel not being able to save the file that is currently being viewed in Tableau. You either have to save the Excel file to a new name or close your existing Tableau workbook. Either way fixes the problem, but it is an annoyance and I’m ending up having temporary files being written to my project drives because of this issue. I would classify this is a minor file synchronization issue that is likely to be resolved in future releases.

Issue #2

This problem may be related to issue number 1, but here is a multi-step description of the problem.

  1. You have a file in Excel that you are also working with in Tableau and you make a change to the the data structures in the Excel file (i.e., you add some new columns of data);
  2. You successfully save the Excel file (if issue #1 discussed above doesn’t happen);
  3. You refresh your data extract in your current Tableau workbook;
  4. Once the refresh is complete, you will notice that one of your existing data fields will be incorrectly identified as either text or numeric. In my examples, one numeric field (i.e. normally treated as a measure) will be treated as text after the extract refresh and sent to the dimensions shelf. I suspect that where this issue occurs depends upon where you inserted the new data columns, but I haven’t pin-pointed this problem due to time constraints.
  5. The solution is to save and close your existing Tableau workbook. Reconnect to the Excel worksheet, regenerate the extract, and then re-open your original Tableau workbook. The corrupted data field will be correct in the new extract. I suspect that this issue will be resolved in future releases.

Issue #3

There are actually two issues discussed in this section. The first issue has to do with the order in which files are selected for a data join. I have never noticed this behavior in previous versions of Tableau (with hundreds of joins completed), so It might be related to the new database driver or new coding approaches being used.

  1. If you do a data join, be sure that your biggest file is the first file loaded for the join. In a data join where I put the small file (400 lines) in before the big file (20M lines), the left join took 52:48 (minutes:seconds), which was clearly very slow. When I reversed the order and did an inner join, the time dropped to somewhere near 14 minutes, which was just a little longer than reading the 20M file by itself. That is fast.
  2. The second issue may or may not be related to issue #1 above. In this example, I wanted to do a right join after reversing the order, but Tableau would only allow me to do an inner or a left join as shown below in Figure 1. For some reason, the right and full outer joins were not available as options. In this case, the inner join was suitable because my right side file was limited to what I wanted to join and didn’t have extra locations included.I have other examples, however, where these join options would not be sufficient and would lead me astray. Maybe the Tableau data engine in super smart and it recognized that an inner join would be sufficient for this case, but I’m suspicious of this behavior. I am left wondering why the right and full outer joins are not allowed for this example and for some other examples I have tried.
Figure 1 - Only Inner and Left Join Options are Allowed.

Figure 1 – Only Left and Inner Join Options are Allowed.

 

 

 

 

One thought on “Notes on Working With CSV files Above 4 Gb in #Tableau, Part 2

  1. Pingback: Notes on Working With CSV files Above 4 Gb in #Tableau, Part 1 | 3danim8's Blog

Leave a comment