Insights on Using #Tableau Coupled With Amazon #Redshift

Introduction

Yesterday and earlier today, I had a chance to experiment with Tableau coupled to Amazon Redshift.  What I am about to write isn’t totally comprehensive, but if I had this knowledge yesterday morning, I would have saved a few hours of work. Hopefully this information will save you some time, too, if you try to use Redshift for the first time.

Prior to yesterday, I really didn’t know how to make a connection from Tableau to Redshift, but I was able to successfully complete the mission without too much pain. I learned that there are a few crucial things that you need to know to get the work done efficiently to keep the usage costs down because once you create a computational cluster, you are on a running clock. The clock runs continuously even if you aren’t using the computational nodes to do anything (lesson #1)!  In my case, I used a total of four compute nodes for less than 24 hours and this cost was $68 ($0.85 per hour per node = $3.40 per hour for 4 nodes).  This is the rate for on-demand usage, but Redshift is much cheaper if bought in advance. There was no charge from Amazon for data storage since I am on free trial period for that service.  So if you are thinking about trying this service, watch the video included in this post and take some time to read my notes to save yourself some cash.  This should also help you avoid some of the difficulties that I had to unravel during this trial.  I spent at least half of the time during this trial doing internet searches to help me resolve problems encountered.  There is definitely a shortage of detailed training guides to help the first-time user of Redshift.  This isn’t surprising since this is a relatively new service. I also want to thank Kim Nielsen of Tableau Software for arranging a free 10-day trial of Tableau Desktop Professional Edition which allowed me to perform this work!


If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Test Description

I used some climate data from the Global Historical Climatology Network (GHCN) to investigate the Tableau/Redshift connection.  I took one year of worldwide climate data from 1890 and blended it with the climate monitoring station data to produce a couple of dashboards that I show in the video.  The test I conducted was intentionally small but sets the stage for a much bigger test to come.  The video was recorded real-time so that you can get a feel for the performance of Redshift.  Figures 1 and 2 show the dashboards that were very quickly created for this exercise.  The intent of this trial was to test the Tableau/Redshift connection, rather than trying to produce a beautiful dashboard!

Image

Figure 1 – Dashboard used to show worldwide maximum temperatures during 1890.

Image

Figure 2 – US based dashboard showing Arkansas climate monitoring stations during 1890.

Video Recorded During the Testing-Period

Once I had the connection established between Tableau and Redshift, I recorded a video to attempt to show the performance of Redshift.  I really didn’t have a great plan for what I was going to do, but I just tried a few different things to see how long the queries took.  Some of the time delays surprised me a little, probably because I was connected live to the database.  By connecting live on Redshift there are no Tableau data extracts, so all queries are completed using behind the scenes SQL commands.  This is why you have to load your data into SQL tables on Redshift, which is lesson #2 for me.  I didn’t try creating a Tableau data extract during this example, so I don’t know how performance would vary.  I don’t know if the Tableau data extract would be created on the Amazon server or if it would be created on the local workstation.  If it is on the local workstation, I don’t know how memory limitations would come into play with big files. Since  I haven’t completed my entire analysis of Redshift speed at this time, so I’ll save my judgments and conclusions for later.  Figure 3 is the video of my Redshift usage.

My Notes on How to Work With Redshift

Every computer platform has its own set of rules, assumptions, and methodologies that you have to learn to be an effective user of the system.  Redshift is well documented but still requires some expertise to be able to complete the work in a reasonable time.  I will say that when you are a first-time user, there is a lot to learn in a short period of time (you are on the clock!).  There are quite a few steps that have to be completed before things actually work. What I have tried to do is record what I learned yesterday so that if I have to revisit this platform again, I’ll be starting with some knowledge of how to get things done.  I have tried to document the overall process in a step-wise fashion along with providing some needed details in the written notes on the things that I couldn’t find during the internet searches.  Click here to retrieve my working notes developed during this trial.  If you want to learn about using really BIG data and Amazon Redshift, read this awesome post.  On the Tableau side, you will need to install the Amazon Redshift driver before you will be able to connect to your data.  Finally, one of the nice things I saw with Redshift is that once you get ready to deactivate your cluster, the configuration setting are stored in a “snapshot” file.  You can reactivate a cluster when needed by selecting your previous snapshot file.

Update 10/31/13

Today I spun up a new 16-node cluster to do a live demo for some potential clients.  Things went well and the procedures for using Redshift were a lot easier the second time around. Writing this post obviously helped me remember all the steps, so in one hour I was able to upload a couple of gigs of data, create and populate the SQL tables, set-up Tableau on a laptop, install the drivers, connect to the data and create a nice working dashboard.  The number of records approached 30 million (one decade of climate data from 1890 – 1900) and the primary data table was populated in 1 minute and 30 seconds (about 20 million records per minute).  Tableau worked like a champ with only minimal query delays. All that I can say is that Tableau coupled to Redshift is really, really impressive.  I can’t wait to do further testing using Tableau data extracts to see how they perform.

3 thoughts on “Insights on Using #Tableau Coupled With Amazon #Redshift

    • You are welcome Jen. I have learned a lot from your blog so its good to know that you liked the post. Congrats on your Zen master status! It is very well deserved!

      Ken

  1. Pingback: Going Behind The Scenes Of 3danim8’s Blog | 3danim8's Blog

Leave a comment