Processing Global Travel Data – Part 3, The Hard Way Via Chas

My First Guest Blogger

This article represents the first guest blogger I have had in 3danim8’s blog. If you have watched Part 1 of this series on processing global travel data, you were introduced to Chas and his new Camero. Without further delay, here is Chas in his first technical article.


Introduction

Hello World! I have never really written a blog post before, but this seemed like as good a way as any to get started. This blog post will be the third (and final) installment in a series focusing on global travel data. In part one, Ken explained the specifics of our problem and developed an Alteryx workflow to effectively carry out the solution.

However, as Ken mentioned in part 2, this workflow is not the only method that could be used to tackle this problem. We thought that it would be interesting to compare and contrast the Alteryx method we used on Saturday to the methodology I used in my initial attempts to explore and solve the problem.

Who Am I?

Before going further, I will reintroduce myself. My name is Chas Latch. A little over two months ago I graduated from Baylor University (sic ’em bears) with a degree in Computer Science.

I will also take the time to issue a small disclaimer: I am very much still a newcomer to all of this. The first time I even heard the word Tableau was when I started my new (and first) permanent employment at General Motors. In fact, I opened the program for the very first time only three weeks ago. In that very small time span I have not even scratched the surface on what this tool can do.

However, it is immediately apparent that this tool is something very special. I want to thank Ken as well for letting me write this here. I look forward to learning even more from this blog and am grateful for the opportunity to contribute!

The Task

So as Ken mentioned in part 2, this task came to Sylvia, Roy, and me in the form of a relatively high-level email. The task we were interested in was processing the itineraries of a number of travel records. In particular, we wanted to parse the itinerary string into its component parts.

So if a particular travel record was from Austin to New York, its corresponding itinerary string would be “AUS/JFK”. So we want a process to take this string and extract the two sub-strings “AUS” and “JFK”. This may not seem like such a big deal, just split on the character “/”. Duh. Unfortunately, there is some more complexity here. For example, if a record was a trip from Austin to Miami with a connecting flight in Dallas, the itinerary would be “AUS-DFW/MIA” and the parsed components would be “AUS”, “DFW”, and “MIA”. For more information about this, see the whiteboard session Ken did in part 1 of this post.

Whenever I see the word “parse”, I immediately think regular expressions (regex). Many readers may already be familiar with regular expressions and can probably skim the remainder of this paragraph. So just what is a regular expression? There are probably many ways to answer that, but the simplest way to think of them is this: a regular expression is simply a way to represent patterns within a string of characters. It does this by using a specific set of symbols and syntax. The particulars of regex syntax is beyond the scope of this post, but more information can be found here.

So while the particulars of regex are not important, what is important is what they allow us to do. By allowing us to efficiently recognize certain kinds of patterns in strings, we can begin to define some operations to identify those patterns. For example we could create a regex_replace(pattern,target,substitute) function that takes the first occurrence of pattern in the target and replaces it with a substitute. We can also define a regex_extract(pattern,target) function that simply returns the first occurrence of a pattern in a target string.

Ok, so what? Is it possible to craft a regex pattern that would extract the component parts of an itinerary string? I was confident that the answer was yes. However, I would first need to find a way to even run regex on the data. I quickly started looking into various ways to do this. I was excited to see that Tableau did have some support for regex in calculated fields. However, I was disheartened to learn that this was a new feature shipping with Tableau 9, and I would only have access to Tableau version 8. I would have to leave Tableau to do regular expression work.

My Solution

I feverishly started searching for alternative ways to achieve regex on my data, even as just a proof of concept for further refinement. Ultimately the quickest solution I could come up with was to dump the entire data set into an Access database. Unfortunately, I immediately ran into problems.

It seems that Access has a limitation on how many rows of data may be imported from an Excel spreadsheet. From what I was able to find, this limit is around 65,500 rows (our data set had around 100,000 rows). So I begrudgingly partitioned my data into chunks of around 60,000 rows and imported the two sets.

Once the records were in a table within Access, I was able to use SQL coupled with some regex functions written in VBA (Access’s native scripting language) to do an initial parse of the itineraries. Figure 1 shows the initial imported table, and Figure 2 shows the table resulting from the initial parse. We were on the right track!

1

Figure 1 – Example itinerary data.

2

Figure 2 – Itinerary data parsed using regex expressions.

Clearly this was a step in the right direction. However, what is not clear is the complexity of the regex at the core of this query. Figure 3 shows the SQL query used to do this parse. Yuck. Yes this query is UGLY.

3

Figure 3 – The regex queries needed to complete the itinerary breakdowns.

It may not be immediately clear what is going on with the regex expressions shown in Figure 3. Within the sub-query, the first column we select is simply the unprocessed itinerary string. The next column is the first IATA code found in the itinerary. Next we need to get the second IATA. To do this, we simply remove the first IATA (by replacing it with the an empty string) and repeat the method used to get the first IATA. We can simply keep repeating this process of stripping off the first N codes to read the N+1 code for any N we choose.

For this data, I happened to choose a value of 10. For each itinerary, 10 columns will be created containing the corresponding leg of the trip itinerary. I was thrilled! Just like that, I had a nice pretty reference table.

I powered up Tableau and connected to the new database. Now I simply joined the table containing the original travel data with this new table. The problem was, this didn’t work. Ultimately the visualization that I think everyone had in mind was a world map with different trips represented as paths. Unfortunately, a path map requires that the points of the path be on different rows.

For us, the path points would correspond to the component IATA codes making up the itinerary. Ok, so we need to not only parse these itineraries onto their own columns, we then needed to transpose each of these columns onto their own row so Tableau could create the path map we wanted. This may seem like a complicated notion (and it is), but as shown in Figure 4 the SQL required to do this doesn’t wind up being too bad (note that this Figure only shows a portion of the query, but you get the idea).

4

Figure 4 – The SQL commands needed to transpose the travel legs.

So what are the costs of an operation like this? Time costs seemed to be reasonable. On my laptop, this query completes consistently in about 2 minutes. What about space? Well, clearly some expansion of the data is expected. After all, we are parsing each record and expanding it into its component parts.

However, the above query is guaranteed to increase our data to ten times its original size. This seemed gratuitous to me. Sure, some really complicated itineraries might be using all 10 rows allocated to them, but most of the itineraries are made up of only 2 or 3 parts. These shorter itineraries would be carrying around 7 or 8 empty rows that were unneeded (note the empty columns in Figure 2). Again SQL comes to the rescue as shown in Figure 5.

5

Figure 5 – SQL used to clean the empty records.

So now we could actually go into Tableau and make the visualization we had in mind. A path map was absolutely possible at this point. However, there was one more little wrinkle. What if we want to calculate the distance of one of the legs of a flight? I thought for sure that this would not be an issue with Tableau. I was confident that even if Tableau was not able to calculate the distance between two points on the earth out of the box, then I would be able to create the necessary calculated field to do the math on my own.

This, however, is where the snag occurs. We were using the custom geocode method described in one of Ken’s earlier blog posts. By utilizing this method, we were able to have Tableau translate the IATA codes in our data into their corresponding latitude/longitude pairs, which is exactly what this technique is supposed to accomplish.

However, Tableau does not currently allow a generated measure to be used within a calculated field. Because we were using one of the convenience functions built into Tableau, our latitude and longitude values would be inaccessible to any kind of calculation we wanted to perform.

Without access to latitude and longitude, calculating the distance between two IATA codes would be impossible. So, what does this mean? It means that we need to create another reference table containing IATA’s and their corresponding coordinates as in Figure 6.

6

Figure 6 – Airport IATA codes and latitude and longitude.

We can then specify a join to this table when we connect in Tableau. Note in Figure 7, that because we need the location of both the origin and destination IATA codes, we will need to join this table an additional time. Our data should now contain all the information we need to create a calculated field finding the distance between the two points of a leg.

7

Figure 7 – Tableau table joins.

An example visualization showing this is shown in Figure 8.

8

Figure 8 – Example flight segments.

Completed Solution

So there you have it. Using the above methodology, we can absolutely parse travel itineraries the way we wanted to. However, this is an elaborate solution that was placing a pretty heavy load on my Tableau application. Was Tableau performing what I asked? Absolutely.

However, multiple “Out of Memory” errors and a few other stability issues (Figure 9) were a pretty clear indicator that Tableau was having to really fight with some aspects of this approach to give us what we wanted.

9

Figure 9 – Tableau was in trouble!

It is extremely likely that the data joins being used in Tableau were the culprit. Nevertheless, I am sure that there are many aspects of this process that could be optimized, and the methodology used may even be considered a totally viable way to do this kind of parsing. Something to keep in mind is that these kinds of optimizations take time to implement, as did the initial development.

It took three days of pretty solid work to create this solution. One can imagine that it would take many more hours before this was really ready for “prime-time”. Further, can you imagine trying to maintaining this process? It is complicated enough for me to keep this process straight in my head, and I am the one who wrote it. I can’t imagine being handed this process and having to work out what it is doing and why on my own.

Comparison to Alteryx

When Ken and I sat down on Saturday to try and recreate this process as an Alteryx work-flow, I was amazed. We were able to recreate what had taken me three days of solid effort in less than 2 hours. That still shocks me. I can either spend an entire day of work manually working through this process or I can accomplish the exact same thing in only an hour of tinkering in Alteryx. Figure 10 shows the final workflow we developed to solve the same problem.

Global_Travel_Workflow

Figure 10 – The final Alteryx workflow we developed to solve the same problem.

And that’s not even the whole story. The Alteryx work-flow was not only faster to develop, but it was also faster to perform, by far. If you watched the videos in part 1, you saw just how fast Alteryx was able to complete what took my query minutes to perform. On top of all of this, the icing on the cake is that this work-flow is much easier to maintain compared to the approach detailed above (at least in my opinion). The whole point of this post was to show just how much work it took to accomplish this task versus how easy Alteryx made this work.

Wrap Up

All and all the gains seen by using Alteryx in this situation were unreal. By using this tool, we eliminated the need to write a single one of the SQL statements used by this method. What’s more, the regular expressions needed by Alteryx are also GREATLY simplified from what we had to use here. I believe the only real regex pattern we used was “[A-Z]\{3\}”, which essential just matches the first three consecutive upper-case letters it finds (IATA codes happen to be formatted this way). Clearly this is an incredible simplification from the regex used in Figure 3.

Summing up, in less than 2 hours spent in Alteryx we were able to accomplish what had taken me three days to do manually. Perhaps the most impressive measure of relative performance is the amount of time it takes to actually execute the query. With my solution, it takes around 2 minutes just to parse and format the data. Compare this to the 10-12 seconds that it takes Alteryx to do the entire job and you see that Alteryx is able to complete the task in only 8% of the time. I know that I was completely shocked by this.

If you want to really get close the same effect we observed, I encourage you too skim over this post again very quickly and then go re-watch the part 1 videos. Tell me, which process would you rather use?

For me, there is no question. By using Alteryx, we were able to focus on what we wanted to do to our data rather than how we would actually do it. Under the hood, the methods that Alteryx is actually using may not be all that dissimilar from what we did in this post. This is an entirely speculative statement, as I have no idea how the black magic inside Alteryx actually works. However, this is exactly the point that I am trying to drive home.

One of the biggest strengths you get from using a tool like this is that you don’t need to be concerned with the how. We simply tell the tool what kind of operation we would like done, and Alteryx takes it from there. This frees us from the many hours it would otherwise take to design, implement, debug, and maintain the “how” parts of the solution, and instead focus on interpreting the results of our analysis and determining how we wish to proceed next.

At the end of the day, Alteryx just boosts your productivity more than any other tool I have been exposed to thus far. Not that Tableau can’t be used on its own to accomplish this. It absolutely can do this without Alteryx (as proved by the methodology above). Yet if we start to offload the non-visualization tasks from Tableau to a more suitable back end, we really start to see what a crazy powerful combination Tableau and Alteryx can be together.

If you managed to stay with my post long enough to reach this point, thanks for reading. I hope that this was at the very least interesting to read. I look forward to continuing along this journey of Tableau goodness, and hope that this post is the first of many more to come as I slowly discover the awesomeness that is data analytics.

4 thoughts on “Processing Global Travel Data – Part 3, The Hard Way Via Chas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s