This is part 4 of a blog post series on combining Alteryx and Tableau to perform a real-world example of an ETL-based analytics project. If you want to read the previous parts of this series, click the blue links for Part 1, Part 2, or Part 3.
In a recent headline it was reported that, 2014 was the hottest year ever recorded. You can find out for yourself if this is true in the area you live by working with this data!
Step 5 – Completing the Alteryx Workflow
The next step of this project is learning how to build the entire data processing workflow within Alteryx. This is a multi-step procedure that begins with reading data and ends with writing Tableau *.tde files (as well as *.csv files for your viewing pleasure).
The Climate Data Processing Workflow
To comprehend how the climate data gets processed, I have decided to document each step of the complete workflow with at least one figure. As shown in Figure 1, the entire workflow involves about 25 discrete operations that are assembled to form a complete workflow. What follows below is a left to right ordering of the steps needed to complete this workflow. A brief description is given for each operation used in Figure 1.
The way I have produced the figures that document each step of the workflow should help users learn (1) which Alteryx function is used in each step, (2) how each function is configured, and (3) the order in which the functions are assembled.
The Workflow Uses A Batch Macro
In this example, this workflow is written as a batch macro in Alteryx. What this means is that a list of monitoring locations is assembled and this list is passed to this macro so that this workflow is completed for every desired monitoring location.
The reason that I employed a batch macro is simple but important to understand. I want this workflow to be flexible in a couple of different ways. First, I want to be able to pick and choose which monitoring locations I want to process. I might chose to process the data from a few monitoring sites, a few hundred sites, or several thousand sites. Secondly, I did not want to be required to process the data for all 92,500 monitoring stations because I’ll never look at the results from all of those sites.
By using a batch macro, I was able to assemble a list of all monitoring stations of interest that gets sent to the macro. Only these stations get processed which gives me the flexibility I desired. I could have also just removed the macro portion of the code and used a wildcard approach (*.dly) to process all the data. That approach would have taken too long and produced too much extraneous information for me to be satisfied, however.
Therefore, the batch macro approach makes sense for this project. By processing the inventory data, I assembled the list of the top 200 monitoring stations (having most data) and used this list to create some files to work with. The top 200 processing operation took 100 minutes, generated 1,990 files (approx 10 files per station, 5 measures and 1 csv and 1 tde file per measure), and generated about 3 Gb of output. These files can then be processed in Tableau, as discussed later in this series (Part 5).
Explaining the Workflow
Step 1 – The macro and initial data join
The first two objects shown in black in Figure 2 are the batch macro components that receive the list of monitoring stations to process. To understand how to build a batch macro in Alteryx, you need to read my article by clicking here. That article contains references to key pieces of information on building batch macros.
The batch macro sends the list of monitoring stations to the workflow, one-by-one. The join operation shown in Figure 2, joins the monitoring station data (lat, long, etc) with the actual climate data by matching the station ID, as shown in the upper left side of the figure. That join operation is assembles all the data needed for a stand-alone analysis for each monitoring station in the database.
By processing the monitoring stations on a one-by-one basis, you avoid having to handle everything in one file, which can really stress Tableau to the breaking point when the record count goes way beyond a billion. You also have a much easier method of updating the data on a routine basis, if needed, without long *.tde rendering times for one massive data set as I tried last year.
Step 2 – Convert the year and month to strings
Figure 3 shows a simple but common method used to manipulate data in Alteryx. The select tool is used to convert the year and month (integers) to strings. The reason for this becomes apparent later in the workflow.
Step 3 – Pad the month_str variable to two digits
Figure 4 shows how to use the padleft function to make all months have two digits in the string. This step might not technically be needed but I used it just to be sure that the upcoming creation of a date did work reliably.
Step 4 – Create the day number based on the field name
As mentioned previously, the original designers of this data set did not want to waste space in their files by storing the day that the climate reading was made. Instead, they used an implicit approach whereby the field name (value1, value2, etc) represented the day of the month. Therefore, the “day” number has to be stripped off the field name and padded to two digits to be added to the date string. This operation is shown in Figure 5, in which I combine right and padleft functions to produce the day number.
Step 5 – Create the measurement date for each day in the file, using a multi-field formula
The measurement date is created using some fancy string operations. Notice that the checkbox is selected to change the output to a date data type. All 31 fields have to be selected for this multi-field formula because you have to create up to 31 individual dates for each record in the file.
Step 6 – Use the select tool to remove some unnecessary fields
It is always a good idea to periodically remove fields that are just taking up space in your workflow. In this step, I do just that as shown in Figure 7. Removing excess fields is a good step to take to optimize your workflow.
Step 7 – Perform a simultaneous transpose operation on five variables
The following operation is interesting, powerful and can be applied very commonly with a variety of data sets. There are a total of five values that are read on a row-by-row basis. These fields include the date (implicit), measurement value, mflag, qflag and sflag. These variables have to be transformed from row-based structure (they exist in each record) to a column structure. This transformation has to happen by the time the workflow is complete. Figure 8 shows how this looks in the workflow.
Figures 9 and 10 show how the 31 dates and 31 value fields are selected for their transpose operations.
Step 8 – Rejoin the transposed variables using record position
Since all five variables have 31 fields per row, these variables can be rejoined in one operation as shown in Figure 11. If there were an un-even number of fields across the variables, special care would be needed in rejoining the data because mis-alignments in the records would occur.
Step 9 – Select the final output fields
This is another simple operation using the select field as shown in Figure 12. It really isn’t doing anything but was inserted during the early development of the workflow and could now be removed.
Step 10 – Remove dates that do not have values
A simple filter is used to remove dates that do not have values, as shown in Figure 13. A check is made to see if the date field is empty.
Step 11 – Remove measurement dates that have no data records
Another simple filter is used to remove data records that have no values, as shown in Figure 14. A check is made to see if the value field = -9999.
Step 12 – Perform data unit conversions
An if-then-else block is used to convert units for each of the five variables we are processing, as shown in Figure 15. The conversions vary depending upon the variables being processed. This is a very efficient way of handling multiple conversions.
Step 13 – Assign a century bucket
A series of century buckets are created for the data since we have readings going back to the 1700’s, as shown in Figure 16. I would strongly encourage you to read my article about why you would want to use buckets by clicking here.
Step 14 – Assign a decade bucket
A series of decade buckets are created for the data since we have readings that span so many decades, as shown in Figure 17. I would strongly encourage you to read my article about why you would want to use buckets by clicking here. The actual calculation of the decade is a fairly clever formula and is about as efficient as you can be.
Step 15 – Choose only 5 data types
A filter is used to choose only the five data types mentioned previously (precip, snow, snowd, tmax, tmin), as shown in Figure 18. This filter greatly speeds processing since all other climate measurements are ignored.
Step 16 – Strip off the file extension
A text to columns operation is used to parse the file name into two pieces, as shown in Figure 19. The reason for this operation is so that I can use the file name as a basis to build the five output files that will be created. The extension of *.dly is not needed for the output file names, so it gets stripped off.
Step 17 – Strip off the file path
A simple text replace operation is used to strip the file path off of the file name, as shown in Figure 20.
Step 18 – Add the measurement type to the file name
A simple text append operation is used to add the name of the climate element (type of measurement) to the file name, as shown in Figure 21.
Step 18 – Reorder the data in preparation for output to the files
A selection tool is used to reorder and pick the final output, as shown in Figure 22.
Step 19 – Send the output to two different files
The output is shown to two types of files (*.csv and *.tde), as shown in Figures 23 and 24. You will note that the file names are created automatically by Alteryx, as shown in the lower left corner of Figures 23 and 24. This is a very powerful feature of Alteryx. The output file name is created using the field we assembled called “output_filename”. The extensions of *.csv or *.tde are automatically added by Alteryx. The *.tde file is a Tableau data extract and these files can directly be loaded in Tableau.
Figure 24 – The output file name is chosen for the *.tde file by using the field called “Output_Filename”.
Step 20 – Viewing the output.
Figure 25 shows a partial list of the files created by processing the top 200 monitoring stations, and Figure 26 shows the actual content of one of the *.csv files.
Upcoming in Part 5
The project continues by using Alteryx to statistically process the temperature records from the top 200 monitoring stations, as well as taking an in-depth look at the precipitation and maximum temperature trends in Texas, USA. Tableau is used to visualize these results.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!