This is part 2 of a blog post series on combining Alteryx and Tableau to perform a real-world example of an ETL-based analytics project. If you want to read part 1, click here. In this article, you will obtain and extract the data used in this project. This example uses free climate data available from here.
The Climate Data Set
The data set I am using contains a few billion records and is continuing to increase in size. This data represents climate readings from all across the world, from the mid-1700’s to the present day. There are over 92,500 monitoring stations from which the data has been recorded. There are monitoring stations that have over 175 years worth of data and others that have only relatively recent data. To learn more about this data and the program that creates and maintains it, click here.
Step 1: Downloading the Data
In this series, I want to be thorough, but I have a lot to cover. I will make certain assumptions, however, that may hamper some people but be no trouble to others.
I will assume that everyone knows how to download data. Therefore, go to the website shown in Figure 1 and download the (*.gz and *.txt files) that you see there. Notice that the biggest *.gz file is over 2.4 Gb and that will take you some time to download, depending upon your connection speed.
The *.txt files are also critically important to performing this work because they contain a lot of information with respect to the content of the files. At this time, it would be a very good idea for you to stop reading this blog and study the content of the *.txt files. That is your homework. The reason for this is that we will be working with the data stored in the *.txt files and I need you to be familiar with its content.
Step 2 – Download the Tartool and Extract the Data
Download the tartool if you do not already have a way to unpack the *.gz files. The right side of Figure 2 shows the file list and the left size of Figure 2 shows the commands used to unpack any *.gz files you choose. In the example show, I unpacked all three files. These files contain the daily monitoring data for three types of monitoring networks including:
- “ghcnd-all.tar.gz” if you want all of GHCN-Daily – 92.5K files, 22.2 Gb, (all monitoring stations);
- “ghcnd-gsn.tar.gz” if you only want the GCOS Surface Network (GSN) – 817 files, 660 Mb (smallest subset);
- “ghcnd-hcn.tar.gz” if you only want the U.S. Historical Climatology Network (U.S. HCN) – 1218 files, 2.3 Gb.
Once you complete this step, you have have created the *.dly files that contain the monitoring data that will be used in this analysis. If you are short on disk space, just choose to unpack #2 to work with 800+ files. If you want the full complement of monitoring stations from around the world, unpack #1 and create over 92,500 files. Go ahead, I dare you!
These are ASCII, fixed-format (i.e., flat files) that look like the sample shown Figure 3. Each line in this file actually represents a month of data. Extracting the data from files like these is where the usage of Alteryx begins. The first step in an ETL project, is of course the “E” part, which means extraction.
Now that we have the raw flat files, we will let these files lie dormant for a while. We will turn our attention to creating the database of monitoring station data that will be needed to perform the temporal and spatial analysis of this data in Tableau.
Step 3 – Assembling the Monitoring Station Data
There are two principal components of this database. The first component is the monitoring station data. The second component is the actual climate data collected at the monitoring stations. What you did in Step 2 was to unpack the actual climate data. Now we need to assemble the monitoring station data so that we know where the data originates.
Part A – Extracting the Monitoring Station Data
As discussed previously, the content of the *.txt files is very important to this project. The flat-file structures used to store this data is described in the file “readme.txt” as shown in Figure 4.
As described in Figure 4, the file called “ghcnd-stations.txt” contains the station ID, latitude, longitude and other information for each of the 92.5K monitoring stations currently in the database. Figure 5 shows a portion of the monitoring station data as it is stored in the flat file. You notice that there are no delimiters between fields in a flat-file structure.
In my approach to creating this case study, I made an assumption. I assumed that I would only have to process the monitoring station database once. Considering that the world-wide climate monitoring network is continually evolving, this assumption is faulty. If I did this work again a year from now, I would have to process the monitoring station database again.
For this reason, writing an Alteryx workflow to do the things I am about to do would have been the right thing to do. However, I decided to take a shortcut and quickly perform the work in Excel and leave the workflow development to you as a homework assignment. I’ll describe the assignment at the end of this article.
My assumption is that since the monitoring station data only has to be processed one time and it only nine columns of data, it isn’t necessary to use Alteryx to extract the information from this file. Excel can be used to perform the data extraction and to assembled the monitoring station data. This spreadsheet will become very handy in this project.
Part B – Loading the Monitoring Station Data Into Excel
Figure 6 shows the fixed-width field setting used in Excel to parse the contents of the file “ghcnd-stations.txt”. These column positions were previously shown in Figure 4.
When you decide to do your homework by writing an Alteryx workflow to read this file, you will need to develop a flat-file configuration file. You can read my article by clicking here to understand how that is done.
Part C – Loading the Country and State Data Into Excel
Figure 7 shows the contents of “ghcnd-countries.txt”. This file contains two-digit codes that are used to represent each country.
These codes are embedded as the first two characters of the monitoring station name. You start by parsing the first two characters from the monitoring station name. Since we want to know the country name for all of our monitoring stations, we need to use this file to the populate a “country” field in the monitoring station data set you just created in Part B in Excel. By performing a vlookup using the two digit country code, you will be able to populate your “country” field.
You can do the same vlookup operations with the two digit state codes that are stored in the “state” field. These codes can be used to vlookup the values of the “state name” that is stored in the file “ghcnd-states.txt”.
In your Alteryx homework, for this portion of the workflow, you will have to learn how to do a few things. These steps include:
- Creating the flat-file configurations and then reading the three text files needed (monitoring stations, countries and states);
- Learning how to parse the two characters for the country name;
- Completing data joins of these fields.
- Writing the completed database to an Excel file or a csv file.
Part D – Assemble the Complete Monitoring Station File
By performing the operations in Parts A, B and C, the complete monitoring station database will be finalized as shown in Figure 8. Click here if you want to download my copy of this 52 Mb file to compare to your work.
As we move on through the development of this workflow, this assembled database will be joined with the climate data within Alteryx to produce the data used for visualization in Tableau. (See Footnote Below).
Upcoming in Part 3
The project continues by developing the workflow to join the monitoring station data to the actual climate data. This is where the real power of Alteryx becomes clear.
When I first performed the work, I completed another major task. I used Alteryx to create an inventory of the climate data that is available to use. I did this work because I wanted to quantify the spatial and temporal data distributions. I wanted to know what type of data is available where and when, including insights in terms of countries and states. However, this part of the workflow is optional, so I’ll save its description for later. The results of this work is stored in the 52 Mb monitoring station Excel file I shared earlier in this article.
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!