For the past couple of months, I’ve been working on some projects that create and use a lot of data. Over the course of a year, several billion records will be created and I get hammered with over 330 million new records per month.
When I was first exposed to this data, I decided that I could roll over and cry uncle, or I could develop some techniques to manage and make sense of this onslaught of information. This is the type of situation that I conceptually described in this article.
I’m going to share some insights on methods and techniques that I use in Alteryx and Tableau to efficiently perform quantitative and visual analytics on this data. If someone were to keep throwing hundreds of millions of records at you per month (Figure 1), could you handle it? If not, you might want to read this article.
Over the Next Few Weeks/Months
With each upcoming week, I plan on adding an additional technique(s) to this article that I’ve developed to efficiently process the information in Alteryx followed by visualization in Tableau. Some of the techniques are very simple, while others are more complex.
It is my hope that by the time I finish this collection of techniques, others will be able to use some of them to solve their daily challenges when working with big data. For some additional guidance on solving tough problems with Alteryx, you might consider reading this article. You can also explore the multitude of Tableau techniques I’ve previously published by reading this article or scanning the blog history for all Alteryx and Tableau techniques I’ve written about.
What is most important, however, is the realization that I have now established a working methodology for processing large volumes of data quite efficiently. In my case, I receive hundreds of millions of records per month from hundreds of mechanized data sources. The data keeps being created and keeps flying at me. I have to be able to receive it, clean it, and process it for us to make great use of it.
The techniques I developed allow for that data to be stored, cleaned, processed, and analyzed at multiple levels of detail. The details I am referring to include time, spatial attributes and are data-source related. In addition to simply receiving and processing the data, I’ll be adding data QA techniques to this article that will surely assist you when faced with challenges like these.
Technique Number 1 – Using the *.yxdb File Format As a Standard To Store Results
I now have a standard that once I ingest data into Alteryx and the data types are determined, all the output files I create are stored in the *.yxdb format. This file format not only holds the data in a very compact format, it also stores the data types of every field. If you want more information on what a *.yxdb format represents, click this link.
Although you might not realize it, you can waste a lot of time determining the data types for big data sources. You should only do it once and by storing the data in *.yxdb format, you will only have to do it once!
Another reason why I have chosen to do this is that I tend to build some very long and complicated workflows. When I build these, I typically build them in stages. After stage 1 is complete, I write out a *.yxdb file that holds those results.
When I begin stage 2, all I have to do is read in the stage 1 results file and I’m ready to go. I don’t have to perform a linkage from stage 2 to stage 1 at this point since I do not want to take the time to keep running the stage 1 process over and over. This is a huge time-savings technique when it comes to developing workflows. As the workflow matures, I might have several *.yxdb files that store the intermediate results. Ultimately, these will disappear as described below.
The lesson I learned is to get each part of the overall workflow completed sequentially, with one output file created for each stage of the larger workflow. When all stages are done, the intermediate *.yxdb files will no longer be needed because you will just make tool connections between each of the staged workflows to create your grand masterpiece. This is simple to do by removing the file output tool at the end of each stage and connecting the data stream to the beginning of the next stage.
Here are a few other reasons why I use the *.yxdb file format as a standard to store results. Since the *.yxdb format is compact, it makes it easy for me to move around between different computers, or even to send to other people that might want to use the data. I can also easily convert the *.yxdb content into any other data format supported by Alteryx, as shown in the video below where I convert the contents of a single yxdb file into another type of file type (*.xlsx).
Although it is tempting to write other forms of output such as *.tde files or *.csv files while the workflow is running, you will end up eating up huge amounts of hard drive space and consuming large amounts of time writing the files. If you have not ever profiled your workflows, take a minute to do that and see how much time is spent in Input/Output operations in Alteryx. Alteryx is so computationally efficient that the time required for many big data workflows is dominated by input/output operations.
You can save a lot of time by selectively converting the files you need for visualization when you need to. In the next technique in this article, I show how this is done.
Technique Number 2 – A Simple Alteryx Workflow For Converting Multiple Files Into Different Formats
In this simple example, I show how an Alteryx workflow can be used to convert all the files in a subdirectory from one format to another. I have used this method to convert hundreds or thousands of files for Big Data situations such as climate monitoring station data from thousands of sites around the world.
This is a simple workflow and simple technique, but it uses a very powerful feature of Alteryx: the ability to automatically name files based on a field name in the workflow. In this case, I’m using the filename that is sent into the workflow by the directory tool. The text to column tool will perform a quick slight of hand to strip the file extension off of the filename, which gets recombined with the new file extension to create a new file type.
One advantage of this method is that the files you are converting do not have to have the same schema or file structure since the files are processed one at a time. Conversely, you could attempt this operation with a file input tool with a *.yxdb wildcard for the filenames, but if the schemas vary, you will experience some warnings from Alteryx.
Technique Number 3 – A Double Batch Macro That Summarizes Large Data File Content and then Processes The Categories It Finds
In this example, a double batch macro approach is used to identify the components of a big data file and then the content gets broken down into its component pieces and handled in a special computational framework. The technique of breaking down a big data file into component pieces is one way to efficiently process big data files in Alteryx.
Technique Number 4 – Using the Listrunner Macro to Sequentially Run a Series of Long-Lasting Workflows
Sometimes workflow can take a long time to complete when millions of records are involved. When you have to process months of data, with each workflow taking potentially hours to complete, it is a great idea to set-up a process to sequentially run the workflows.
In this example, I show how the list runner macro (the Crew macro pack) is used to sequentially process some computationally demanding workflows. Adam Riley’s work is outstanding and has saved me countless hundreds of hours. Thank you, Adam, for sharing your brilliant work.
Technique Number 5 – Storing and Reusing Data Field Type Settings in Alteryx
Even though I said earlier that you should only determine data types once, there are situations where the data coming to you is in CSV format and the content might be changing over time. In this example, I show how you can have Alteryx determine the optimal field sizes and types for the data you are consuming so that you can reuse the file again in the future without having to use the auto field. Of course, my preferred method is to write the data to a *.yxdb file as described above, but sometimes the source keeps coming at you in csv format.
In the following video, I show the techniques needed to determine, save and reuse field types and sizes. For the best results, you should always read the entire file to be sure all fields are appropriately set, but for the sake of brevity in the video, I show reading only a subset of the data.
Technique Number 6 – Using a Batch File to Perform Aggregations
One key element to comprehending a lot of data, is to wisely perform aggregations. When you have data that is recorded as a date/time variable, at the sub-second level, being able to aggregate at the daily, hourly and sometimes minute level is important. By aggregating the at different levels of both time and categorical detail, you are able to decipher behaviors in the data that cannot easily be seen in the raw data when billions of records are present.
In this example, I show how I use a batch file to process monthly files (200-350M records each) to perform hourly aggregations at two levels of categorical detail. One level has specific breakdowns by multiple categories and the other is just a counting by hour of the day. The output of these files collapse 2B records down to 1.5M and 7.3K records, respectively. By using the pre-aggregated data, the visualizations produced in Tableau are instantaneous and allow for insights to be rapidly developed.
Technique Number 7 – Incremental Data Capture
Successfully capturing big data for local usage requires some planning and the use of repeatable techniques for avoiding problems with the captured data. In the following video, I show several techniques I use to incrementally download data to assemble a series of monthly files for local usage. Each monthly file will eventually hold hundreds of millions of records, so using this approach makes the capture and usage of the data simple.
You might wonder why I would do this in the first place. The reason is that this particular data is automatically generated on machines and has not been passed through any QA filters. To clean and process these large segments of data, having local monthly files available is most efficient because a large amount of processing is going to be completed. Once the data is cleaned, parsed, processed and ready for analysis, the final files can be uploaded to a networked database for usage by the analysts.
Finally, I use the listrunner technique (Technique #4 above) to run the data downloads during the nighttime hours, when the available bandwidth is maximized. The timing of these downloads is key to avoid having interruptions in capturing the data.
I have a whole series of additional techniques I have been refining so that Alteryx can process a lot of data and Tableau can visualize it. I will continue to add to this article when time allows so that these techniques become a part of this collection.