This is part 2 of a multi-part series that is exploring how to perform high-volume trend modeling in three software packages: Tableau, Alteryx, and Power BI. To read about the motivation for this work and to get the explanations of how this work was done in Tableau, click here for Part 1 of this series.
With me being a student and practitioner of Alteryx for a few years now, I was surprised when I realized that I had not completed any trend modeling in the ways I am going to show in this article. I have done trend modeling over many years in Excel and with custom programs, so this recognition surprised me.
Of course, I knew that I could use the linear regression tool in Alteryx to perform the types of analysis I wanted to do. However, when I went to do it, I was also surprised by a few things, and I will be documenting those findings in this article.
Trend Modeling In Alteryx
I received the first surprise of this work when I tried to use the linear regression tool to perform the linear modeling of daily maximum temperatures over time for all the Alaska monitoring stations. The surprise was that Alteryx did not allow me to use the date field as a predictor variable. I could choose the max temp as a target variable, but I could not use the date of the temperature measurement as a predictor variable.
Well, this is a common use case that I thought should easily be handled by Alteryx. A quick call to Ned Harding was all that I needed, and he told me to convert the dates to numbers and try that. I thought to myself that he had a good idea, so I began thinking about how to make the conversions.
Initially, I thought about how Excel handles dates as numbers and how easy it is to flip between them. I thought for a few moments about how many conversions I was going to have to do to accomplish linear modeling of data that has over a hundred million dates. Just as quickly, an alternate solution popped into my head and I quickly forgot about converting the dates at all. Can you think of a way to handle this situation?
Well, since all I really needed was a day counter from a beginning point in time, I used the datetimediff function to my advantage as shown in Figure 1. This produced a new variable that represents the number of days elapsed since a starting date that is specified as a user constant as shown in Figure 2. This new variable that I called “Number of Days” could be used as the predictor variable in the linear regression tool. If my regression model were built using data that had been aggregated in terms of months or years, this approach would still work fine. Additionally, I wouldn’t have the problems I had in Tableau when I did R-based linear modeling using monthly aggregated data.
After configuring a workflow to produce linear regression results for one monitoring station, I was hit with a couple of additional surprises. The linear regression tool spawns off R code, which instantiates an instance of the lm package properly configured with its arguments. The tool is set-up to run one model at a time. When I ran one model for one monitoring station, the workflow took on on the order of 10 seconds to complete the linear model.
Finding a More Efficient Solution Technique
I made a call to my buddy Joe Mako, and I showed him the situation I was faced with, including how I solved the problem in Tableau with the simultaneous computation of potentially thousands of models. We bounced around a few ideas of how to efficiently do this same thing in Alteryx.
We realized that a lot of coding was probably going to be needed for me to configure R to run a whole slew of models simultaneously, or I could write an Alteryx macro to drive through the thousands of models I needed to create. In either case, too much time was going to be needed to complete the job due to the time needed to complete the R lm package. Additional research indicated that there is a new R package called dynlm, that is configured to perform dynamic linear models and time series regression. We realized there must be an easier way to get the job done. Additionally, extracting and using the linear model terms returned from R isn’t so straight forward, and I didn’t want to spend extra time having to do this.
As typical, Joe thought outside the box and said, “I think I’ve done this before, or something like it”. Since I only needed to calculate the slope that represented the temperature trend at each monitoring station, we consulted Wiki and found the formula for directly calculating the trend in a linear regression model. The beta hat formula shown in Figure 3 is what Joe recalled using in the past. This formula simply finds the best fit line to a series of x,y data points and is known as the least square method.
As I told Joe thanks for the help, he said, “No, I’d like to stay in the screen sharing session and see this through to the end.” In a few minutes, we built the workflow shown in Figure 4 to directly calculate the slopes for the Alaska monitoring stations. The workflow completed in under 3 seconds for the month of April, with data spanning between 1960 and 2017. With that result, I knew we had solved the problem in an efficient manner.
You can download this Alteryx packaged workflow (including the Alaska data set) to try this for yourself. I have had to rename the *.yxzp file to a *.xlsx file get WordPress to upload it into the content management system. You will have to rename the file yourself before importing it into Alteryx. All the daily data for other states are available for you to download in this article.
The results of this work were the same as those achieved in Tableau. The computed slopes were identical. The Alteryx total temperature changes are more accurate than those completed in Tableau because I used the actual number of days between the first and last measurements that defined the slope, for every monitoring station. Figure 5 shows the top 5 monitoring stations that exhibited the highest amount of heating in April and these agree with the results from Tableau with a 0.1 degree F difference in the total Temp change field due to using a different number of total days for each station.
One of the primary benefits of doing this work in Alteryx is that the workflow can be configured any way desired to handle the partial data sets and then the output can be constructed to make great interactive dashboards in Tableau, as shown in Figure 6. Once again, it is this incredible combination of Alteryx and Tableau that so easily brings the story to life.
The flexibility of Alteryx makes it an indispensable tool for doing work like this. Next up in Part 3: Power BI. It should be fun to see what transpires. I have a feeling I’ll be learning some DAX programming approaches. After that, the grand daddy showdown is going to occur as part 7 of Tableau vs Power BI series will be completed.