Text Analytics Improved With Tableau Clustering

Introduction

Tableau 10 has a lot of great features. One of the best features for me will be clustering.

In this article, which is part 2 of my blogging experimental analysis, I combine a little Alteryx magic with Tableau 10 clustering to answer a very specific question related to the importance of key words used in article titles. Before I tell you what the question is, let me give you a little background on how the question entered my experiment.

Background

It was probably in either 2012 or 2013, when Ben Jones of Tableau wrote some of his lessons learned from his excellent blog (Data Remixed). Ben is a guy that I really appreciate and that I have learned a lot from.

Ben is a couple of years ahead of me in blogging experience, so I tend to trust what he says because his writing is thoughtful, intelligent, and concise. Ben is a great student of data visualization, data communication and is a very good writer.

What I remembered from his work back then was that he said something like this:

Title your articles carefully and use the term “How To” to increase your readership.

Based on this advice, I decided to see if what Ben said was true. I wanted to know if using the terms “how”, “how to”, and “how to use” in article titles actually lead to better readership.

It has taken me about 3 years and writing over 200 articles to conclusively answer that question. Of these articles, 111 articles were about Tableau, 22 were about Tableau and Alteryx, 17 were about Alteryx, and about 50 were about other topics.

Processing The Data (Alteryx Power, Speed and Magic)

I collected over 3 years of daily blog readership data (1,153 files) from my WordPress analytics page. If you want to know how I did this (which isn’t a whole lot of fun), you can watch the last video in this article (which is Part 1 of the 3danim8 blogging experimental analysis).

Figure 1 gives you a glimpse of the primary data file that contains the time series history for every article I have ever written. This file contains over 46,700 records. This file will be used in the near future to perform a lot of additional analysis on the blogging experiment itself.

Daily_Blog_Data

Figure 1 – Data file containing the time series history of every article I have ever written on 3danim8’s blog.


 

This file was assembled by Part 1 of an Alteryx workflow that is shown in Figure 2. This workflow reads the 1,153 files, performs some data clean-up, produces the data columns, and produces this file in under 10 seconds.

Processing_Article_Titles_Part1

Figure 2 – Part 1 of an Alteryx workflow that was developed to process the daily blog readership files.


 

In the second part of the workflow, I perform some regular expression parsing of the article titles to enable the text analytics analysis that is the topic of this article. I create all 1,2,3-word combinations that can be created from the article titles.

I parse the article titles into three groups: single words, two-word combinations, and three-word combinations. The second half of the workflow is shown in Figure 3. The total time required for Alteryx to do all this work (Part 1 and 2) is about 17 seconds. The speed of Alteryx still blows my mind every time I use it to do something like this.

Processing_Article_Titles_Part2

Figure 3 – Part 2 of the Alteryx workflow that creates all of the 1,2,3-word combinations.


 

Once the workflow was complete, I had a second Excel file that was used to perform this text analytics analysis in Tableau 10. A portion of this file is shown in Figure 4.

text_analytics

Figure 4 – The resulting summary data file for all 1,2,3-word combinations. This was created by the Part 2 workflow.


 

The workflow shown in Figure 3 eliminates a large number of common words for the 1-word case to avoid performing text analytics on these. I also eliminate any 1,2,3-word combinations that occur less than 3 times in the collection of article titles. I also generate every possible 2,3-word combination by duplicating the titles and successively removing the first word (Figure 5), which generates all possible multi-word combinations.

All_combinations

Figure 5 – This shows the successive removal of the first word of a title. This ensures that all possible 2,3-word combinations are discovered throughout all titles.


 

The successive removal of the first word is accomplished by a multi-row formula coupled with a regular expression replacement as shown in Figure 6.

Regex_replacement

Figure 6 – A multi-row formulation coupled with a regular expression replacement does all the hard work.


 

For the 2-word combinations, regular expression identification and automatic pivoting  (split to rows) was used to generate the word combinations as shown in Figure 7. The regular expression used to match the 2-word combinations was [[:alpha:]]+ [[:alpha:]]+.

Tokenize_2_words_at_a_time

Figure 7 – Regular expressions are used to identify the two-word combinations present in each NewTitle field. The successive replication of each title and subsequent removal of each leading word will lead to duplication every second operation in long titles. To combat this, a unique tool is used to remove the duplicates in the next step of the workflow.


 

For the 3-word combinations, regular expression identification and automatic pivoting  (split to rows) was also used to generate the word combinations present in the NewTitle Field as shown in Figure 8. The regular expression used is this case was [[:alpha:]]+ [[:alpha:]]+ [[:alpha:]]+.

Tokenize_3_words_at_a_time

Figure 8 – Regular expressions are used to identify the 3-word combinations present in each NewTitle field. The successive replication of each title and subsequent removal of each leading word will lead to duplication every third operation in long titles. To combat this, a unique tool is used to remove the duplicates in the next step of the workflow.


Tableau Clustering Results

I wanted to answer the question of whether or not using the word phrases “how”, “how to”, and “how to use” would lead to improved leadership over the long-term for the articles I wrote. This meant that I had to process the article titles using the methods described above for 1,2,3-word phrases included in the article titles.

The performance metric I  used for this example is easy to understand. If I wrote a new article and it was read by at least 1 person on the day that I wrote it, the percentage of days that the article was read would by 100% (1/1). If on the second day, nobody read the article, the percentage would drop to 50% (1/2). If on the third day the percentage would be either 33.3% (1/3 = not read) or 66.6% (2/3 = was read). Of course in a perfect world, I’d like all my articles to start at 100% and never drop below that, but that is not really possible over the long term.

Using this metric, the best performing article groupings have the highest averages. Remember, that I have grouped the articles by the 1,2,3-word phrases that occur in the 200 or so articles. If an article grouping containing a word or 2,3-word phrase achieved a percentage of 70%, then this means that those articles have been read an average of 7 out of every 10 days that have elapsed since they were published. These article groupings all have average ages between 500 and 1200 days.

Results for the word “How”

As shown in Figure 9, the word “how” exists in the highest performing cluster of 1-word phrases used in my articles. As shown previously in Figure 4, the word “how” is shown on line 21, with 33 uses in the articles and a 57% readership over time. Check one, Ben Jones might just be right.

1-word_readership_results

Figure 9 – 1-Word phrase results. I wanted to see if the word “how” was among the highest performing cluster.


Results for the 2-word phrase “How To”

As shown in Figure 10, the word phrase “how to” exists in the highest performing cluster of 2-word phrases used in my articles.  As shown previously in Figure 4, the word phrase “how to” is shown on line 10, with 21 uses in the articles and a 64% readership over time. Check two, Ben Jones is looking like he might just be right. I also see that writing about csv files and big data has its advantages in the context of data science.

2-word_readership_results

Figure 10 – 2-Word phrase results. I wanted to see if the word “how to” was among the highest performing cluster.


Results for the 3-word phrase “How To Use”

As shown in Figure 11, the word phrase “how to use” exists in the highest performing cluster of 3-word phrases used in my articles.  As shown previously in Figure 4, the word phrase “how to use” is shown on line 18, with 4 uses in the articles and a 58.9% readership over time. Check three, Ben Jones is right.

 

3-word_readership_results

Figure 11 – 3-Word phrase results. I wanted to see if the word “how to use” was among the highest performing cluster.


 

Finally, not only is Ben right with respect to readership continuity, he is also right in terms of average article views. The 2-word phrase “how to” came in fourth place. The ordering for the 2-word terms was “tableau buckets”, “I wish”, “csv files”, “how to”, and “big data”. All of these were in the highest performing cluster. For the 3-word phrases, “how to use” was in first place, firmly established in its own cluster.

The ANOVA p-values for all the results shown above indicate the results are statistically significant, which means the clusters have different means when calculated with the single measure used.  For example, the 2-word cluster results are shown in Figure 12.

Anova_1factors

Figure 12 – The 2-word cluster (shown in Figure 10)  ANOVA model results.


 

Going a Little Deeper (Statistically Speaking)

I wondered how the results would change if I added other factors to the analysis? For the fun of it, I decided to look into the performance of the 2-word groupings by adding another measure to the cluster.  The second measure added was the cumulative views of the articles. I was interested in finding out if Ben’s theory could stand up to a little more rigor.

As shown in Figure 13, the 2-word, 2-factor clusters indicate that “how to” became the highest performing cluster of the 2-word article groupings.

2-word_readership_results, 2 factors

Figure 13 – “How to” is the sole, highest performing term based on this 2-factor cluster analysis.


 

The ANOVA model results also indicated that this result was statistically significant, as shown in Figure 14. In these 2-factor results, the null hypothesis states that the average readership and the total article views of the 4 groupings are equal. Because the p-values are 0.018 and 0.019, which is less than the significance level of 0.05, we can reject the null hypothesis and conclude that the groups have different means.

Anova_2factors

Figure 14 – The 2-factor ANOVA results for the 2-word groupings.


 

Learn How To Perform Clustering

To learn how I used Tableau 10 clustering to create these figures, please watch the video shown below. For a superb coverage of several new Tableau 10 features, including how to use multiple variables in your clusters, watch this video by the Vizpainter, Joshua Milligan.



 

Final Thoughts

After completing this work, I stopped to wonder how Ben could have been correct a few years ago when he wrote those words. The answer, I believe, is at least two-fold.

First, Ben is an astute guy. He reads a lot, he experimented with his blog and carefully reviewed his blog stats. Upon doing that, I suspect that he noticed a correlation between the usage of the “how to” term and good readership response rates.

Second, he also knew that a lot of his blog traffic arrived via internet searches. When people are looking to learn about something, it is natural for them to write “how to …”. We might want to know “how to ride a bike”, or “how to create a histogram”, or “how to create a dashboard”.

We use searches to learn how to do all types of things, so by using the term “how to” in a title, you have just improved your chances of matching your article title to the search engine database.

So the answer to the question is now known. The use of the term “How to” or “How to use” are powerful phrases in blog article titles. It is that simple.

So thanks Ben for your words of wisdom so long ago, and I now plan to use this to my advantage when possible.

Other Articles Describing Results From The Blogging Experiment

Over 160 articles were written over 2.5 years, as part of a scientific blogging experiment that was designed to quantify many aspects of writing a technical blog. This article is the third part of the quantitative analysis of this experiment. There is a lot to learn from these articles, if you are so inclined to learn.

  1. Click here to review the official ending of the experiment (The Conclusion and overview of why I did the experiment)
  2. Click here to review the Epilogue to the experiment.
  3. Click here to review part 1 of the blogging experimental analysis (Geographical expansion of the blog)
  4. This article is part 2 – (Quantifying the importance of  the terms “How, How To, How To Use” in blog titles)
  5. Click here to review part 3 of the blogging experimental analysis (Fast-burning articles vs slow-burners, which are more important to a technical blog?)

4 thoughts on “Text Analytics Improved With Tableau Clustering

  1. Pingback: Proof of Why Alteryx Is Great Software | 3danim8's Blog

  2. Pingback: Who Is More Important To Your Blog: A Hare or A Tortoise? | 3danim8's Blog

  3. Pingback: Should You Believe Your Website Traffic Data? | 3danim8's Blog

  4. Pingback: 3danim8's Blog - How I Use #Tableau LOD’s To Process Asynchronous Time Series Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s