It does not happen often, but whenever I get a flat file (i.e., fixed-width ASCII file) to work with, I have to work some magic on the file to get it ready for Tableau. This post describes how some of that magic happens.
A fixed-width, or flat file, simply has a stream of characters from column 1 out to the end of each data row, with no commas or other delimiters to separate the data fields.
For big applications that have a lot of data fields per record, there typically will be a guidance document that comes with the file telling you how each data field is structured (field name, data type, field width, definitions, etc). For small applications, you might be able to use a tool like Excel to set-up the field delimiters manually. However, before you can begin using the data in a big application like this, you have to have a way to automatically delimit and restructure the data. A manual approach to parsing hte data will likely be inaccurate and too time consuming to do.
I recently had the opportunity to receive a data source that contains over 500 fields per record, with each record being exactly 4,800 characters long. Each line of the file stops precisely at character 4801. Each of the data fields are variable width (from 1 up to 100 characters long) and of variable types.
The data types ranged from pure text, to dates, to floats, to integers, and to mixed alpha-numerics. Mixed alpha-numerics are problematic in several ways and I discussing how to handle these fields in another blog post (click here).
Some fields have leading zeros that are important and others don’t have leading zeros. The dates do not have the “/” delimiter because they are stored as MMDDYYYY, so the “/” delimiter has to be added to each of the fields that contain date data.
Converting a fixed width file of this type into to workable data file requires an attention to detail that normally isn’t required for most jobs but offers a number of interesting challenges. Due to the proprietary nature of this data, I will not be showing any unaltered source data but will only be explaining the techniques that allow the file to be converted successfully.
To get an idea of some of the concepts just discussed, Figure 1 shows the structure of the first ten data fields (from the list of over 500) of the file I was using (field descriptions intentionally empty).
If you like this article and would like to see more of what I write, please subscribe to my blog by taking 5 seconds to enter your email address below. It is free and it motivates me to continue writing, so thanks!
First Method – Write a VB .net code to do the Job
The particular file I was using had about 421,000 lines (records), which means it contains over 210 million pieces of data. Luckily, I didn’t need to use all 500 data fields (initially I wanted to keep about about 265 of them), so I began the process of converting the file by writing a VB .net code. Within a couple of hours, I had a working code that allowed me to parse the file such that the 265 fields I wanted to keep were directly written to a csv file that was comma delimited. This code allowed me to specify which records to keep and which records to ignore when writing to the output file. A printout of the vb .net code I developed to do this can be downloaded by clicking this link: conversion_code.
The problem with this approach was that the conversion process was very slow, with only 100 records processed per minute. I suspect that the sluggishness of this code had to with the record length being 4800 characters wide with 503 fields, so for smaller files, this approach might work. However, at a rate of 100 records per minute, it would take nearly 3 days to process the file I had to convert! I couldn’t wait that long so I turned to my trusted companion, the programmer’s editor Vedit (www.vedit.com). By the way, there are commercially available codes that might be able to do some of this work, but I wasn’t interested in buying another program for this one application.
Second Method – Using Vedit to do the Job
Vedit is an amazing tool that still is light years ahead of its time. I have been using this editor since 1987 or so and not a day goes by without me using Vedit many times.
One of the macros supplied with Vedit is called, “Flat to CSV.vdm”, which is an unbelievably wonderful 2,000 line macro written by some talented programmers a few years ago. This macro allowed me to convert the flat input file into a csv file at the rate of over 5,000 records per minute for the 265 fields, or 50 times faster than the vb code.
The directions for its usage are simple and well documented. In a separate file called “fields.txt”, you specify the the field widths and whether you want to write the field(s) to the output file. You can even specify whether to strip leading and trailing spaces that are common in fixed width files, which is a big space saver in the output files. For most operations, running the macro once is sufficient, but in my case, I wanted to do two things, which forced me to run the macro twice in a strategic order of operations.
First, I wanted to introduce the date delimiter “/” (you can specify any delimiter in the macro instructions) into the date fields so that Tableau would recognize the dates in the form mm/dd/yyyy. In step 1 of the conversion process, I ran the flat to csv converter with the settings needed to introduce the date delimiter into the file. This operation still kept the file as a flat, fixed length file, but now it had the date delimiters added where needed. This is important to understand because it allowed me to run the macro a second time because the file was still a flat, fixed-width file that now had the date fields properly formatted.
By this time, I had reduced the number of fields that I was interested in keep down from 265 to 75. These instructions looked like this, which specified the exact columns where I needed a date “/” delimiter string added (there were six date fields in the 75 I was keeping):
Delim=/ // Step 1: Introduce the “/” in all the date fields.
Note that the initial 1 was needed to get the process started and the ending 4869 was needed to complete each record. The first “/” was added at column 112, second at 114, which is the first date field I was keeping. Notice that the numbers come in pairs, separated by two columns, which converts an example date like this: 01232011 to this: 01/23/2011.
In step 2, now that I had the dates delimited, I wanted to write the fields to the comma-delimited csv file. The vedit macro was now driven with these settings (there are actually 503 commands, with only the first few shown here):
Trim=3 // Trim leading and trailing spaces
Note that the “x” before the column starting location tells the macro to not write that field to the output file (i.e., skip the field). Also note that I had to recalculate the new column starting positions that existed after the date delimited strings (“/”) were added, so that required an Excel spreadsheet to keep track of the starting positions for all 503 fields. This is not a big deal but does force you to do one re-computation of the field starting positions. The astute reader will recognize that the eighth field is now starting at position 120, rather than 118 as shown in Figure 1, because two date delimiter characters “/” were added for field number seven.
After executing the macro the second time, I had a comma delimited file that looked something like this:
After these two operations, it seemed that I would have a file that I could send to Excel to do some work before sending the data to Tableau. There was one problem left to solve, however.
If the file as shown in Figure 2 is sent to Excel, the fields that have leading zeros such as the third (PCN) and fourth (MRN) fields will lose the leading zeros. Unless you are working with well-known data structures that you can specify special formats for such as social security numbers, Excel does not handle leading zeros very well.
A direct import of this file will create PCN numbers of 924, 3000009, and 1030200216 for the first three records, which is not what these fields needs to be. The fields need to be interpreted by Excel exactly as they are shown in Figure 2.
To overcome this problem, Vedit was used to convert fields that have leading zeros into a special format in the csv file so that Excel recognizes the leading zeros correctly. Each field that has leading zeros (in this case the third and fourth fields shown in Figure 2) have to be in the format as shown in Figure 3. The leading “=” triggers Excel to consider the content as a formula entry during the csv import process, thereby correctly including the leading zeros. If you just use quotes around the numbers without the “=”, Excel will drop the leading zeros. This little trick will save you a lot of grief in future interactions with Excel if you are using fields that have leading zeros!
The previous paragraph has an important insight that can hamper you in more ways than one. For this reason, I’ve added a footnote at the end of this article, to explain another example of why you might need to use the trick I showed to properly import data into Excel.
Third Method – Using Alteryx to Process the Flat File
This section was added many months after the original publication date. For information on how to use Alteryx to process this type of flat file, click here to read another blog post. Alteryx makes the other methods discussed above completely obsolete when it comes to efficiently processing flat files.
Footnote – How to Properly Import Numerical Data Into Excel As a String
About 2.5 years (8/11/16) after originally writing this, I has a problem solving session with a co-worker. The problem she was having is shown in Figure 4. The team she was working with wanted to concatenate two fields together to create one new field. There were a couple of problems to overcame with this data.
First, the data had to be read in from a *.csv into Tableau. At import time, I told her to manually change the P.O.Number and Line Number fields into strings. Tableau defaults to numerical fields for these two. This type conversion will ensure that all the data is properly imported without any loss of significance errors due to large P.O. Numbers.
Second, the concat field had to be specially prepared so that when the data got exported to a csv file (Figure 5), the proper format of =”all concat data” will be added.
This extra formatting ensures that the concat field will be properly interpreted by Excel as a string field. Without the introduction of the =” ” characters wrapping the concat field, Excel will treat that column of data as numbers and numerical truncation will occur because of the size of the “numbers”.
Here is the formula used to wrap any fields you want to have treated as a string in Excel:
‘=”‘+[PO Number]+[Line Number]+'”‘
Figure 6 shows what the exported csv file looks like once it gets exported from Tableau (after the Export All button is hit). All that extra formatting is troublesome but is necessary for proper importation of this data into Excel.
Finally, Figure 7 shows what the data looks like in Excel once the csv file is imported. The Concat field is properly treated as a string in Excel and there are now truncation or loss of significance errors that occur.