Ken Wood

Nothing Too Fancy, But Surprisingly Useful

Blog Post created by Ken Wood Employee on Mar 20, 2015

Using "filename" to Analyze Customer Interview Reports


In my continuing blog series of “what I’ve been doing for the past year with the Pentaho software while the M&A activity was still a secret”, here is another example developed because I could, but as it turns out, this will be a great tool for HDS' internal use.


But first, little background. As a member of the Global Office of Technology and Planning, we have a yearly activity where a select group of customers across a broad range of industries are recorded during an interview. These interviews focus on subjects pertaining to long term strategic planning for InterviewWordCloud3.pngHitachi, HDS and our customers. Validating these strategic concepts with our customer’s long term planning activities is our way of confirming that we are on the right track or need to rethink our positions. Over the years, the GOTP has amassed a sizable repository of these customer interview reports, which contain thought leadership and strategies. Obviously, the content of the resulting customer reports are highly confidential and are used throughout the company to check on the validity of ideas, industry direction, opinions of trends and help guide us in future planning.


Access to the content of these documents are controlled since the customer and the futuristic subjects are strategic, but access to the “metadata” of these documents are available, can be just as important, and can yield useful information as well. On many occasions, account managers or country managers want to know what regions, countries, customers and subjects were covered for these reports. They want to know if any of their customers were interviewed, why hasn't their country been included in the interviews, and other “logistical” types of information without actually downloading and reading the documents.


Thanks to Michael Hay, the GOTP has adopted a “schema” for the report names (filenames). These reports are stored and maintained in our internal Microsoft Sharepoint folder. The filename schema has the following hyphen [-] separated format,


  • Date – YYYYMMDD format
  • Geographical Region – AMER, EMEA, APAC, BRIC
  • Country – US, UK, IT, FR, etc.
  • State
  • City
  • Industry Vertical
  • Subject of interview (the sensitive part of this blog)
  • Customer Name (the other sensitive part in this blog)
  • Type of report
  • File type – .docx (but could also include .doc, .pdf, .iba and others in the future)
  • Filesize – POSIX filesize for the “ls –l” command


The last field, filesize, actually comes as a results of mounting the Sharepoint folder and doing the ls -l command in my PDI embedded script. I can extract this piece of data as well and have plans to use it for an enhanced feature to this application at a later time.


These are long filenames and technically “filename” is a POSIX metadata field that we’ve taken to the next level with a very descriptive naming format, but they are easy to read and specifically determine which reports they are. Using Pentaho Data Integrator and Business Analytics tools, I was able to look in the repository and parse these filenames, do a little processing and analysis on them and create this live dashboard.




By browsing to this URL (the Penatho BA server dashboard), people interested in this information can get a quick dashboard view of our activities, like whether or not their customer’s have been visited, what countries or locations we have met with customers, what industries we have focused on, what the percentages of regions are visited, and so on. The live dashboard also includes the subject of discussion and a list of the actual customer’s name. The list of the customer names also doubles as a hyperlink to the actual report in our Sharepoint folder so the report can be pulled up completely.


This was done using Pentaho’s Data Integration and Business Analytics visualization tools. By now you should get a sense that I like using the “Geomap” chart type in Pentaho’s Business Analyzer. Below you can see the process flow I use from getting the POSIX metadata from a mounted Sharepoint folder to the final dashboard visual.




Here I’ll show the job and transformations that I created to parse and process the raw file into useable data. First, the job that runs everything including the super simple script embedded in the step <GetSharePointMetadata> to create the raw data, then runs the two transformations in sequence. The job is what is scheduled to run either once or on a periodic schedule.




Now, for the first transformation entry, <ParseFileNames>, executed in this job.




This quick transformation takes the raw file created by the script and parses the fields from the first step <Text file input>. Since this is the output of the ls -l command, the raw data is both a fixed field as well as a hyphen separated set of fields once the FileSize and filename <GoodData> is extracted. Below is a preview of the parsed FileSize and the GoodData data fields I want. The stuff I didn’t want is all of the permission, owner, mode, etc. data that the ls –l command outputs.




The next step in this transformation is the <SplitField> step. This step takes the hyphen separated filename and splits it up into the individual data pieces that I need.




The last transformation step in this short process is I need to drop the filename extension which is either a .doc or .docx. Finally, I load all of the parsed data into a database.


In the next transformation <InsertSharePointURL>, I cheat a little. Since this is really a Sharepoint folder mounted to a file system, the full file name is based on the mountpoint of the file system I’m using. However, want I really need is the Sharepoint URL location of the document. The second transformation shown below, depicts the steps used taking the filename and turning it into a Sharepoint URL.




The first step is the second temporary file of a simple ls command, I don't need the other fields from the ls –l command output. Actually, I use a ls | nl to create line numbers to match the record ID in my database. I then concatenate the fixed strings of “https://sharepoint/folder/of/customer/interview/reports/” to the document filename <MakeURLString>. Then I convert all the space characters in the document filename section of the new string into “%20”s so that it matches the Sharepoint naming nomenclature with <SwapSpaces>. Finally, I do an update with the <Update> step to the database to add the URL of the document to the record.


The final phase of this simple application is to create the various charts and visuals using the Pentaho Business Analyzer tool that queries the updated database and combined to create the dashboard as you see above.


There are a number of enhancements we are developing on this concept in the background. Matthew O'Keefe, Michael Hay and I are going to create a full blown mobile app so HDS employees can pull up the current  dashboard and find the reports they are interested in and read them on their favorite mobile device. We are also looking at embedding deep analytics into the app to process the content of these reports to create “self describing” visuals like the word cloud at the beginning of this blog, or sentiment analysis breakdowns of the various sections within a report or among categories of reports, and other great ideas.


As we discuss and debate the list of possible enhancements going forward, it becomes obvious that we are creating the application we want and not worrying about the implementation details. This is because of the Pentaho tool set. Pentaho makes it easier to turn your ideas and data into information that can be quickly consumed to make inform decisions. This is one of the reasons we are acquiring Pentaho!


So stay tuned and watch for future write-ups as we share our progress.