Ken Wood


Blog Post created by Ken Wood Employee on Feb 12, 2015

The power of Pentaho+Hortonworks+Hitachi


For the past year, I have been talking in conceptual terms about a method for controlling the underlying components as part of an analytics process flow. I call this method “Analytics Defined Data & Infrastructure Intelligently Orchestrated" - ADDIIO (pronounced au-di-o, if you really get into it, roll the “d"). The concept came to me with my behind the scenes involvement and work with Pentaho for the past 16 months. Yes, I am leveraging the buzz around Something-Defined-Something. With our new relationship with Hortonworks and our recently announced intent to acquire Pentaho, I thought now was an excellent time to introduce this concept publically.


The concept of ADDIIO is to instantiate the software services required to complete an analytics process inline with the analytics process workflow.


This could include high level steps like,

  • Accessing, collecting, getting and/or placing data (includes tier promotions like bringing from cloud or from HDD to SDD)
  • Getting data into the desired form for processing (a pre-processing step),
  • Testing the data set for processing requirements,
  • Adjust the infrastructure to meet the processing demands,
  • Visualize and/or take action on the results.


This recipe suggests that in the design flow of your analytics process, you can instrument some control to meet demands of unknown requests.


So how does this concept work? As an example, if you look at a hyper-converged, scale-out cloud platform with a minimum (4 nodes) Hortonworks Hadoop cluster (Hortonworks and Hitachi Data Systems Partner to Deliver Apache Hadoop to the Enterprise) occupying a small portion of the resources. An ChordChart2.pnganalytics process workflow would be developed to correlate web logs to determine the site’s traffic flow once a user visits the site. This process will be used to vary the advertising rates for ads based where users frequent the most. The web farm itself is dynamically adjusting itself depending on demand, so the size of the generated logs are unknown.


Consider this flow,

  1. Pentaho Data Integrator (PDI) collects the log files from the web farm to determine a ranking of the pages with most frequent visits,
  2. In the PDI workflow, a test is done to determine that the size of the data set being processed this time is 80 TB.
  3. A rule is defined that a Map/Reduce job requires 1 node per 10 TB of data.
  4. The workflow branches off the main flow to add 4 additional Hortonworks nodes before the Map/Reduce job starts to bring the total node count to 8 node to properly processing the data and meet an SLA.
  5. The main workflow proceeds with the Map/Reduce job as part of the main analytics flow.
  6. Once the Map/Reduce job is complete, the workflow removes the 4 recently added Hadoop nodes which returns the cluster back to the minimum 4 nodes,
  7. Visualize the results using Pentaho’s Business Analytics dashboard and the results are used to adjust the ads rates for the identified pages.


This is a typical workflow except for steps 2, 3 & 6. Testing the dataset to determine the dataset size and processing horsepower required is one thing, but being able to increase and decrease the size of a Hadoop cluster within the workflow is a different approach, again Analytics Defined Data & Infrastructure Intelligently Orchestrated. This is possible if the underlying platform is based on an OpenStack, or ADDIIOFlow2.pngVMware, or other cloud computing platform. This approach can also instantiate a variety of special purpose virtual machines that have been pre-configured for specific tasks like data collectors & movers, renderers, data preprocessors, database servers, web servers, and others.


The virtual machine can be the new deployable software component that now can be orchestrated from within the analytics workflow. The combination of Hortonworks, Pentaho and Hitachi is positioned perfectly to see this concept materialized. Say it with me, “au-di-o”!

Also see Michael Hay's blog at Another peanut spilled...? where he talks about this week's announcements with Pentaho and Hortonworks.


I have a personal list of different types of use cases, examples and ideas for using this concept in projects and other activities that I keep. If this peaks your interest, please share some of your thoughts and where you would use this approach to solve you own challenges. There may even be some opportunities to co-innovate with Hitachi as new programs come up to speed. More on this at a later time.