Ken Wood

“It’s Both! Simultaneously!”

Blog Post created by Ken Wood Employee on Feb 10, 2016

HDS’ “Hyper Scale-out Platform”

A Platform for Applications Requiring a Hybrid of HCI and BDA

 

 

I’m not sure if the recent announcement of HDS’ Hyper Scale-out Platform (HSP) is being portrayed as the analytics and cloud infrastructure hybrid platform it is. As I talk to people about HSP both internally and externally, the consensus falls squarely in two camps – HSP is a Hyper-Converged Cloud Infrastructure OR, HSP is a Big Data Analytics platform – to which I respond, “It’s both! Simultaneously!” (actually, it’s more than this, but that’s fodder for another blog). Maybe, it’s double converged!

 

Granted, HSP can be deployed as a Hyper-Converged Infrastructure (HCI) - Cloud platform for on-premise cloud computing requirements and application consolidation via hyper-converged compute-network-storage convergence. To me, (yawn!) this is a little boring, but there are some interesting features built into HSP that gives HCI use cases an TwitCloud.pnginteresting and beneficial kick in enterprise deployments. HSP allows for a distributed shared file system configuration that your virtual machines can be stored and ran on. This shared and distributed file system, the eScaleFS, is highly protected with triple redundancy and can be accessed by all nodes and KVM containers. What does this mean? It means your running virtual machines can be moved and load balanced across the cluster, either voluntarily or involuntarily, without moving or copying the stored virtual machine and its data. The data stays in place and the container that the virtual machine(s) run in, is switched.

 

The eScaleFS built into HSP also has some High performance Computing applications due to its shared file system and parallel data access. However, this is not a primary focus at this time for HSP (and could be the subject of another blog) unless you consider MapReduce and other Hadoop based applications a modern form of High Performance Computing like I do. More specifically, Hadoop is a modern form of High Throughput Computing.

 

For me, the exciting aspect of HSP is the integration with Hortonworks Data Platform (HDP) and Pentaho to turn HSP into a Big Data Analytics (BDA) platform. Using Pentaho Data Integration (PDI) and its flow orchestration, and the Pentaho Visual MapReduce (PVMR) plugin to graphically create MapReduce jobs, then include them in complex analytic data flows is a very underrated and unappreciated capability, granted, it may not be that well understood. BUT! Here’s the kicker. While Hadoop plus PDI plus PVMR plus graphically creating complex analytic data flows can be done on any servers or cloud or container running these technologies (heck you can do the whole thing on a single server physical or virtual), there is a special superpower you get when running this combination of technology on HSP with eScaleFS.

 

Hadoop on HSP doesn’t run the Hadoop File System (HDFS). Instead, HSP comes with a HDFS plugin that interprets and translates HDFS semantics into eScaleFS semantics for all of the Hadoop ecosystem components. This makes the eScaleFS, HDFS compatible AND, eScaleFS is also POSIX compliant, simultaneously. An HDFS plugin is not a new or unique capability in the Hadoop world, but it is unique in the HCI and BDA world, or stated another way, in the hyper-converged, hyper scale-out, Big Data world.

 

What this means for you and your applications is that you can run applications, possibly legacy applications, that are reliant on POSIX based file systems and datasets, that can all share a common file system, with the same file system that Hadoop uses.

 

Now, lets think about that for a moment. A major complaint of some Hadoop users in mixed analytic application environments is the alphabet soup of protocols used to move and copy data to and from individual applications and Hadoop cluster(s) over their network. These complicated data flows could include data collectors, Extract-Translate-Load (ETL) steps, predictive analytic engines, Hadoop analytics, visualization-reporting-dashboard engines and other machine-to-machine communications. In most of these cases or steps the results of one step needs to be transferred to another processing step. Many of these steps are individual servers or full computing clusters. This means data is moved (copied then deleted) or copied from one application server to another requiring and consuming network bandwidth. One measureable metric of complex and multi-staged analytic data flows like this is the amount of wall clock time used just for the data transfers between discreet steps which can significantly elongate the total processing time.

 

If  these applications were running in virtual machines on HSP and the data stored in eScaleFS, then Hadoop is able to access the same data stored in eScaleFS, these analytics-data-flow models are transformed into analytics-flow models only, which means each analytic step in a complex work flow is pointing to the previous step’s processed results, including MapReduce and other Hadoop processing components.

 

To illustrate this concept, I give you the “TwitterVerse” version 2 (in progress). Previously, in the TwitterVerse version 1, a multi-stage application that collects tweets from Twitter, processes this data and stores the summarized results of the processing into a vanilla SQL database for now-time dashboarding of tweet locations, as well as, the mobile device breakdown and tweeted language breakdown, all orchestrated and visualized via Pentaho.


TwitterVerse1-0.png

 

In this case, we’re getting tweets that have location turned on and the tweet came from a mobile device. The database is just a temporary place to hold summary interim results. The raw twitter data in this collection stage, and this first version of the TwitterVerse, are eventually pushed as a raw tweet bundle to HCP. This is a data move to HCP (no big deal for these 100+ MB bundles), but if these tweets were to be processed again by MapReduce later, they would have to be retrieved and transferred from HCP to HDFS which now is in terabytes.

 

The TwitterVerse version 2.0, or “the TwitterVerse on HSP”, extends this flow by making a few changes. In the first stage, the collection stage, still collects, processes and dashboards the tweet locations, device types and languages, but instead of pushing the raw data to HCP when this stage is completed, the tweet bundle is just renamed to an archiving folder in the eScaleFS - no data transfer! A MapReduce job is then kicked off to do key-term frequency analysis on all or some time window of the raw tweets. The result is then stored in another shared folder on eScaleFS where a final step uses the results to visualize the key-terms, in this case a wordcloud.


TwitterVerse2-0-2.png

Except for the containerized data in the temporary database, the reuse of a single copy of data for this analytic-flow is just a matter of the steps pointing to the correct folder holding the raw data. Yes, results are written back, but then these results are then used while in place as well.

 

You can probably start to envision your own application's analytic-data-flows and the amount of network traffic (and transfer latency) it generates getting data from one step to the next and how with a couple of changes you could reuse data while it is in place with this type of platform, basically transforming your analytic-data-flow into analytic-flows. I’d be interested in hearing some of your thoughts and possibly the applications you might have that could take advantage of this type of shared data capability in a unique hyper-converged platform.

 

 


Outcomes