[ Part Three of a Five-Part Series ] - The 'Orchestrated' Data Lake
Very simply put, this is an application of Chaos Theory and the Butterfly Effect. The problem is that we look at so much data but we don’t see. What is seemingly a chaotic set of data points is actually a pattern that evolves on multiple dimensions. A small trigger on a possibly unseen facet of your data landscape could well be the cause of a multi-billion-dollar business down the line, if leveraged correctly.
For truly leveraging all your useful data, you need your analysts to be able to work with data beyond the structured databases, correct? We have an answer.
Tap into your entire range of enterprise data across data warehouses, data marts, logs and semi/un-structured data, archives, knowledge-bases and even documents, images and videos with us.
Treating an enterprise data warehouse as the sole source of truth for making business decisions is not a good option and here is why:
- EDW is typically limited to transactional data (such as Billing-Booking-Backlog, Accruals, Account Receivables, Returns and Exchanges, Partner Transactions, etc.). Now, this can provide a current snapshot of your business and that too, a limited viewpoint without factoring in the impact of external factors
- There is a lot of data that you already have beyond EDW in the form of competitive intelligence, log files and social information, M2M and sensor feeds, archived data; and a lot more
- Beyond all this, you need to consider seasonal-trend information, currency and international market fluctuations, socio-economic scenarios to be able to make true data-driven decisions
I’ll use the following two examples to illustrate how our Orchestrated Data Lake achieves this.
#1 – Drawing a trend across your sales data over the last seven years
Typically, data worth seven years would not be in one place for you to query. Let us say it is distributed across:
- A year of very hot data in an in-memory data source (SAP HANA) – actually about 13 months for SOX compliance purposes
- The next three years data ware house offloaded into a big data repository (like Hadoop)
- And everything beyond that archived and safely put away
Introducing Pentaho as the unique orchestrator across the data lake – with a one-time exercise of a few mappings and transformations telling Pentaho where to look for the right data, we are able to tap into SAP HANA with an ANSI-SQL query, into Hadoop with a Map-Reduce job and into your archives with an Elastic search query around the metadata for each archived data block. Pentaho takes a single query and automatically breaks up the query in the right formats for individual data sources, monitors the execution of individual queries and gets back the data as a single object, even a JSON, if that is what your dashboard needs.
#2 – Unifying different data types to draw business insight
A second and more business-oriented use case would be to combine:
- Structured sales info – what was sold, when, where and how much
- Web logs – how have customers navigated your website, what did they click on, where did they go from there, what did they actually buy, what are the conversion and attach rates
- Social data – what are your customers saying about your products on, say, Facebook - are they recommending the products to their family and friends; can you use that info to launch a new family and friends packaged campaign
Let us say the sales data is in Oracle, the web logs are on a Hadoop cluster and the Facebook data is ‘scooped’ into MongoDB. Again, exactly the same concept – Pentaho is able to automatically tap into the data where it exists and derive it for you as a single data object.
This is a simplified description of the entire job – in most cases, we might need staging layers and more repositories but the important thing is being able to tap into all of your enterprise data via a single pane of glass. What is also important is that this is not just a union of the datasets across sources. It is also a filtering scheme based on the query conditions. What it means is that you get just the Megs of data you need and not a Gig of data that is difficult to handle.
Also importantly, a structured database is very different from an unstructured one. SAP HANA and Hadoop are different things altogether and have different purposes. Their speeds and volumes cannot be compared with each other. Having said that, we can put a layer of SAP HANA Vora on top of Hadoop to streamline the data flow speed better. It all depends on the business need and how this overall scheme of things is put to use.
In the next part of this series, we will discuss how all this can be brought together with IoT for both performance optimization as well as predictive maintenance.