[ Part Two of a Five-Part Series ] - Data Temperatures and Partitioning
All you need to know about how data can be sliced up in pieces and yet kept reachable, accelerating your data-driven decisions.
I was talking to an acquaintance in the industry a few days ago and was told that that their company had a data offloading solution. So how do you access the offloaded data when it is needed?
Answer was – “well, it’s offloaded, you know … so no one should need it”.
“You mean archived and as good as purged, not offloaded”, I said.
At Hitachi, we look at problems, both current and prospective, and figure out sustainable, resilient and scalable solutions.
Some time back, we set out to solve three basic problems with data accessibility:
- Keeping all data accessible is expensive, so it was offloaded into repositories and archives. Now, how to find what is needed?
- Business sense of data can be made only if you tie together structured with unstructured data. How to accomplish that?
- The data scientists have created some nice R and Python scripts for mathematical analytics. How to apply those to all the data?
And in doing this, we found a solution to managing all of our customers’ data, including the ability for our customers to make data-driven decisions on demand across the end-to-end landscape. Before we talk about that, let us sync-up on data temperatures driven by access-frequency-based partitioning:
- Hot – Frequently accessed data, typically pertaining to the last three months of business
- Warm Data – Business information that needs to be retained in a structured repository for typically, the next one year
- Cold data – This is of two types and best place to keep it is in a big data system:
- Business data that is offloaded after, say, more than a year
- Logs, device data, real time streams, social/media inputs
- Frozen data – All the data that can be archived but may be required at a later point in time
#1 – How data is partitioned
Consider a conservative data footprint of 10 Petabytes. This is a standard data size for most corporations around the globe and the same philosophy applies irrespective of growth in data size. In general, a 10 Petabyte footprint would be distributed across the following:
- About a 100 Terabytes of structured data – covering the direct business information (sales, billings, bookings, backlog, account receivables, accruals, etc.), things are go into a SAP HANA instance or Oracle or even MySQL for that matter:
- Say, up to 6TB of hot data (for large global corporations, this could be more)
- The remainder as warm data, covering requirements of a little more than a year’s worth for things like SOX compliance
- The next few Petabytes pertaining to cold data, offloaded or retained in a big data system, whether unstructured such as a flavor of Hadoop or semi-structured, like a MongoDB or Cassandra
- Finally, we have all the remaining data (which is the largest chunk) in the form of frozen data on an archival system
Not only do we have all these units but also packaged solutions that are purpose built for each unit leveraging Pentaho Data Integration.
#2 – Super-convergence of platforms
Once we had required units for hot/warm, cold and frozen data, the next thought that struck us was convergence of all of these into a unified platform. The step after that was to give customers control of this entire data landscape on a single pane of glass. With that goal, we created the all-new Hitachi Unified Compute Platform (UCP), combining:
- Industry Vertical Solutions; built on
- Business Intelligence and Enterprise ERP software; running on
- Structured data appliances – SAP HANA, Oracle; in conjunction with
- Hadoop and NoSQL environments; alongside
- Rich and query-able archival systems like Hitachi Content Intelligence; all of this built on quality Hitachi hardware products which are known in the industry for dependability.
We created this as an a-la-carte platform menu enabling customers to plug and play components on demand. While we have built the ecosystem bottom-up from hardware all the way to industry solutions, we ask customers to look at it top down. In essence, customers can choose the application layer they desire and all other options required to make this happen are available to plug and play.
All of these capabilities are built upon a single, virtualized platform for all data types to ensure seamless access, protection and management of all information assets.
This integrated strategy focuses on 3 layers of technology that align with the evolution of the data landscape.
- Hitachi is known for exceptional Infrastructure technology, where they enable virtualization, mobility, integrated management and infrastructure on demand
- However, with the growth of unstructured data and content, Hitachi has added rich capabilities to search, discover and integrate content independent of the applications that create it
- And when you are able to extract information and insight from that data, you can achieve real business value leveraging analytics, integration, greater intelligence and Big Data solutions
The beauty of the platform is that it avoids the need for rip and replace. Also, these is the least amount of data replication. Our philosophy is:
- Allow customers to pick up only what they need
- We will enable them with data access into all their data sources, irrespective of whether it was Hitachi-built or not
- And we will do that without data movement, by accessing the data where it resides
We call this concept single query data access, made possible by our orchestrated data lake – discussed in the next blog post in this series.