Nick DeRoo

A Better Big Data Ecosystem with Hadoop and Hitachi Content Platform (Part2)

Blog Post created by Nick DeRoo Employee on Aug 10, 2018

In the part 1 of this series, we introduced the challenges our customers currently face with storing data long term in Hadoop. In this blog, we’ll discuss how the new Hadoop functionality brings object storage closer to the Hadoop ecosystem and how future Hadoop functionality will continue to simplify big data management. 

Review: The big data problem

As we discussed in part 1 of this series, storing petabytes of data in the Hadoop Filesystem (HDFS) and expanding storage in HDFS is costly and inefficient. It requires you to expand compute and storage capacity together. We then reviewed how customers can reduce the cost of storing data in HDFS by offloading data to an object storage system, like Hitachi Content Platform (HCP). However, that only solves part of the problem. Although HDFS offloading solutions exist, they require applications to move their data, or storage administrators to update the applications database after data has been moved. But there must be a better way to more effectively offload data?

For application owners who don’t want to modify their applications to move their data? Aren’t storage administrators responsible for maintaining the backend storage, so why can’t they solve the growing storage problem in a way that is seamless to the applications? The Hadoop community and Hitachi Vantara recognize this problem and are working towards a seamless Hadoop offload solution to address it.

Decoupling storage and compute in Hadoop

Apache Hadoop is currently addressing the issue of uneven storage and compute growth by adding functionality to decouple growing storage capacity from compute capacity. One way Hadoop is addressing this issue is with Heterogeneous Storage. With the introduction of Heterogeneous Storage, Hadoop has made strides towards managing data directly in Hadoop by introducing storage types and storage policies. In Hadoop 2.3, new functionality was introduced to change the data node storage model from a single-storage per data node, to a collection of storage in which each ‘store’ corresponds to physical storage media. This brings the concept of storage types (Disk and SSD) to Hadoop. The concept of storage policies allow data to be stored in different storage types based on a policy. This enables data to be moved between storage types or volumes by setting the storage policy on a file or directory.

Another important Hadoop feature to decouple storage and compute has been the addition an archival storage type. Nodes with higher density and less expensive storage can be used for archival storage. A new data migration tool called ‘Mover’ was added for archiving data. It periodically scans the files in HDFS to check if the block placement satisfies the storage policy. Although storage policies allow ‘mover’ to identify and move blocks that are supposed to be in a different storage tier, the functionality to transition files from one storage policy to another does not exist. Hadoop is missing a policy engine that looks at file attributes, access patterns, and other higher-level metadata, and then based on what it finds chooses the storage policy for the data.

To transition data from the Hot storage policy to Cold, the storage administrator either needs to manually tag files and directories or build and maintain complicated tools and logic. Even with an automated storage policy, there is still significant room for cost savings by tiering Cold data outside of the Hadoop filesystem to an object storage system. 

Bringing object storage closer to Hadoop

In Apache Hadoop 3.1, external storage can be mounted as a PROVIDED storage type. (See HDFS-9806 for more details.) This brings object storage closer to the Hadoop ecosystem but limits customers by only allowing them to create a read-only image of any remote namespace.  PROVIDED storage allows data stored outside HDFS to be mapped to and accessed from HDFS. Clients accessing data in PROVIDED storages can cache replicas in local media, enforce HDFS security and quotas, and then address more data than the cluster could persist in the storage attached to data nodes. Although more data can be addressed, data still cannot be seamlessly tiered from HDFS to the PROVIDED storage tier.

Currently, Apache Hadoop is working to extend the tiering functionality to external storage mounted as the PROVIDED storage type. HDFS-12090 is an open item to handle writes from HDFS to a PROVIDED storage target. This enhancement is referenced in a presentation from the Data Works Summit. Features used in the Data Works Summit demo are shown utilizing the ‘hdfs syncservice’ or ‘hdfs providedstorge’ subcommands to assign a storage policy to a data set, and then tier this data to external storage. Unfortunately, the functionality described in HDFS-12090 and what is shown in the Data Works Summit demo is still being designed and not yet scheduled for an Apache release. This leaves us with an open question: How do we seamlessly offload data from HDFS to object storage with the existing HDFS functionality?


MapR 6.1 functionality

MapR is another Hadoop implementation, but unlike Apache Hadoop and Cloudera is MapR is fully proprietary and is under its own development. MapR has recently announced that as part of its 6.1 release it will support seamless offloading of data from MapR-FS to "cost-optimized" storage (in other words, an S3 bucket). They describe this functionality as: Policy-Driven automatic data placement across performance-optimized, capacity-optimized and cost-optimized tiers, on-premises or in cloud, with Object Tiering

The 6.1 MapR announcement also describes the ability to have one global namespace that can transparently store hot, warm, and cold data and eliminate creating segregated namespaces. As data is transitioned from hot to cold, it can be moved from HDFS to cost optimized storage, and applications can continue to access data at the same path.  Object tiering in MapR 6.1 can be easily deployed using simple policies. In a given policy, administrators can identify the data to be tiered, the criteria for tiering, and the choice of a public or private cloud target. Although the described functionality sounds like an end all solution to the data offload problem, this functionality only enables customers who have their Hadoop environments in MapR today, or plan to transition to a MapR Hadoop environment in the future.

What’s Next?

The Content Solutions Engineering team recognizes that this is an opportunity to simplify data management and reduce costs for our customers. We are currently evaluating the feasibility of a few different solutions that will provide this seamless Hadoop offload functionality for Apache Hadoop and Cloudera distributions. Keep an eye on the Hitachi community site for more information as we continue to define the solutions around this use case. Please feel free to reach out to the Content Solutions Engineering team if you have feedback or would like to share some customer use cases.