Nick DeRoo

MapR Scale-out Storage with Hitachi Content Platform

Blog Post created by Nick DeRoo Employee on Oct 18, 2018

Introduction

On October 3rd, 2018 MapR announced the general availability of MapR version 6.1. This release offers much improved storage management capabilities that now include the ability to tier cool, cold, or frozen data to the Hitachi Content Platform (HCP). As we’ve discussed in previous blog posts, our customers are asking for solutions to seamlessly offload their cold and frozen HDFS and MapR-FS data. We are confident that MapR 6.1 delivers on these requirements as the release was beta tested on HCP by an important customer of both MapR and Hitachi.  HCP provides tremendous value as a virtually limitless scale out cold storage pool for the MapR environment. Data that is infrequently accessed will be seamlessly tiered from your mission critical MapR storage infrastructure into economical HCP S3 buckets. Once your data is tiered to the HCP your data will be protected by HCP's best in class durability and availability. And best of all, your data will continue to be accessible to all your applications at the original MapR-FS URI. There is no need to update data paths, as this is completely transparent at the application layer. Of course, there is additional latency when recalling data from the HCP, but MapR will cache any recently recalled blocks, so that the performance hit is only on the first access. Subsequent access to the recalled blocks will be as fast as any other locally stored data. By leveraging Hitachi’s best in class object storage with MapR Data Tiering, the enterprise can now right size their MapR clusters, and stop buying high end compute nodes to accommodate a ceaseless accumulation of cold data.

 

To learn more about these capabilities please refer to our blog post Critical Capabilities for Hadoop Offload to HCP

How it Works

MapR has a concept of Volumes which are logical unit used to organize data and manage performance. A Volume allows you to apply policies to a set of files, directories, and sub-volumes. The new data tiering functionality provides more control over the data by tiering cold data (at the block level) to more economical remote storage targets like HCP. Data tiering is controlled with rules that can be customized based on the user, group, file size, and last modified time. One or more of these rules comprise a Storage Policy. Volumes can be assigned a Storage Policy and a remote target which is backed by HCP. A schedule is then configured to determine how often the rules of the storage policy are applied to the volume. This workflow allows aging data to be automatically and seamlessly tiered to HCP. When data is tiered, the blocks will be moved to the lower cost storage, and a file stub will be kept on primary storage. This enables applications to seamlessly access data which has been tiered. The Content Solutions Engineering team validated this functionality with MapR 6.1 and HCP, and we implemented the workflow as described above.

 

Configuring HCP as a remote target and implementing seamless data tiering is a fairly simple and well-documented procedure. First, the HCP administrator configures an HCP namespace with the S3 protocol enabled and a data access user who must also be the namespace owner. From the MapR Control System a remote target is then configured pointing to the HCP namespace. Next, a MapR volume can be created, data tiering can be enabled, and the newly created HCP target can be selected. A storage policy needs to be selected or created to determine when data is eligible to tier. Once a user has selected a storage policy, the user can either initiate offloading manually or select a schedule. By assigning a schedule to the volume, you can control how often eligible data will be offloaded to the HCP.

 

MAPR.pngConsiderations

  • Data tiering is only supported for file data, offloading table and stream data is not supported at this time.
  • Tiering data is only supported on newly created Volumes and cannot be enabled on existing volumes. Data on existing volumes must be copied to new tiering enabled volumes to take advantage of data tiering.
  • Tiering for an individual volume is managed by a single MapR MAST Gateway, which runs on a single MapR node. To achieve high offload throughput, it will be important to build a volume hierarchy according to MapR’s best practices to balance your data among several volumes and maximize cluster performance and data availability.

Conclusion

With the MapR 6.1 release, MapR is taking the lead in acknowledging and addressing the problem of cold data bloat in big data clusters. MapR customers now have an option to address this problem other than adding more nodes. Not only is HCP a great choice to be your MapR remote storage target, HCP has been tested in this capacity and is currently in production in this capacity at a very large customer behind a multi-petabyte MapR cluster. The Content Solutions Engineering team will be working with the MapR alliances team to provide official certification of this solution. There is a great opportunity here for customers and Hitachi Vantara.  Let's start talking about the pain caused by big data, and how Hitachi Vantara can help. As always, your feedback is appreciated in the comment section.

Outcomes