Skip navigation

Expanding HDFS storage to Object Store with S3a

 

In this blog post I’ll explore how Hadoop applications can quickly and easily address storage in the Hitachi Object Storage through the Hadoop S3a interface. See my previous Blog posts from this series.

Getting started with object storage access from Hadoop

Hadoop has long supported the ability to interact with object storage systems via the S3A interface or its predecessors, S3N and S3 (reference). There is growing interest amongst Hadoop users and administrators in leveraging object storage to: (a) recover HDFS capacity, by offloading files from HDFS to S3A, or (b) to use as primary storage for raw data and other intermediate data sets where HDFS performance is not critical. Storing data in Hitachi Content Platform (HCP) provides distinct advantages over a filesystem because object storage scales to billions of objects without performance degradation and provides a lower total cost of ownership. Storing HDFS data on HCP via S3A also allows applications the ability to leverage HCP’s best in class compliance capabilities and to share data with other applications by directly leveraging HCP’s RESTful APIs. With the increase of less frequently accessed data or “cold data” in HDFS, the desire to offload data from Hadoop to object storage continues to grow.

 

The S3A interface allows Hadoop applications to read and to write data directly to object storage with a HDFS syntax by simply addressing the bucket like so: s3a://bucket/path/to/data. Applications can leverage the s3a://bucket/path URI to address data stored in a S3 bucket directly as they normally would reference a HDFS path.

 

This functionality provides native Hadoop applications with the ability to expand the storage they are able to interface with beyond the local HDFS storage to HCP’s S3 API. To address HCP as Hadoop storage I simply need to configure the access key, secret key and the S3A endpoint as properties in the Hadoop core-site.xml. This can either be done through Ambari/Cloudera Manager, or by editing the $HADOOP_HOME/hadoop/conf/core-site.xml on each DataNode.

 

<property>

<name>fs.s3a.access.key</name>

<value>HcpS3AccessKeyId</value>

</property>

 

<property>

<name>fs.s3a.endpoint</name>

<value>MyTenant.MyHCP.MyDomain.com/value>

</property>

 

<property>

<name>fs.s3a.secret.key</name>

<value>HcpS3SecretKey</value>

</property>




Below is a demo video depicting the configuration of HCP in HDFS, along with a few examples of interacting with HCP from Hadoop via S3A. Video found here: https://youtu.be/Z-d62OedZP8

Picture1.png

Looking Ahead

The Hitachi Content Solutions Engineering (CSE) team is currently building a solution which will provide the capability to seamlessly offload data from HDFS to HCP with complete transparency and with zero change to applications. As I demonstrated in this post, existing Hadoop functionality supports the expansion of Hadoop storage onto HCP by having end users and applications leverage the S3A interface. Unfortunately, leveraging S3A requires significant application change, and most application owners would prefer not to change application configurations and workflows to accomplish this. Customers have expressed their concerns and the CSE team is developing a solution that will leverage existing Apache HDFS storage management capabilities with new technology that will allow HDFS to tier directly to HCP.

Introduction

On October 3rd, 2018 MapR announced the general availability of MapR version 6.1. This release offers much improved storage management capabilities that now include the ability to tier cool, cold, or frozen data to the Hitachi Content Platform (HCP). As we’ve discussed in previous blog posts, our customers are asking for solutions to seamlessly offload their cold and frozen HDFS and MapR-FS data. We are confident that MapR 6.1 delivers on these requirements as the release was beta tested on HCP by an important customer of both MapR and Hitachi.  HCP provides tremendous value as a virtually limitless scale out cold storage pool for the MapR environment. Data that is infrequently accessed will be seamlessly tiered from your mission critical MapR storage infrastructure into economical HCP S3 buckets. Once your data is tiered to the HCP your data will be protected by HCP's best in class durability and availability. And best of all, your data will continue to be accessible to all your applications at the original MapR-FS URI. There is no need to update data paths, as this is completely transparent at the application layer. Of course, there is additional latency when recalling data from the HCP, but MapR will cache any recently recalled blocks, so that the performance hit is only on the first access. Subsequent access to the recalled blocks will be as fast as any other locally stored data. By leveraging Hitachi’s best in class object storage with MapR Data Tiering, the enterprise can now right size their MapR clusters, and stop buying high end compute nodes to accommodate a ceaseless accumulation of cold data.

 

To learn more about these capabilities please refer to our blog post Critical Capabilities for Hadoop Offload to HCP

How it Works

MapR has a concept of Volumes which are logical unit used to organize data and manage performance. A Volume allows you to apply policies to a set of files, directories, and sub-volumes. The new data tiering functionality provides more control over the data by tiering cold data (at the block level) to more economical remote storage targets like HCP. Data tiering is controlled with rules that can be customized based on the user, group, file size, and last modified time. One or more of these rules comprise a Storage Policy. Volumes can be assigned a Storage Policy and a remote target which is backed by HCP. A schedule is then configured to determine how often the rules of the storage policy are applied to the volume. This workflow allows aging data to be automatically and seamlessly tiered to HCP. When data is tiered, the blocks will be moved to the lower cost storage, and a file stub will be kept on primary storage. This enables applications to seamlessly access data which has been tiered. The Content Solutions Engineering team validated this functionality with MapR 6.1 and HCP, and we implemented the workflow as described above.

 

Configuring HCP as a remote target and implementing seamless data tiering is a fairly simple and well-documented procedure. First, the HCP administrator configures an HCP namespace with the S3 protocol enabled and a data access user who must also be the namespace owner. From the MapR Control System a remote target is then configured pointing to the HCP namespace. Next, a MapR volume can be created, data tiering can be enabled, and the newly created HCP target can be selected. A storage policy needs to be selected or created to determine when data is eligible to tier. Once a user has selected a storage policy, the user can either initiate offloading manually or select a schedule. By assigning a schedule to the volume, you can control how often eligible data will be offloaded to the HCP.

 

MAPR.pngConsiderations

  • Data tiering is only supported for file data, offloading table and stream data is not supported at this time.
  • Tiering data is only supported on newly created Volumes and cannot be enabled on existing volumes. Data on existing volumes must be copied to new tiering enabled volumes to take advantage of data tiering.
  • Tiering for an individual volume is managed by a single MapR MAST Gateway, which runs on a single MapR node. To achieve high offload throughput, it will be important to build a volume hierarchy according to MapR’s best practices to balance your data among several volumes and maximize cluster performance and data availability.

Conclusion

With the MapR 6.1 release, MapR is taking the lead in acknowledging and addressing the problem of cold data bloat in big data clusters. MapR customers now have an option to address this problem other than adding more nodes. Not only is HCP a great choice to be your MapR remote storage target, HCP has been tested in this capacity and is currently in production in this capacity at a very large customer behind a multi-petabyte MapR cluster. The Content Solutions Engineering team will be working with the MapR alliances team to provide official certification of this solution. There is a great opportunity here for customers and Hitachi Vantara.  Let's start talking about the pain caused by big data, and how Hitachi Vantara can help. As always, your feedback is appreciated in the comment section.