Expanding HDFS storage to Object Store with S3a
In this blog post I’ll explore how Hadoop applications can quickly and easily address storage in the Hitachi Object Storage through the Hadoop S3a interface. See my previous Blog posts from this series.
Getting started with object storage access from Hadoop
Hadoop has long supported the ability to interact with object storage systems via the S3A interface or its predecessors, S3N and S3 (reference). There is growing interest amongst Hadoop users and administrators in leveraging object storage to: (a) recover HDFS capacity, by offloading files from HDFS to S3A, or (b) to use as primary storage for raw data and other intermediate data sets where HDFS performance is not critical. Storing data in Hitachi Content Platform (HCP) provides distinct advantages over a filesystem because object storage scales to billions of objects without performance degradation and provides a lower total cost of ownership. Storing HDFS data on HCP via S3A also allows applications the ability to leverage HCP’s best in class compliance capabilities and to share data with other applications by directly leveraging HCP’s RESTful APIs. With the increase of less frequently accessed data or “cold data” in HDFS, the desire to offload data from Hadoop to object storage continues to grow.
The S3A interface allows Hadoop applications to read and to write data directly to object storage with a HDFS syntax by simply addressing the bucket like so: s3a://bucket/path/to/data. Applications can leverage the s3a://bucket/path URI to address data stored in a S3 bucket directly as they normally would reference a HDFS path.
This functionality provides native Hadoop applications with the ability to expand the storage they are able to interface with beyond the local HDFS storage to HCP’s S3 API. To address HCP as Hadoop storage I simply need to configure the access key, secret key and the S3A endpoint as properties in the Hadoop core-site.xml. This can either be done through Ambari/Cloudera Manager, or by editing the $HADOOP_HOME/hadoop/conf/core-site.xml on each DataNode.
Below is a demo video depicting the configuration of HCP in HDFS, along with a few examples of interacting with HCP from Hadoop via S3A. Video found here: https://youtu.be/Z-d62OedZP8
The Hitachi Content Solutions Engineering (CSE) team is currently building a solution which will provide the capability to seamlessly offload data from HDFS to HCP with complete transparency and with zero change to applications. As I demonstrated in this post, existing Hadoop functionality supports the expansion of Hadoop storage onto HCP by having end users and applications leverage the S3A interface. Unfortunately, leveraging S3A requires significant application change, and most application owners would prefer not to change application configurations and workflows to accomplish this. Customers have expressed their concerns and the CSE team is developing a solution that will leverage existing Apache HDFS storage management capabilities with new technology that will allow HDFS to tier directly to HCP.