The Hitachi Vantara Content Solutions Engineering Team has successfully certified the Hitachi Content Platform (HCP) as an Alluxio understore. The certification testing involved running several big data benchmarking tools against a Hadoop cluster using Alluxio virtualized Hadoop Distributed File System (HDFS) and HCP storage. After reviewing the results of the certification testing, Alluxio engineering has approved the certification of HCP as an Alluxio understore.
Big data platforms are known for delivering high performance analytics at massive scale. They achieve this by co-locating data and compute on commodity hardware nodes where storage and compute resources are balanced. When additional compute or storage resources are required, nodes are added to the cluster. Over time these clusters can grow to be hundreds or thousands of nodes, which can accumulate great quantities of older, less active cold data. When this occurs, enterprises are forced to scale Hadoop clusters well beyond their computational requirements, in order to meet these increasing storage requirements.
Big data application developers also face the challenge of how to unlock the value of unstructured data stored in HCP. With tools like the HCP metadata query engine and Hitachi Content Intelligence, developers have powerful tools for data discovery. However they do not have a tightly integrated, performance optimized method to access and analyze that data. They need a solution that directly exposes their HCP data to their big data applications while minimizing the cost of repetitive data retrieval.
Hitachi Vantara Content Solutions Engineering Team has partnered with Alluxio to certify a solution that addresses these challenges. Together, HCP and Alluxio empower enterprise customers to extract big data value from their object data, and to recover precious HDFS capacity occupied by cold data.
HCP and Alluxio
Hitachi Content Platform (HCP) is Hitachi Vantara's market leading object storage platform. Available as an appliance or software only HCP scales to store many billions of objects at multi-petabyte scale. HCP delivers durable data protection with much greater storage efficiency than can be achieved with the Hadoop standard 3 replica configuration. HCP offers data protection both by replication as well as by geographically distributed erasure coding where fewer than 2 full copies of the data must be stored to deliver the same durability as three replicas.
Alluxio is a data access layer that lies between compute and heterogeneous storage resources. Alluxio unifies data access at memory-speed and bridges big data frameworks with multiple storage platforms. Applications simply connect with Alluxio to access data stored in any underlying storage system, including HCP. Using Alluxio global namespace, existing data analytics applications such as Apache Hadoop MapReduce, Apache Spark, and Apache Presto can continue using the same industry-standard interfaces to access data federated between HCP and native Hadoop storage.
Use Cases Validated
There are two primary use cases that have been validated with HCP and Alluxio in a big data ecosystem. The first use case is to simplify access to data stored in HCP in order to enable Hadoop applications to perform analytics on HCP data. An HCP bucket virtualized with Alluxio can be accessed by big data applications using Alluxio's Hadoop Compatible File System interface. The Hadoop Compatible File System interface mimics the HDFS interface. By simply changing the URI scheme from "hdfs://" to "alluxio://" big data applications are able to access and analyze data in an HCP bucket using the familiar HDFS interface.
Alluxio provides several client interfaces, and virtualizes a variety of storage types
The second use case is to simplify the movement of data between HDFS and HCP in order to enable the offload of cold data. Virtualizing both HDFS and an HCP bucket with Alluxio provides a unified namespace for Hadoop applications to read and write data to and from both HCP and HDFS. Applications can then move data from HDFS to the HCP bucket as easily as moving data from one directory to another.
The downside of moving data from HDFS to HCP is that analytics performed on cloud data is slower than analytics performed on data stored locally in HDFS. However, Alluxio addresses this issue by providing HDD, SSD, and RAMDISK cache on the Hadoop node where cloud data can be promoted to the Alluxio cache for analysis, enabling memory-speed analytics with object-store savings. We verified the performance benefits of promoting cloud data to the Alluxio cache by analyzing the same HCP data set multiple times. After the data in HCP was promoted to the Alluxio cache during the initial analysis, the performance of analyzing the HCP data in the Alluxio cache was comparable to the performance of analyzing a local HDFS data set.
Software Configuration and Test Methodology
To test HCP and Hadoop together, we installed a Hadoop cluster on four D51B-2U nodes running CentOS 7 Linux OS and configured for 10G networking. Hadoop version HDP-22.214.171.124 was provisioned and managed using Apache Ambari software. All four nodes had the necessary Hadoop software for the benchmark test suites performed including but not limited to, HDFS, Spark2, and MapReduce2. In addition, all four Hadoop nodes were running Alluxio Enterprise 1.7.1 software.
All S3 bucket tests were performed against HCP 8.1 software running on a 4 node G10 cluster with 10G network and VSP G600 storage volumes. HCP network traffic was routed through a Pulse Secure Virtual Traffic Manager load balancer.
The certification testing was performed using various big data performance testing tools including HiBench, DFSIO, and TPC-DS. Each test was run three times. The first test was a benchmark test and used the S3A protocol to go directly to HCP. Then two consecutive tests were run using Alluxio with HCP as the understore. Validation of the test results involved verifying that the performance of recalling data from HCP with Alluxio was comparable to the S3A benchmark, and that subsequent analyses of previously recalled data showed the performance benefits of being locally cached by Alluxio.
HCP Specific configuration Settings in Alluxio
Alluxio exposes an UnderFileSystem interface that enables the HCP to be configured as the underlying storage for the Alluxio filesystem. When HCP is configured as the understore in Alluxio, HCP acts as the primary storage backend for all applications that interact with Alluxio. This configuration can completely replace HDFS or coexist with HDFS in a big data ecosystem. Within in this configuration, the root directory of the Alluxio filesystem is mapped to the root directory of the HCP namespace. This makes for a one to one mapping between files and directories in the Alluxio filesystem and an HCP namespace. To accomplish configuring HCP as a understore the following configuration properties were configured in the $alluxioHome/conf/alluxio-site.properties configuration file:
Some of these settings may not be necessary but represent the configuration used in our testing. For example, list.objects.v1=true was originally set for HCP 8.0 compatibility, but was likely not necessary for HCP 8.1 testing. accessKeyId and secretKey are the base64 encoded username and md5 encoded password of the HCP namespace data access user.
Another method for configuring HCP with Alluxio would be to mount HCP to a specific directory in the Alluxio filesystem. The primary use case for this would be non-seamless HDFS offload. Alluxio would be configured with HDFS as the under filesystem(as described here) and HCP would be mounted to a sub directory within the root filesystem of Alluxio. The end result is to present both HDFS and HCP as a single filesystem. This would be accomplished by following Alluxio’s documentation to configure Alluxio with HDFS, and then using the alluxio fs mount command to mount a HCP namespace as shown in the following command:
./bin/alluxio fs mount --option aws.accessKeyId=<base64_Username> --option aws.secretKey=<md5_Password> --option alluxio.underfs.s3.endpoint=tenant.HCPSystem.SubDomain.Domain.com /mnt/HCP s3a://namespace/directory/
Properties not explicitly set in the mount command will be inherited from $alluxioHome/conf/alluxio-site.properties configuration file as described above.
HCP software versions prior to HCP 8.1 have not been certified to work with Alluxio. There are known functional differences between 8.1 and prior versions, for example the multi object delete (bulk delete) API is not implemented with earlier versions of HCP. Alluxio has a configuration to disable invocation of this API, but this was not tested as part of this certification. On each HCP namespace to be configured as an under filesystem address in Alluxio, the S3 compatible API will need to be enabled along with the ‘Optimize for Cloud’ feature. The optimize for cloud feature must be enabled for HCP to support Multipart Upload. Depending on scale and workload, the following configuration settings may need to be tuned:
For More Information
For more information about Alluxio, please refer to these links which describe Alluxio architecture and data flow. Or you can reach out directly to the Alluxio team at firstname.lastname@example.org. If you have questions about the solution described in this brief or have an opportunity where you think this solution may be a fit, please reach out to our ISV team at ISVAlliances@hitachivantara.com.