Anyone who is familiar with the limitations of the Hadoop clustered storage architecture knows there is a huge opportunity for purpose built mass storage like Hitachi Content Platform as a data offload target in this environment. See Nick DeRoo's post A Better Big Data Ecosystem with Hadoop and Hitachi Content Platform (Part1) if you are not familiar with the limitations of the Hadoop clustered storage architecture. There are a variety of possible solutions to enable data offload and the viability of these solutions ultimately rests on whether the solution satisfies customer requirements. The Content Solutions Engineering (CSE) team has had the opportunity to engage with several customers (all large financial institutions) and customer account teams to understand their requirements. What we have heard so far are variations of the following 3 main requirements:
- They want to offload cold data from Hadoop cluster storage to external storage
- To free capacity for warm/hot data
- To avoid expanding Hadoop to accommodate PBs of cold data
- To save money
- To reduce complexity
- They do not want to change their applications
- They do not want to move the data
- They do not want to tag the data to move
- They want uninterrupted access to their cold data
- Cool/cold data may still be accessed 1-2 times per month/year
- Data paths/URIs must be unaltered
- Cold data access may slower than hot, but must still be fast
- When cold data is accessed it may be accessed again shortly
Most of the customers we have heard from are looking for all 3 of these requirements, a combination we are referring to as "seamless offload". Seamless offload automatically tiers cold data to external storage freeing internal capacity for new data. It provides uninterrupted access to tiered data and is completely hidden from the application layer.
In this post we will cover capabilities that enable these requirements. We will evaluate the capabilities of the three main Hadoop distributions, Hortonworks (HDP), Cloudera (CDH), and MapR. And we will also look at the capabilities of Alluxio, a 3rd party solution evaluated by the CSE team previously and discussed in this blog post: Certification of HCP with Alluxio
This section describes several capabilities in Hadoop platforms and software, that are relevant to the offload use case. The terminology used in this section is not necessarily standard terminology as each vendor may use different terms to describe their version of these features.
Seamless offload refers to the ability of the Hadoop platform and software to tier cold data from cluster storage to external storage without affecting the application layer in any way (other than less performant retrieval of cold data). This is achieved by combining several of the capabilities listed below.
S3A is an S3 protocol connector that ships with recent versions of Apache Hadoop, having deprecated the earlier connector S3N. S3A allows applications to directly access data in an S3 bucket, with full read and write capability. S3A does not support cache on read or rehydration, every read must be serviced directly from the S3 bucket.
To use S3A, applications must address the bucket directly with a URI like s3a://folder/object.foo. Applications can move data between HDFS and S3A, and use tools like DistCp to bulk copy data. That said, the application is entirely responsible for managing data movement and keeping track of where the data is. Also, the S3A protocol is different than the HDFS protocol, so interfacing with S3A will require separate API logic.
Unified namespace refers to the ability to read both HDFS and S3 data in the same namespace using the same protocol. Required for seamless offload.
Outside of a seamless offload solution, the primary value of unified namespace is simplification of application coding. It provides the ability to read S3 data from a S3 bucket in a previously mounted hdfs:// file system without having to use different API logic.
Unified Namespace Write
Same as unified namespace but allows writing to the S3 bucket. Required for seamless offload.
Read caching, or rehydrating, refers to persisting data that has been recalled from S3 recently in a cache local to the Hadoop cluster, either on RAM, SSD, or HDD.
This provides much faster subsequent data access because generally data which has been accessed recently is more likely to be accessed again in the near future.While not a key capability for seamless offload, read caching is a highly desirable capability to enhance performance of the offload solution.
File Tiering Service
The file tiering service moves data between the storage tiers defined in the Hadoop configuration. Required for seamless offload.
Automatic Tiering Policy
Auto tiering policy is a rules based policy that will automatically identify data to be move based on a defined set of rules. Required for seamless offload.
It is important to differentiate between automatic tiering policy, and manual. Apache Hadoop has a feature called "storage policies" which are used to flag the data to be tiered. Hadoop storage policies are not automatic or rules based. Hadoop storage policies must be applied manually (by the application) to individual files or directories. Conversely, automatic tiering policies are applied by the platform, not by the application, and are applied based on the policy's rule. For example, I might set up a cold data policy for data that was last modified more than 270 days ago and is larger than 10MB. Any data which matches the criteria will automatically be tiered without requiring additional action from the application.
URI preservation refers to the capability of the platform or software to move data from cluster storage to external storage while allowing applications to continue to access the data using the original URI path. Required for seamless offload.
This is similar to file stubbing technology used in HDI and other cloud gateway solutions. While the data has been moved, the application's view of the data is unaltered, and the application's ability to access the data using the original path is uninterrupted.
Block Level Tiering
Block level tiering allows policies to be applied at the storage block level, as opposed to policies that are applied at the whole file or whole directory level. This allows parts of very large files to be tiered without requiring the whole file to be tiered. Required for seamless offload of table or stream data. This capability is not required for seamless offload of file data.
Tiering of Table and Stream Data
It is my current understanding that block level tiering is the key to this capability, so for the purposes of this post I will keep them together.
It is possible to offload tables (i.e. the underlying files behind the tables) in their entirety to an S3 bucket. This is not tiering though, this is a manual migration. It is unclear at this point what the performance of tables stored in an S3 bucket would be, but I suspect it would be poor.
Now that we have introduced the capabilities required for offload of Hadoop data to an S3 bucket, let's look at the big 3 Hadoop distributions and Alluxio to see which capabilities each possess.
1 - required for seamless
|HDP 3.1||CDH 6.0||Mapr 6.1||Alluxio 1.8|
Unified Namespace 1
2 - HDFS-9806 is read only
Unified Namespace Write 1
File Tiering Service 1
Automatic Tiering Policy 1
URI Preservation 1
Block Level Tiering
|Tiering of Table and Stream Data||N||N||Y||N|
As you can see from the matrix above, only MapR 6.1 has all of the capabilities to enable seamless offload of data from cluster storage to an S3 bucket. Apache Hadoop is adding capabilities and may catch up at some point (see HDFS-12090 and HDFS-7343), but today you cannot do seamless offload with HDP or CDH without bringing in other technology. Alluxio can be a valuable addition in HDP and CDH environments, particularly when cache on read is required to enable analytics of data stored in S3. However, because Alluxio lacks tiering capabilities and URI preservation, it cannot be seen as an enabler for a seamless offload solution.
Big data platforms like Hadoop have accumulated massive amounts of data, and are continuing to grow at a rapid rate. The owners of these platforms are clamoring for options to scale capacity more cheaply by tiering cold data to a less expensive tier. While not all big data platforms have the built-in capabilities necessary to support the customers' requirements, the CSE team is busy exploring options for each of these platforms to help our customers offload their data to HCP.
The MapR 6.1 release went GA on September 29th 2018 and the CSE team is testing these capabilities currently. We will be posting to the blog with updates, so check back soon. Until then, please if you have any comments or feedback, capabilities we missed or mischaracterized, please comment in the comment section.