Geoffrey Marsh

What Temperature Is Your Data?

Blog Post created by Geoffrey Marsh Employee on Apr 10, 2018

One question I keep getting is does an object store replace Hadoop or vice versa and I get it enough I thought I would write about it.

Both platforms are relatively new, so I understand the confusion but let’s take a step back and understand what each were designed for because this will assist in why it not a replacement conversation but instead these two platforms are incredibly complementary.

Object stores were designed to store data as an object in an economical way.

Hadoop is a file system, so it handles relational data, unstructured data and was designed to be a replacement of expensive enterprise data warehouses.

Now I know that some of my colleagues would argue I didn’t do either justice, but I really just wanted to briefly highlight the differences. What I really want to concentrate on is that the fact that not all data is created equal and it should not be treated as such.

So, let’s talk about a concept I am a big fan of and believe that it should be the basis for all data architectural decisions, that concept is of cold, warm, and hot data. Let me explain.

Most organizations have data they use at different times this is where the concept of cold, warm, and hot comes in. So, what are they and how do different storage and compute options let you maximize the amount you can do with it?

Cold Data: This is data that you wouldn’t use regularly but maybe you are in a regulated industry or you have legislation that requires you to build out reports at certain intervals financial services comes to mind. Just about every organization needs to keep this data but you don’t need to access or work with it often(some organizations I know have data they use annually or even bi-annually) so a platform that is cost effective while allowing to you to move this information to analytical engine would be your best option. This is a great use case for an object store or content platform.

Warm Data: This is the data you will likely use a little more often but for common reporting or analytics.  Still not the best option for weeding through droves and of droves of data to find that nugget of gold but still will suffice to do some basic analytical querying. This would be where HDFS would fit in. Also, you would want to be able to move data between the HDFS engine and the Object store depending on the use case so a tool to be able to do that easily while indexing and managing the metadata will be key.

Hot Data: This is your true data science workbench. This is the down and dirty digging through a haystack for that needle. Maybe it is streaming data sets, maybe it is an AI engine, or maybe you are using Spark or something similar to create decision models for your organization. You likely will not keep much data in this engine but rather move it from either of the two previous steps along the data architecture as needed.

These three main items are core to what I call the “The Modern Data Architecture”. Gone are the days of the monolithic EDW. The data and the use case define the data, the storage, the compute, and ultimately the output so if you are using a linear architecture you are either not able to satisfy all the use cases you need or you are having to work harder than you should to try and make up for it.