I was sitting in an airport lounge reading a fantastic blog by Shawn Rosemarin about our “Data Stairway to Value” and I thought it may be good to elaborate on one of the steps that everyone is trying to do but are not necessarily doing well - Activate.
Let’s look at the stairway and what ‘Activate’ means.
To sum it up Activate is the data integration, the analytics, the machine learning, the data science your organization either is or isn’t doing.
It really makes me think of the start of my career the better part of twenty years ago when I was a T-SQL programmer working in Teradata and I hear the words of my parents in my head – things were much simpler then. My job was to move data from our billing system or our CRM to the enterprise data warehouse and then to run reports. The data sources, the integration, and ultimately the reports were pretty simple.
So, what’s changed? Let’s take a look at it in a couple of figures, figure one is the traditional way of doing things and figure two is the modern data architecture.
Right away you notice the complexity. In the legacy model you moved data from one relational source to a relational destination/target, sure they may be some transformations but nothing a union, join or function couldn’t easily do for you. In the modern architecture everything is different, the data types, the sources, the destinations, the analytical process, the world is different.
So why do we think that the same tools will still be able to solve the problems even though the problems are vastly different?
The tools that have been around since the legacy data architecture were built for just that – legacy work. There are new tools that have been developed to solve these modern challenges.
I believe it is important to call out a couple items that are absolutely critical to modern data integration and engineering and are included in tools like Pentaho but not in legacy applications.
- Hand coding- Back in SQL days I would write hundreds of lines of code to bring in data from multiple sources dump in a temporary table do some calculations and then dump to a reporting table or mart. This took forever and there were so many points where if I had a typo it would break the entire chain and I would have to go through it all to find my error. Being able to drag and drop with little to no hand coding and just about every kind of native connector would have saved me an absolute ton of time let me get on to the many jobs I had piling up on my desk
- Inspect the data in stream- Like I said in the first points hundreds of lines of code takes time to run. So, I would have to wait (sometimes for hours) for the job to finish then inspect the data to see if it was accurate. God forbid there was a new or different value it could break anywhere in the flow or give me a useless table. Being able to inspect the data at any step along that data flow and fixing errors instead of waiting and ultimately blowing up a table is incredibly powerful.
- Built for the modern age- Probably the most important item, legacy ‘ETL’ tools were not engineered for sources that include unstructured data, data at the edge, real time or targets such as Hadoop. These tools needed to be rebuilt to support it at all. For the tool to be effective it needs to be built with the correct purpose in mind not re-engineered decades later.
Pentaho solves all of these problems. No longer do you have to hand code hundreds of lines of code it is drag and drop, wait for jobs to end because you can visualize the data at any step, and Pentaho was built for the modern world so it works seamlessly with technologies like Hadoop and Spark. Use a tool that was designed for the problems you are facing now not the problems you faced ten years ago.
Bottom line is that the world has changed, you are no longer working in a relational linear world and to expect that the tools that worked to solve the legacy problems will solve the modern problems just won’t cut it anymore.