This post was written by Chris Deptula and originally published on Tuesday, August 8, 2017
The data engineering industry has been through heavy disruption over the last five years. The increasing adoption of Hadoop, NoSQL databases, columnar analytic databases and Cloud platforms have all changed data integration strategy. One change as a result of these new platforms has been a move from a more traditional ETL (Extract, Transform, Load) design to an ELT (Extract, Load, Transform) design.
What is ELT?
ELT and ETL are both similar in that they are high level strategies to move data from one system to another. The difference is where data transformation processing occurs. In ETL the transformation is done in an external application such as Pentaho, Talend, or Informatica with the results loaded into the target. ELT does the transformation in the target system. The data is first loaded into the target system exactly as it looks in the source and then scripts are run inside the target platform to perform the necessary transformations.
ELT is not a new concept. It has been around for years and is a common approach. However, in many situations, it introduces difficult challenges that make ETL a more attractive option. Some of these challenges include:
- ELT increases load on the target system during data loads. ETL isolates the processing power required to transform the data outside the target database. ELT puts all of this load on the target database, perhaps causing performance conflicts with active analytic queries.
- ELT processes usually require scripting. For databases this means writing SQL queries. For Hadoop this means writing Spark, Pig, Hive, or MapReduce. These scripts can be very complex and require significant expertise to maintain. Think 100+ line SQL queries.
- Although the Transform is done by scripts inside the target platform, a solution is still required for Extract, Load, and overall job orchestration. This generally requires custom scripts or the use of traditional ETL tools such as Pentaho, Talend, or Informatica. (And if you’re using a traditional ETL tool for Extract and Load, the temptation to leverage its Transformation capabilities grows!)
Performance. Data engineers are always under pressure to load the data into the target as quickly as possible. The faster the data can be loaded, the sooner the data can be acted upon -- improving the speed of business.
Performance differences between ELT and ETL when using classic relational databases such as Oracle, SQL Server, MySQL, and Postgres have been negligible. However, the introduction of Hadoop and analytic databases has tipped the scale, making ELT orders of magnitude faster than ETL when using these platforms.
Hadoop solved the system performance problem by massively scaling out processing. The performance benefits of MapReduce and Spark moving the processing to the data and the volume of data processed in Hadoop make ELT the obvious choice. Analytic databases, on the other hand, have significant performance degradation when you use ETL processing. These databases are very slow when processing a multitude of record-level updates and inserts. In some cases, a single record insert can take seconds. On the other hand, analytic databases are very good at set-based operations requiring users to bulk load data into the database, and then execute SQL scripts to manipulate (i.e. Transform) the required records in batch.
ELT is faster for these types of new systems, but as discussed, it still requires manual writing of scripts, limiting the audience that is able to maintain it. Pentaho recognizes this problem and sees a couple of new products on the market to help solve them.
ELT Technology Options
For the Cloud-based Analytic databases Amazon Redshift, Snowflake, and Google BigQuery, Pentaho has experimented with Matillion, a Cloud-based visual ELT environment. There is no software license for Matillion, you simply pay a small upcharge on an EC2 instance and pay for Matillion by the hour. Matillion uses the same concepts as traditional ETL tools with orchestration jobs and transformation steps that read data from the database and transform the data in a visual development environment. The difference between the traditional ETL and Matillion is Matillion compiles the transformations into SQL that are run in the database rather than externally. Users familiar with traditional ETL tools will find the Matillion development interface familiar.
For ELT on Hadoop, at Pentaho, we find our adaptive execution layer to be a powerful solution. This layer allows users to run ordinary Pentaho transformations as Spark jobs on a Cloudera Hadoop cluster. Existing Pentaho users can take advantage of parallel processing on Hadoop using Spark with no additional training. The written transformations are the exact same as Pentaho developers are used to writing maintaining the same level of maintainability as traditional ETL transformations. We expect the adaptive execution layer to continue to evolve into the preferred way of developing ETL. Write one, run anywhere.
The Future of ETL and ELT
It’s clear that new data management solutions are upsetting the traditional ways of transporting and organizing data. With every release, traditional ETL providers are adding capabilities to support evolving data management environments. And new vendors are also rapidly entering the market with purpose built tools to solve these new challenges. As these new data management solutions mature, the purpose built tools will continue to alter the traditional ETL and ELT landscape. With all of these changes, expect to see best practices for data management evolve along with them. Pentaho can help you stay up to date with the latest data management solutions and designs.