Skip navigation
1 2 Previous Next

Pentaho

29 Posts authored by: Kevin Haas Employee

This post was written by Chris Deptula and originally published on Wednesday, February 11, 2015

 

This is the first in a 3 part blog on working with small files in Hadoop. Hadoop does not work well with lots of small files and instead wants fewer large files.  This is probably a statement you have heard before.  But, why does Hadoop have a problem with large numbers of small files?  And, what exactly does “small” mean?  In the first part of this series I will answer these questions.  The subsequent parts will discuss options for solving or working around the small file problem.

 

What is a small file?

A small file can be defined as any file that is significantly smaller than the Hadoop block size.  The Hadoop block size is usually set to 64,128, or 256 MB, trending toward increasingly larger block sizes.  Throughout the rest of this blog when providing examples we will use a 128MB block size.  I use the rule if a file is not at least 75% of the block size, it is a small file.  However, the small file problem does not just affect small files.  If a large number of files in your Hadoop cluster are marginally larger than an increment of your block size you will encounter the same challenges as small files.

 

For example if your block size is 128MB but all of the files you load into Hadoop are 136MB you will have a significant number of small 8MB blocks.  The good news is that solving the small block problem is as simple as choosing an appropriate (larger) block size.  Solving the small file problem is significantly more complex.  Notice I never mentioned number of rows.  Although number of rows can impact MapReduce performance, it is much less important than file size when determining how to write files to HDFS.

 

Why do small files occur?

The small file problem is an issue we frequently see on Hadoop projects.  There are a variety of reasons why companies may have small files in Hadoop, including:

  • Companies are increasingly hungry for data to be available near real time, causing Hadoop ingestion processes to run every hour/day/week with only, say, 10MB of new data generated per period.
  • The source system generates thousands of small files which are copied directly into Hadoop without modification.
  • The configuration of MapReduce jobs using more than the necessary number of reducers, each outputting its own file.  Along the same lines, if there is a skew in the data that causes the majority of the data to go to one reducer, then the remaining reducers will process very little data and produce small output files.

 

Why does Hadoop have a small file problem?

There are two primary reasons Hadoop has a small file problem:   NameNode memory management and MapReduce performance. The namenode memory problem Every directory, file, and block in Hadoop is represented as an object in memory on the NameNode.  As a rule of thumb, each object requires 150 bytes of memory.  If you have 20 million files each requiring 1 block, your NameNode needs 6GB of memory.  This is obviously quite doable, but as you scale up you eventually reach a practical limit on how many files (blocks) your NameNode can handle.  A billion files will require 300GB of memory and that is assuming every file is in the same folder!   Let’s consider the impact of a 300GB NameNode memory requirement...

  • When a NameNode restarts, it must read the metadata of every file from a cache on local disk.  This means reading 300GB of data from disk -- likely causing quite a delay in startup time.
  • In normal operation, the NameNode must constantly track and check where every block of data is stored in the cluster.   This is done by listening for data nodes to report on all of their blocks of data.   The more blocks a data node must report, the more network bandwidth it will consume.   Even with high-speed interconnects between the nodes, simple block reporting at this scale could become disruptive.

 

The optimization is clear.   If you can reduce the number of small files on your cluster, you can reduce the NameNode memory footprint, startup time and network impact.

 

The MapReduce performance problem

Having a large number of small files will degrade the performance of MapReduce processing whether it be Hive, Pig, Cascading, Pentaho MapReduce, or Java MapReduce.    The first reason is that a large number of small files means a large amount of random disk IO.  Disk IO is often one of the biggest limiting factors in MapReduce performance.  One large sequential read will always outperform reading the same amount of data via several random reads.  If you can store your data in fewer, larger blocks, the performance impact of disk IO is mitigated.

 

The second reason for performance degradation is a bit more complicated, requiring an understanding of how MapReduce processes files and schedules resources.  I will use MapReduce version 1 terminology in this explanation as it is easier to explain than with Yarn, but the same concepts apply for Yarn.  When a MapReduce job launches, it schedules one map task per block of data being processed.  Each file stored in Hadoop is at least one block.  If you have 10,000 files each containing 10 MB of data, a MapReduce job will schedule 10,000 map tasks.  Usually Hadoop is configured so that each map task runs in its own JVM. Continuing our example, you will have the overhead of spinning up and tearing down 10,000 JVMs!

 

Your Hadoop cluster only has so many resources.  In MapReduce v1, to avoid overloading your nodes, you specify the maximum number of concurrent mappers a node can process.  Often the maximum number of concurrent mappers is in the 5 to 20 range.  Therefore, to run 10,000 mappers concurrently you would have to have 500 to 2000 nodes.  Most Hadoop clusters are much smaller than this, causing the JobTracker to queue map tasks as they wait for open slots.   If you have a 20 node cluster with a total of 100 slots, your queue will become quite large and your process will take a long time.  And don’t forget, your job is likely not the only job competing for cluster resources.

 

If instead of 10,000 10MB files you had 800 128 MB files you would only need 800 map tasks. This would require an order of magnitude less JVM maintenance time and will result in better disk IO.  Even though an individual map task processing 128 MB will take longer than a map task processing 10 MB, the sum of all of the processing time will almost always be orders of magnitude faster when processing the 800 larger files.

 

What can you do if you have small files?

Now that we have discussed what constitutes a small file and why Hadoop prefers larger files, how do you avoid the small file problem?  In my next post, I will discuss solutions to the NameNode memory problem as well as some initial options for solving the MapReduce performance problem.  In my third and final blog in this series, I will discuss additional solutions for the performance problem and how to choose the best solution for your situation.

This post was written by Chris Deptula and originally published on Wednesday, November 12, 2014

 

When I proposed this blog to my Pentaho Consulting colleagues they stared blankly at me and asked: “Seriously?  A blog about optimizing file formats in Hadoop?”. To which I replied that choosing the optimal file format in Hadoop is one of the most essential drivers of functionality and performance for big data processing and query. It is an interesting and important topic that is often overlooked by Hadoop architects and engineers.

 

In my experience, there are stark performance differences resulting from (im)proper format choices. Pick the wrong format for your Impala data and you could see your query times increase by 5x or more. Choose a format that does not nicely compress and you may consume multiple GB of disk instead of 100’s of MB. Choose a format that does not support flexible schema evolution and you may have to pay with massive re-processing just to add a new field. Unfortunately, there is no single file format that optimizes for all of these concerns. You need to understand the trade-offs and then make educated decisions.

 

Before we dive in, perhaps you are like my incredulous colleagues asking:"Isn’t Hadoop supposed to be able to process unstructured data? Why do file formats matter so much?". It is true that Hadoop can store and process unstructured data like video, text, etc. This is a significant advantage for Hadoop. However, there are many, perhaps more, uses of Hadoop with structured or “flexibly structured” data. In fact, we often encounter clients using the term “unstructured” when referring to flexibly structured data. Data with flexible structure can have fields added, changed or removed over time and even vary amongst concurrently ingested records. Many of the file format choices focus on managing structured and flexibly structured data. Our commentary will revolve on this assumption.

 

This blog will start by identifying the key questions that drive a file format choice. We’ll then contrast the current file format options and close with a discussion on how to leverage a portfolio of formats optimized for your use cases.

When evaluating file formats, what must be considered?

 

  1. What processing and query tools will you be using?

This is the most obvious but overlooked question. If you want to use Avro, does the data processing language you’ve chosen include Avro readers and writers?     Likewise, suppose you’ve picked the Cloudera distribution because you like Impala. You should probably know that Impala currently does not support ORC format. Conversely, if you’ve chosen Hortonworks and Hive-Stinger, you probably want to avoid Parquet. Yes, it is expected that most of the tools will end up supporting most of the popular formats, but doublecheck before you make any final decisions.

 

  1. Will your data structure change over time?

Do you want to add and delete fields from a file and still be able to read old files with the same code?   Certainly, if you have a large enough cluster you can rewrite all of your historical data to add a field, but this is often not ideal. Being able to add a field and still read historical data may be preferred. If so, know which file formats enable flexible and evolving schema.

 

  1. How important is file format “splittability”?

Since Hadoop stores and processes data in blocks you must be able to begin reading data at any point within a file in order to take fullest advantage of Hadoop’s distributed processing. For example, CSV files are splittable since you can start reading at any line in the file and the data will still make sense; however, an XML file is not splittable since XML has an opening tag at the beginning and a closing tag at the end. You cannot start processing at any point in the middle of those tags.

 

 

  1. Does block compression matter?

Since Hadoop stores large files by splitting them into blocks, its best if the blocks can be independently compressed. Snappy and LZO are commonly used compression technologies that enable efficient block storage and processing. If a file format does not support block compression then, if compressed, the file is rendered non-splittable. When processed, the decompressor must begin reading the file at its beginning in order to obtain any block within the file. For a large file with many blocks, this could generate a substantial performance penalty.

 

  1. How big are your files?

If your files are smaller than the size of an HDFS block, then splittability and block compression don’t matter. You may be able to store the data uncompressed or with a simple file compression algorithm. Of course, small files are the exception in Hadoop and processing too many small files can cause performance issues. Hadoop wants large, splittable files so that its massively distributed engine can leverage data locality and parallel processing.

 

Large files in Hadoop consume a lot of disk -- especially when considering standard 3x replication. So, there is an economic incentive to compress data. i.e. store more data per byte of disk. There is also a performance incentive as disk IO is expensive. If you can reduce the disk footprint through compression, you can relieve IO bottlenecks. As an example, I converted an uncompressed, 1.8GB CSV file into the following formats, achieving much smaller disk footprints.

 

Uncompressed CSV

1.8 GB

Avro

1.5 GB

Avro w/ Snappy Compression

750 MB

Parquet w/ Snappy Compression

300 MB

 

I then ran Impala and Hive queries against each of the file formats. As the files became smaller, the query performance improved. The queries against Parquet were a couple orders of magnitude faster than uncompressed CSV.

 

  1. Are you more concerned about processing or query performance?

There are three types of performance to consider:  

  • Write performance -- how fast can the data be written.
  • Partial read performance -- how fast can you read individual columns within a file.
  • Full read performance -- how fast can you read every data element in a file.

 

A columnar, compressed file format like Parquet or ORC may optimize partial and full read performance, but they do so at the expense of write performance. Conversely, uncompressed CSV files are fast to write but due to the lack of compression and column-orientation are slow for reads. You may end up with multiple copies of your data each formatted for a different performance profile.

 

How do the most common Hadoop file formats stack up?

 

1. Formats to Avoid

Any format that is not splittable should generally be avoided. The use of XML File and JSON File formats is a common mistake. Each of these formats contain a single document per file with an opening tag at the beginning and a closing tag at the end. Note that files containing JSON records are ok and will be discussed below.

 

2. Text/CSV Files

CSV files are still quite common and often used for exchanging data between Hadoop and external systems. They are readable and ubiquitously parsable. They come in handy when doing a dump from a database or bulk loading data from Hadoop into an analytic database. However, CSV files do not support block compression, thus compressing a CSV file in Hadoop often comes at a significant read performance cost.

 

When working with Text/CSV files in Hadoop, never include header or footer lines. Each line of the file should contain a record. This, of course, means that there is no metadata stored with the CSV file. You must know how the file was written in order to make use of it. Also, since the file structure is dependent on field order, new fields can only be appended at the end of records while existing fields can never be deleted. As such, CSV files have limited support for schema evolution.

 

3. JSON Records

JSON records are different from JSON Files in that each line is its own JSON datum -- making the files splittable. Unlike CSV files, JSON stores metadata with the data, fully enabling schema evolution. However, like CSV files, JSON files do not support block compression. Additionally, JSON support was a relative late comer to the Hadoop toolset and many of the native serdes contain significant bugs. Fortunately, third party serdes are frequently available and often solve these challenges. You may have to do a little experimentation and research for your use cases.

 

4. Avro Files

Avro files are quickly becoming the best multi-purpose storage format within Hadoop. Avro files store metadata with the data but also allow specification of an independent schema for reading the file. This makes Avro the epitome of schema evolution support since you can rename, add, delete and change the data types of fields by defining new independent schema. Additionally, Avro files are splittable, support block compression and enjoy broad, relatively mature, tool support within the Hadoop ecosystem.

 

5. Sequence Files

Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not store metadata with the data so the only schema evolution option is appending new fields. However, unlike CSV, sequence files do support block compression. Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.

 

6. RC Files

RC Files or Record Columnar Files were the first columnar file format adopted in Hadoop. Like columnar databases, the RC file enjoys significant compression and query performance benefits. However, the current serdes for RC files in Hive and other tools do not support schema evolution. In order to add a column to your data you must rewrite every pre-existing RC file. Also, although RC files are good for query, writing an RC file requires more memory and computation than non-columnar file formats. They are generally slower to write.

 

7. ORC Files

ORC Files or Optimized RC Files were invented to optimize performance in Hive and are primarily backed by HortonWorks. ORC files enjoy the same benefits and limitations as RC files just done better for Hadoop. This means ORC files compress better than RC files, enabling faster queries. However, they still don’t support schema evolution. Some benchmarks indicate that ORC files compress to be the smallest of all file formats in Hadoop. It is worthwhile to note that, at the time of this writing, Cloudera Impala does not support ORC files.

 

8. Parquet Files

Parquet Files are yet another columnar file format that originated from Hadoop creator Doug Cutting’s Trevni project. Like RC and ORC, Parquet enjoys compression and query performance benefits, and is generally slower to write than non-columnar file formats. However, unlike RC and ORC files Parquet serdes support limited schema evolution. In Parquet, new columns can be added at the end of the structure. At present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such as Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala. Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem.

 

One note on Parquet file support with Hive... It is very important that Parquet column names are lowercase. If your Parquet file contains mixed case column names, Hive will not be able to read the column and will return queries on the column with null values and not log any errors. Unlike Hive, Impala handles mixed case column names. A truly perplexing problem when you encounter it!

How to choose a file format?

As discussed, each file format is optimized by purpose. Your choice of format is driven by your use case and environment. Here are the key factors to consider:

  • Hadoop Distribution- Cloudera and Hortonworks support/favor different formats
  • Schema Evolution- Will the structure of your data evolve?  In what way?
  • Processing Requirements - Will you be crunching the data and with what tools?
  • Read/Query Requirements- Will you be using SQL on Hadoop?  Which engine?
  • Extract Requirements- Will you be extracting the data from Hadoop for import into an external database engine or other platform?
  • Storage Requirements- Is data volume a significant factor?  Will you get significantly more bang for your storage buck through compression?

 

So, with all the options and considerations are there any obvious choices?  If you are storing intermediate data between MapReduce jobs, then Sequence files are preferred. If query performance against the data is most important, ORC (HortonWorks/Hive) or Parquet (Cloudera/Impala) are optimal --- but these files will take longer to write. (We’ve also seen order of magnitude query performance improvements when using Parquet with Spark SQL.) Avro is great if your schema is going to change over time, but query performance will be slower than ORC or Parquet. CSV files are excellent if you are going to extract data from Hadoop to bulk load into a database.

 

Because you will likely have multiple use cases for your Hadoop data, you will use multiple file formats. It is quite common to store the same (or very similar) data in multiple file formats to support varied processing, query and extract requirements. Embrace diversity!

This post was written by Dave Reinke and originally published on Wednesday, June 22, 2016

 

In a previous blog, we discussed the importance of tuning data lookups within Pentaho Data Integration (PDI) transformations.  We suggested that there were identifiable design patterns which can be exploited to improve the performance and stability of our code.  In this blog, we will investigate the most frequently used lookup pattern:  Key-based, Single Record Lookup.

 

This pattern’s classic use case involves an inbound stream of fact records queried from a source containing dimensional attributes in the form of business keys.  In order to load this data into the target fact table, the dimensional business keys must be converted into a set of technical keys referencing target dimension tables. The conversion of dimensional business to technical keys is done through a series of lookups.

 

The Database Lookup and Stream Lookup steps are used to implement this pattern.   Database Join could be used, but would be much less efficient since it does not support caching.   Additionally, we are not considering PDI’s purpose built Dimension Lookup/Update and Combination Lookup/Update steps as this pattern assumes no versioning of dimension data, i.e. no slowly-changing dimensions and that we have already loaded our dimension tables prior to the fact table load.  (I can already hear some of my colleagues’ arguments.   Yes, I know that Dimension Lookup/Update can be used for Type 1 dimensions and that its use enables the code to more easily migrate to Type 2, versioned, dimensions.  It is a candidate for this pattern, but requires specific database design changes for its use.   We’ll discuss Dimension Lookup/Update in a future blog.  I promise.)  So with that caveat, let’s examine the use of Database Lookup.

 

Using Database Lookup for Key-based, Single Record Lookups

Database Lookup is the prime choice for this pattern because it is easy to set up and provides rich functionality and tuning options.  However, there are two vital prerequisites for its use.  First, the lookup source must be a single database table.  Second, the lookup table must contain a unique key column, often a business key, and a technical key column, almost always the primary key. The following diagram shows an example of this step’s configuration.  It depicts a lookup of the technical product_key from the dim_product table in my_datamart database using a unique product_code business key field.

 

 

As configured, the step will generate and issue the following SQL statement to the my_datamart database:

 

select product_key from dim_product where product_code = ?

 

The ? is replaced by the value in the inbound stream’s product_code field.  

 

Since the step is configured to enable caching of lookup data, the SQL statement will be run only once for each distinct product_code value found within the inbound stream.  Additionally, we provide a default value of 0 should the lookup fail to find a match.  No-Match behavior and caching are both configurable.  Let’s discuss...

Handling No-Match Lookup Failure in the Database Lookup Step

In our example, a no-match would occur if an inbound stream record’s product_code did not find a matching product_code in the dim_product database table.   The step is configured to “not fail”, rather to place a default value of 0 in the outbound stream’s product_key field.  The underlying assumption is that there exists a  “null” or “unknown” dim_product database record with a 0 product_key for null or invalid product_code’s.  The following table describes the two ways a “no-match” can be handled by this step.

 

Action

Step Configuration

Processing Implications

Provide a Default Value

Depicted in the example.

A subsequent step in the transformation can be used to filter and/or operate on the defaulted value, perhaps writing an error record and sending an email to an administrator.

Filter the inbound record

Check the "Do not pass the row if the lookup fails" box

No error is raised and the row is not passed to the outbound stream. This prevents you from taking corrective action on the errant data but ensures that only valid matches are processed.

 

In practice, we rarely use the filter option as most data programs attempt to remediate and/or alert if invalid data appears.

 

Optimizing the Lookup Cache for the Database Lookup Step

Not implementing or optimizing the lookup cache is a frequent mistake made by new PDI developers. Fortunately, it’s simple to explain and set up!  The following table details the configuration options:

 

Cache Option

Step Configuration

Processing Implications

Cache All

Check “Enable Cache”, then check “Load all data from table”

A single SQL select statement is executed, retrieving all key and lookup column data from the database Lookup Table. The entire Lookup Table dataset must fit in PDI’s memory. No inbound stream records will be processed until this entire dataset is retrieved and cached.

Cache Some

Check “Enable Cache”, uncheck “Load all data from table”, then specify how big (in # of lookup records) the cache will be allowed to get via the “Cache Size in rows” field.

One SQL select statement will be executed per distinct inbound record key that is not already found in the cache. PDI memory must be able to hold the number of lookup rows specified.

No Cache

Uncheck “Enable Cache”

Every inbound record will cause a SQL select statement to be executed against the Lookup Table -- even if keys are repeated!

 

So, when to use which option?   Here are some insights into our best practice decision-making...

 

Cache Option

When to Use

Cache All

  • The lookup table is not very big

Cache Some

  • The lookup table is very big, the inbound stream is not small and contains repeated lookup keys

  • The lookup table is big and the inbound stream is sorted on lookup keys (i.e. removing previously cached records from the active cache has no downstream penalty)

No Cache

  • The lookup table is very big and the inbound stream is very small

  • The lookup table is very big and the inbound stream contains very few (or no) repeated lookup keys

 

The most common configuration is Cache All. In general, if Cache All works efficiently, then use it. Most lookup tables are small (< 100,000 records) and thus can be retrieved quickly and will easily fit in memory. If full lookup dataset retrieval is inordinately slow or memory is limited, then consider Cache Some or No Cache depending on the size and lookup key distribution in your inbound stream.

 

Using Stream Lookup for Key-based, Single Record Lookups

Stream Lookup differs from Database Lookup in the following manner:

  • It’s lookup dataset is sourced from a second inbound stream of records and not tied directly to a single database table.

  • It always caches all lookup data -- there is no option for partial or no caching. 

 

Thus, Stream Lookup should be used in the following circumstances:

  • The lookup data is not sourced from the entirety of a single lookup table

    • You may want to use a subset of rows in a lookup table. In this case, you’d use a Table Input step to gather the subset and then stream that subset as the lookup data into the Stream Lookup step, or

    • the lookup data is sourced by joining two or more tables or streams.

  • You have multiple lookups against the same “not big” lookup table

    • You have role-playing dimension keys that relate a single fact record to the same dimension table multiple times. In this case, you’d use a Table Input step to query all of the lookup data and then use “copy” hops to stream that data to multiple Stream Lookup steps -- one per role.

  • You have a source of data that does not reside in a database table

    • E.g. a flat file, or a spreadsheet

 

Since the lookup data stream for Stream Lookup can be generated by any valid sequence of PDI steps, the alternatives and examples for its use are somewhat boundless.

 

Also note that Stream Lookup does not provide a filter record option for “no-match” conditions.   If you want to rid your stream of “no-match” records you have to use a Filter step after the Stream Lookup.

 

As you can see, there is tremendous opportunity to tune (or mistune!) lookups when doing a simple key-based, single record lookup.   Next in our series, I’ll examine a similar pattern but with a subtle twist:  Looking up the Most Recent Record.

This post was written by Dave Reinke and originally published on Wednesday, July 6, 2016

 

As we continue our series of Pentaho Data Integration (PDI) Lookup Patterns, we next discuss best practice options for looking up the “most recent record”. Common use cases for this pattern include finding the most recent order for a customer, the last interaction of a web site visitor, and the last claim record for a policy. As you’d expect there are multiple ways in which this can be done within PDI.

 

We’ll explore two of the most frequently used options, but first let’s define an example use case.

Example Use Case

Suppose we have a stream of customers each identified with a customer_key and there is an orders table which contains the following columns:

 

COLUMN

DESCRIPTION

order_key

Primary key of the order. The column we want to lookup.

customer_key

Foreign key to customers. The column we need to match with our input stream records..

order_timestamp

The timestamp of the order.  We’ll use this to determine the most recent record for a given customer_key.

Other columns

There may be many other columns which we could include in our lookup.   We will ignore these in order to de-clutter our example.

 

If you had customer_key, 100, and wanted to get the most recent order_key, you would issue a query like the following:

 

select order_key

from orders

where customer_key = 100

order by order_timestamp desc

limit 1

 

The database would return exactly one order_key (limit 1) for the given customer_key (100) chosen as the first of a list sorted by the order_timestamp descending. In PDI, we can use the Database Lookup and Database Join steps to accomplish the same result.

 

Approach 1: Database Lookup

As previously discussed, the Database Lookup step is most often used for key-based, single record lookups, however it is also apt for the Most Recent Record lookup pattern. The following picture demonstrates how this step would be configured for our example.

 

The step looks up the order_key of the most recent record in the order_db.orders table for each input stream record’s customer_key. The Order by field tells PDI to sort the matching orders records by order_timestamp in descending order. By design, the step will only return one row -- the first row returned after the sort.   So, we have effectively configured the step to execute a query similar to the one listed above.

 

As configured, our example will execute a single lookup query for each input stream record. If customer_key’s are repeated in the input stream, this will result in many redundant query executions. To eliminate these redundant queries, caching can be configured for Database Lookup. Database Lookup enables caching because it restricts the lookup query to a single database table. The performance boost can be quite significant. In our example, suppose there are 1,000,000 input stream records containing 55,000 distinct customer_key’s. Without caching, the step will execute 1,000,000 lookup  queries. If we check the Enable Cache? box, the step will only fire 55,000 lookup queries -- one per distinct customer_key. This results in an order of magnitude or more in improved performance. 

 

Approach 2: Database Join

The Database Join step can also be configured to implement the Most Recent Record lookup pattern. The following picture shows it’s configuration for our orders example.

 

 

When using the Database Join step, you must supply your own lookup SQL. In our simple example, this is a straightforward parameterized select statement that returns a order_keys for a given customer_key in descending timestamp order -- most recent record first. We enter 1 in the Number of rows to return field to ensure that only the first record for each input stream record is passed to the output stream. We check Outer join? to allow a failed lookup record (i.e. no matching orders for a customer_key) to pass with a null order_key in it’s output stream record.

 

Performance-wise, the Database Join step will take about as long to process as a non-cached Database Lookup step. There will be one query per input stream record. Unlike Database Lookup, there is no ability to cache data to improve performance. However, because Database Join allows you to define your own SQL Select statement, you have the ability to define arbitrarily complex lookup queries.

Which Approach to Use?

So, how do you decide which step to choose? The answer is rather straightforward. Use Database Lookup step if you have a single lookup table which a straightforward lookup condition. Leverage caching if your input datastream contains duplicate lookup keys. Only use Database Join if your lookup query is complex. Remarkably, in the field, we see developers using to Database Join without realizing that Database Lookup can be leveraged to improve performance. Don’t make that mistake!

 

One final comment, the savvy reader will note that this same pattern applies to “least recent record” or historically first record lookups as well. In fact, any arbitrary record sort that you might apply can be handled by this design pattern.

Next in our series, I’ll examine a the Master-Detail explosion pattern causing the Database Join step to come to the fore.

This post was written by Chris Deptula and originally published on Wednesday, January 28, 2015

 

With an immutable file system and no update command, how do you perform updates in Hadoop?   This problem occurs in just about every Hadoop project.   Yes, the most common use case of Hadoop data ingestion is to append new sets of event-based and/or sub-transactional data.   However, for analytics, you often have to join these events with reference data (hereafter, I will refer to as “dimension data”) -- data that can and usually does change over time. 

 

Imagine you are collecting website activities through an intelligent event tracking system.  The events are generated as JSON datums.  You’ll likely have some automated collection mechanism that uses streaming or micro-batching ingestion into HDFS.  Simple enough, just append the new data.   Now your business users want to analyze these events by various web user attributes -- attributes that you maintain in your registration systems.   You often update registration attributes via third party data append services and from direct updates by the registrants.  You need a way to maintain a current (perhaps historically versioned) copy of the registrant data in HDFS.  You need updates in Hadoop.


Performing data updates in Hadoop is hard.  On a recent project and after a long day of Pig scripting, one of my colleagues said, “My head hurts.  I have to think about things I never had to think about when using databases.”   His experience maintaining dimension data using mature ETL tools and RDBMS technology was great, but now that had to translate to the data programming capabilities in Hadoop.   Comparatively speaking, data programming in Hadoop is, how do we put this nicely, still maturing. 

 

If you want to implement “upsert” (i.e. merge) logic or (gasp) Type 2 dimension logic, you have to build it yourself.   HDFS files are immutable, meaning you cannot edit a file after it has been written. Hive very recently added, and Pig does not (yet) provide an update command, much less a merge statement. Some day they will, but for now you must develop your own algorithms.   Fortunately, there are some patterns which can be applied.

 

In this blog, we’ll discuss four common approaches:  Hybrid Update, HBase Dimension Store, Merge and Compact Update, and “Good Enough” Update.  There is a fifth less common approach involving the creation of a custom access layer that resolves updates at read time.  This is often reserved extremely high volume data where the batch processing cost to transform all of the data does not offset the cost of transforming the data on-demand.  We won’t cover this last approach.

 

Hybrid Update

The Hybrid Update approach uses mature ETL and SQL programming to maintain dimensional data within an RDBMS then periodically copies the data, likely using Sqoop, into HDFS replacing the previous version.  All of the hard work to handle special update rules, slowly changing dimensions (i.e. versioning), surrogate key assignment, etc. can be maintained using highly productive ETL programming languages.   The Hadoop “update” becomes a file replacement.   Given that most companies have already invested in a data warehouse to maintain dimensional data, this approach is easiest to implement and often the first adopted.


However, there are downsides.  First, the approach is not scalable.  Suppose you have a billion record dimension table.  With this approach you’d extract a billion records from your RDBMS each time you need to update the dimension data in Hadoop.  There would clearly be performance and capacity implications!  Even if you currently have manageably sized dimensions, as data volume grows, your load performance will degrade.


Second, this approach requires orchestration between processes running in different environments.   You have to make sure that your RDBMS ETL completes before your extract process starts.   The latency requirements for your Hadoop-based analysis are wholly dependent upon the timeliness and success of your RDBMS load processing.  You could use an ETL technology like Pentaho Data Integration to orchestrate, but your load windows may shrink and exception handling could be complex.

 

In the end, if all of your dimensions are relatively small with slow growth, then this approach may be a good fit. Otherwise, we’d only recommend it as a transitional step before embarking on one of the following Hadoop-centric algorithms.

 

HBase Dimension Store

HBase is a NoSQL database that runs on Hadoop.  Since it provides native support for updates, it is often considered for dimension data maintenance.  In fact, many ETL tools contain pre-built steps to stream inserts, update and deletes into HBase.  You also can ingest “real-time” streams of high velocity data into HBase, minimizing update latency.  And, HBase tables can be queried via SQL on Hadoop options like Hive and Impala with properly configured serdes.   All this makes HBase seductively attractive, but, be warned, there are significant performance challenges when using HBase tables for analytical queries.


Querying Hbase data works best when a single row or a range of rows is selected.   If the query must scan the entire table (as is often the case for analytical queries), HBase is quite inefficient.   In our labs, we set up a small experiment using Impala to contrast performance between a file-based dimension table versus an HBase version.  We created several aggregating “group by” queries that joined our dimension tables to a large, file-based fact table.  In all cases, the file-based dimension table outperformed the HBase version.


If you are 1) already using HBase, 2) comfortable maintaining and balancing its resource needs with the other services running on your Hadoop cluster and 3) able to ensure that your HBase queries always return a single row or a range of rows, then an HBase dimension store is a good solution to the update problem.  Otherwise, we’d recommend you consider one of the next two alternatives.

 

Merge and Compact Update

Unlike the Hybrid and HBase approaches this update strategy leverages an algorithm rather than a specific technology. The algorithm can be implemented with Java MapReduce, Pig, Hive, Spark, Cascading, Pentaho Visual MapReduce or any other Hadoop data programming language.  It consists of the following steps:

  1. Maintain a Current Copy of the Master Data:  A complete copy of your dimension data must reside in HDFS.  For clarity’s sake we’ll call this the “master” data.  This requires a one-time, initial load of the data.
  2. Load the Delta Data: Load the newly updated data into HDFS.  We’ll call this your “delta” data.
  3. Merge the data:  Join the master and delta data together on the business key field(s).
  4. Compact the data: After the merge you will have one or more records for each business key.  In the compaction, you apply logic that emits only one or more records per business key.  In the simplest case, you will emit one record having the maximum update timestamp.   However, if you need to maintain record versions (e.g. type 2 dimensions) then your logic will emit a set of versioned records with non-overlapping effectivity dates.  There may be additional complexity in your compaction process if you have to apply special business logic.  For example, you may never want to update some fields or only update a field when some condition occurs.
  5. Write the Data to a Temporary Output: Since most Hadoop jobs (e.g. MapReduce) cannot overwrite an existing directory, you must write the compaction results to a temporary output.  The temporary output is your new dimension table containing a merge of new, updated and unchanged records.
  6. Overwrite the Original Master Data: Move the temporary output to the location of the original master data, overwriting the previous version.


This Hortonworks blog illustrates a simple example of this approach using Hive.  It works when there are no complex business rules for the compaction.   If you have business rules that cannot be designed using SQL clauses, you may be limited in Hive.


The primary downside of this approach is that you must read and process the entire dimension table every time you perform an update -- even if only one record is changed.  If your dimension table is 2TB on disk, then you are reading and writing 2TB to perform the update.  Fortunately, there is a way around this issue...

 

“Good Enough” Update

If you have extremely large dimensions that cannot be processed within your batch window, then you either have to buy a bigger cluster or come up with another approach.  This is where the “good enough” update is applied.

 

At its heart is an intelligent partitioning strategy whereby you organize your dimension data so that you isolate records which are most likely to be updated into a small subset of the partitions.  This is often done using a record creation or last update date and assuming that recently inserted records are more likely to be updated than older records.  (For type 2 dimensions, you would also isolate the “current” version of each record since older versions are considered immutable.)  Within HDFS, you’d store each partition in a sub-folder of a parent dimension folder.   A Hive table can be defined to recognize these partitions, which might also improve query performance.

 

With this partitioning strategy in place, you simply perform the Merge and Compact Update on the subset of data considered most likely to change.   You avoid processing the entire dataset with the risk of inserting redundant dimension records.  The redundant dimension records are created when an updated delta record cannot find its previous version in the partition subset. 

 

Of course, the potential to have more than one dimension record for the same dimension key means that query results which include that key might be wrong.   This may persist and potentially compound unless you augment this approach with a periodic merge and compact update on the complete dataset.   We have seen this strategy applied where good enough updates are executed throughout the week and then a full update is run over the weekend.  We have also seen organizations run data quality queries looking for duplicate keys and only running merge and compact jobs if an issue is detected.   In each case, there is a period of potential data inaccuracy which eventually is resolved.

 

The Good Enough Update approach should only be used when you cannot feasibly execute the full Merge and Compact Update and when there is a tolerance for small data errors.    Given the vagaries of event/sub-transactional data collection, most Hadoop-based analytical use cases already have such a tolerance.

 

Picking an Update Strategy

As you’d expect, one size does not fit all.   You will likely utilize more than one Hadoop update approach.   So, how do you decide which approach is right for which circumstance?    There are four major (and a whole host of minor) areas for consideration.   First, what is the size of your dimension? Are you dealing with 100’s, 1000’s, … 1,000,000’s of records?   Second, what is your latency requirement?  Do you have a batch window or do you need to see updates occur as changes “stream” in?   Third, how much data volatility exists?   Is this dimension fairly stable with new/changed records trickling in or do you see massive “delta” files with nearly every record being touched?   Fourth, how complex are your update rules?  Is it as simple as overriding the previous record with a delta record or are you implementing type 2 dimension logic with additional update rule dependencies?   The following table considers these factors for each algorithm.

 



In general, Hybrid is useful for small dimensions that have complex update rules.  HBase is best when latency is a major factor and update rules are simple.   Merge and Compact is the most versatile and often used but has scalability and latency limits.  Good Enough helps solve the Merge and Compact issues but at the cost of potentially introducing data quality issues.


The Future

Hortonworks has added batch insert, update, and delete capabilities in Hive as part of the Stinger.next initiative.  These capabilities are designed for large scale inserts, updates, and deletes as opposed to single record transactions.  Too many transactions over a period of time can lead to a drop in performance.  Additionally, ACID transactions in Hive currently only work with the ORC file format and is a very new immature feature.  As this capability matures we expect this to become an attractive option and widely used capability.

In 2006, a group of business intelligence professionals became smitten with the technology of a small startup with a big dream. That company, Pentaho, first assembled over a decade ago, developed a business analytics platform that is today unparalleled in the market.

 

That team started a company that became Inquidia Consulting.  From inception, our goal has been to help organizations become data-driven. A key focus of our consulting business revolved on Pentaho’s commercial open source technologies.

 

As we've joined forces with Pentaho, we're now aligned with an organization that values our technology expertise and will provide opportunities to access broader analytics capabilities and vision. Collaboration with Pentaho will turbocharge efforts to deliver sophisticated data and analytics, especially around IoT and machine learning. Inquidia customers and consultants will benefit from our expanded capabilities and expertise.

 

Inquidia's team will remain true to our principles that managing and analyzing core data assets is our clients’ surest path to creating value. And by joining Pentaho’s team, we know that the whole will be greater than the sum of the parts. With Pentaho, the Inquidia team is poised to deliver on the data vision, helping our staff and customers reach their potential. We’re honored to be part of Pentaho and excited for the future.

 

We're bringing over some of our favorite (and useful) blog posts from the past several years. As we migrate these over, feel free to peruse them, and don't hesitate to reach out for more information!

 

The Future of ETL and ELT

Pentaho Data Integration Best Practices: Lookup Optimization

Harvesting Raw Data into Custom Data API’s with Pentaho Data Integration

Pentaho Data Integration Best Practices: Lookup Optimization

Swallowing a mouthful of JSON via the Elephant

Metadata Injection with Pentaho Data Integration

Map Analytics with Javascript and Pentaho

Pentaho Custom Dashboard Development with Javascript

This post was written by Kevin Haas and originally published on Tuesday, July 14, 2015

 

When working with our clients, we find a growing number who regard their customer or transaction data as not just an internally leveraged asset, but one that can enrich their relationships with customers and supporting partners. They need to systematically share data and analytics outside their firewall.

 

One of the ways we help them achieve this goal is with Custom Data API's. In this article, we'll show you how to build Data API's using visual programming within the Pentaho Data Integration Interface. Along the way, we'll expose capabilities of the platform that are not always seen by the average user.

 

Inspecting The Field: Inside Pentaho Data Integration

 

For those who use Pentaho's Data Integration (PDI aka Kettle) platform regularly, you're probably aware of the amazing things you can do. For the uninitiated, PDI is a very powerful Extract-Transform-Load (ETL) technology that lets you access data from virtually any source, perform complex transformations, and push/load that data to virtually any destination. It's commonly used for integrating and shaping data into new forms optimized for analytical purposes (data warehouses, data marts, etc.).

 

However, Pentaho experts know that Kettle is more than an ETL platform. It is powered by an amazing, open, and flexible data management platform that can not only read and write data from relational databases, Hadoop, NoSQL databases, and Web API's, but it can also serve data to other applications.

 

Enter one of the most powerful (but maybe underutilized) components of the Pentaho Data Integration Platform: the Data Integration Server. It can power your API's.

 

Sowing The Seeds: The Data Integration Server

 

If you're an Enterprise Edition Pentaho customer, you're probably aware of the capabilities of the Data Integration Server. It's one of the foundational reasons why customers choose the Enterprise Edition of PDI. The Data Integration Server is a server application that allows you to schedule and run jobs and transformations created with PDIs "Spoon" visual development tool.

 

Community Edition users have a similar (but not as sophisticated) version of the Data Integration Server with the included Carte server. If you're a community edition user, it's easy enough to start up your own local Carte server on a port of your choosing.  You could start Carte on port 8081 like this: 

 

    carte.sh 127.0.0.1 8081 

or

    carte.bat 127.0.0.1 8081 

 

There are many configuration options for the Data Integration Server and Carte server that allow it to operate in all kinds of forms. See here for the current list of options.

 

Bringing It To Life: Calling The Transformation on the Server with ExecuteTrans 

 

The first step to create a Data API is to develop a Pentaho Data Integration transformation that gathers data from virtually any source:  RDBMS, Hadoop, NoSQL Databases, Web APIs, etc.  Since transformations can be parameterized, you can apply filtering, sorting or any other customizable logic to your code. Again, this can all be developed using PDIs graphical development tool:  Spoon.

 

Once you have installed and launched your Data Integration or Carte server, you can execute your transformations with a simple HTTP call using the ExecuteTrans method. For example, if you have Carte running on port 8081 and want to execute a transformation that is stored in /srv/pentaho/sample.ktr and accepts a parameter named "parm", you can simply call: 

    http://127.0.0.1:8081/kettle/executeTrans/trans=/srv/pentaho/sample.ktr&parm=parmvalue 

When you launch this URL within a browser, you wont see a lot of information in return. In fact, you'll get nothing back! But, if you look at the PDI logs on the server, you'll see that the sample.ktr transformation did run.  This is all well and good -- you were able to launch a transformation on a server via an HTTP request. But, now you want it to return some data! 

 

Growing a Transformation into an API: The Output To Servlet Option 

 

To provide an output, the transformation must produce one or more rows of data through a configured output step.  To send rows of data out as a response from the ExecuteTrans method, your transformation must pass this data to a Text File Output (or JSON Output) Step, configuring the step to "Pass output to servlet."

 

pho-data-web-service-1.png       

 

      By checking the "Pass output to servlet" box, the transform will take the stream of data entering this step and push it out as your response. Note that the exact data contents of the stream are output as specified by the steps formatting options (e.g. fields, headers, separators).  You'll want to configure depending on your needs. 

 

Bringing it to Market: A Taste of Farmers Market Data In A Custom Data API 

 

Lets look at an example. With harvest season upon us, more and more farmers markets are filling up with their bounty. But where can you find local farmers markets? Thankfully, the USDA provides an API that lets you search for nearby markets.

 

      http://catalog.data.gov/dataset/farmers-markets-search 

 

This is a powerful set of data, but its not always easy to use the API (or any API, as provided). In this case, to get a market, you first provide a zip code.  The API returns a list of markets with their associated IDs, which you then must use to query the API again in order to get detailed information on each market. Also, we'd like to handle the result data using JSON format and some elements of the API return comma separated values. Finally, we need to format the returned address a bit better, attaching latitude and longitude for mapping. 

 

To begin, we'll create a transform that accepts a parameter called "zip", and does the following: 

1. Generate Rows: Initialize with a row for Pentaho to process input parameters.
2. HTTP Client: Call the USDA API to get the list of markets with the input parameters.
3. JSON Output: Convert the JSON that comes back into a list of markets.
4. JSON Output: Extract the Market IDs from that list of markets.
5. Calculator: Format a URL string to re-query the API to get detailed market info for each market.
6. HTTP Client: Call the USDA API again for each of the markets to get the details.
7. JSON Output: Extract the detailed address info on the market.
8. HTTP Client: Call the Google Geocoding API to get the latitude and longitude.
9. JSON Output: Extract the latitude and longitude from Google's response.
0. Javascript: Clean up and format the Address and Market info from the source APIs into a nice JSON format.         11. Group By: Group all of the detailed Markets from the API into a single row.
12. Javascript: Format the final JSON.
13. Text File Output: Output the single data element as a response to the servlet to show in the server.

  
 

A lot of steps, for sure, but their configuration is relatively simple. This is what the transformation looks like in the Spoon IDE: 

 

pho-data-web-service-2.png       

 

Once we have developed and tested our transformation using Spoon, we are ready to use our Carte server and the ExecuteTrans method.   We open a browser and enter the following URL: 

 

      http://127.0.0.1:8081/kettle/executeTrans/?trans=/mnt/pentaho/pdi-ee/repository/api/farmers-market.ktr&zip=60634 

 

Note, we can even front end the Carte or DI server with Apache and do a little URL rewriting to make it a friendlier URL to call.  We did this. So instead of the above executeTrans URL, we can call this: 

 

    http://127.0.0.1:8081/api/farmers-market?zip=60634 

 

When you execute the API, it does all the work for you. It calls all the various child APIs, does all the cleanup and formatting, and returns delicious looking JSON, with all of the added items from the Transformation as a result. For example: 

 

    { javascript:;
   market:[ javascript:;
      { javascript:;
         ami_id:"1010508",
         marketname:"Eli's Cheesecake and Wright College Farmer's Market",
         location:"6701 W. Forest Preserve Drive, Chicago, Illinois, 60634",
         address1:"6701 West Forest Preserve Drive",
         address2:"",
         city:"Chicago",
         state:"IL",
         postal_code:"60634",
         latitude:"41.9588993",
         longitude:"-87.7944428",
         schedule:[ javascript:;
            "06/12/2015 to 09/25/2015 Thu: 7:00 AM-1:00 PM"
         ],
         products:[ javascript:;
            "Baked goods",
            "Crafts and/or woodworking items",
            "Cut flowers",
            "Fresh fruit and vegetables",
            "Honey",
            "Canned or preserved fruits, vegetables, jams, jellies, preserves, dried fruit, etc.",
            "Maple syrup and/or maple products",
            "Poultry",
            "Soap and/or body care products"
         ]
      }
   ]
}

 

Taking It Home: Building Your Own Data APIs 

 

This is a fun illustration of how to pull data from various sources and craft it together into your very own API. All of this was done with with out-of-the-box Pentaho Data Integration capabilities, no extra parts required.

 

Clearly, you're not limited on what you can do with this capability. You can easily get data from virtually any source and wrap a custom data API around it, giving you new capabilities to leverage your existing infrastructure, and share data in new ways. Pentaho has a lot of capabilities inside.  We know that once you take PDI home from the market and look inside the box, you'll be inspired. 

 

If you're interested in seeing how the transformation works, including a look at the transformation that powers the example, just let us know, and we'll be happy to share with you.

This post was written by David Reinke and originally published on Thursday, June 16, 2016

 

This blog is the first in a series featuring some of the practices we use with Pentaho Data Integration.

 

Pentaho Consulting is often asked to performance tune and/or stabilize ETL processes created by developers who are inexperienced with the product. One of the first areas we investigate is the use of data lookups. Looking up values from a database or file for each record in a stream of data is one of the most frequently executed activities in any data program.  If improperly tuned, ETL jobs can suffer.  Examples of lookups abound including looking up a technical product dimension key for a product name, obtaining the most recent order record for a given customer, or gathering all pageview history for a given website visitor.

 

A frequent issue we find is either the misuse or inefficient configuration of these data lookup steps. It's amazing that problems with lookups would be such an issue -- data lookups seem, at first, like a very straightforward exercise. Yet, their optimization arguably provides the biggest “bang for the buck” in performance tuning and code stabilization.

 

We've found the root causes of these issues to be a lack of understanding of the various data lookup options provided by PDI and minimal recognition of regularly occurring lookup patterns.   In this blog, I'll compare the most commonly used lookup steps and their core functionality.  In future blogs, I'll review design patterns applying these characteristics to what we implement as part of our consulting projects.

 

Most Common Pentaho Data Integration Lookup Steps

There are many steps categorized as “Lookup” within the PDI environment. Let's focus on the three most frequently used steps: Database Lookup, Stream Lookup and Database Join. All of these steps assume an “inbound stream” of records that are then compared record by record with some “lookup source” to produce 0 or more “outbound stream” records with one or more “looked up” fields appended.

 

The following table summarizes the functionality and behavior of these steps.

 

Database Lookup

Stream Lookup

Database Join

Lookup Source

Single Database Table

PDI Stream

Database Query

# of Records Looked Up

1

1

0 or more

Lookup Operators

Typical SQL Operators

Equality Only

Any valid SQL

Caching of Lookup Data

0 to All Rows, Many Config Options

Always All Rows

None

No-Match Lookup Behavior

Can filter inbound record or insert a null or default value in the lookup fields.

Can insert a null or default value in the lookup fields.

Can return null lookup fields or act as a filter on inbound stream

Multiple-Match Lookup Behavior

Can generate failure or match on first record returned from generated lookup query. (Lookup query can have an order by clause applied)

Looks up the last matching record in lookup stream

Returns all matched records potentially causing the inbound stream record to be replicated for each matched lookup record

 

As you can see, there is quite a bit of overlapping functionality. You can often use any one of the three to accomplish the same lookup task. However, there are both striking and nuanced differences which provide significant opportunity for advanced functionality and impactful performance tuning. This is where the use of patterns can aid the insightful developer -- applying rules of thumb and best practice algorithms.

 

In future blogs, I’ll bring to bear common patterns that we have uncovered through our years of PDI implementation work. I’ll start by examining the most frequently occurring pattern: Key-Based, Single Record Lookup. Visit our website in the next week or so as we publish the rest of this series.

This post was written by Bryan Senseman and originally published on Tuesday, November 29, 2016

 

In today’s diverse data environments, it seems like JSON data is everywhere now.  Often working with JSON datums means using one of the popular NoSQL databases like MongoDB or Cassandra. These are easy enough to install, but most organizations use SQL databases for analytics. This begs the question:  Is it possible to use SQL-based tools to do analytics on a JSON dataset? Well, the short answer is YES!...with a smattering of technical gymnastics and a recent version of PostgreSQL.

 

The PostgreSQL community has been steadily adding JSON functionality over the last couple of major releases. While a simple web search provides plenty of how-to instructions for storing JSON data in PostgreSQL, few articles actually exist that tell you how to use it. This blog will help you understand how to not only capture, but to analyze JSON data within PostgreSQL.

 

Finding Some JSON Data

The first thing I had to do was find a firehouse of JSON that would work. While I was forming the idea for this blog, I was 100% focused on a riveting World Series between the Cleveland Indians and Chicago Cubs. It dawned on me that I could use a twitter search (stealing liberally from a previous Pentaho Consulting Labs exercise) of World Series tweets to gather my JSON data.

 

I designed a basic Postgres table to store the tweets. It contains 2 columns: an id and the tweet. Note that the tweet is a JSON datum.

 

 

I used Pentaho Data Integration (PDI) to consume the Twitter “data firehouse” and persist it into my new Postgres tweets table. Since my company, a Pentaho Consulting, is one of the premier Pentaho consultancies on the planet, and that you are reading this on our website, I’ll assume the majority of you know at least a little about Pentaho. (If not, please explore our myriad blog posts to learn more about this wonderful platform.)

 

Trick #1:  Writing JSON Data to Postgres using Views and Rules

Pentaho has a multitude of output capabilities, some specifically for JSON; Mongo and Cassandra just to name two. But, if you have ever tried to write to a JSON(B) datatype via PDI, you know it doesn’t work…or does it?

Trying to write directly to this table using the traditional Table Output step doesn’t work as PDI does not understand the JSONB data type.

 

 

Remember those technical gymnastics I mentioned earlier? It’s time to get limber now.

 

There are a couple of approaches to solve this. I chose to create a view which does an explicit type cast to text, which PDI can read with ease.

 

 

But, how then can I write data through a view? Well, the trick is to use the PostgreSQL rule system!  You can present a view to PDI that has plain vanilla text fields. When PDI inserts to this view, Postgres handles the conversion and writes directly to the underlying table.

 

So, now we can use a simple Table Output step in PDI to insert the JSON configured per the diagram below.

 

Trick #2:  Using PDI with a Little Java Magic to Obtain Historical Tweets

 

I would be remiss if I didn’t point out how, in the past, we’ve used a PDI extension point, User Defined Java Class, to consume tweets via the Twitter API. Since I was working my little experiment after the World Series actually began, I needed to use the Search API to look backward in time. So, a little Java magic and we have a “step” that calls the Twitter search API outputting rows to its result stream.

 

 

Since this API is documented to be non-exhaustive, we took a brute force approach and called it repeatedly, using the Transform Executor step, and ignored Primary Key violations at the Table Output using error handling as the search will return duplicates from its prior invocations.

This resulted in 150,000+ tweets across games 6 & 7. Admittedly, I turned it off after the Cubs victory as I am from Cleveland and have been an avid Indians fan since birth. So, I didn’t get nearly the volume of tweets I expected, and the general vitriol and “spam” made the dataset NSFW. But, the concept is sound and I got enough data to do my experiment.

Trick #3:  Using Functions and Views to Unpack JSON for Ubiquitous SQL access

 

So, we now have a PostgreSQL repository of tweets that we’ve consumed via the Twitter API using Pentaho Data Integration. How do we interrogate/mine that information for meaning?  I did not want to write an analytics tool from scratch and most of the packaged OLAP tools on the market are SQL based.  I had to do some more technical gymnastics, but I found a way to make the open source Mondrian OLAP engine work.

 

My first challenge was figuring out how to expose elements within the JSON field such that they were  queriable via the SQL an OLAP tool generates The answer:  PostgreSQL functions. There are multitudes of functions available for use on JSON(B) fields. After some trial and error, I found a combination of operators and functions that produced a dataset that I could work with. This is by no means the best combination, but it does work for my quick POC analyses.

 

Having already used Pentaho to populate the data, my first choice was to see if I could create a Mondrian schema, using custom SQL code to expose JSON elements as columns, for use in dimensions and measures. I could and it was relatively easy.

 

Using the Expression capabilities within Mondrian schemas, I created SQL snippets that extract various elements from the JSON encoded tweet and present them as separate items. I created the schema using Pentaho’s schemaWorkbench tool. The [Tweet Country] field is created by retrieving the place object from within the tweet and then retrieving the country element as text.

 

 

Which when presented via Pentaho Analyzer, allows for simple slicing and dicing of JSON data!

 

 

Here is the log, just to prove we are analyzing the information encapsulated within the tweet JSON.

 

 

Extracting single elements from JSON proved to be a simple task. However, JSON allows for much more complex data structures, including arrays, specifically the Hashtags and User Mentions. Using the same technique above proved to be problematic. I was able to create a dimension that exploded the array contents into individual values.

 

 

But, it only worked under ideal conditions, and could not be used within a SQL WHERE clause -- basically invalidating its use within Mondrian, as eventually a user will drill on a value and generate SQL that includes a WHERE component.

This led me down the path of creating a view in PostgreSQL that would bundle all the necessary JSON functions together and present a “ROLAPy” SQL representation to Mondrian. Although not as “sexy” as using the Mondrian schema to parse out the JSON elements, the view approach did circumvent all the issues encountered with the JSON arrays.

 

 

With minor modifications to our existing cube, we now reference the view and can act upon each field as if they were table columns

 

 

The upside to doing the conversion of the JSON within the database (instead of the Mondrian schema), is that it can be used by any SQL-based analytics tool. I attached QlikSense to the view and was able to analyze JSON data with Qlik!

 

 

And it doesn’t stop there!  Tableau 10.1 recently announced support for JSON files, so you can assume JSON within a database is on the horizon. Not surprisingly, we can use the same techniques; either using the view, or Custom SQL in the connector to make data available within Tableau. I really like how simple it is to do geographic analysis, and the following image definitely shows, baseball is an American game but has avid followers in Latin America and Asia.

 

But wait...there is a small problem with our approach

 

Those of you familiar with RDBMS optimizers will realize there is a flaw in the view approach. The view is fully resolved for every request, even when the Hashtags or User Mentions are not needed for the analytic query.

But, there are tweaks we can make to our view definition and physical structures to improve the situation. We’ll investigate these and the potential performance benefit of using MonetDB in a future post.

This post was written by Matt Casper and originally published on Monday, January 23, 2017

 

In November of 2010, Pentaho Data Integration (PDI) v4.1 introduced the concept of Metadata Injection.  The promise of creating generalized ETL transformations whose behavior can be changed at run-time became a reality.  However, this initial breakthrough was limited to just 4 steps.   Today, with the recent release of PDI 7.0, there are 74 supported steps, making Metadata Injection more powerful (and applicable) than ever before!

 

With Metadata Injection, developer agility and productivity accelerates.  Instead of creating and maintaining dozens of transformations built with a common pattern, developers define a single transformation template and change its runtime behavior by gathering and injecting metadata from property files or database tables.  At Pentaho Consulting, we’ve seen code lines shrink by as much as 50:1 through the use of Metadata Injection.

 

When is Metadata Injection Useful?

There is often a lot of confusion about the proper and best use of new technologies.  Metadata Injection is no different. Over the past couple years, we have observed many situations which benefited from Metadata Injection and others where its use added unnecessary complexity.  Invariably, our assessments generated questions as to why and when Metadata Injection should be used. So, we decided to update our best practices to codify our thinking.

 

We took stock of our development experiences and identified the following traits to indicate Metadata Injection’s use:

  1. There is a repeatable processing pattern,
  2. Whose variations are finite and declarative,
  3. And is required for a large number of transformations,
  4. That often have dynamic source(s) and/or target(s).

 

As we continue to innovate, we’d expect this list to evolve, but we think it is a great starting point.   Let’s elaborate on these traits through the use of two examples.

 

Imagine you are tasked with creating an export process.   Each export will be sourced using a SQL SELECT statement. The queried records will be placed into a delimited file with a specific  name and directory.   It’s clear that the process has a pattern. (#1)  It simply issues a select statement and write the rows out to an export file.   The variations between each export are well-defined.  They consist of a distinct SELECT statement, list of fields with formatting and other metadata, record delimiter, field delimiter, field closure, and filename/directory location.   None of the variations involve optional or new processing steps, rather they include declarative values used to configure existing steps.  (#2)   If you only need to do this for one or two exports, you might as well create one distinct transformation per export, but you have requirements for 20 exports and you know your users will be demanding more over time.  In fact, you expect some of the exports to be retired while others are added.  In short, you have volume. (#3)   Finally, since each export has a distinct source query and target file, you have dynamism. (#4)

 

Another common scenario involves loading a set of staging tables with delta data from source tables.   Staging tables are often loaded without much transformation, with each load process following the same pattern.  A query is issued to get changed data from the source table with the results inserted into the staging table.   The process is patterned and repetitive. (#1)  Variation is limited to metadata about the source and target tables (#2)   Most staging processes deal with dozens if not scores of tables, so there is usually volume.  (#3)  The source and target tables vary. (#4)

 

Admittedly, these are straightforward transformations that are easy to describe in a blog.  Rest assured that there are more complex patterns to be found.   For example, we’ve implemented big data ingestion processes that consume varying input feeds into standardized, yet customizable JSON or Avro files as well as complex change data capture patterns that require determination of logical deletes and target record versioning.   Regardless of how simple or complex, all of our experiences (to date) share the four traits outlined above.

 

A Metadata Injection Example

Enough of the generalized concepts, let’s look at some code!   We’ll expand on the staging load example outlined above.    Here is our repetitive, processing pattern for each staging table.

  1. A SQL SELECT is used to extract source data,
  2. Which is written to a delimited text file,
  3. That is bulk loaded into a target staging table.

 

Below is a screenshot of the PDI job for the overall process:  j_run_staging_jobs.

 

image04.png

It starts by truncating all of the target staging tables via the t_truncate_tables transformation. (label 1; details not shown) This transformation obtains a list of staging table names, uses a Concat Fields step to construct a “TRUNCATE TABLE table_name;” SQL statement, then passes this statement to an Execute SQL Statements step.  Although this transform does not technically use Metadata Injection, it is utilizing metadata and PDI’s ability to dynamically construct and execute SQL statements.

 

After truncating, we use Metadata Injection within a processing loop to load each staging table (label 2)   The t_get_table_names transform generates metadata for each staging table to load.   The j_staging_table subjob executes for each staging table using Metadata Injection to extract source data, generate a file and bulkload into the target staging table.

 

A couple of assumptions for this example:

  • The source and target databases are both SQL Server.
  • The Metadata Injection technology is not restricted to sourcing and targeting only SQL Server tables. This same pattern could be adapted to source and target from just about any JDBC compliant database that has a bulk load step.
  • The source and target table structures are identical.
  • We did this for simplicity’s sake.  We can still use this same pattern if there is a simple mapping between fields or extra audit fields to be added.

 

The following diagram depicts the transform (t_get_table_names) which acquires initial metadata for each staging table.

 

image05.pngimage00.png

 

The transformation is summarized as numbered:

  1. Query information_schema.columns to produce the metadata required for each table.
  2. Remove excess fields from the stream.
  3. Use a Group By step to generate a concatenated, comma delimited list of column names for each table.
  4. Copy each row to the result buffer for processing by the subsequent j_load_staging_table job.  (We’ll be using Metadata Injection within this job!)

 

The output rows for the “crtr: tables_columns” step (#4) will look something like:

 

table_name

field_list

table_1

column_1, column_2, column_3, column_4, column_5

table_2

column_1, column_2, column_3, column_4

table_3

column_1, column_2, column_3

table_...

column_...

 

You can see the metadata forming as we will use these fields to generate a SQL extract statement like SELECT ${FIELD_LIST} FROM ${TABLE_NAME}.

 

Next, let’s examine the j_load_staging_table job which executes once per row returned from the t_get_table_names result set.  (i.e. one execution per table.)   The following diagram depicts and numbers the steps within this job.

 

image00.png

 

  1. Write an entry to the PDI file log denoting the parameters being used, i.e. the table and its column list.
  2. Use Metadata Injection to extract data from our source and generate a delimited file for bulk loading.
  3. Double-check that the data file was produced.
  4. Invoke a SQL Server bulk load process to populate the target table
  5. Delete the generated data file, freeing up space for the next iteration.

 

The t_staging_metainject transformation (pictured below) obtains metadata for the table being processed and injects it into the t_staging_table_load transformation template. 

 

image03.png

 

The metadata to be injected is:

  • Source SELECT statement
  • Target (staging) table name
  • Target (staging) column metadata

 

The source SELECT statement can be constructed from the table_name and field_list input parameters.   In our example, the target table name will be named the same as the source.  The target column metadata is obtained via the ti: table_col_data_types Table Input step which uses the table_name input parameter to pull metadata from the staging database catalog.

 

Once obtained, the metadata is passed into the “emi: t_staging_table_load” Metadata Injection step.   The metadata will look something like:

 

target_ table_ schema

target_ table_ name

column_ name

data_ type

format

length

precision

ordinal_ position

sql_stmnt

staging

customer

customer_id

Integer

0

<null>

<null>

1

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

city

String

<null>

<null>

<null>

2

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

state

String

<null>

<null>

<null>

3

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

number_fld1

Number

0

18

2

4

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

create_date

Date

yyyy-mm-dd

<null>

<null>

5

SELECT customer_id, city, state, number_field1 FROM dbo.customer

 

Before explaining how the Metadata Injection step is configured, let’s review the t_staging_table_load template transformation.   This is called a template because its steps are not completely configured, i.e. the steps will have their metadata injected at runtime.    The template transformation is pictured below.

 

image01.png

 

This template is straightforward.  We will inject the source SELECT into the Table Input step (table_name_select) and the column metadata into the Text File Output step (bulk_load_file).   Note that we obtain the target record create_date using a Get System Info step.  There is no metadata injected into this step, but we do include metadata about the create_date column in the Text File Output step.

 

In the Inject Metadata tab of the “emi: t_staging_table_load” step, we’ll configure the Table Input step to use the SQL SELECT statement (sql_stmnt).  (Note that while sql_stmnt is repeated for every record in the input metadata, it is used only once and will be executed only once.)   We’ll also configure the Text File Output step with metadata using the table_name as the filename and the column metadata for the fields. The column metadata includes column name, format, length, precision, and data type.

 

Voila!  We’re done.  All we need to do next is test and debug.   Alas, debugging metadata injection jobs with their dynamic runtime content is not easy.    Fortunately, PDI provides a helpful feature in the Metadata Injection step’s Option tab:

 

Setting this option will generate the runtime transformation file (ktr file) that is generated after the metadata has been injected.   If your tests don’t produce the desired results, you can open the generated ktr(s) in spoon and run them individually to find the bug -- a huge timesaver!

 

Stepping back, imagine that you have 100 staging tables to load.   Before metadata injection, you’d build the process for one table and then pay an army of programmers to copy and paste the remaining 99.   (Not to mention the maintenance costs!)   With Metadata Injection, you are loading 100 tables using just 2 jobs and 4 transformations. BOOM!

 

When Not to Use Metadata Injection

There’s always a flipside.   Let’s next consider when Metadata Injection’s use would not be appropriate.    We have found four indicators that deem a set of processes incompatible with Metadata Injection:

  1. Business rules vary between process instances
  2. Substantial and complex transformation of the data
  3. Slowly changing dimension table loads
  4. Fact table loads

 

We suspect that there may be other indicators, but as with our traits for adoption, we need to start with what we have found and adjust as we learn more.

 

The first indicator implies that we cannot define a template transformation that uses the same set of processing steps for each process instance.   We need the variation to be declarative, meaning we can inject a finite set of attributes into well-defined, templated steps.  If the processing requires different steps or step sequences, it will be hard, if not impossible, to implement using Metadata Injection.

 

Determining whether a transformation pattern is too complex for Metadata Injection is often driven by the availability of injectable steps.   One can imagine a repeatable pattern with declarative variations that is complex.   That said, there are steps in PDI that may not be compatible with Metadata Injection.  Before embarking on a Metadata Injection based design, confirm that all of the steps in your process template are supported.   The complete list of steps that work with Metadata Injection can be found on the Pentaho help site here.

 

As of the current PDI release, the Dimension Lookup/Update step (used for maintaining slowly changing dimensions) does not support Metadata Injection.  But even if it did, it’s entirely unlikely that you will find repeatable patterns between dimensions.   For example, a transformation for loading dim_customer will likely have vast differences with dim_product.

 

Considering fact tables, if a transformation template can accommodate the loading of more than one fact table, then there is a likely argument that the Fact table design is flawed.  Fact tables typically having varying dimensional grain and metric columns that are distinctly computed by the dimensional grain.  These features cause uniqueness of processing that is not suitable for Metadata Injection.

 

The Future of Metadata Injection

As PDI continues to grow, we can be certain that more steps will support Metadata Injection and the opportunity to apply its magic will grow.  But, heed the warning given to Peter Parker as he became Spiderman: “with great power comes great responsibility”.  Knowing when to apply this capability is just as important as knowing how to use it.

 

With our experience working with PDI across many different industries and domains you can be assured that Pentaho Consulting knows when and how to apply the dynamism and  might of Metadata Injection.    If you’d like more info on Metadata Injection and the rest of our data engineering services and how we can help you find success, drop us a note.  We’d be happy to share our expertise to ensure your success.

This post was written by Greg Graham and originally published on Friday, May 19, 2017

 

We’ve been seeing an increase in customers looking for cost-effective ways to include analytical mapping solutions in their analytical applications. To do this within the Pentaho Business Analysis ecosystem we can develop custom dashboards with the latest Javascript charting libraries, including LeafletJS used by Tableau.  These custom dashboards leverage Pentaho’s robust data services using Community Data Access (CDA) to provide fully interactive and integrated mapping and charting solutions right to your web browser or device.

 

The following demonstration walks through the use of LeafletJS and JQuery DataTables to create interactive displays of financial and test score data from School Districts in Illinois.  You can also see the demo on Inquidia's site here.

            C:\Users\GregoryGraham\AppData\Local\Microsoft\Windows\INetCache\Content.Word\2017-05-19 10_29_45-School Districts in Illinois 2014.png

 

The data were gathered from the Illinois State Board of Education and the US Census Bureau and joined using Pentaho Data Integration.  GEOJSon files containing map boundaries of Illinois elementary, secondary and unified school districts were obtained from the US Census Bureau Tigerline system.  Selections in the data table are highlighted in the map, and vice versa.

 

 

 

Additionally, we’ve produced a user-driven charting laboratory in another view that allows the user to make scatterplots of the same data selected in the Data Table on axes chosen by the user.
C:\Users\GregoryGraham\AppData\Local\Microsoft\Windows\INetCache\Content.Word\2017-05-19 10_46_12-School Districts in Illinois 20142.png

 

This demo presents just of a taste of the integrated mapping, analytics and visualization techniques that we’ve developed through the years.  Integrating them with a mature web BI platform such as Pentaho enables us to deliver cost-effective, geographical analytical applications to our clients that provide a highly-tailored user experience with a robust and scalable data platform.  Drop us a line today to see how we can help you meet your mapping analytics challenges!

 

See the demo here!

 

This post was written by Greg Graham and originally published on Thursday, April 13, 2017

Many of our customers are looking for a sound, cost-effective platform to provide rich, embedded visualizations in their web applications. To do this, there are a number of options that we commonly turn to. The Pentaho Business Analysis Server platform comes with many excellent tools for custom dashboard development (Analysis Reports/CDE/CDF). Pentaho allows developers to create highly customized dashboards using a relatively easy to use framework. But sometimes, Pentaho’s “out of the box” visualizations aren’t exactly what you’re looking for...


In these cases, where more visual flexibility is required, we develop custom dashboards with the latest Javascript charting libraries.  These dashboards leverage Pentaho’s robust data services using Community Data Access (CDA) to provide mobile friendly interactive charting solutions right to your web browser or device.


The following demonstration walks through the use of ChartJS and DimpleJS to create interactive displays of Pentaho’s “Steel Wheels” sample data. The user can navigate and drill down by clicking on chart elements.   We utilize Bootstrap for mobile friendly resizing , and the dynamic appearance of ergonomic controls for smaller screen sizes.

 

 

Additionally, we’ve produced map based displays with colorful, interactive Choropleth maps of Steel Wheels data like these examples using Plotly.js and Highcharts.

 


These demos present just of a few of the many free and “paid-for” Javascript charting, mapping and visualization libraries that we’ve worked with through the years.  Integrating them with a mature web BI platform such as Pentaho enables us to deliver analytical applications to our clients that provide a highly tailored user experience with a robust and scalable data platform.  Drop us a line today to see how we can help you meet your web and mobile analytics challenges!

This post was written by Steve Miller and originally published on Wednesday, August 2, 2017

 

A few weeks ago, I participated in a conference call with a customer to discuss their new initiative in predictive analytics. We’ve already been helping them build a web-based analytic app for about 9 months. The partnership has been quite successful, and with analytics, once the stakeholders start to see their data and statistics, they often say “aha, this is great, now I’d like to see….” Or in this case, “now that I’ve seen some of the drilldowns, I’d like to produce go-forward predictive models…” The discussion was wide-ranging, spirited, and productive.

 

The analytics in the app today are primarily exploratory, with slice and dice/drilldowns from important business dimensions such as product, geography, and vendor into financial targets like billings, revenue, and profit. The aim is to identify those dimension categories that are business leaders and laggards. In analytics parlance, the drilldown dimensions are independent variables or features, while the financial targets are the outputs or dependent variables.

 

Going forward, the mandate is to be able to predict the performance with regression or machine learning models.

I had my statistical hat on early in the call, translating “dimensions” into categorical features, and “billings/profit” into numeric regression dependent variables. In concept, the work would be pretty straightforward, producing models that looked something like: profit = catvar1 + catvar2 + …..

 

After the call I flashed back to grad school, where we spent a lot of time working with precisely the same regression model challenge of continuous target variables against categorical or qualitative features. The tricky part was transforming the categorical variables so they made sense in the regression setting. What we ended up doing was coding “dummy variables” of 0’s and 1’s to represent qualitative variable categories. We then regressed the output variables of interest against those dummy variables. Back then, it was quite tedious to set up; today, thankfully, such “model matrices” are generated automatically by statistical software regression functions.

 

I also remembered a class assignment we were given to “prove” that the predictions derived from a regression of a numeric dependent on a qualitative feature were in fact equivalent to the means of the dependent summarized across levels of the categorical. In other words, in this simple case, the results of the predictive analytics would be identical to the exploratory.

 

So I just had to dig up a data set to test the conjecture. Lo and behold, I resurrected a blog from a few months back that used a data set Ames Iowa Housing Data that perfectly captured the need: a continuous dependent variable (house sale price), with scores of mostly categorical feature variables. So I loaded the data into R and ran a few models to “test” the exploratory = prediction conjecture.

 

What follows are the analyses. I’ll first explore the relationship between selected categorical predictors and house sale price visually. I’ll then compare house sale price means by those features with the unique predictions derived from regression models to determine if memory serves.

 

First, set a few options, load some packages, and read the data into an R data.table named housing. Compute several new variables, changing character vars to factors and dates as appropriate. Also, create binned variables for grlivearea (ground living area size).

 

 

Define a frequencies function and examine bathrcde, quantgrlivearea, and the 2-way combination. Note that while the distributions of bathrcde and quantgrlivearea are individually reasonable, the 2-way “crosstab” is quite sparse, since living area size and the number of bathrooms are correlated. This is a common problem with survey data, reflecting collinearity among features.

First look at the density violin plot of log10saleprice by bathrcde, ordering by median log10saleprice left to right. Not surprisingly, median sale prices increase with the number of bedrooms. From the x-axis, note that there are 955 1-bedroom homes with a median sale price of $127,500, compared to 48 homes with 4 or more bedrooms and a median sale price of $294,859. bathrcde seems to be highly related to log10saleprice.

Next look at log10saleprice by quantgrlivarea. Not surprisingly again, median sale price increases pretty dramatically with the percentile-binned living area size.

Finally, consider the two-way, faceted violin plot of log10saleprice by bathrcde and quantgrlivarea, which affirms both the positive relationships between # bathrooms, floor space and sale price, and cross-feature sparsity.

Now contrast the log10saleprice means broken down by bathrcde in the top table with the regression predictions of log10saleprice on bathrcde in table 2, cell below. With the regression table, only the distinct predictions are cited. Confirming the math, the means by group are identical to the regression predictions.

Ditto for log10saleprice by quantgrlivarea.

And finally, the exploratory means summarized by both bathrcde and quantgrlivarea are identical to the predictions derived from the two-way regression of log10saleprice on bathrcde and quantgrlivarea plus the interaction term.

The takeaways from this simple exercise? First, I haven’t as yet completely lost my memory. Also, sometimes it may be prudent not to obsess on the distinction between exploratory and predictive analytics, which in certain situations produce similar results. Rather, a focus on the design and execution of asking/answering business analytic questions with pertinent visual and predictive tools might be optimal.

This post was written by Chris Deptula and originally published on Tuesday, August 8, 2017

 

The data engineering industry has been through heavy disruption over the last five years. The increasing adoption of Hadoop, NoSQL databases, columnar analytic databases and Cloud platforms have all changed data integration strategy. One change as a result of these new platforms has been a move from a more traditional ETL (Extract, Transform, Load) design to an ELT (Extract, Load, Transform) design.

 

What is ELT?

ELT and ETL are both similar in that they are high level strategies to move data from one system to another. The difference is where data transformation processing occurs. In ETL the transformation is done in an external application such as Pentaho, Talend, or Informatica with the results loaded into the target. ELT does the transformation in the target system. The data is first loaded into the target system exactly as it looks in the source and then scripts are run inside the target platform to perform the necessary transformations.

 

ELT is not a new concept. It has been around for years and is a common approach. However, in many situations, it introduces difficult challenges that make ETL a more attractive option. Some of these challenges include:

  • ELT increases load on the target system during data loads. ETL isolates the processing power required to transform the data outside the target database. ELT puts all of this load on the target database, perhaps causing performance conflicts with active analytic queries.
  • ELT processes usually require scripting. For databases this means writing SQL queries. For Hadoop this means writing Spark, Pig, Hive, or MapReduce. These scripts can be very complex and require significant expertise to maintain. Think 100+ line SQL queries.
  • Although the Transform is done by scripts inside the target platform, a solution is still required for Extract, Load, and overall job orchestration. This generally requires custom scripts or the use of traditional ETL tools such as Pentaho, Talend, or Informatica. (And if you’re using a traditional ETL tool for Extract and Load, the temptation to leverage its Transformation capabilities grows!)

 

Why ELT?

Performance. Data engineers are always under pressure to load the data into the target as quickly as possible. The faster the data can be loaded, the sooner the data can be acted upon --  improving the speed of business.

 

Performance differences between ELT and ETL when using classic relational databases such as Oracle, SQL Server, MySQL, and Postgres have been negligible. However, the introduction of Hadoop and analytic databases has tipped the scale, making ELT orders of magnitude faster than ETL when using these platforms.

 

Hadoop solved the system performance problem by massively scaling out processing. The performance benefits of MapReduce and Spark moving the processing to the data and the volume of data processed in Hadoop make ELT the obvious choice. Analytic databases, on the other hand, have significant performance degradation when you use ETL processing. These databases are very slow when processing a multitude of record-level updates and inserts. In some cases, a single record insert can take seconds. On the other hand, analytic databases are very good at set-based operations requiring users to bulk load data into the database, and then execute SQL scripts to manipulate (i.e. Transform) the required records in batch.

 

ELT is faster for these types of new systems, but as discussed, it still requires manual writing of scripts, limiting the audience that is able to maintain it. Pentaho recognizes this problem and sees a couple of new products on the market to help solve them.

 

ELT Technology Options

For the Cloud-based Analytic databases Amazon Redshift, Snowflake, and Google BigQuery, Pentaho has experimented with Matillion, a Cloud-based visual ELT environment. There is no software license for Matillion, you simply pay a small upcharge on an EC2 instance and pay for Matillion by the hour. Matillion uses the same concepts as traditional ETL tools with orchestration jobs and transformation steps that read data from the database and transform the data in a visual development environment. The difference between the traditional ETL and Matillion is Matillion compiles the transformations into SQL that are run in the database rather than externally. Users familiar with traditional ETL tools will find the Matillion development interface familiar.

 

For ELT on Hadoop, at Pentaho, we find our adaptive execution layer to be a powerful solution. This layer allows users to run ordinary Pentaho transformations as Spark jobs on a Cloudera Hadoop cluster. Existing Pentaho users can take advantage of parallel processing on Hadoop using Spark with no additional training. The written transformations are the exact same as Pentaho developers are used to writing maintaining the same level of maintainability as traditional ETL transformations. We expect the adaptive execution layer to continue to evolve into the preferred way of developing ETL. Write one, run anywhere.

 

The Future of ETL and ELT

It’s clear that new data management solutions are upsetting the traditional ways of transporting and organizing data. With every release, traditional ETL providers are adding capabilities to support evolving data management environments. And new vendors are also rapidly entering the market with purpose built tools to solve these new challenges. As these new data management solutions mature, the purpose built tools will continue to alter the traditional ETL and ELT landscape. With all of these changes, expect to see best practices for data management evolve along with them.  Pentaho can help you stay up to date with the latest data management solutions and designs.