This post was written by David Reinke and originally published on Thursday, June 16, 2016
This blog is the first in a series featuring some of the practices we use with Pentaho Data Integration.
Pentaho Consulting is often asked to performance tune and/or stabilize ETL processes created by developers who are inexperienced with the product. One of the first areas we investigate is the use of data lookups. Looking up values from a database or file for each record in a stream of data is one of the most frequently executed activities in any data program. If improperly tuned, ETL jobs can suffer. Examples of lookups abound including looking up a technical product dimension key for a product name, obtaining the most recent order record for a given customer, or gathering all pageview history for a given website visitor.
A frequent issue we find is either the misuse or inefficient configuration of these data lookup steps. It's amazing that problems with lookups would be such an issue -- data lookups seem, at first, like a very straightforward exercise. Yet, their optimization arguably provides the biggest “bang for the buck” in performance tuning and code stabilization.
We've found the root causes of these issues to be a lack of understanding of the various data lookup options provided by PDI and minimal recognition of regularly occurring lookup patterns. In this blog, I'll compare the most commonly used lookup steps and their core functionality. In future blogs, I'll review design patterns applying these characteristics to what we implement as part of our consulting projects.
Most Common Pentaho Data Integration Lookup Steps
There are many steps categorized as “Lookup” within the PDI environment. Let's focus on the three most frequently used steps: Database Lookup, Stream Lookup and Database Join. All of these steps assume an “inbound stream” of records that are then compared record by record with some “lookup source” to produce 0 or more “outbound stream” records with one or more “looked up” fields appended.
The following table summarizes the functionality and behavior of these steps.
Single Database Table
# of Records Looked Up
0 or more
Typical SQL Operators
Any valid SQL
Caching of Lookup Data
0 to All Rows, Many Config Options
Always All Rows
No-Match Lookup Behavior
Can filter inbound record or insert a null or default value in the lookup fields.
Can insert a null or default value in the lookup fields.
Can return null lookup fields or act as a filter on inbound stream
Multiple-Match Lookup Behavior
Can generate failure or match on first record returned from generated lookup query. (Lookup query can have an order by clause applied)
Looks up the last matching record in lookup stream
Returns all matched records potentially causing the inbound stream record to be replicated for each matched lookup record
As you can see, there is quite a bit of overlapping functionality. You can often use any one of the three to accomplish the same lookup task. However, there are both striking and nuanced differences which provide significant opportunity for advanced functionality and impactful performance tuning. This is where the use of patterns can aid the insightful developer -- applying rules of thumb and best practice algorithms.
In future blogs, I’ll bring to bear common patterns that we have uncovered through our years of PDI implementation work. I’ll start by examining the most frequently occurring pattern: Key-Based, Single Record Lookup. Visit our website in the next week or so as we publish the rest of this series.