This post was written by Dave Reinke and originally published on Wednesday, June 22, 2016
In a previous blog, we discussed the importance of tuning data lookups within Pentaho Data Integration (PDI) transformations. We suggested that there were identifiable design patterns which can be exploited to improve the performance and stability of our code. In this blog, we will investigate the most frequently used lookup pattern: Key-based, Single Record Lookup.
This pattern’s classic use case involves an inbound stream of fact records queried from a source containing dimensional attributes in the form of business keys. In order to load this data into the target fact table, the dimensional business keys must be converted into a set of technical keys referencing target dimension tables. The conversion of dimensional business to technical keys is done through a series of lookups.
The Database Lookup and Stream Lookup steps are used to implement this pattern. Database Join could be used, but would be much less efficient since it does not support caching. Additionally, we are not considering PDI’s purpose built Dimension Lookup/Update and Combination Lookup/Update steps as this pattern assumes no versioning of dimension data, i.e. no slowly-changing dimensions and that we have already loaded our dimension tables prior to the fact table load. (I can already hear some of my colleagues’ arguments. Yes, I know that Dimension Lookup/Update can be used for Type 1 dimensions and that its use enables the code to more easily migrate to Type 2, versioned, dimensions. It is a candidate for this pattern, but requires specific database design changes for its use. We’ll discuss Dimension Lookup/Update in a future blog. I promise.) So with that caveat, let’s examine the use of Database Lookup.
Using Database Lookup for Key-based, Single Record Lookups
Database Lookup is the prime choice for this pattern because it is easy to set up and provides rich functionality and tuning options. However, there are two vital prerequisites for its use. First, the lookup source must be a single database table. Second, the lookup table must contain a unique key column, often a business key, and a technical key column, almost always the primary key. The following diagram shows an example of this step’s configuration. It depicts a lookup of the technical product_key from the dim_product table in my_datamart database using a unique product_code business key field.
As configured, the step will generate and issue the following SQL statement to the my_datamart database:
select product_key from dim_product where product_code = ?
The ? is replaced by the value in the inbound stream’s product_code field.
Since the step is configured to enable caching of lookup data, the SQL statement will be run only once for each distinct product_code value found within the inbound stream. Additionally, we provide a default value of 0 should the lookup fail to find a match. No-Match behavior and caching are both configurable. Let’s discuss...
Handling No-Match Lookup Failure in the Database Lookup Step
In our example, a no-match would occur if an inbound stream record’s product_code did not find a matching product_code in the dim_product database table. The step is configured to “not fail”, rather to place a default value of 0 in the outbound stream’s product_key field. The underlying assumption is that there exists a “null” or “unknown” dim_product database record with a 0 product_key for null or invalid product_code’s. The following table describes the two ways a “no-match” can be handled by this step.
Provide a Default Value
Depicted in the example.
A subsequent step in the transformation can be used to filter and/or operate on the defaulted value, perhaps writing an error record and sending an email to an administrator.
Filter the inbound record
Check the "Do not pass the row if the lookup fails" box
No error is raised and the row is not passed to the outbound stream. This prevents you from taking corrective action on the errant data but ensures that only valid matches are processed.
In practice, we rarely use the filter option as most data programs attempt to remediate and/or alert if invalid data appears.
Optimizing the Lookup Cache for the Database Lookup Step
Not implementing or optimizing the lookup cache is a frequent mistake made by new PDI developers. Fortunately, it’s simple to explain and set up! The following table details the configuration options:
Check “Enable Cache”, then check “Load all data from table”
A single SQL select statement is executed, retrieving all key and lookup column data from the database Lookup Table. The entire Lookup Table dataset must fit in PDI’s memory. No inbound stream records will be processed until this entire dataset is retrieved and cached.
Check “Enable Cache”, uncheck “Load all data from table”, then specify how big (in # of lookup records) the cache will be allowed to get via the “Cache Size in rows” field.
One SQL select statement will be executed per distinct inbound record key that is not already found in the cache. PDI memory must be able to hold the number of lookup rows specified.
Uncheck “Enable Cache”
Every inbound record will cause a SQL select statement to be executed against the Lookup Table -- even if keys are repeated!
So, when to use which option? Here are some insights into our best practice decision-making...
When to Use
The most common configuration is Cache All. In general, if Cache All works efficiently, then use it. Most lookup tables are small (< 100,000 records) and thus can be retrieved quickly and will easily fit in memory. If full lookup dataset retrieval is inordinately slow or memory is limited, then consider Cache Some or No Cache depending on the size and lookup key distribution in your inbound stream.
Using Stream Lookup for Key-based, Single Record Lookups
Stream Lookup differs from Database Lookup in the following manner:
It’s lookup dataset is sourced from a second inbound stream of records and not tied directly to a single database table.
It always caches all lookup data -- there is no option for partial or no caching.
Thus, Stream Lookup should be used in the following circumstances:
The lookup data is not sourced from the entirety of a single lookup table
You may want to use a subset of rows in a lookup table. In this case, you’d use a Table Input step to gather the subset and then stream that subset as the lookup data into the Stream Lookup step, or
the lookup data is sourced by joining two or more tables or streams.
You have multiple lookups against the same “not big” lookup table
You have role-playing dimension keys that relate a single fact record to the same dimension table multiple times. In this case, you’d use a Table Input step to query all of the lookup data and then use “copy” hops to stream that data to multiple Stream Lookup steps -- one per role.
You have a source of data that does not reside in a database table
E.g. a flat file, or a spreadsheet
Since the lookup data stream for Stream Lookup can be generated by any valid sequence of PDI steps, the alternatives and examples for its use are somewhat boundless.
Also note that Stream Lookup does not provide a filter record option for “no-match” conditions. If you want to rid your stream of “no-match” records you have to use a Filter step after the Stream Lookup.
As you can see, there is tremendous opportunity to tune (or mistune!) lookups when doing a simple key-based, single record lookup. Next in our series, I’ll examine a similar pattern but with a subtle twist: Looking up the Most Recent Record.