This post was written by Dave Reinke and originally published on Wednesday, July 6, 2016
As we continue our series of Pentaho Data Integration (PDI) Lookup Patterns, we next discuss best practice options for looking up the “most recent record”. Common use cases for this pattern include finding the most recent order for a customer, the last interaction of a web site visitor, and the last claim record for a policy. As you’d expect there are multiple ways in which this can be done within PDI.
We’ll explore two of the most frequently used options, but first let’s define an example use case.
Example Use Case
Suppose we have a stream of customers each identified with a customer_key and there is an orders table which contains the following columns:
Primary key of the order. The column we want to lookup.
Foreign key to customers. The column we need to match with our input stream records..
The timestamp of the order. We’ll use this to determine the most recent record for a given customer_key.
There may be many other columns which we could include in our lookup. We will ignore these in order to de-clutter our example.
If you had customer_key, 100, and wanted to get the most recent order_key, you would issue a query like the following:
where customer_key = 100
order by order_timestamp desc
The database would return exactly one order_key (limit 1) for the given customer_key (100) chosen as the first of a list sorted by the order_timestamp descending. In PDI, we can use the Database Lookup and Database Join steps to accomplish the same result.
Approach 1: Database Lookup
As previously discussed, the Database Lookup step is most often used for key-based, single record lookups, however it is also apt for the Most Recent Record lookup pattern. The following picture demonstrates how this step would be configured for our example.
The step looks up the order_key of the most recent record in the order_db.orders table for each input stream record’s customer_key. The Order by field tells PDI to sort the matching orders records by order_timestamp in descending order. By design, the step will only return one row -- the first row returned after the sort. So, we have effectively configured the step to execute a query similar to the one listed above.
As configured, our example will execute a single lookup query for each input stream record. If customer_key’s are repeated in the input stream, this will result in many redundant query executions. To eliminate these redundant queries, caching can be configured for Database Lookup. Database Lookup enables caching because it restricts the lookup query to a single database table. The performance boost can be quite significant. In our example, suppose there are 1,000,000 input stream records containing 55,000 distinct customer_key’s. Without caching, the step will execute 1,000,000 lookup queries. If we check the Enable Cache? box, the step will only fire 55,000 lookup queries -- one per distinct customer_key. This results in an order of magnitude or more in improved performance.
Approach 2: Database Join
The Database Join step can also be configured to implement the Most Recent Record lookup pattern. The following picture shows it’s configuration for our orders example.
When using the Database Join step, you must supply your own lookup SQL. In our simple example, this is a straightforward parameterized select statement that returns a order_keys for a given customer_key in descending timestamp order -- most recent record first. We enter 1 in the Number of rows to return field to ensure that only the first record for each input stream record is passed to the output stream. We check Outer join? to allow a failed lookup record (i.e. no matching orders for a customer_key) to pass with a null order_key in it’s output stream record.
Performance-wise, the Database Join step will take about as long to process as a non-cached Database Lookup step. There will be one query per input stream record. Unlike Database Lookup, there is no ability to cache data to improve performance. However, because Database Join allows you to define your own SQL Select statement, you have the ability to define arbitrarily complex lookup queries.
Which Approach to Use?
So, how do you decide which step to choose? The answer is rather straightforward. Use Database Lookup step if you have a single lookup table which a straightforward lookup condition. Leverage caching if your input datastream contains duplicate lookup keys. Only use Database Join if your lookup query is complex. Remarkably, in the field, we see developers using to Database Join without realizing that Database Lookup can be leveraged to improve performance. Don’t make that mistake!
One final comment, the savvy reader will note that this same pattern applies to “least recent record” or historically first record lookups as well. In fact, any arbitrary record sort that you might apply can be handled by this design pattern.
Next in our series, I’ll examine a the Master-Detail explosion pattern causing the Database Join step to come to the fore.