Pentaho

 View Only

 Is there a way for Pentaho job to run to only get new records and not rerun the old records?

Anna Nguyen's profile image
Anna Nguyen posted 05-03-2023 15:39

I have this query that parses JSON data to populate race and ethnicity columns. I added the query to Pentaho transformation and it takes very long time (about 20 minutes) to run the Pentaho job, and we have new ID records every day so I wonder if there's a way for Pentaho to run to insert new records only and retain the old records in a way for Pentaho job not to RERUN all the old records?

Here's my query for my Table Input step
And here's my Pentaho transformation
And here's my Table Output step in my Pentaho transformation.

So basically, I wrote my query in the Table Input step, get the data from production table, and then populate the table in data warehouse in my Table Output step. The issues are, I have two tables from different databases,, and I need Pentaho to run to only get new records inserted into table in data warehouse, retain the old records without RERUNNING the entire old records, so that to optimize the Pentaho job and make it run faster.

John Craig's profile image
John Craig

Hello Anna,

I don't know how much control (if any) you have over the source data (BISource_one--I infer there's also a BISource_two?), but if your source data were tagged with a creation/modification date, then you could keep a table row in your BI_Target table with the last processed creation date.

It's a bit unclear what you're doing in the Get System Info step (perhaps tagging with the current time?).

Assuming that adding a timestamp to the source data is not a possibility and you really only need new entries. Then I would suggest you might put whatever key value is unique into a staging table in your BI_Target DB (without doing any processing and only using, for instance the applicant_guid--or whatever is unqiue). Then have an SQL step that does a JOINed DELETE on the staging table so that all existing rows are deleted from the staging table. None of the following will work if you do not a unique key column or columns that is in common  to source and target. The most efficient way to deal with the problem in that case is to use the Merge rows (diff) step fed with a Table input step for the key column(s) from target and compared to the values from another Table input step for the key columns from the target. That step is followed by a Switch / case step that send everything that isn't new to a Dummy step. Those that are new get all the data retrieved and processed as you do now, but you current Table input step gets replaced by an Execute SQL  script step that takes the unique key(s) as parameters (so the script reads one row at a time) and passes it through to whatever other processing you're doing. But you only end up with new entries in app_data_tribe. You might look at the examples or find some via a web search that use the Merge rows (diff) just to see the pattern.

An alternate method is to use another staging table to decide what full data to pull. This includes more details about how to pull the full data for a new record (which you'd use whether using Merge rows (diff) or the extra staging table.

So you'd create your staging table in BI_Target (I'm assuming applicant_guiud is unique and that's al you'd need)

Then create a JOB (not a transformation) that would call some transformations:

Transformation 1
Step 1: Table input from BISource_one to retrieve the application_guid (I'm asssuming that's a unique key)
Step 2: Table output step to Staging table (you could set it to truncate the staging table each time it hits this step)

Transformation 2 (if I understood correctly)
Step 1: Table input from BISource_two to retrieve the application_guid (I'm asssuming that's a unique key; if not, use whatever is appropriate)
Step 2: Table output step to Staging table (don't check the box to truncate the table

Transformation 3
Step 1: SQL Script to delete from the Staging Table joined to the app_data_tribe on Staging.applicant_guid = app_data_tribe.applicant_guid

Transformation 4
Step 1: Table input on Staging table to retrieve applicant_guid
Step 2: Execute SQL script step
Use a modification of your query in your current Table input step to use a where clause like this: WHERE applicant_guid = ?
Specify applicant_guid (in the data stream from step 1) as the Parameters: Field name to be used as argument
Note that your SELECT query can omit the applicant_guid because you already have that in the data stream.
Check the box Execute for each row?
Step 3: Get System Info (just as you're using it now)
Step 4: Table output (as you're using now) -- which I note must be some kind of staging table because it's being truncated each time the step starts

You'll need an index on the app_data_tribe table or whatever table has the key you can compare to. The end result is that although you'll have to read all the key values and save them somewhere, you'll only do full processing on the entries that are new.

Hope this is some help,

John

Stephen Donovan's profile image
Stephen Donovan

The simple answer is, "But of course".  There are a few options.  As John stated "None of the following will work if you do not a unique key column or columns that is in common  to source and target. "

Insert/Update step versus Table Output is the easiest.  However, this step does single row requests for eash row to determine if the row is an Insert, Update or no action.

The most performant (though more comples) is to do a Merge Rows (diff) between your source and target.  https://help.hitachivantara.com/Documentation/Pentaho/9.0/Products/Using_Merge_rows_(diff)_on_the_Pentaho_engine

Table Input from Source and Table Input from Target sorted (via Order By clause to match) will result in a new column 'flagfield'. 

New- row at source not target
Changed - row in both but changed
Deleted - row in target no longer on source
Identical - row on both match.

You can then use a Switch/case statement (or filter) to redirect rows to the proper action.  In your case, only act on 'new' and send those to your Table Output.

As I stated, more complex but purpose built for batch insert and/or update and very fast.