Pentaho

 View Only

 Execute an Update SQL Query on a Kettle Transformation

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Data Conversion's profile image
Data Conversion posted 01-02-2018 17:05

Hi good people,

I'm new to spoon and I've been using Holowczack's tutorials to get a feel of how PDI works. I have a scenario where I would like to run an update script after a table input and table output job, can anyone assist? I have tried these four but I can't seem to figure out how to make them work.

scripting

Here's the scenario...

Table Input: MySQL Database Table1 (Select * from Table1)

Table Output: Oracle Database (Create Table 1)

(this runs well to completion but then I have to execute the update script manually. I am looking for a way to automate this)

Intended Update Script:

update ods.table1 set colum1='New Value1' where column1='Old Value1'

update ods.table1 set colum1='New Value2' where column1='Old Value2'

Thank you in advance.


#PentahoDataIntegrationPDI
#Kettle
#Pentaho
Data Conversion's profile image
Data Conversion

You want to use a Job.

Job: Start -> Transformation -> Execute SQL

Where the Transformation looks like:

Table Input -> Table Output

This will allow the Table Output to completely finish before the Execute SQL occurs.

This will also prevent table access deadlocks (Table Output will lock the table for Insert, and the SQL statement will want access to the table too!)

Data Conversion's profile image
Data Conversion

I used the Execute SQL Script tool. I just added the two update queries separated by a semicolon ;.

I created two transformations. One for the table input and table output and another for the Execute SQL Script Tool. I then created a Kettle Job and placed my query after the table output transformation.

Data Conversion's profile image
Data Conversion

Hi Kenince,

It sounds like you're on the right track.  Figuring out the best way to distribute tasks across transformations can be tricky, but avoiding doing too many different things at once in the same transformation is generally a good approach.  Smaller, single purpose transformations linked together with jobs is better. 

I wanted to add that there are other tutorials out there at Getting Started with PDI that include both transformations and jobs.

Good luck!

Kevin Haas's profile image
Kevin Haas

Agreed with Greg that a good approach would be to break this into multiple transforms and then call them in a job.

Another idea is to use the "Blocking Step" or "Block this step until steps finish."

Blocking step - Pentaho Data Integration - Pentaho Wiki

Block this step until steps finish - Pentaho Data Integration - Pentaho Wiki

Kevin

John Craig's profile image
John Craig

Kenice,

We use both separation of tasks into 2 transformations and the Block this step until steps finish methods (depending upon how things fit together most logically). That is, sometimes the job running a transformation is already complicated enough without adding another transformation that would do the update and it just makes sense to put in the "wait for other things to finish" (Block this step until steps finish) step before continuing with subsequent steps.

One thing I notice is missing from the above replies is a really critical item to keep in mind:

All steps in a single transformation proceed in parallel.

The only exception to this, from a practical point of view, is that some steps need input data to do anything; in which case, they wait for data. What this means in your example is that the step to read the data (ERP Input) starts by invoking the query against its Connection, and as soon as there is a batch of rows, they're passed to the output step (ODS Update) which operates on them as they come in. You can control the size of the batch used between steps in the Transformation settings and an Update Table step has its own batch size (that would typically control how often a COMMIT occurs or a batch size if using batch updates). But, if you watch the SQL operations feeding into the ODS DB from your PDI step, you'd expect to see a whole series of UPDATE statements happening as soon as there is any data to operate on. A step that does not have a hop leading to it will start as soon as the transformation starts (that is, before any data is available for the ODS Update step, in your example). But if you put a hop to it, then the question becomes, does it need any input data to run or not? If not, it'll start right along with the first step in the hop-chain.

 

So, if you need steps to occur in a particular order, there are two things that will not do what you have in mind:

  1. Simply put the steps into the transformation with hops indicating the order.That doesn't work because the steps start running in parallel (and just wait for data to work on--if they need input data);
  2. Put the step in without a hop.That means that step starts as soon as the others do (with no waiting for data to come through from a prior step).

The options listed above will work:

  1. Put the step you want to run after the others into its own transformation and then create a job that has the two transformations in it (the hops in a job do indicate execution order: the first transformation finishes before the next one starts).
  2. Add a Block this step until steps finish step before the step you want to delay in your transformationAnd configure it to wait for the appropriate step to finish (in this case ODS Output).

There's also other ways to accomplish things that may be simpler and/or more efficient, depending upon what your needs are. For example, suppose you need to set a value in the ODS DB row based on a foreign key to another table in the ODS DB (that is, you cannot read the value in question from the ERP DB itself, but you can read its foreign key). In that case, rather than doing the update as a separate step after the ODS Update step, you might add the lookup step before the ODS Output step so that you have three steps joined by hops in your transformation:

 

  1. ERP Input
  2. LookupUses the ODS DB connection to look up one or more values in another ODS table using a key from the ERP Input query
  3. ODS Update

 

We use this pattern frequently in our PDI transformations and it's a very useful one to have in your personal toolkit. This is particularly useful when the ODS Update touches a tiny subset of the rows in the target table and setting the extra value requires examining each row in the target table if it's done at the end with an UPDATE statement that joins in one or more tables. Also, if the lookup table is of a modest size or the input data has many rows with the same foreign key, you can configure the Lookup step to cache the values so it doesn't hit the DB for each entry.

 

Another pattern we use occasionally is to have a prior step feed one or more values into an Execute SQL Script which then puts those values into the SQL script by Variable substitution: The SQL script has ? placeholders and the step's configuration lists the values from the prior step's output to feed into the SQL's placeholders as Parameters.

 

Perhaps one or the other of these latter patterns would be more useful in your particular case. If not, as you get deeper into PDI work, you're likely to run into cases that are more effectively dealt with using one of these latter patterns.

 

John

Data Conversion's profile image
Data Conversion

DON'T use a Block step for this.

The flow will behave in odd ways, including the SQL step running before the insert.

The correct way to do this is with a Job, and likely even just using SQL script at the job level.