Pentaho

 View Only

 Transformation stays at idle PDI 9.4 with SQL Server 2016

Marc Stroo's profile image
Marc Stroo posted 09-28-2023 02:49

At my client with a test of PDI 9.4 CE it often happens that transformations stay at idle. It already happens that when I have the SQL entry open of my table in block, that when I want to close it, it freezes and I have to do a forced close through Windows. This hampers my demonstration to the client to use PDI as a new ETL tool. Any idea and how I can fix this? The used database both as source and target is SQL Server 2016.

Carlos Lopez's profile image
Carlos Lopez

What version of the MSSQL JDBC Driver are you using? Are you using MS Integrated Windows Security on the connection details?

Marc Stroo's profile image
Marc Stroo

Thank you for the reply with the additional questions. I have the driver mssql-jdbc-12.4.0.jre8. I don't use integrated windows security, but a normal user and password. The integrated security wat giving issues and when I will run it scheduled, I think a normal user works better.

Carlos Lopez's profile image
Carlos Lopez

@Marc Stroo would you be able to share a sample transformation when you see this behavior? Are you using Table Input to Table Output or is it more complex than this? 

Marc Stroo's profile image
Marc Stroo

All I do is Table Input to Table Output. The SELECT statement is exactly in the format needed for the INSERT of the output block. So it's the most basic version of extract and load possible.

Stephen Donovan's profile image
Stephen Donovan

@Marc Stroo as a simple test to understand where it is blocking, can you disable the hop to the output and see if it runs through to the end and at what throughput?  Can also replace the output with a text file output and use that text file as an input for an output performance test.  In most cases, the issue is with the output, filewalls, triggers, indexing, page updates etc. 

Potentially, locking on the logging tables if you have those all turned on and configured.  Or configured and not properly set up/indexed.  Those are not usually on by default, but you may have inherited a kettle.properties file from a prior installation.

One other rare issue is putting the transform into single threaded mode and using the same connection for input and output.  That will lock up the row buffer as it may not be able to commit and context switch that single connection. 

As @Carlos Lopez mentioned, sharing a simple example woudl help us find configuration issues.

Marc Stroo's profile image
Marc Stroo

Thank you for the feedback. I am hesitant to put up an example here, as the client uses home built custom software, so I can spread company secrets by posting a query on their data here. I use a clean PDI 9.4 install and all I do is just table in, table out, no alterations, on my new work laptop provided by my client. So everything is as default as it can be.

If this still is insufficient, then I will put one transformation up here, provided that I can change the SQL in the table in block. I will first try to turn on that one setting you mention.

Marc Stroo's profile image
Marc Stroo

Another thing that I notice is that it seems to occur when I am rerunning a transformation that has previously produced some output. One thing I do in the input query is do a LEFT JOIN on the output table. May that also cause some issue? And may that get solved by changing the setting you mention, where the connection is shared?

Marc Stroo's profile image
Marc Stroo

I just notice that turning on that setting of sharing the connection actually causes the issue that writing to the table stops happening. So that surely solves nothing, sadly.

Petr Prochazka's profile image
Petr Prochazka

You can create thread dump of trans in freeze state.

IMHO problem is table lock on output table, so table output step can not insert rows to table.

You can try add Block step before Table output step. If it's work, problem is definitely table lock in DB.

Marc Stroo's profile image
Marc Stroo

I try adding a blocking step, changing the output step to a dummy step or even only using the table input step and removing the table output step. In all cases it stays idle. It only helps when I add a top n to the query, like top 20 or top 2000. Then the query runs and finishes. If it ha to do with locking, then it seems that the lock isn't dropped once I try to run the query one time.

The thing to note is that in the table in step there is a reference to the table that is written to in the table out block. Maybe that creates a locking issue from the table out that affects the table in? 

Stephen Donovan's profile image
Stephen Donovan

Good thnking.  SQL Server 2016 (and SQL server in general) is notorious for page level locking on indexes.  It is not a clean on maintaining a a red isolated view of the tables.  I think you can tell it to do so.

However, you said you tested without targeting the output table and you still were blocked.  This seems to point to the input.  You also said you can limit rows and it works.  So it seems your input is likley using a ton of resources or dependant on waiting for commits to other data (maybe the output table).

Try adding the "WITH (NOLOCK)" to prevent your query for lockingt tables and/or pages

SELECT column1, column2 FROM your_table WITH (NOLOCK) WHERE your_condition;

You can also try read isolution but that might be harder to do within the session of Pentaho (unless you force a single session).
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN TRANSACTION;
-- Your SELECT statement here
COMMIT;

Marc Stroo's profile image
Marc Stroo

I put a WITH (NOLOCK) after each of the 3 tables I use in my query and still it stays idle when I run the query without limiting the number of rows. When I do a top n in the first query part (I start with a WITH section), then it runs. When I do a top 500 it runs more or less instantly. When I do a top 5000 it has a very short idle. When I do a top 50000 it is idle for 2 seconds or so before it starts running. Now I test it with a top 500000 and it stays idle for a long time. It seems that it is doing some scan over the database that takes a long time. Normally though I see that the status changes to running as soon as the query is fired at the database from the TABLE IN block.

I will check the other suggestion and see what it does.

Marc Stroo's profile image
Marc Stroo

The transaction isolation level works!

Marc Stroo's profile image
Marc Stroo

I wonder if you can arrange it that this is set at the connection level? Then you have to do it only once, instead of at every SELECT query that you have.

Petr Prochazka's profile image
Petr Prochazka

You can define init statements in Advanced tab on connection definition.