Pentaho

 View Only

 How to handle huge(500-600 mn rows) records in PDI

Samarth Shah's profile image
Samarth Shah posted 10-20-2022 00:45
Hello there,

I have a task to take 500 to 600 mn records from the source table and compare that records with target table (with almost same amount of rows). If there's any difference in the record then it should sync target table according to the source table. Both tables are located in oracle database. I have implemented the transformation (displayed in img) which takes source and target table data, sorts them and then uses merge with diff to get difference.

Img: Basic implementation for Sync target database

The only problem is performance issue. Currently sorting has been done at 1,00,000 rows per 5 mins. Based on this performance, it would take hours to sorting and finding the difference.

Is there something I can do to improve the performance?

PS: Please don't judge me, I have been new to PDI application :)

Thank you for the support,
Samarth Shah.
Petr Prochazka's profile image
Petr Prochazka
Hi Samarth,
sort rows step is not effective, because step have to read all rows to memory or stored to disk and after send data to next step.
Is It possible, use server side sorting, ORDER BY clause in input SQL queries?
Samarth Shah's profile image
Samarth Shah
Hello @Petr Prochazka,

Actually, I had ​​​also tried order by clause in Input sql, but there's no difference in reading speed of rows.

I have one idea; I don't know if we can implement in PDI or not.

What if I take the sorted data (Using order by) in batch (i.e. 100000 Rows per batch) and compare those batch data with target and after sync, next batch of data will be fetched. Is this approach possible in Pentaho?

Thanks.
Petr Prochazka's profile image
Petr Prochazka
If no difference in reading speed so it is problem in Oracle server or network IMHO.

For more information where is bottleneck look at to tab Step metrics and these columns are interesting:
  • Speed (r/s)
  • input/output
If trans is running, information is refreshed every 1 sec. In column input/output is actual edge buffer of step, so one of steps has full input buffer and output buffer has empty (almost). This is bottleneck of data stream. If input table steps have empty output buffer, this steps are bottleneck.

Trans only fetch data and process data stream over all steps. Of cause you can fetch smaller batch in table input, process this data and call trans again for next batch data.
Alberto Garcia's profile image
Alberto Garcia
Hello @Samarth Shah, one way I did it was with custom pagination.

At first I tried this typical SQL pagination (in PostgreSQL), building a loop inside the parent Job (against PDI warning):

SELECT *
FROM table_name
OFFSET ${PAGE_OFFSET}
ORDER BY id
LIMIT ${PAGE_SIZE}​

Then in the loop you can set the ${PAGE_OFFSET} to the maximum id of your current query, and continue until you don't have any more records.

The problem with that method, is that the "offset" operation takes too long to process, so my next implementation was:

SELECT *
FROM table_name
WHERE id > ${PAGE_OFFSET}
ORDER BY id
LIMIT ${PAGE_SIZE}

And repeat the process until you reach the ${LAST_ID} of your query: (Again, you have to build a loop inside a Job, against PDI warning)


The problem with this method is that PDI can run out of memory if you have to process many pages (in my case, it crashed at 250 pages).

Finally, I came up with another pagination that requires a previous sequence generation, that will be used as your ${PAGE_OFFSET}. Knowing beforehand all the indexes you need, will help you use PDI capabilities and do it without building a loop.
First, you query the first index you want to process, and then you generate a sequence adding the ${PAGE_SIZE} you want to process for each query, until you reach the ${LAST_ID} of your query.
After that, just execute your transformation using the "Execute every input row" option and passing the sequence with the "Copy Rows to Result" Step and the Parameters options. This way, you will have your ${PAGE_OFFSET} in the transformation for each query:

Parent Job:

Transformation to generate the sequence (in this case, I'm using PostgreSQL):


Your "process table page" transformation:

NOTE: Now we know all of our ${PAGE_OFFSET}, so we don't need to check the ${LAST_ID} variable. Also, we change the query to retrieve the ids between our current ${PAGE_OFFSET} and the next one (${PAGE_OFFSET} + ${PAGE_SIZE}) because if you use LIMIT ${PAGE_OFFSET}, you might skip some ids (remember that we generated the sequence just by adding ${PAGE_SIZE} and not querying the real ids.

select * 
from ${TABLE_NAME}
where id > ${PAGE_OFFSET} 
and id <= ${PAGE_OFFSET} + ${PAGE_SIZE}