Hello
@Samarth Shah, one way I did it was with custom pagination.
At first I tried this typical SQL pagination (in PostgreSQL), building a loop inside the parent Job (against PDI warning):
SELECT *
FROM table_name
OFFSET ${PAGE_OFFSET}
ORDER BY id
LIMIT ${PAGE_SIZE}
Then in the loop you can set the ${PAGE_OFFSET}
to the maximum id
of your current query, and continue until you don't have any more records.
The problem with that method, is that the "offset" operation takes too long to process, so my next implementation was:
SELECT *
FROM table_name
WHERE id > ${PAGE_OFFSET}
ORDER BY id
LIMIT ${PAGE_SIZE}
And repeat the process until you reach the ${LAST_ID}
of your query: (Again, you have to build a loop inside a Job, against PDI warning)

The problem with this method is that PDI can run out of memory if you have to process many pages (in my case, it crashed at 250 pages).
Finally, I came up with another pagination that requires a previous sequence generation, that will be used as your ${PAGE_OFFSET}
. Knowing beforehand all the indexes you need, will help you use PDI capabilities and do it without building a loop.
First, you query the first index you want to process, and then you generate a sequence adding the ${PAGE_SIZE}
you want to process for each query, until you reach the ${LAST_ID}
of your query.
After that, just execute your transformation using the "Execute every input row" option and passing the sequence with the "Copy Rows to Result" Step and the Parameters options. This way, you will have your ${PAGE_OFFSET}
in the transformation for each query:
Parent Job:

Transformation to generate the sequence (in this case, I'm using PostgreSQL):

Your "process table page" transformation:
NOTE: Now we know all of our ${PAGE_OFFSET}
, so we don't need to check the ${LAST_ID}
variable. Also, we change the query to retrieve the ids between our current ${PAGE_OFFSET}
and the next one (${PAGE_OFFSET} + ${PAGE_SIZE}
) because if you use LIMIT ${PAGE_OFFSET}
, you might skip some ids (remember that we generated the sequence just by adding ${PAGE_SIZE}
and not querying the real ids.
select *
from ${TABLE_NAME}
where id > ${PAGE_OFFSET}
and id <= ${PAGE_OFFSET} + ${PAGE_SIZE}
