Pentaho

 View Only
  • 1.  Huge Performance Issues in PDI 9.3

    Posted 05-30-2022 01:02
      |   view attached
    Hi,
    I have some huge performance issues in PDI 9.3 that I did not have in 8.3. Even in 9.2 I found out, that the samcde issues are there.
    The flow starts with Textfile input (about 2 millions row). Then I sort it. This works quite fine in all versions.
    But the Stream lookup flows are bringing the performance issue.
    The first three ones are working fine, but the last two are endless .
    One is doing a stream lookup of 1,6 millions to the other workflow.
    A sceenshot of the flow is attached.
    I tried it with
    To give you an Idea of the performance difference:
    PDI 8.3 about 5 Minutes (JRE 1.8.0_322 Eclipse)
    PDI 9.2 about 2,5 h (JRE 1.8.0_322 Eclipse)
    PDI 9.3 about 3 h (JRE 11.0.15 Eclipse)
    Settings: Nr of rows in rowset: 500000
    Feedback size: 100000
    manage thread priorities: yes
    I also gambled with these settings but it did not have the big influence... it just went slower.
    Is this a knows issue or can anyone help me with that? Sorting and making a Join is also an option, but when I have to sort the flow after every joing It takes even longer...
    Thanks for help.

    ------------------------------
    Martin Heller
    Systems Engineer
    Wiener Netze
    ------------------------------


  • 2.  RE: Huge Performance Issues in PDI 9.3

    Posted 05-30-2022 01:43
    Hi Martin

    With the sorts have you tried starting multiple copies of the sort step (so they have a far smaller set to organize) then using a 'sort-merge' step with the same sort criteria to remerge them?

    ------------------------------
    Andrew Cave
    Systems Engineer
    BizCubed Pty Ltd
    Australia
    ------------------------------



  • 3.  RE: Huge Performance Issues in PDI 9.3

    Posted 05-30-2022 02:25
    Hi, many thanks for the hint. No I did not try that, and I will try that to improve the sort issues, that I also have ,) But the main issues are still the stream lookup.. this really hurts. I also tried there to open up several copies, but it did not help so much.

    ------------------------------
    Martin Heller
    Systems Engineer
    Wiener Netze
    ------------------------------



  • 4.  RE: Huge Performance Issues in PDI 9.3

    Posted 05-30-2022 19:56
    I think the issue is because the text file extract is being searched line by line (as it won't have any indexing)

    a) you could try dumping the big files into an SQLite database file (it's file-based ), putting an index on the lookup column and using that
    b) you could use the merge-sort technique to sort the main stream and the lookup stream and then use a merge-join step to connect them up correctly.

    why it is slower is very hard to tell, but it might be that the last lookup steps are waiting for the input rows to come through from the sorts and something has been changed there.

    ------------------------------
    Andrew Cave
    Systems Engineer
    BizCubed Pty Ltd
    Australia
    ------------------------------



  • 5.  RE: Huge Performance Issues in PDI 9.3

    Posted 05-31-2022 03:24
      |   view attached
    Hi,
    ok... I just tuned the workflow based on your input.
    I also used the csv file input insted of using the textfile input.. that also had a big impact. The replace string, I put behind the two sorting ...
    Now its 1 minute faster than in 8.3 Many thanks for the help.
    SQLite would also be an option.. but I don't have the infrastructure for that at this time..
    Best regards
    Martin

    ------------------------------
    Martin Heller
    Systems Engineer
    Wiener Netze
    ------------------------------



  • 6.  RE: Huge Performance Issues in PDI 9.3

    Posted 07-30-2022 16:10
    Martin
    Are you able to share a sample file dummy data or at least the header row so I can write up some random data and attempt to reproduce this behavior; seems very erratic and while you were able to change / tune it to run better by changing the transformation to use the CSV input step I believe there is still a performance issue that needs to be investigated.
    If you don't mind sharing as I noted a sample file with the header row; some dummy data on it and the transformation I would like to take a debugging look at it

    ------------------------------
    Carlos Lopez
    Application Architecture Engineering - Expert
    Hitachi Vantara
    ------------------------------