View Only
  • 1.  Looking for design assistance

    Posted 01-11-2023 12:03
    I have a transformation where there are 8 streams, each with 5-10 values (integers in this instance).  The individual streams sometimes will have duplicated values from another stream.  My goal is to generate all unique combinations from the 8 streams.

    My current solution is using merge join 7 times, using some filtering between each merge join.
    • An integer value should not appear twice in the same row.
    • The same set of integers should not appear in two or more rows, even if in a different order. ([1, 2, 3] and [1, 3, 2] should not both exist in the output)
      • It does not matter which set is removed
      • I'm currently solving this by creating an ID for each row based on mathematical formula and removing duplicate IDs
    I hope this makes sense what I'm trying to accomplish.  My current solution works, but I'm looking for a more efficient way to do it.  By the last merge join, I'm looking at nearly 30 million rows, which takes a few minutes to process.

    Cory Glowe
    Systems Engineer

  • 2.  RE: Looking for design assistance

    Posted 01-12-2023 06:09

    You didn't mention where your streams are coming from but if they're all coming from a DB then how about a UNION query view, either on the DB or using the table input step?  I would test whether just a UNION on a DB view would work better than a UNION ALL on a table input step.  The former would possibly save network traffic from the DB and the DB server / engine is doing the work with potentially better compute power than wherever you have Pentaho running.  The UNION ALL in a table input step would perhaps save compute resource by not trying to drop duplicates at this stage if you can't create views on the DB.  Either way, you wouldn't need multiple sorts prior to merging like this.

    I would then do the same as you in terms of creating a sorted values concatenated ID field, then sort the resulting dataset using this newly created ID field and then I'd use the unique rows step to discard the duplicate ID rows.

    Presumably your logic for creating the sorted values for the new concatenated ID field, is placing the integer values into an array and then sorting the array?

    Kind regards,

    Justin Phebey

  • 3.  RE: Looking for design assistance

    Posted 01-12-2023 06:27
    Hi Cory,
    you can use cartesian join rows without key condition. This creates combination of all input stream, ex. in pic.

    Petr Prochazka
    Systems Engineer
    P.V.A. systems s.r.o.

  • 4.  RE: Looking for design assistance

    Posted 01-12-2023 10:56
    Hi Cory,
      one way would be to connect all the inputs to a Join Rows (cartesian product) step and code conditions to not have the same value across the inputs.
    To achieve that, each input must have a different stream name. For example, for 4 input streams, if the values are stored in the stream variable data1, data2, data3, data4, then the condition in the Join Rows is that
    data1 <> data2 AND data1 <> data3 and data1 <> data4 AND data2 <> data3 AND data2 <> data4 AND data3 <> data4. By doing so, the output of the Join Rows is a row where the values of data1, data2, data3 and data4 are all different.

    Then you need to spot the duplicates due to different combinatoric across data1, data2, data3 and data4. You can use a Modified Javascript value step. Inside this step your javascript code would simply fill an array with the values of data1, data2, data3, and data4. Then sort the array and output it as a new stream variable (let's call this signature). By doing so a row [1,2,3,4] and [3,4,2,1] would have the same signature 1,2,3,4. You add then a Sort component, by sorting on signature and ticking the checkbox "Only pass unique rows" to get only one combination for each signature.

    To be tested with your specific data ingestion mechanism / rowset size / environment.

    Antonio Petrella
    Data Service Manager