Pentaho

View Only

Back to discussions

Expand all | Collapse all

Looking for design assistance

This thread has been viewed 24 times

1. Looking for design assistance

0 Like
Cory Glowe
Posted 01-11-2023 12:03

Reply Reply Privately
I have a transformation where there are 8 streams, each with 5-10 values (integers in this instance). The individual streams sometimes will have duplicated values from another stream. My goal is to generate all unique combinations from the 8 streams.

My current solution is using merge join 7 times, using some filtering between each merge join.
Conditions:

An integer value should not appear twice in the same row.

The same set of integers should not appear in two or more rows, even if in a different order. ([1, 2, 3] and [1, 3, 2] should not both exist in the output)

It does not matter which set is removed

I'm currently solving this by creating an ID for each row based on mathematical formula and removing duplicate IDs

I hope this makes sense what I'm trying to accomplish. My current solution works, but I'm looking for a more efficient way to do it. By the last merge join, I'm looking at nearly 30 million rows, which takes a few minutes to process.

------------------------------
Cory Glowe
Systems Engineer
Tangoe
------------------------------
2. RE: Looking for design assistance

0 Like
Justin Phebey
Posted 01-12-2023 06:09

Reply Reply Privately
Hi,

You didn't mention where your streams are coming from but if they're all coming from a DB then how about a UNION query view, either on the DB or using the table input step? I would test whether just a UNION on a DB view would work better than a UNION ALL on a table input step. The former would possibly save network traffic from the DB and the DB server / engine is doing the work with potentially better compute power than wherever you have Pentaho running. The UNION ALL in a table input step would perhaps save compute resource by not trying to drop duplicates at this stage if you can't create views on the DB. Either way, you wouldn't need multiple sorts prior to merging like this.

I would then do the same as you in terms of creating a sorted values concatenated ID field, then sort the resulting dataset using this newly created ID field and then I'd use the unique rows step to discard the duplicate ID rows.

Presumably your logic for creating the sorted values for the new concatenated ID field, is placing the integer values into an array and then sorting the array?

Kind regards,
Justin

------------------------------
Justin Phebey
Freelance
------------------------------

Original Message
3. RE: Looking for design assistance

0 Like
Petr Prochazka
Posted 01-12-2023 06:27

Reply Reply Privately
Hi Cory,
you can use cartesian join rows without key condition. This creates combination of all input stream, ex. in pic.

------------------------------
Petr Prochazka
Systems Engineer
P.V.A. systems s.r.o.
------------------------------

Original Message
4. RE: Looking for design assistance

0 Like
Antonio Petrella
Posted 01-12-2023 10:56

Reply Reply Privately
Hi Cory,
one way would be to connect all the inputs to a Join Rows (cartesian product) step and code conditions to not have the same value across the inputs.
To achieve that, each input must have a different stream name. For example, for 4 input streams, if the values are stored in the stream variable data1, data2, data3, data4, then the condition in the Join Rows is that
data1 <> data2 AND data1 <> data3 and data1 <> data4 AND data2 <> data3 AND data2 <> data4 AND data3 <> data4. By doing so, the output of the Join Rows is a row where the values of data1, data2, data3 and data4 are all different.

Then you need to spot the duplicates due to different combinatoric across data1, data2, data3 and data4. You can use a Modified Javascript value step. Inside this step your javascript code would simply fill an array with the values of data1, data2, data3, and data4. Then sort the array and output it as a new stream variable (let's call this signature). By doing so a row [1,2,3,4] and [3,4,2,1] would have the same signature 1,2,3,4. You add then a Sort component, by sorting on signature and ticking the checkbox "Only pass unique rows" to get only one combination for each signature.

To be tested with your specific data ingestion mechanism / rowset size / environment.

------------------------------
Antonio Petrella
Data Service Manager
UNOG
------------------------------

Original Message

Pentaho

Looking for design assistance

Cory Glowe01-11-2023 12:03

Justin Phebey01-12-2023 06:09

Petr Prochazka01-12-2023 06:27

Antonio Petrella01-12-2023 10:56

1. Looking for design assistance

2. RE: Looking for design assistance

3. RE: Looking for design assistance

4. RE: Looking for design assistance

Related Content

PDI: Microsoft Excel Writer via Stream XLSX Data - always generates 1048576 rows

ETL Metdata Inejction for "Stream Lookup" step

How to optimize 'Join Rows (Cartesian Product)' step in Spoon?

Metadata injection with update step. Passing two columns to the stream

RE: How to optimize 'Join Rows (Cartesian Product)' step in Spoon?