Hi Cory,
one way would be to connect all the inputs to a Join Rows (cartesian product) step and code conditions to not have the same value across the inputs.
To achieve that, each input must have a different stream name. For example, for 4 input streams, if the values are stored in the stream variable data1, data2, data3, data4, then the condition in the Join Rows is that
data1 <> data2 AND data1 <> data3 and data1 <> data4 AND data2 <> data3 AND data2 <> data4 AND data3 <> data4. By doing so, the output of the Join Rows is a row where the values of data1, data2, data3 and data4 are all different.
Then you need to spot the duplicates due to different combinatoric across data1, data2, data3 and data4. You can use a Modified Javascript value step. Inside this step your javascript code would simply fill an array with the values of data1, data2, data3, and data4. Then sort the array and output it as a new stream variable (let's call this signature). By doing so a row [1,2,3,4] and [3,4,2,1] would have the same signature 1,2,3,4. You add then a Sort component, by sorting on signature and ticking the checkbox "Only pass unique rows" to get only one combination for each signature.
To be tested with your specific data ingestion mechanism / rowset size / environment.
------------------------------
Antonio Petrella
Data Service Manager
UNOG
------------------------------
Original Message:
Sent: 01-11-2023 12:03
From: Cory Glowe
Subject: Looking for design assistance
I have a transformation where there are 8 streams, each with 5-10 values (integers in this instance). The individual streams sometimes will have duplicated values from another stream. My goal is to generate all unique combinations from the 8 streams.
My current solution is using merge join 7 times, using some filtering between each merge join.
Conditions:
- An integer value should not appear twice in the same row.
- The same set of integers should not appear in two or more rows, even if in a different order. ([1, 2, 3] and [1, 3, 2] should not both exist in the output)
- It does not matter which set is removed
- I'm currently solving this by creating an ID for each row based on mathematical formula and removing duplicate IDs
I hope this makes sense what I'm trying to accomplish. My current solution works, but I'm looking for a more efficient way to do it. By the last merge join, I'm looking at nearly 30 million rows, which takes a few minutes to process.
------------------------------
Cory Glowe
Systems Engineer
Tangoe
------------------------------