AnsweredAssumed Answered

How to optimize 'Join Rows (Cartesian Product)' step in Spoon?

Question asked by Himanshu Dixit on Apr 10, 2018
Latest reply on Apr 20, 2018 by Himanshu Dixit

Hi Folks,

 

I am new to Kettle. I have question regarding 'Join Rows (Cartesian Product)' step.

 

I am using 2 BigQuery tables as input and cross joining them with 3 conditions based on date fields. It does include operators like '>=' and '<' in the join condition. Count in first BigQuery table is around 5.5k and other BigQuery table has 700k records. Since its a cross join, I am expecting the output to be somewhere around 3.8 Billion records. Currently, this join is happening on BigQuery side and I am ready everything from that query and putting it into a file which is taking close to 3 or sometimes 3.5 hrs. I want to optimize this. I am thinking about using 2 BigQuery inputs in kettle and use 'Join Rows (Cartesian Product)' step to join them.

 

Question - What is the best way to optimize the 'Join Rows (Cartesian Product)' step? I tried to implement the above logic in kettle using join rows step but it is also taking hours to finish. How can achieve the same result in less time?

Attachments

Outcomes