View Only

 Data Validation before moving on to the next step

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Henrik Perreault's profile image
Henrik Perreault posted 07-07-2019 20:19


I discovered Pentaho about 2 weeks ago and I've transferred most of my Python routines into Pentaho Data Integration Transformations for a project I've been working on ==> CSV Data from distributor to organise into an internal structure to and import to an eCommerce solution.

My internal structure is pretty basic, one entry per item required.

  • title
  • description
  • shortname
  • short description
  • price
  • stock level
  • weight
  • height
  • length
  • width
  • category
  • ...

Every products ready for CSV export MUST count one entries for each of the items of the above list and an undetermined quantities of attributes, also one per row.

I'm pretty confident that my transformations are effective but somehow, I'd be reassured if there would be a method to validate the presence of all the items listed above. As much for their presence for a given SKU and a minimum validation for their content.

I've been looking around the documentation to find a method to make sure all required entries exist and/or are valid so the CSV generator would not output inconsistant entries.

With 90k products (rows on the final outputted CSV) it's easy not to notice any holes or potential broken data generated. 

Any advices? An exemple or some sort of road map to accomplish this.

Thank you 


Ana Gonzalez's profile image
Ana Gonzalez

I don't know if I have understood what you are trying to do exactly, but one of the things you can do is use the Filter rows step, where you put the condition that each of the columns mandatory are not null, and set up a different flow of data for rows that comply the conditions (TRUE) and another for rows that don't (FALSE).



Ana Gonzalez's profile image
Ana Gonzalez

The FALSE stream is not necessary, if you don't want to do anything with the rows that don't meet your requirements, you can just skip it

Johan Hammink's profile image
Johan Hammink

And of course you can also use the "Data Validator" step

Henrik Perreault's profile image
Henrik Perreault

Thank you Ana, 

24 hours after posting my message and fooling around in Pentaho, I've realized I was focussing on the original structure for my validation (mySQL, 1 row per meta I need) when after all my transformations to produce CSV, they'd be all lined up in columns ready for my output (easy to validate with the transformation you offered).

I was initially thinking to validate my values before running my CSV Generator output and the format they were on was mainly related to my request.

Henrik Perreault's profile image
Henrik Perreault

I'm yet to give a try to this step, I was just not able to see a way to use it with the original structure (mySQL, 1 row per meta) thanks for your input.