Pentaho

 View Only

 How to skip illegal character, CTRL-CHAR from XML

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Ana C's profile image
Ana C posted 10-10-2018 10:44

Hello Community,

I am dealing with big xml files which have illegal characters.

"XML Input Stream (StAX)" and "Get data from XML" steps work fine if no illegal characters appear but I haven't found the way to skip them in these 2 steps.

Attached an example.

Anyone has a solution? Any suggestions welcomed!

Thanks,

Ana

*PDI version 7.1

 


#PentahoDataIntegrationPDI
#Pentaho
#Kettle
Brandon Jackson's profile image
Brandon Jackson

This is bad.   So you have two options.

1. Clean the files before they get to PDI

Remove non-printable ASCII characters from a file with this Unix command | alvinalexander.com

2. Use a step like "Load file into memory" and set the "File content" field to type "Binary", then follow up with a Javascript or UDJC and strip it out using code.

The real bummer here is that because the character is essentially non-printable bytes, probably "0001" the XML step and frankly any other step that tries to represent it as a Java String are going to have a problem with it.  That's why the step bombs out.  So if you load it as a stream of bytes in PDI and do coding, something on the order of the link below:

Binary to Ascii Conversion

Where in the process you can kick out those bytes you consider invalid, then you are good to go.

I noticed that you have XML files with UTF-16 designated and that it may be a Latin language (special characters) also involved.  So you'd have to define all the byte combos that stump your file and are not printable from a UTF-16 perspective.

Hope that helps.

Ana C's profile image
Ana C

Option 1 works perfectly.

Many thanks

Data Conversion's profile image
Data Conversion