Pentaho

 View Only

 How to read a html file as input?

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Leandro Alves's profile image
Leandro Alves posted 07-10-2019 21:16

Hi guys, 

I'm trying to read a html file (out of http, just a single html file on my desktop folder that is a raw data) but i'm failing miserably.

how can i do this? pdi can help me with this case?

note: I'm note using the server, just the spoon module and my version is 6.1

tks!


#Kettle
#Pentaho
#PentahoDataIntegrationPDI
David da Guia Carvalho's profile image
David da Guia Carvalho

Hi,

There is no step to parse HTML it self, so, you have to do it by your self and you got some choices!

Any way, you will have to first y "prepare" de html or convert it to a tabular data.

As far as I  can see you whant to get the html table to a data stream, in that case a very simple way to do it woul be "manual" copy the "table" object and replace the tags with a delimiter and save it to a "csv" file

As a html table is compoused by somethigs like:

<table><th>.....

<tr><td>VALUE</td><td>VALUE1</td></tr>

You could replace:

  • "<TR>" and "<td>  for blank
  • </td> for separator ";"
  • "</TR>" for line feed "\n" (or just blank it depends on your file)

in linux there is a very easy way with "sed",  just copy the table to a new file and it could go like this:

sed -i 's/<td>//gI'  my.html

sed -i 's/<\/td>/;/gI' my.html

sed -i 's/<\/tr>/\n/gI' my.html

You can also "manual" parse the html on PDI using "replace in string" step

Johan Hammink's profile image
Johan Hammink

There is a plugin in the Marketplace HTML to XML. That step in inspired by a blogpost by Roland Bouwman

http://rpbouman.blogspot.com/2011/05/using-tidy-to-clean-webpages-with.html

Data Conversion's profile image
Data Conversion
Attachment  View in library
rawdata.PNG 90 KB
Data Conversion's profile image
Data Conversion
Attachment  View in library