Hi guys, I'm trying to read a html file (out of http, just a single html file on my desktop folder that is a raw data) but i'm failing miserably. how can i do this? pdi can help me with this case? note: I'm note using the server, just the spoon module and my version is 6.1 tks! <a data-tag-text="Kettle" data-sign="#" class="user-content-hashtag" href="https://hitachi.connectedcommunity.org/search?s=tags%3A%22Kettle%22&executesearch=true" data-tag-key="e58f5353-0c00-4c3e-854c-4b082a5ea7b6">#Kettle</a> <a data-tag-text="Pentaho" data-sign="#" class="user-content-hashtag" href="https://hitachi.connectedcommunity.org/search?s=tags%3A%22Pentaho%22&executesearch=true" data-tag-key="70b4436b-4549-4ce3-aa64-4566c7b665c9">#Pentaho</a> <a data-tag-text="PentahoDataIntegrationPDI" data-sign="#" class="user-content-hashtag" href="https://hitachi.connectedcommunity.org/search?s=tags%3A%22Pentaho Data Integration PDI%22&executesearch=true" data-tag-key="b01e3d75-3411-4f9a-a674-a7e8d0c9c995">#PentahoDataIntegrationPDI</a>

Pentaho

View Only

How to read a html file as input?

Leandro Alves posted 07-10-2019 21:16

Hi guys,

I'm trying to read a html file (out of http, just a single html file on my desktop folder that is a raw data) but i'm failing miserably.

how can i do this? pdi can help me with this case?

note: I'm note using the server, just the spoon module and my version is 6.1

tks!

#Kettle
#Pentaho
#PentahoDataIntegrationPDI

Attachments View in library

How to read a html file as input? 90 KB

How to read a html file as input? 3 KB

David da Guia Carvalho posted 07-11-2019 12:42

Hi,

There is no step to parse HTML it self, so, you have to do it by your self and you got some choices!

Any way, you will have to first y "prepare" de html or convert it to a tabular data.

As far as I can see you whant to get the html table to a data stream, in that case a very simple way to do it woul be "manual" copy the "table" object and replace the tags with a delimiter and save it to a "csv" file

As a html table is compoused by somethigs like:

<table><th>.....

<tr><td>VALUE</td><td>VALUE1</td></tr>

You could replace:

"<TR>" and "<td> for blank
</td> for separator ";"
"</TR>" for line feed "\n" (or just blank it depends on your file)

in linux there is a very easy way with "sed", just copy the table to a new file and it could go like this:

sed -i 's/<td>//gI' my.html

sed -i 's/<\/td>/;/gI' my.html

sed -i 's/<\/tr>/\n/gI' my.html

You can also "manual" parse the html on PDI using "replace in string" step

Johan Hammink posted 07-12-2019 09:26

There is a plugin in the Marketplace HTML to XML. That step in inspired by a blogpost by Roland Bouwman

http://rpbouman.blogspot.com/2011/05/using-tidy-to-clean-webpages-with.html

Data Conversion posted 08-14-2019 20:18

Attachment View in library

rawdata.PNG 90 KB

Data Conversion posted 08-14-2019 20:18

Attachment View in library

Capturar.PNG 3 KB

Pentaho

How to read a html file as input?

Related Content

How to read a html file as input?

RE: How to read a html file as input?

RE: How to read a html file as input?

error read mail using Email messages input with Gmail address

RE: error read mail using Email messages input with Gmail address