Pentaho

 View Only

 Some questions about the difference between open source and Enterprise edition

  • Pentaho
  • Pentaho
Michel Philippenko's profile image
Michel Philippenko posted 05-11-2020 12:21

Hello,

I would like to ask you some questions (you can simply answer inside the text of this message) :

 

1. I saw a video about the global differences between Pentaho Enterprise and Kettle : could you provide a precise list of functionnalities ? 

2. Has Kettle connectivity with SIEBEL (ERP) ?

3. Has Kettle connectivity with PeopleSoft ?

4. Has Kettle connectivity with Webmethods ?

5. Is it possible to import Word files (.DOC .DOCX) ?

6. Is it possible to import PDF files ?

7. Does the data lineage belong to "Advanced Analysis" which is in the Enterprise version ?

https://help.pentaho.com/Documentation/9.0/Products/Data_lineage

8. Is it possible, in Kettle, to trigger transformation on data received from a socket ?

9. What about, in Kettle, multi-user repository ?

10. What about, in Kettle, the versionning of meta-data ?

11. What about, in Kettle, the management of configuration (SCM) ?

 

Thank you for your answers, with best regards,

 

Mikhael Philippenko, senior consultant


#Pentaho
Ana Gonzalez's profile image
Ana Gonzalez

I can't answer all your questions, I'll help you with the ones I can.

  1. I use the community edition and I'm not an HV employee, so I can't provide a precise list of differences. There are a few plugins only available in the EE edition, but the main functionality is available in the Community Edition, for the plugins not available (such a built-in Python executor or R executor) it is posible to make it work with a work around, or, if you are a savvy java developer, build your own plugin. There might be a difference working with some big data technologies, you might have to google them up to know if it is possible with the CE edition. There are plugins in Github for the CE edition (and the EE) working with Beam that are not available with the basic installation or in the built-in marketplace of Kettle.
  2. I don't work with it, but I don't remember people asking in the old forum or here, so I don't think there's anything specific built in, but if you can connect to the database in SIEBEL or through a webservice, with the generic database/https/get/post/etc steps you'll be able to connect.
  3. Same as in the previous question. Peoplesoft works with an Oracle database, so I can confirm you can connect to Oracle with the CE Edition through a JDBC connection and extract/load data, I'm doing it to connect to my Oracle databases. Out of the box, the JDBC driver is not provided with Kettle, but you can download from Oracle website and add it to the installation, it's just as simple as adding a .jar file to the plugins directory in the kettle installation.
  4. It provides generic https/rest/get/post steps, so I think it has.
  5. There's nothing built-in to do that, with .docx you can use the read xml generic step to get the information you need, but if the document is complicated, it's going to be a nightmare.
  6. Again, the same as the previous question. I have talked with someone in the past who was doing that, I think they built custom functionality adding Apache Tika libraries to Kettle, but I'm not a java developer, so I can't do that out of the box. Now I have done a simple google search about Apache Tika, I think it has also java libraries to read .doc and .docx documents, so the same answer might apply. If I had to do it myself, as I have a little knowledge of R, I would probably prefer to build some R scripts to extract the information from the PDF or DOC to a more logic format such as CSV or excel, and use a call to inside Kettle to a shell script to execute R and then continue with PDI, but that's because it would be easier for me than doing it with java and incorporating the proper libraries in Java. With this, I mean when you don't have something specific available in Kettle, you have enough steps so that workarounds you are familiar with are available.
  7. Yes, Data Lineage is only EE. As Ketlle generates XML files for their transformations and jobs, you could built your own Data Lineage with Kettle reading the XML of your own code. There's this project I love available in Github to autodocument your code, it's not updated, so images belong to the old version (5.1 or so) and it's not datalineage exactly, but as of today it still works with version 8.2 (that's the one I'm using right now and most of the steps I use get documented) and generates nice html documentation that you can use as an starting point: https://github.com/rpbouman/kettle-cookbook. This fork generates more modern images and logos: https://github.com/danielams/kettle-cookbook but I think it didn't work out-of-the-box for me, I had to correct some minor hard-coded paths to make it work.
  8. I don't understand what you mean, maybe something like processing data from a stream? In this case, the EE edition might have steps that are not available for the CE edition, or maybe the useful specific steps are available in github repositories and you have to incorporate them to the installation, maybe something like this: https://diethardsteiner.github.io/pdi/streaming/2016/10/30/PDI-Streaming.html (a very nice blog, by the way, with a lot of resources for kettle, when you are beginning you might also find some useful information in the old blog too, even if it's referring to very old versions of kettle so somethings might have changed: https://diethardsteiner.blogspot.com/)
  9. PDI has a built-in repository that I don't find that useful in my personal opinion. I prefer working with git repositories, and the majority of developers would be familiar with them. Kettle transformations and jobs are XML files, and git handles them very well. There's a caveat though, you might have a lot of differences because of changing just the position of steps in the canvas that are not reallly changes, and also there are differences due to changes in empty tags (such as finding them sometimes as
<tag/>

and other times as

<tag />

) I'm the only one working in my projects, so I don't find many differences, sometimes because of working with different Kettle versions or in different virtual machines, but I have heard of other people working with bigger projects that those differences appear and are a pain to handle when merging. Again, there's a github repository for a plugin built to help working with GIT repositories in PDI, https://github.com/HiromuHota/pdi-git-plugin , I haven't used it much, but take a look at it.

This same author has another project in Github to use a web interface to build transformations and jobs, webspoon, that some people are using sucessfully: https://github.com/HiromuHota/pentaho-kettle.

10 and 11. Another external plugin to work with environments that has become a must-have for me in any kettle installation: https://github.com/mattcasters/kettle-environment combined with the needful things (https://github.com/mattcasters/kettle-needful-things) that adds the maitre script to schedule jobs and transformations using environments. There's at least a blog entry in Diethard Steiner blog explaining how to work with it. Take a look to Matt Caster's github repositories, there are some very useful kettle plugins (the one for unit testing and help for debugging is handy also, I haven't used the part of unit testing, but the ability to skip or remove steps when debugging a transformation is one that comes handy sometimes: https://github.com/mattcasters/pentaho-pdi-dataset, also Diethard Steiner blog has a couple of entries on how to use this plugin, Matt documentation sometimes is scarce)

Andrew Cave's profile image
Andrew Cave

EE has

* quartz scheduler for PDI jobs

* dashboard designer for analyzer reports

* some extra steps including Splunk input/output and Weka scoring and forecasting streps

 

2. Has Kettle connectivity with SIEBEL (ERP) ? No specific steps. It can use any JDBC connector though

3. Has Kettle connectivity with PeopleSoft ? as above

4. Has Kettle connectivity with Webmethods ? as above

5. Is it possible to import Word files (.DOC .DOCX) ? Only as a binary. You could parse it using a User Defined Java class and org.apache.poi which is already in the pentaho dist.

 

6. Is it possible to import PDF files ? not directly. Either use a specialised Java process or use one of the commandline readers and call it using the 'Run SSH commands' step

Michel Philippenko's profile image
Michel Philippenko

Thank you so much Ana for such detailled answers and for the time you spent writting all these explanations !!! It's very much appreciated !

 

Point n°8 : my question was : Can Kettle listen on a TCP connexion and launch a transformation if data is arriving ?

 

Have a nice day !

Michel Philippenko's profile image
Michel Philippenko

Thank you very much Andrew for your quick answer. Very much appreciated ! Have a nice day !

Andrew Cave's profile image
Andrew Cave

You can do a POST like this

 

"http://localhost:8080/pentaho/kettle/executeTrans/" \

 --header "Content-Type: application/x-www-form-urlencoded" \

 --header "Authorization: Basic ${auth}" \

 --data-urlencode "userid=admin" \

 --data-urlencode "pass=password" \

 --data-urlencode "rep=singleDiServerInstance" \

 --data-urlencode "trans=/public/" \

 --data-urlencode "level=Basic" 

 

where the ${auth} is "username:password" in Base64 and pass the data as JSON in the body

 

But listening is limited to the new Streaming steps