Pentaho

 View Only

 PDI Production Options

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Data Conversion's profile image
Data Conversion posted 03-19-2018 11:53

Hi Everyone, I'm new to PDI and I'm a little confused about deployment. I've got transformations and jobs running locally that do what I want but I need them to run scheduled on a remote server.

I've read around this topic and just confused myself, do I need a repo? Does it have to have a database? Is this the same as Carte? Should I just get an Ubuntu VM with remote desktop and do it that way?

Apologies if I'm being slow but I can't seem to find a single source of info on the simplest way to go from running jobs on my desktop to deploying them to production.

My use case is fairly simple; it's basic ETL from OLTP databases to an AWS Redshift data warehouse, it's only me that will be setting up and running jobs and none of the jobs are particularly intensive.

If anyone can point me in the direction of the simplest method of getting up and running I'd be very grateful. Doubly so if there's a way to do it using AWS EC2.

Many thanks.


#PentahoDataIntegrationPDI
#Pentaho
#Kettle
Brandon Jackson's profile image
Brandon Jackson

A repository is just a centralized place to store your kjb and ktrs.  If you have figured out a mechanism to make your real files available on the same disk as Pentaho Data Integration, then really all you need to do is use 'cron' to run ./kitchen.sh or ./pan.sh to run a job or transform at a specific time.

I would suggest a common layout for your ETL to make everything more deterministic.

project_named_directory/content  <- all kjb and ktrs go here

project_named_directory/input      <- all manner of flat file input placed here

project_named_directory/output    <- If your ETL emits files, place them here.

project_named_directory/environment  <- Any properties files or standard connection setting stuff put here to keep your PDI clean.  Just read in the properties and let your JDBC connections use those variables.  That will save you the hassle of mucking up your PDI /simple-jndi stuff, /home/pentaho/.kettle/kettle.properties or /home/pentaho/.kettle/shared.xml

A cron example running a job at 1 AM every day.

#minute (0-59)

#       hour(0-23)

#               day of the month (1-31)

#                       month of the year (1-12)

#                               day of the week (0-6 with 0=Sunday)

#                                       commands

### Budgeted Census

0       1       *       *       *       cd /opt/pentaho/pdi/latest/data-integration; ./kitchen.sh -file=/opt/pentaho/ETL/Build\ Budgeted\ Census\ Data/content/build_budgeted_census_data.kjb;

David Martinez's profile image
David Martinez
Data Conversion's profile image
Data Conversion

Thanks guys, much appreciated.

I'll have a go with the cron method above.