Pentaho

View Only

Pentaho Data Integration on Kubernetes

By Andres Perez Confortti posted 09-09-2021 09:13

Scope of this document

Main purpose of this document is to help you (our customers) in all the required steps to make a successful deployment of Pentaho Data Integration using Kubernetes technology for such purpose. You will have details about:

Introduce Pentaho options for container-based deployment
Provide definitive guide from zero to deployment stage
Facilitate best practices to approach your Kubernetes deployment

Introduction

Now a days majority of enterprises are deploying their applications in the cloud to take benefit of the different *aaS offerings (IaaS, SaaS, and PaaS) due to its flexibility and seamless workload scalability offerings.
Depending on the chosen cloud services provider, you can have access to certain set of specific services but many of them are similar across available providers, like Kubernetes.

What is Kubernetes?

As per definition of its creators (Google cloud: https://cloud.google.com/learn/what-is-kubernetes) Kubernetes “is an open-source system to deploy, scale and manage containerized applications anywhere”.

Why Kubernetes in my organization?

Depending on each enterprise, reasons of Kubernetes adoption might vary but these are common benefits of Kubernetes implementation: automation, abstraction, and monitoring.
More and detailed information https://kubernetes.io/docs/concepts/overview/

Why Pentaho Data Integration (PDI) on Kubernetes?

As part of customers cloud journey many of the services used and connected as part of defined ETL workflows are present and managed by the cloud (such as storage buckets, relational & analytical databases, messaging systems, etc.).
It means Pentaho is required to:

Sit close to these services for security and performance considerations
Ability to scale up and down based on requirements
In short, adjust to the deployment pattern used as standard in the enterprise

Hitachi Vantara doesn’t provide an out of the box container image mainly because we believe that each customer and each deployment has specific needs. Our approach is to provide this document as the definitive guide for your very specific PDI on Kubernetes journey.

Preparation

Pre-requisites

In order to work and go throw all the steps of this guide, you need to have following software/tools in place:

Docker Engine. Necessary to interact with container image (build, test, etc). It is available in Linux, Mac or Windows. Refer to Docker Engine documentation to install it
Pentaho Data Integration client package. It is required as part of the container image building process. You can download it from our Support Portal
For demonstration purposes we will use GKE (Google Kubernetes Engine) as the Kubernetes engine. You can use any flavor/cloud vendor for it. You need to have all required software based on your choice. For GKE perspective, please follow its QuickStart documentation to have up and running a Kubernetes cluster and required tools to interact with

Where do we start? General guidelines

Kubernetes provides a container runtime. As per Kubernetes definition, “A container represents binary data that encapsulates an application and all its software dependencies. Container images are executable software bundles that can run standalone and that make very well-defined assumptions about their runtime environment”.
This container image is usually provided by Docker (not restricted to it but is enterprise adoption). Let’s focus on this topic as the starting point of our journey. These are the steps to make your container image

Figure 1: Preparing your docker container image

A developer download template of PDI Dockerfile from Hitachi Vantara Github repository (containers -> pentaho-data-integration -> pdi-client-XX, where XX is related to specific Pentaho version). Each version contains a README file with main instructions to interact and build it
Adapt existent template based on your requirements. Our template is based on OpenJDK JRE 8 container, you might need to change it based on your organization policies or supported OS (proprietary licenses).
PDI Dockerfile template use two “FROM” statements in the image definition, first one is used for installation of software and the second one for final image creation. This is to make sure a clean base image is used and just adopt installed packages from the installation layer.
Docker file needs to be built for usage purposes. Our README file contains examples and arguments you can use to create your docker image. Is MANDATORY to pre-download PDI client artifact and place it in predownloaded folder in order to build the image
Once image is built, you need to have it available in your Kubernetes containers registry. In GKE, you need to make it available in GCR (Google container registry). Please follow GCR documentation to make your image available for GKE

Building your Pentaho Data Integration Image

To have a fully working PDI container image is important to have right version, components, configurations, and 3^rd party elements in place.
Recommended approach is to build containers as layers starting from Vanilla software configuration and adding configuration and customization layers based on requirements. Look at bellow diagram for better understanding.

Container images based on layers

Figure 2: Pentaho containers, layers principal

“Vanilla Software” layer. This container image is based on our PDI container image template. This should be your starting point
“Customization” layer. Container image layer that allows you to incorporate specific configuration based on your deployment requirements. i.e. JDBC drivers, Hadoop configurations, cloud components, etc. In our case, we use a customized layer called pdi-client-gcp in top of standard pdi-client-xx container image
“Project” layer. Depending on your needs, you can also have additional layers based on non-common configuration required for certain projects and not required to be part of the standard Pentaho container image. i.e. properties file, certificates, custom plugins, etc.

Visit our Pentaho Container Github Repository to see examples of customized layers

Step by step. Build PDI docker image for Kubernetes (GKE) deployment

Based on general advice and recommendations provided in previous sections let’s get to work. In this section we will perform a step-by-step procedure to create our Pentaho Data Integration image to be used in Kubernetes deployment.
This section will guide you in two main stages to:

Create a basic Pentaho Data Integration client container image
Add a layer in top of basic PDI image and adapt it for Google Cloud components interaction. Adapted to execute in GKE

Building standard PDI container image

Download and use Hitachi Vantara PDI container definition template from Hitachi Vantara repository as starting point based on required version.
In predownloaded folder place your Pentaho Data Integration client artifact (.zip file) together with other required element (such as specific service pack artifact)
Adapt Dockerfile mainly to use your preferred OS image as base of your PDI image. As explained in previous section, template image use openjdk:8-jre as the starting point

Note: Depending on selected image base, you may have to adapt other sections of the Dockerfile

Using docker commands, build your PDI container image. Here some examples:

Building GA version:
docker build -t pentaho/pdi-client:9.1.0.0 .
Build applying Service Pack:
docker build -f ./Dockerfile --build-arg SERVICE_PACK_VERSION=9.1.0.8 --build-arg SERVICE_PACK_DIST=629 -t pentaho/pdi-client:9.1.0.8 .

For testing purposes, you can test newly generated container image locally.

docker run -it "pentaho/pdi-client:<VERSION-NUMBER>" bash

Get more details and extra information on README present in template repository.

Building GCP-oriented PDI container image

As per this guide purpose, demonstration deployment is based on Google Kubernetes. We are adapting our Pentaho Data Integration image to be GCP ecosystem ready as part of a second container layer (in top of standard pdi-client image created on previous section).
If you have a different cloud/on-prem vendor, you can use following approach as a reference or download specific vendor template from the repository

Download Pentaho Data Integration client, GCP oriented container definition template from Hitachi Vantara repository
Understanding this layer customization template:
1. Dockerfile

Its main purpose is to add certain level of configuration and customization to PDI client base image oriented to interact with Google cloud services. It defines the following:

Receives as argument what is the PDI client base image version TAG to be used as base for this one.
Install Google Cloud SDK dependency to interact with google cloud components such as GCS (Google Cloud Storage) buckets
Incorporate local “resources” folder to the image content
Modifies default entrypoint of the image to a customized one

1. entrypoint/docker-entrypoint.sh

Its main purpose is change default image entry point and adapt it to be GCP ready. For template purposes, it defines the following:

Specify ETL project path (location of .ktr and .kjb) in the file system
Specify KETTLE_HOME location (same location of ETL project path)
Copies everything coming from resources folder in the template container folder to the container image. This is used to customize or replace content of default data-integration installation directory. Such as plugins, drivers, etc
Using Google SDK, interacts with GCS (Google Cloud Storage) bucket to download ETL artifacts (.ktr’s and .kjb’s). You can use other approaches like connecting to a VCS (Version control system) or use mounted volumes for such purpose
Execute and expose carte service using carte-config.xml present in resources folder

Adapt both, Dockerfile and entrypoint/docker-entrypoint.sh based on your requirements
Using docker command, build your newly created image. Example of build command:

docker build -f ./Dockerfile --build-arg "PENTAHO_CLIENT_BASE_TAG=9.1.0.8" -t pentaho/ pentaho/pdi-client-gcp:9.1.0.8 .

Make your image available. Publish to container registry

Once you have your image(s) built. Is time to make it available in corresponding container registry. As this guide is based on GKE, next steps are focus on make images available in GCR (Google Container registry). You can adapt it to your own container registry:

Tag the image with a registry name

docker tag pentaho/pdi-client-gcp:9.1.0.8 gcr.io/<YOUR_PROJECT>/pentaho/pdi-client-gcp:9.1.0.8

Push the image to container registry

docker push gcr.io/<YOUR_PROJECT>/pentaho/pdi-client-gcp:9.1.0.8

Please visit your Container Registry documentation site for more details. Here GCR documentation link

Kubernetes Deployment

To deploy and manage your containerized Pentaho Data Integration client in Kubernetes, you use different Kubernetes controller objects depending on your workflow nature. These controller objects represents either stateless Carte service or batch job executions with Kitchen and Pan kind of deployments.

Define workflow nature

Both options are considered as part of Hitachi Vantara Pentaho on Kubernetes deployment templates. Depending on your needs, you should adapt your workflows deployment pattern

Stateless Carte service

Creation of Kubernetes scalable deployment to have a set of Carte servers controlled by an exposed service to execute process on demand. Commonly used to execute in parallel several PDI processes using REST API as communication channel with the exposed service

Figure 3: High level architecture. Stateless carte service

Batch Jobs (Kitchen, Pan)

Represents finite, independent, and often parallel tasks which run until its completion releasing resources back to the cluster. Commonly used for long running processes (usually very heavy) with a very detailed end-to-end scope.

Figure 4: High Level architecture. Batch job workflow

Kubernetes Objects required

In a simple but efficient approach, there are a set of specific Kubernetes objects, environment configuration and vendor specific resources we need to put in place to perform a basic setup.

Service. Used to expose the service of the Kubernetes cluster internal network and how to reach out each service. LoadBalancer kind of service is recommended for Carte server deployment
Secrets. Used for sensitive information. Pentaho licenses and Google credentials API key to be used are good example of it.
ConfigMap. Set of properties/configuration files to be used and attached to the Kubernetes cluster. These are specifically related to the deployment and not part of the default container image. Elements such as carte configuration xml or kettle.properties are good candidates to make it available in the pentaho deployment
Controller -> Deployment/Job. Application workflow to be deployed. As explained in previous section, you can decide use either Stateless Carte service or Batch job as deployment options for PDI client

All this elements can be found in Hitachi Vantara Kubernetes templates repository examples. We will cover each element step by step fashion in following sections

Step by step - Deploying Pentaho Data Integration on Kubernetes

In following section, detailed step by step examples are exploded for both approaches described in previous section,

Deploying Carte server on Kubernetes

As starting point, you need to have a Kubernetes cluster up and running. Please visit your vendor/flavor documentation to get information required.
Next, these are the steps to finally have Pentaho Data Integration carte container(s) up and running in your Kubernetes cluster. For guidance purposes, Stateless Carte service is the deployment method to be used:

Decide and define how to expose your Pentaho application. For this guide purpose, LoadBalancer definition together with NodePort service are combined for service exposure
Download and use as starting point manifest definition (.yaml file) from our Kubernetes manifest templates repository (pdi-gke.yaml).

NOTE: This is a template, should not be considered to deploy “as it is” to your environment. Follow next steps to understand how to adapt it properly based on your environment specifications

In the first section of corresponding YAML file, you will find “Service” definition. You should adapt it based on your requirements and definition you did before regarding service exposure
Adapt rest of pdi-gke.yaml manifest based on your needs taking following as important consideration:
1. Application name TAG should match with selector specified in your service exposure to tie both together
2. “spec:replicas”, adapt based on amount of Pods you want a deploy. Template has “1” as default value

NOTE: Multiple replicas requires a special Service configuration to use “Sticky Sessions”.

1. Definition of “Secrets” and “ConfigMaps”. These elements can be created either inside of current yaml configuration file or outside as will covered during this guide. Make sure names and mount paths are correct based on your requirements
2. Container image. You need to specify the location of your Pentaho Data Integration docker image based on your container registry location
3. Define size of the Pod specifying limits in terms of CPU and Memory as its exemplified in the template (limits section). You need to make sure it matches your cluster sizing specifications
4. containerPort. It should match with your service exposure specification
5. Environment variables. Are important to define project paths, JVM settings and variables required in the container
Create required Secrets elements based on application deployment YAML file names, like this (you can also define it as part of your application YAML file or in a separate manifest):

kubectl create secret generic pentaho-license --from-file <PATH_TO_YOUR_.installedLicenses.xml_FILE_LOCAL_LOCATION>
kubectl create secret generic key --from-file=key.json=<PATH_TO_YOUR_KEY>

Create required ConfigMaps elements (if used), like this (you can also define it as part of your application YAML file or in a separate manifest):

kubectl create configmap configMapName --from-file=<EITHER_FILE_OR_FOLDER_PATH>

Once you have all the required pieces in place, you can deploy PDI application to your cluster

kubectl apply -f <PATH_TO_YOUR_MANIFEST>/pdi-gke.yaml

To verify Carte server service is up and running, you can reach out Carte status page http://<YOUR-IP>:<SERVICE_PORT>/kettle/status and interact with it using Carte REST API

Deploying batch jobs (Kitchen/Pan) on Kubernetes

As starting point, you need to have a Kubernetes cluster up and running. Please visit your vendor/flavor documentation to get information required.
Next, these are the steps to finally have Pentaho Data Integration batch jobs container(s) up and running in your Kubernetes cluster. For guidance purposes, Stateless Carte service is the deployment method to be used:

Download and use as starting point manifest definition (.yaml file) from our Kubernetes manifest templates repository (pdi-gke-job.yaml).

NOTE: This is a template, should not be considered to deploy “as it is” to your environment. Follow next steps to understand how to adapt it properly based on your environment specifications

As main differentiator with stateless service explained in previous section, this deployment definition is based on a Job. You can also consider CronJob if you require Kubernetes to handle job schedule execution
Adapt pdi-gke-job.yaml manifest definition based on your needs taking following as important consideration:
1. Definition of “Secrets” and “ConfigMaps”. These elements can be created either inside of current yaml configuration file or outside as will covered during this guide. Make sure names and mount paths are correct based on your requirements
2. Container image. You need to specify the location of your Pentaho Data Integration docker image based on your container registry location
3. Define size of the Pod specifying limits in terms of CPU and Memory as its exemplified in the template (limits section). You need to make sure it matches your cluster sizing specifications and other parallel activities for it
4. Environment variables. Are important to define project paths, JVM settings and variables required in the container
Create required Secrets elements based on application deployment YAML file names, like this (you can also define it as part of your application YAML file or in a separate manifest):

kubectl create secret generic pentaho-license --from-file <PATH_TO_YOUR_.installedLicenses.xml_FILE_LOCAL_LOCATION>
kubectl create secret generic key --from-file=key.json=<PATH_TO_YOUR_KEY>

Create required ConfigMaps elements (if used), like this (you can also define it as part of your application YAML file or in a separate manifest):

kubectl create configmap configMapName --from-file=<EITHER_FILE_OR_FOLDER_PATH>

Once you have all the required pieces in place, you can deploy PDI application to your cluster

kubectl apply -f <PATH_TO_YOUR_MANIFEST>/pdi-gke-job.yaml

You can follow Job execution either using “kubectl” commands or interacting with your Kubernetes vendor dashboard

Using Tray to handle processes distribution

Is very common to encounter a problem of process distribution and management when deploying Carte servers.
What happen when just scaling your Kubernetes cluster is not enough to distribute several jobs execution in parallel?
This is where Tray service plays an important role in our solution architecture

Tray main features

Monitors and Controls worker carte servers
Provides Work/Job queue
Provides Load Balancing capabilities of Jobs executions based on CPU, Memory and available slots for execution
Memory and Database persistence available
Manages finished jobs on Worker Carte servers
Compatible with Carte REST APIs / PDI Remote server execution approach
Simple plugin installation on Carte side

You can find Tray Container and Kubernetes deployment Manifest in Hitachi Vantara Container repository. Please contact your Customer Care technical representative at Hitachi Vantara to get more information about how to get Tray.

Links used for references

4 comments

71 views

Permalink

Comments

Tanmoy Panja

05-19-2022 14:17

Very Informative !!

Chayan Sarkar

05-04-2022 11:46

Thanks for sharing

Dipta Kundu

04-26-2022 13:43

Very informative

Alexander Schurman

01-28-2022 07:33

Now that GCP Also released GCP Cloud Run

https://cloud.google.com/run
We could use the containers described here for this service.

Pentaho

Pentaho Data Integration on Kubernetes

By Andres Perez Confortti posted 09-09-2021 09:13

Related Content

Permalink

Comments