Pentaho

View Only

Pentaho Server on Kubernetes

By Andres Perez Confortti posted 09-27-2021 10:15

Scope of this document

Main purpose of this document is to help you (our customers) in all the required steps to make a successful deployment of Pentaho server using Kubernetes technology for such purpose. You will have details about:

Introduce Pentaho options for container-based deployment
Provide definitive guide from zero to deployment stage
Facilitate best practices to approach your Kubernetes deployment

Introduction

Now a days majority of enterprises are deploying their applications in the cloud to take benefit of the different *aaS offerings (IaaS, SaaS, and PaaS) due to its flexibility and seamless workload scalability offerings.
Depending on the chosen cloud services provider, you can have access to certain set of specific services but many of them are similar across available providers, like Kubernetes.

What is Kubernetes?

As per definition of its creators (Google cloud: https://cloud.google.com/learn/what-is-kubernetes) Kubernetes “is an open-source system to deploy, scale and manage containerized applications anywhere”.

Why Kubernetes in my organization?

Depending on each enterprise, reasons of Kubernetes adoption might vary but these are common benefits of Kubernetes implementation: automation, abstraction, and monitoring.
More and detailed information https://kubernetes.io/docs/concepts/overview/

Why Pentaho Server Business Analytics (PBA) on Kubernetes?

As part of customers cloud journey many of the services used and connected as part of business intelligence solutions are present and managed by the cloud (such as storage buckets, relational & analytical databases, messaging systems, etc.).
It means Pentaho is required to:

Sit close to these services for security and performance considerations
Ability to scale up and down based on requirements
In short, adjust to the deployment pattern used as standard in the enterprise

Hitachi Vantara doesn’t provide an out of the box container image mainly because we believe that each customer and each deployment has specific needs. Our approach is to provide this document as the definitive guide for your very specific PBA on Kubernetes journey.

Preparation

Pre-requisites

In order to work and go throw all the steps of this guide, you need to have following software/tools in place:

Docker Engine. Necessary to interact with container image (build, test, etc). It is available in Linux, Mac or Windows. Refer to Docker Engine documentation to install it
Pentaho Server package. It is required as part of the container image building process. You can download it from our Support Portal
For demonstration purposes we will use GKE (Google Kubernetes Engine) as the Kubernetes engine. You can use any flavor/cloud vendor for it. You need to have all required software based on your choice. For GKE perspective, please follow its QuickStart documentation to have up and running a Kubernetes cluster and required tools to interact with

Where do we start? General guidelines

Kubernetes provides a container runtime. As per Kubernetes definition, “A container represents binary data that encapsulates an application and all its software dependencies. Container images are executable software bundles that can run standalone and that make very well-defined assumptions about their runtime environment”.
This container image is usually provided by Docker (not restricted to it but is enterprise adoption). Let’s focus on this topic as the starting point of our journey. These are the steps to make your container image

Figure 1: Preparing your docker container image

A developer download template of pentaho server Dockerfile from Hitachi Vantara Github repository (containers -> pentaho-server -> pentaho-server-XX, where XX is related to specific Pentaho version). Each version contains a README file with main instructions to interact and build it
Adapt existent template based on your requirements. Our template is based on OpenJDK JRE 8 container, you might need to change it based on your organization policies or supported OS (proprietary licenses).
Pentaho Server Dockerfile template use two “FROM” statements in the image definition, first one is used for installation of software and the second one for final image creation. This is to make sure a clean base image is used and just adopt installed packages from the installation layer.
Docker file needs to be built for usage purposes. Our README file contains examples and arguments you can use to create your docker image. Is MANDATORY to pre-download Pentaho Server artifact and place it in predownloaded folder in order to build the image
Once image is built, you need to have it available in your Kubernetes containers registry. In GKE, you need to make it available in GCR (Google container registry). Please follow GCR documentation to make your image available for GKE

Building your Pentaho Server Image

To have a fully working Pentaho Server container image is important to have right version, components, configurations, and 3^rd party elements in place.
Recommended approach is to build containers as layers starting from Vanilla software configuration and adding configuration and customization layers based on requirements. Look at bellow diagram for better understanding.

Container images based on layers

Figure 2: Pentaho containers, layers principal

“Vanilla Software” layer. This container image is based on our Pentaho Server container image template. This should be your starting point
“Customization” layer. Container image layer that allows you to incorporate specific configuration based on your deployment requirements. i.e. JDBC drivers, Hadoop configurations, cloud components, etc. In our case, we use a customized layer called pentaho-server-gcp in top of standard pentaho-server-xx container image
“Project” layer. Depending on your needs, you can also have additional layers based on non-common configuration required for certain projects and not required to be part of the standard Pentaho container image. i.e. properties file, certificates, custom plugins, etc.

Visit our Pentaho Container Github Repository to see examples of customized layers

Step by step. Build Pentaho Server docker image for Kubernetes (GKE) deployment

Based on general advice and recommendations provided in previous sections let’s get to work. In this section we will perform a step-by-step procedure to create our Pentaho Data Integration image to be used in Kubernetes deployment.
This section will guide you in two main stages to:

Create a basic Pentaho Server container image
Add a layer in top of basic Pentaho Server image and adapt it for Google Cloud components interaction. Adapted to execute in GKE

Building standard PDI container image

Download and use Hitachi Vantara Pentaho Server container definition template from Hitachi Vantara repository as starting point based on required version.
In predownloaded folder place your Pentaho Server artifact (.zip file) together with other required element (such as specific service pack artifact). In this stage, you should also add .zip artifacts related with BA plugins such as Analyzer, Interactive Report or Dashboard designer
By default, command used to start Pentaho Server is as defined at the end of template Dockerfile

CMD [ "sh", "./tomcat/bin/catalina.sh", "run" ]
and as reference, port 8080 is exposed

Adapt Dockerfile mainly to use your preferred OS image as base of your Pentaho Server image. As explained in previous section, template image use openjdk:8-jre as the starting point

Note: Depending on selected image base, you may have to adapt other sections of the Dockerfile

Using docker commands, build your Pentaho Server container image. Here some examples:

Building GA version:
docker build -t pentaho/pentaho-server:9.1.0.0 .
Build applying Service Pack:
docker build -f ./Dockerfile --build-arg SERVICE_PACK_VERSION=9.1.0.8 --build-arg SERVICE_PACK_DIST=629 -t pentaho/pentaho-server:9.1.0.8 .

For testing purposes, you can test newly generated container image locally.

docker run -it "pentaho/pentaho-server:<VERSION-NUMBER>" bash

Get more details and extra information on README present in template repository.

Building GCP-oriented Pentaho Server container image

As per this guide purpose, demonstration deployment is based on Google Kubernetes. We are adapting our Pentaho Server image to be GCP ecosystem ready as part of a second container layer (in top of standard pentaho-server image created on previous section).
If you have a different cloud/on-prem vendor, you can use following approach as a reference or download specific vendor template from the repository

Download Pentaho Server, GCP oriented container definition template from Hitachi Vantara repository
Understanding this layer customization template:
1. Dockerfile

Its main purpose is to add certain level of configuration and customization to Pentaho Server base image oriented to interact with Google cloud services. It defines the following:

Receives as argument what is the Pentaho Server base image version TAG to be used as base for this one.
Install Google Cloud SDK dependency to interact with google cloud components such as GCS (Google Cloud Storage) buckets
Incorporate local “resources” folder to the image content
Modifies default entrypoint of the image to a customized one

1. entrypoint/docker-entrypoint.sh

Its main purpose is change default image entry point and adapt it to be GCP ready. For template purposes, it defines the following:

Retrieve Pentaho Server configuration (such as internal database connection configs) from a GCS bucket using Google SDK
Copies everything coming from config folder inside of the Pentaho Server installation path. This is used to customize or replace content of default pentaho-server installation directory. Such as configuration, plugins, drivers, etc.
Executes default CMD defined in base image

Adapt both, Dockerfile and entrypoint/docker-entrypoint.sh based on your requirements
Using docker command, build your newly created image. Example of build command:

docker build -f ./Dockerfile --build-arg "PENTAHO_SERVER_BASE_TAG=9.1.0.8" -t pentaho/ pentaho/pentaho-server-gcp:9.1.0.8 .

Make your image available. Publish to container registry

Once you have your image(s) built. Is time to make it available in corresponding container registry. As this guide is based on GKE, next steps are focus on make images available in GCR (Google Container registry). You can adapt it to your own container registry:

Tag the image with a registry name

docker tag pentaho/pentaho-server-gcp:9.1.0.8 gcr.io/<YOUR_PROJECT>/pentaho/pentaho-server-gcp:9.1.0.8

Push the image to container registry

docker push gcr.io/<YOUR_PROJECT>/pentaho/pentaho-server-gcp:9.1.0.8

Please visit your Container Registry documentation site for more details. Here GCR documentation link

Kubernetes Deployment

To deploy and manage your containerized Pentaho Server in Kubernetes, you use different Kubernetes controller objects depending on your workflow nature. These controller objects represent the deployment.

Pentaho service

Creation of Kubernetes scalable deployment to have a set of Pentaho servers controlled by an exposed service to execute process on demand.

Figure 3: High level architecture. Pentaho Service

Kubernetes Objects required

In a simple but efficient approach, there are a set of specific Kubernetes objects, environment configuration and vendor specific resources we need to put in place to perform a basic setup.

Service. Used to expose the service of the Kubernetes cluster internal network and how to reach out each service. LoadBalancer kind of service is recommended for Pentaho server deployment, in this example, we are using a single instance deployment. In case you need to scale and execute two or more, please take a look to Scaling Pentaho server deployment – Considerations section on this document
Secrets. Used for sensitive information. Pentaho licenses and Google credentials API key to be used are good example of it.
ConfigMap. Set of properties/configuration files to be used and attached to the Kubernetes cluster. These are specifically related to the deployment and not part of the default container image. Elements such as carte configuration xml or kettle.properties are good candidates to make it available in the pentaho deployment
Controller -> Deployment/Job. Application workflow to be deployed.

All these elements can be found in Hitachi Vantara Kubernetes templates repository examples. We will cover each element step by step fashion in following sections

Step by step - Deploying Pentaho Server on Kubernetes

As starting point, you need to have a Kubernetes cluster up and running. Please visit your vendor/flavor documentation to get information required.
Next, these are the steps to finally have Pentaho Data Integration carte container(s) up and running in your Kubernetes cluster. For guidance purposes, Stateless Carte service is the deployment method to be used:

Decide and define how to expose your Pentaho application. For this guide purpose, LoadBalancer definition together with NodePort service are combined for service exposure
Download and use as starting point manifest definition (.yaml file) from our Kubernetes manifest templates repository (pentaho-server-gke.yaml).

NOTE: This is a template, should not be considered to deploy “as it is” to your environment. Follow next steps to understand how to adapt it properly based on your environment specifications

In the first section of corresponding YAML file, you will find “Service” definition. You should adapt it based on your requirements and definition you did before regarding service exposure
Adapt rest of pentaho-server-gke.yaml manifest based on your needs taking following as important consideration:
1. Application name TAG should match with selector specified in your service exposure to tie both together
2. “spec:replicas”, adapt based on amount of Pods you want a deploy. Template has “1” as default value

NOTE: Multiple replicas requires a special Service configuration to use “Sticky Sessions”.

1. Definition of “Secrets” and “ConfigMaps”. These elements can be created either inside of current yaml configuration file or outside as will covered during this guide. Make sure names and mount paths are correct based on your requirements.
2. Container images.
  1. You need to specify the location of your Pentaho Server docker image based on your container registry location
  2. Pentaho DB image. In this example we are using postgresql deployment in the same Pentaho POD for demonstration purposes. You should consider using persistent DB for production deployments. More examples in our manifest repository examples

1. Define size of the Pod specifying limits in terms of CPU and Memory as its exemplified in the template (limits section). You need to make sure it matches your cluster sizing specifications
2. containerPort. It should match with your service exposure specification
3. Environment variables. Are important to define project paths, JVM settings and variables required in the container
Create required Secrets elements based on application deployment YAML file names, like this (you can also define it as part of your application YAML file or in a separate manifest):

kubectl create secret generic pentaho-license --from-file <PATH_TO_YOUR_.installedLicenses.xml_FILE_LOCAL_LOCATION>
kubectl create secret generic key --from-file=key.json=<PATH_TO_YOUR_KEY>

Create required ConfigMaps elements (if used), like this (you can also define it as part of your application YAML file or in a separate manifest):

kubectl create configmap configMapName --from-file=<EITHER_FILE_OR_FOLDER_PATH>
Note: Is very important to have in place set of .sql to initialize pentaho internal databases (configMap init-db-scripts)

Once you have all the required pieces in place, you can deploy PDI application to your cluster

kubectl apply -f <PATH_TO_YOUR_MANIFEST>/pdi-gke.yaml

To verify Carte server service is up and running, you can reach out Carte status page http://<YOUR-IP>:<SERVICE_PORT>/kettle/status and interact with it using Carte REST API

Scaling Pentaho Server deployment – Considerations

One of the main profits of deploying Pentaho on Kubernetes is to be able to (auto) scale up and down based on certain conditions or requirements on your service. If you plan to deploy Pentaho server in cluster mode, these are main considerations:

Pentaho BA is architected to use sticky sessions. This means, each request for a particular session gets associated to a single node instance of the cluster while user session is alive. Having this in mind, you need to configure your Load Balancer to be aligned with the session model.

Please refer to your LoadBalancer solution instructions to have details on how to make this implementation

Unique cluster name for Jackrabbit. If you are running two or more servers in HA mode, each instance node needs to have a unique identifier. For detailed instructions on how to configure jackrabbit journal, visit our help page
Maintenance of Jackrabbit journal for non-existing servers (scale down). If a cluster node is removed permanently from the cluster, then its entry in the LOCAL_REVISIONS table should be removed manually. Otherwise, the clean-up thread will not be effective
For detailed configuration instructions regarding Pentaho BA in cluster mode, visit our help site

Links used for references

7 comments

60 views

Permalink

Comments

Tanmoy Panja

05-19-2022 14:16

Very Informative !!

Chayan Sarkar

05-02-2022 02:03

Informative

Dipta Kundu

04-26-2022 13:43

Thanks for sharing

Darion Gislason

01-03-2022 05:14

Wohh Got to know today i was having problems regarding this and now i want to get more info like this keep posting. and going to check MyBalanceNow for further planning.

Cristy Banks

12-21-2021 01:19

Thanks for sharing the Information! Btw is this a Official Website ?

Archive User

10-06-2021 10:40

Yes, pentaho-server-9.1 can be used for any 9.x version. Main consideration is to use corresponding installation packages based on desired version.

There aren't major considerations between 9.1 and 9.2 in terms of configuration and installation procedure.

Chris Schafer

10-05-2021 23:50

Has there been any testing on 9.2 on this docker builds and deployments? If we are targeting 9.2 is it safe to work from this github repo?