Hitachi Kubernetes Service​

 View Only

Pentaho Server on Kubernetes

By Archive User posted 09-27-2021 10:15


Scope of this document

Main purpose of this document is to help you (our customers) in all the required steps to make a successful deployment of Pentaho server using Kubernetes technology for such purpose. You will have details about:
  • Introduce Pentaho options for container-based deployment
  • Provide definitive guide from zero to deployment stage
  • Facilitate best practices to approach your Kubernetes deployment


Now a days majority of enterprises are deploying their applications in the cloud to take benefit of the different *aaS offerings (IaaS, SaaS, and PaaS) due to its flexibility and seamless workload scalability offerings.
Depending on the chosen cloud services provider, you can have access to certain set of specific services but many of them are similar across available providers, like Kubernetes.

What is Kubernetes?

As per definition of its creators (Google cloud: Kubernetes “is an open-source system to deploy, scale and manage containerized applications anywhere”.

Why Kubernetes in my organization?

Depending on each enterprise, reasons of Kubernetes adoption might vary but these are common benefits of Kubernetes implementation: automation, abstraction, and monitoring.
More and detailed information

Why Pentaho Server Business Analytics (PBA) on Kubernetes?

As part of customers cloud journey many of the services used and connected as part of business intelligence solutions are present and managed by the cloud (such as storage buckets, relational & analytical databases, messaging systems, etc.).
It means Pentaho is required to:
  • Sit close to these services for security and performance considerations
  • Ability to scale up and down based on requirements
  • In short, adjust to the deployment pattern used as standard in the enterprise
Hitachi Vantara doesn’t provide an out of the box container image mainly because we believe that each customer and each deployment has specific needs. Our approach is to provide this document as the definitive guide for your very specific PBA on Kubernetes journey.




In order to work and go throw all the steps of this guide, you need to have following software/tools in place:
  • Docker Engine. Necessary to interact with container image (build, test, etc). It is available in Linux, Mac or Windows. Refer to Docker Engine documentation to install it
  • Pentaho Server package. It is required as part of the container image building process. You can download it from our Support Portal
  • For demonstration purposes we will use GKE (Google Kubernetes Engine) as the Kubernetes engine. You can use any flavor/cloud vendor for it. You need to have all required software based on your choice. For GKE perspective, please follow its QuickStart documentation to have up and running a Kubernetes cluster and required tools to interact with


Where do we start? General guidelines

Kubernetes provides a container runtime. As per Kubernetes definition, “A container represents binary data that encapsulates an application and all its software dependencies. Container images are executable software bundles that can run standalone and that make very well-defined assumptions about their runtime environment”.
This container image is usually provided by Docker (not restricted to it but is enterprise adoption). Let’s focus on this topic as the starting point of our journey. These are the steps to make your container image

Figure 1: Preparing your docker container image
  • A developer download template of pentaho server Dockerfile from Hitachi Vantara Github repository (containers -> pentaho-server -> pentaho-server-XX, where XX is related to specific Pentaho version). Each version contains a README file with main instructions to interact and build it
  • Adapt existent template based on your requirements. Our template is based on OpenJDK JRE 8 container, you might need to change it based on your organization policies or supported OS (proprietary licenses).
  • Pentaho Server Dockerfile template use two “FROM” statements in the image definition, first one is used for installation of software and the second one for final image creation. This is to make sure a clean base image is used and just adopt installed packages from the installation layer.
  • Docker file needs to be built for usage purposes. Our README file contains examples and arguments you can use to create your docker image. Is MANDATORY to pre-download Pentaho Server artifact and place it in predownloaded folder in order to build the image
  • Once image is built, you need to have it available in your Kubernetes containers registry. In GKE, you need to make it available in GCR (Google container registry). Please follow GCR documentation to make your image available for GKE

Building your Pentaho Server Image

To have a fully working Pentaho Server container image is important to have right version, components, configurations, and 3rd party elements in place.
Recommended approach is to build containers as layers starting from Vanilla software configuration and adding configuration and customization layers based on requirements. Look at bellow diagram for better understanding.

Container images based on layers

Figure 2: Pentaho containers, layers principal
  • “Vanilla Software” layer. This container image is based on our Pentaho Server container image template. This should be your starting point
  • “Customization” layer. Container image layer that allows you to incorporate specific configuration based on your deployment requirements. i.e. JDBC drivers, Hadoop configurations, cloud components, etc. In our case, we use a customized layer called pentaho-server-gcp in top of standard pentaho-server-xx container image
  • “Project” layer. Depending on your needs, you can also have additional layers based on non-common configuration required for certain projects and not required to be part of the standard Pentaho container image. i.e. properties file, certificates, custom plugins, etc.

Visit our Pentaho Container Github Repository to see examples of customized layers

Step by step. Build Pentaho Server docker image for Kubernetes (GKE) deployment

Based on general advice and recommendations provided in previous sections let’s get to work. In this section we will perform a step-by-step procedure to create our Pentaho Data Integration image to be used in Kubernetes deployment.
This section will guide you in two main stages to:
  • Create a basic Pentaho Server container image
  • Add a layer in top of basic Pentaho Server image and adapt it for Google Cloud components interaction. Adapted to execute in GKE

Building standard PDI container image

  1. Download and use Hitachi Vantara Pentaho Server container definition template from Hitachi Vantara repository as starting point based on required version.
  2. In predownloaded folder place your Pentaho Server artifact (.zip file) together with other required element (such as specific service pack artifact). In this stage, you should also add .zip artifacts related with BA plugins such as Analyzer, Interactive Report or Dashboard designer
  3. By default, command used to start Pentaho Server is as defined at the end of template Dockerfile
CMD [ "sh", "./tomcat/bin/", "run" ]
and as reference, port 8080 is exposed
  1. Adapt Dockerfile mainly to use your preferred OS image as base of your Pentaho Server image. As explained in previous section, template image use openjdk:8-jre as the starting point
Note: Depending on selected image base, you may have to adapt other sections of the Dockerfile
  1. Using docker commands, build your Pentaho Server container image. Here some examples:
Building GA version:
docker build -t pentaho/pentaho-server: .
Build applying Service Pack:
docker build -f ./Dockerfile --build-arg SERVICE_PACK_VERSION= --build-arg SERVICE_PACK_DIST=629 -t pentaho/pentaho-server: .
  1. For testing purposes, you can test newly generated container image locally.
docker run -it "pentaho/pentaho-server:<VERSION-NUMBER>" bash

Get more details and extra information on README present in template repository.

Building GCP-oriented Pentaho Server container image

As per this guide purpose, demonstration deployment is based on Google Kubernetes. We are adapting our Pentaho Server image to be GCP ecosystem ready as part of a second container layer (in top of standard pentaho-server image created on previous section).
If you have a different cloud/on-prem vendor, you can use following approach as a reference or download specific vendor template from the repository
  1. Download Pentaho Server, GCP oriented container definition template from Hitachi Vantara repository
  2. Understanding this layer customization template:
    1. Dockerfile
Its main purpose is to add certain level of configuration and customization to Pentaho Server base image oriented to interact with Google cloud services. It defines the following:
  • Receives as argument what is the Pentaho Server base image version TAG to be used as base for this one.
  • Install Google Cloud SDK dependency to interact with google cloud components such as GCS (Google Cloud Storage) buckets
  • Incorporate local “resources” folder to the image content
  • Modifies default entrypoint of the image to a customized one
    1. entrypoint/
Its main purpose is change default image entry point and adapt it to be GCP ready. For template purposes, it defines the following:
  • Retrieve Pentaho Server configuration (such as internal database connection configs) from a GCS bucket using Google SDK
  • Copies everything coming from config folder inside of the Pentaho Server installation path. This is used to customize or replace content of default pentaho-server installation directory. Such as configuration, plugins, drivers, etc.
  • Executes default CMD defined in base image
  1. Adapt both, Dockerfile and entrypoint/ based on your requirements
  2. Using docker command, build your newly created image. Example of build command:
docker build -f ./Dockerfile --build-arg "PENTAHO_SERVER_BASE_TAG=" -t pentaho/ pentaho/pentaho-server-gcp: .

Make your image available. Publish to container registry

Once you have your image(s) built. Is time to make it available in corresponding container registry. As this guide is based on GKE, next steps are focus on make images available in GCR (Google Container registry). You can adapt it to your own container registry:
  • Tag the image with a registry name
docker tag pentaho/pentaho-server-gcp:<YOUR_PROJECT>/pentaho/pentaho-server-gcp:
  • Push the image to container registry  
docker push<YOUR_PROJECT>/pentaho/pentaho-server-gcp:

Please visit your Container Registry documentation site for more details. Here GCR documentation link

Kubernetes Deployment

To deploy and manage your containerized Pentaho Server in Kubernetes, you use different Kubernetes controller objects depending on your workflow nature. These controller objects represent the deployment.

Pentaho service

Creation of Kubernetes scalable deployment to have a set of Pentaho servers controlled by an exposed service to execute process on demand.

Figure 3: High level architecture. Pentaho Service

Kubernetes Objects required

In a simple but efficient approach, there are a set of specific Kubernetes objects, environment configuration and vendor specific resources we need to put in place to perform a basic setup.
  • Service. Used to expose the service of the Kubernetes cluster internal network and how to reach out each service. LoadBalancer kind of service is recommended for Pentaho server deployment, in this example, we are using a single instance deployment. In case you need to scale and execute two or more, please take a look to Scaling Pentaho server deployment – Considerations section on this document
  • Secrets. Used for sensitive information. Pentaho licenses and Google credentials API key to be used are good example of it.
  • ConfigMap. Set of properties/configuration files to be used and attached to the Kubernetes cluster. These are specifically related to the deployment and not part of the default container image. Elements such as carte configuration xml or are good candidates to make it available in the pentaho deployment
  • Controller -> Deployment/Job. Application workflow to be deployed.
All these elements can be found in Hitachi Vantara Kubernetes templates repository examples. We will cover each element step by step fashion in following sections


Step by step - Deploying Pentaho Server on Kubernetes

As starting point, you need to have a Kubernetes cluster up and running. Please visit your vendor/flavor documentation to get information required.
Next, these are the steps to finally have Pentaho Data Integration carte container(s) up and running in your Kubernetes cluster. For guidance purposes, Stateless Carte service is the deployment method to be used:
  1. Decide and define how to expose your Pentaho application. For this guide purpose, LoadBalancer definition together with NodePort service are combined for service exposure
  2. Download and use as starting point manifest definition (.yaml file) from our Kubernetes manifest templates repository (pentaho-server-gke.yaml).
NOTE: This is a template, should not be considered to deploy “as it is” to your environment. Follow next steps to understand how to adapt it properly based on your environment specifications
  1. In the first section of corresponding YAML file, you will find “Service” definition. You should adapt it based on your requirements and definition you did before regarding service exposure
  2. Adapt rest of pentaho-server-gke.yaml  manifest based on your needs taking following as important consideration:
    1. Application name TAG should match with selector specified in your service exposure to tie both together
    2. “spec:replicas”, adapt based on amount of Pods you want a deploy. Template has “1” as default value
NOTE: Multiple replicas requires a special Service configuration to use “Sticky Sessions”.
    1. Definition of “Secrets” and “ConfigMaps”. These elements can be created either inside of current yaml configuration file or outside as will covered during this guide. Make sure names and mount paths are correct based on your requirements.
    2. Container images.
      1. You need to specify the location of your Pentaho Server docker image based on your container registry location
      2. Pentaho DB image. In this example we are using postgresql deployment in the same Pentaho POD for demonstration purposes. You should consider using persistent DB for production deployments. More examples in our manifest repository examples
    1. Define size of the Pod specifying limits in terms of CPU and Memory as its exemplified in the template (limits section). You need to make sure it matches your cluster sizing specifications
    2. containerPort. It should match with your service exposure specification
    3. Environment variables. Are important to define project paths, JVM settings and variables required in the container
  1. Create required Secrets elements based on application deployment YAML file names, like this (you can also define it as part of your application YAML file or in a separate manifest):
kubectl create secret generic pentaho-license --from-file <PATH_TO_YOUR_.installedLicenses.xml_FILE_LOCAL_LOCATION>
kubectl create secret generic key --from-file=key.json=<PATH_TO_YOUR_KEY>
  1. Create required ConfigMaps elements (if used), like this (you can also define it as part of your application YAML file or in a separate manifest):
kubectl create configmap configMapName --from-file=<EITHER_FILE_OR_FOLDER_PATH>
Note: Is very important to have in place set of .sql to initialize pentaho internal databases (configMap init-db-scripts)
  1. Once you have all the required pieces in place, you can deploy PDI application to your cluster
kubectl apply -f <PATH_TO_YOUR_MANIFEST>/pdi-gke.yaml
  1. To verify Carte server service is up and running, you can reach out Carte status page http://<YOUR-IP>:<SERVICE_PORT>/kettle/status and interact with it using Carte REST API


Scaling Pentaho Server deployment – Considerations

One of the main profits of deploying Pentaho on Kubernetes is to be able to (auto) scale up and down based on certain conditions or requirements on your service. If you plan to deploy Pentaho server in cluster mode, these are main considerations:
  • Pentaho BA is architected to use sticky sessions. This means, each request for a particular session gets associated to a single node instance of the cluster while user session is alive. Having this in mind, you need to configure your Load Balancer to be aligned with the session model.
Please refer to your LoadBalancer solution instructions to have details on how to make this implementation
  • Unique cluster name for Jackrabbit. If you are running two or more servers in HA mode, each instance node needs to have a unique identifier. For detailed instructions on how to configure jackrabbit journal, visit our help page
  • Maintenance of Jackrabbit journal for non-existing servers (scale down). If a cluster node is removed permanently from the cluster, then its entry in the LOCAL_REVISIONS table should be removed manually. Otherwise, the clean-up thread will not be effective
  • For detailed configuration instructions regarding Pentaho BA in cluster mode, visit our help site

Links used for references




16 days ago

Wohh Got to know today i was having problems regarding this and now i want to get more info like this keep posting. and going to check MyBalanceNow for further planning.

30 days ago

Thanks for sharing the Information! Btw is this a Official Website ?

10-06-2021 10:40

Yes, pentaho-server-9.1 can be used for any 9.x version. Main consideration is to use corresponding installation packages based on desired version.

There aren't major considerations between 9.1 and 9.2 in terms of configuration and installation procedure.

10-05-2021 23:50

Has there been any testing on 9.2 on this docker builds and deployments? If we are targeting 9.2 is it safe to work from this github repo?