Evan Cropper

Pentaho Data Integration (PDI), Python and Deep Learning

Blog Post created by Evan Cropper Employee on Dec 6, 2018

Anand Rao, Principal Product Marketing Manager, Pentaho

 

 

1 Deep Learning – What is the Hype?

 

According to Zion Market Research, the deep learning (DL) market will increase from $2.3 billion in 2017 to over $23.6 billion by 2024. With annual CAGR of almost 40%, DL has become one of the hottest areas for Data Scientists to create models[1]. Before we jump into how Pentaho can help operationalize your organization’s DL models within product environments, let’s take a step back and review why DL can be so disruptive. Below are some of the characteristics of DL, it:

DL-10.jpg

DL-11.jpg

 

 

 

  • Uses Artificial Neural Networks that have multiple hidden layers that can perform powerful image recognition, computer visioning/object detection, video stream processing, natural language processing and more. Improvements in DL offerings and in processing power, such as the GPU, cloud, have accelerated the DL boom in last few years.
  • Attempts to mimic the activity in the human brain via layers of neurons, DL learns to recognize patterns in digital representations of sounds, video streams, images, and other data.
  • Reduces need to perform feature engineering prior to running the model through use of multiple hidden layers, performing feature extraction on the fly when the model runs.
  • Improves on performance and accuracy over traditional Machine Learning algorithms due to updated frameworks, availability of very large data sets, (i.e. Big Data), and major improvements in processing power, i.e. GPUs, etc.
  • Provides development frameworks, environments, and offerings, i.e. Tensorflow, Keras, Caffe, PyTorch, etc that make DL more accessible to data scientists.

 

2 Why should you use PDI to develop and operationalize Deep Learning models in Python?

 

Today, Data Scientists and Data Engineers have collaborated on hundreds of data science projects built in PDI. With Pentaho, they’ve been able to migrate complex data science models to production environments at a fraction of the costs as compared to traditional data preparation tools. We are excited to announce that Pentaho can now bring this ease of use to DL frameworks, furthering Hitachi Vantara’s goal to enable organizations to innovate with all their data. With PDI’s new Python executor step, Pentaho can:

  • Integrate with popular DL frameworks in a transformation step, expanding upon Pentaho’s existing robust data science capabilities.
  • Easily implement DL Python script files received from Data Scientists within the new PDI Python Executor Step
  • Run DL models on any CPU/GPU hardware, enabling organizations to use GPU acceleration to enhance performance of their DL models.
  • Incorporate data from previous PDI steps, via data pipeline flow, as Python Pandas Data Frame of Numpy Array within the Python Executor step for DL processing
  • Integrate with Hitachi Content Platform (HDFS, Local, S3, Google Storage, etc.) allowing for the movement and positioning of unstructured data files to a locale, (i.e. Data Lake, etc.) and reducing DL storage and processing costs.

 

Benefits:

  • PDI supports most widely used DL frameworks, i.e. Tensorflow, Keras, PyTorch and others that have a Python API, allowing Data Scientists to work within their favorite libraries.
  • PDI enables Data Engineers and Data Scientists to collaborate while implementing DL
  • PDI allows for efficient allocation of skills and resources of the Data Scientist (i.e. build, evaluate and run DL models) and the Data Engineer (Create Data pipelines in PDI for DL processing) personas.

 

3 How does PDI operationalize Deep Learning?

 

Components referenced are:

  • Pentaho 8.2, PDI Python Executor Step, Hitachi Content Platform (HCP) VFS
  • Python.org 2.7.x or Python 3.5.x
  • Tensorflow 1.10
  • Keras 2.2.0

Review Pentaho 8.2 Python Executor Step in Pentaho On-line Help for list of dependencies. Python Executor - Pentaho Documentation

Basic process:

 

1. HCP VFS file location within a PDI step. Copy and stage unstructured data files for use by DL framework processing within PDI Python Executor Step.

DL-9.jpg

 

Additional info: https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Data_Integration_Perspective/Virtual_File_System

 

https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Data_Integration_Perspective/Virtual_File_System2. 2. Utilize a new Transformation that will implement workflows for processing DL frameworks and associated data sets, etc. Inject Hyperparameters (values to be used for tuning and execution of models) to evaluate the best performing model. Below is an example that implements four DL framework workflows, three using Tensorflow and one using Keras, with the Python Executor step.

 

 

DL-12.jpg

 

DL-13.jpg

 

 

 

3. Focusing on the Tensorflow DNN Classifier workflow (which implements injection of hyperparameters), utilize a PDI Data Grid Step, ie named Injected Hyperparameters, with values used by corresponding Python Script Executor steps.

 

 

4. Within the Python Script Executor step use Pandas DF and implement the Injected Hyperparameters and values as variables in the Input Tab

 

DL-4.jpg

5. Execute the DL related Python script (either via Embedding or a URL to a file) and reference a DL framework and Injected Hyperparameters from inputs. Also, you can set the Python Virtual Environment to a path other than what is the default Python install.

DL-5.jpg

 

6. Verify that you have Tensorflow installed, configured and is correctly importing into a Python shell.

DL-6.jpg

 

 

7. Going back to the Python Executor Step, click on the Output Tab and then click on the Get Fields button. PDI will do a pre-check on the script file as to verify it for errors, outputs, etc.

DL-7.jpg

 

8. This completes the configurations for the running of the transformation.

 

4 Does Hitachi Vantara also offer GPU offerings to accelerate Deep Learning execution?

 

DL frameworks can benefit substantially from executing with a GPU rather than a CPU because most DL frameworks have some type of GPU accelerators. Hitachi Vantara has engineered and delivered the DS225 Advanced Server with NVIDIA Tesla V100 GPUs in 2018. This is Hitachi Vantara’s first GPU server designed specifically for DL implementation.

 

DL-8.jpg

More details about the GPU offering are available here: https://www.hitachivantara.com/en-us/pdfd/datasheet/advanced-server-ds225-datasheet.pdf

 

5 Why Organization use PDI and Python with Deep Learning:

 

  • Intuitive drag and drop tools: PDI makes the implementation and execution of DL frameworks easier with its' graphical development environment for DL related pipelines and workflows.
  • Improved collaboration: Data Engineers and Data Scientist can work on a shared workflow and utilize their skills and time efficiently.
  • Better allocation of valuable resources: The Data Engineer can use PDI to create the workflows, move and stage unstructured data files from/to HCP, and configure injected hyperparameters in preparation for the Python script received from the Data Scientist.
  • Best-in-Class GPU Processing: Hitachi Vantara offers the DS225 Advanced Server with NVIDIA Tesla V100 GPUs that allow DL frameworks to benefit from GPU acceleration.

 

 

 

1. Global Deep Learning Market Will Reach USD 23.6 Billion By 2024 End: Zion Market Research

Outcomes