Pentaho Data Integration (PDI) and Data Science Notebook Integration
Carl Speshock, Senior Technical Product Manager
1. Using PDI and Data Science Notebook and Python Together
Data Scientists are great at developing analytical models to achieve specific business results. However, deploying a model in a production environment requires a different skillset than data exploration and model development. The result is wasted human capital. The data scientist spends a significant amount of engineering the data, which is often re-written by your data engineers for production deployments.
So why not allow the data scientist and data engineer focus on what they do best? With Pentaho PDI’s Drag and Drop GUI environment data engineers can prepare and orchestrate data to seamlessly flow into a Data Scientist’s Notebook, i.e. Jupyter (which is the focus of this blog). Data Scientists, can explore analytical models using the Python programming language and related Machine Learning and Deep Learning frameworks with cleansed data. The Result? Models ready for production environments at a fraction of the cost and delivered in a fraction of the time.
2. Why would you want to use PDI and the Jupyter Notebook together to develop models in Python?
Pentaho allows Data Scientists to spend their time on data science models instead of data prep tasks and makes it easier to share Python scripts between data scientists and data engineers. By choosing Pentaho to operationalize data science, organizations can:
- Utilize a graphical drag and drop development environment, which makes data engineering easier with a toolbox of connectors to many data sources - easily configured instead of coded, tools that can blend data from multiple sources, and transformation steps to cleanse and normalize the data.
- Migrate data to production environments with minimal changes.
- Scale seamlessly to address growing production data volumes, and
- Share production quality data Sets and Python scripts between Data Engineers and Data Scientists, as shown below in a collaborative workflow between the two personas:
3. How do you develop Python models using PDI and Jupyter Notebooks?
Dependencies and components tested with:
- Pentaho versions applicable with: 8.1/8.2
- Python.org 2.7.x or Python 3.5.x
- Jupyter Notebook 5.6.0
- Python JDBC package dependencies, i.e. JayDeBeApi and jpype
Pentaho PDI Data Service On-line help link that includes configuration, installation, client jars, etc. (https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Data_Services)
- In PDI, create a new Transformation connected to the Pentaho Server repository. Implement all of your data connections, blending, filtering, cleansing, etc., as shown in below example,
2. Use PDI's Data Service feature to export rows from the PDI transformation (which later will be consumed in a Jupyter Notebook). Create a New Data Service by right-clicking on the last step in the transformation. Test the Data Service within the UI and select Save Transformation As to save [EC1] the Data Service to Pentaho Server.
3. Before the Data Scientist can work in the Jupyter Notebook utilize a Data Grid Step to review the Data Grid Fields and Data Values. These input variables will flow into the Python Executor Step. Note they can be easily changed by the Data Engineer for new PDI Data Services.
4. Below, the Python Executor – Create Jupyter Notebook –Python API contains Python Script, Input and Output references and more. From here, the Data Engineer can create the Jupyter Notebook for the Data Scientist to consume.
5. Python Executor – Create Jupyter Notebook –Python API step automatically populates the Jupyter Notebook (shown below) with the cleansed and orchestrated data from the transformation. The Data Scientist is connected directly to the PDI Data Service created earlier by the Data Engineer.
6. Data Scientists will retrieve, i.e. Enterprise Data Catalog, File Share, etc., the Jupyter Notebook file created by the Data Engineer and PDI. The Data Scientist will confirm the output from the Python Pandas Data Frame named df in the last cell.
7. From here, the Data Scientist can begin building, evaluating, processing and saving the machine and deep learning models by utilizing the Pandas Data Frame named df. An example is shown below using a Machine Learning Decision Tree Classifier.
4. How can Data Engineers and Data Scientists using Python collaborate better with PDI and Jupyter Notebooks?
- PDI’s graphical development environment makes data engineering easier.
- Data Engineers can easily migrate PDI applications to production environments with minimal changes.
- Data Engineers can scale PDI applications to meet production data volumes.
- Data Engineers can quickly respond to Data Scientist’s data set requests with PDI Data Services.
- Data Scientists can easily access Jupyter Notebook templates connected to a PDI Data Service.
- Data Scientists can quickly pull data on demand from the Data Service and get to work on what they do best!