Kevin Haas

Harvesting Raw Data into Custom Data API's with Pentaho Data Integration

Blog Post created by Kevin Haas Employee on Dec 15, 2017

This post originally published by Kevin Haas on Tuesday, July 14, 2015


When working with our clients, we find a growing number who regard their customer or transaction data as not just an internally leveraged asset, but one that can enrich their relationships with customers and supporting partners. They need to systematically share data and analytics outside their firewall.


One of the ways we help them achieve this goal is with Custom Data API's. In this article, we'll show you how to build Data API's using visual programming within the Pentaho Data Integration Interface. Along the way, we'll expose capabilities of the platform that are not always seen by the average user.


Inspecting The Field: Inside Pentaho Data Integration

For those who use Pentaho's Data Integration (PDI aka Kettle) platform regularly, you're probably aware of the amazing things you can do. For the uninitiated, PDI is a very powerful Extract-Transform-Load (ETL) technology that lets you access data from virtually any source, perform complex transformations, and push/load that data to virtually any destination. It's commonly used for integrating and shaping data into new forms optimized for analytical purposes (data warehouses, data marts, etc.).


However, Pentaho experts know that Kettle is more than an ETL platform. It is powered by an amazing, open, and flexible data management platform that can not only read and write data from relational databases, Hadoop, NoSQL databases, and Web API's, but it can also serve data to other applications.


Enter one of the most powerful (but maybe underutilized) components of the Pentaho Data Integration Platform: the Data Integration Server. It can power your API's.


Sowing The Seeds: The Data Integration Server

If you're an Enterprise Edition Pentaho customer, you're probably aware of the capabilities of the Data Integration Server. It's one of the foundational reasons why customers choose the Enterprise Edition of PDI. The Data Integration Server is a server application that allows you to schedule and run jobs and transformations created with PDIs "Spoon" visual development tool.


Community Edition users have a similar (but not as sophisticated) version of the Data Integration Server with the included Carte server. If you're a community edition user, it's easy enough to start up your own local Carte server on a port of your choosing. You could start Carte on port 8081 like this: 8081


carte.bat 8081


There are many configuration options for the Data Integration Server and Carte server that allow it to operate in all kinds of forms. See here for the current list of options.


Bringing It To Life: Calling The Transformation on the Server with ExecuteTrans

The first step to create a Data API is to develop a Pentaho Data Integration transformation that gathers data from virtually any source: RDBMS, Hadoop, NoSQL Databases, Web APIs, etc. Since transformations can be parameterized, you can apply filtering, sorting or any other customizable logic to your code. Again, this can all be developed using PDIs graphical development tool: Spoon.


Once you have installed and launched your Data Integration or Carte server, you can execute your transformations with a simple HTTP call using the ExecuteTrans method. For example, if you have Carte running on port 8081 and want to execute a transformation that is stored in /srv/pentaho/sample.ktr and accepts a parameter named "parm", you can simply call:


When you launch this URL within a browser, you wont see a lot of information in return. In fact, you'll get nothing back! But, if you look at the PDI logs on the server, you'll see that the sample.ktr transformation did run. This is all well and good -- you were able to launch a transformation on a server via an HTTP request. But, now you want it to return some data!


Growing a Transformation into an API: The Output To Servlet Option

To provide an output, the transformation must produce one or more rows of data through a configured output step. To send rows of data out as a response from the ExecuteTrans method, your transformation must pass this data to a Text File Output (or JSON Output) Step, configuring the step to "Pass output to servlet."



By checking the "Pass output to servlet" box, the transform will take the stream of data entering this step and push it out as your response. Note that the exact data contents of the stream are output as specified by the steps formatting options (e.g. fields, headers, separators). You'll want to configure depending on your needs.


Bringing it to Market: A Taste of Farmers Market Data In A Custom Data API

Lets look at an example. With harvest season upon us, more and more farmers markets are filling up with their bounty. But where can you find local farmers markets? Thankfully, the USDA provides an API that lets you search for nearby markets. 


This is a powerful set of data, but its not always easy to use the API (or any API, as provided). In this case, to get a market, you first provide a zip code. The API returns a list of markets with their associated IDs, which you then must use to query the API again in order to get detailed information on each market. Also, we'd like to handle the result data using JSON format and some elements of the API return comma separated values. Finally, we need to format the returned address a bit better, attaching latitude and longitude for mapping.


To begin, we'll create a transform that accepts a parameter called "zip", and does the following:


1. Generate Rows: Initialize with a row for Pentaho to process input parameters.

2. HTTP Client: Call the USDA API to get the list of markets with the input parameters.

3. JSON Output: Convert the JSON that comes back into a list of markets.

4. JSON Output: Extract the Market IDs from that list of markets.

5. Calculator: Format a URL string to re-query the API to get detailed market info for each market.

6. HTTP Client: Call the USDA API again for each of the markets to get the details.

7. JSON Output: Extract the detailed address info on the market.

8. HTTP Client: Call the Google Geocoding API to get the latitude and longitude.

9. JSON Output: Extract the latitude and longitude from Google's response.

10. Javascript: Clean up and format the Address and Market info from the source APIs into a nice JSON format.

11. Group By: Group all of the detailed Markets from the API into a single row.

12. Javascript: Format the final JSON.

13. Text File Output: Output the single data element as a response to the servlet to show in the server.



A lot of steps, for sure, but their configuration is relatively simple. This is what the transformation looks like in the Spoon IDE:


Once we have developed and tested our transformation using Spoon, we are ready to use our Carte server and the ExecuteTrans method. We open a browser and enter the following URL: 


Note, we can even front end the Carte or DI server with Apache and do a little URL rewriting to make it a friendlier URL to call. We did this. So instead of the above executeTrans URL, we can call this:


When you execute the API, it does all the work for you. It calls all the various child APIs, does all the cleanup and formatting, and returns delicious looking JSON, with all of the added items from the Transformation as a result. For example:


          marketname:"Eli's Cheesecake and Wright College Farmer's Market",
          location:"6701 W. Forest Preserve Drive, Chicago, Illinois, 60634",
          address1:"6701 West Forest Preserve Drive",
             "06/12/2015 to 09/25/2015 Thu: 7:00 AM-1:00 PM"
             "Baked goods",
             "Crafts and/or woodworking items",
             "Cut flowers",
             "Fresh fruit and vegetables",
             "Canned or preserved fruits, vegetables, jams, jellies, preserves, dried fruit, etc.",
             "Maple syrup and/or maple products",
             "Soap and/or body care products"


Taking It Home: Building Your Own Data APIs

This is a fun illustration of how to pull data from various sources and craft it together into your very own API. All of this was done with with out-of-the-box Pentaho Data Integration capabilities, no extra parts required.


Clearly, you're not limited on what you can do with this capability. You can easily get data from virtually any source and wrap a custom data API around it, giving you new capabilities to leverage your existing infrastructure, and share data in new ways. Pentaho has a lot of capabilities inside. We know that once you take PDI home from the market and look inside the box, you'll be inspired.


If you're interested in seeing how the transformation works, including a look at the transformation that powers the example, just let us know.