Jesse Zuckerman

Plotting ESRI Shapefiles on a Map in Pentaho

Blog Post created by Jesse Zuckerman Employee on Dec 18, 2017

This article was co-authored with Benjamin Webb

 

Foundational to any map—whether it be a globe, GPS or any online map— is the functionality to understand data on specific locations. The ability to plot geospatial data is powerful as it allows one to distinguish, aggregate and display information in a very familiar manner. Using Pentaho, one can use shape files to plot areas on a map within a dashboard and then explore data geospatially. In this example, we can use C*Tools and Pentaho Data Integration to examine geographic spread of crime occurring in the city of Chicago.

 

Getting and Preparing Shapefiles

 

There are many popular formats of shapefiles. The most popular is the format developed and regulated by ESRI. Many geographic analytics packages produce this format, so it is relatively easy to find shapefiles for common geographic boundaries, including Country, State/Province, County, Postal Code, Political boundaries and more. For this analysis, we’ll use data provided by the Chicago Data Portal.

 

First, to get the shapefiles, we will download the Community Area Boundaries datafile in ESRI format. To use in Pentaho, we will prepare the shapefile by converting ESRI to GeoJSON. We will use a command line tool provided by GDAL titled ogr2ogr. More information on this suite of tools can be found on their website and on GitHub. To execute this tool we can use the following PDI transformation that will call ogr2ogr.exe with parameters including GeoJSON for the destination filetype as well as the source and destination files.

From this process, important information is collected on the 76 community areas in Chicago. As seen below in a small sample of the GeoJSON file created, information is contained for the community of Douglas including a series of latitudes and longitudes representing the points that form a polygon.

 

{

"type": "FeatureCollection",

"name": "geo_export_80095076-0a6b-4028-a365-64ec9f0350d7",

"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [

{ "type": "Feature", "properties": { "perimeter": 0.0, "community": "DOUGLAS", "shape_len": 31027.0545098, "shape_area": 46004621.158100002, "area": 0.0, "comarea": 0.0, "area_numbe": "35", "area_num_1": "35", "comarea_id": 0.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -87.609140876178913, 41.84469250265397 ], [ -87.609148747578061, 41.844661598424025 ], [ -87.609161120412566, 41.844589611939533 ]…

 

Next, we can get crime data from the same site. This very large file contains diverse information on every crime occurring in Chicago since 2001 including the location, type of crime, date, community ID etc. To reduce the size of this file with over 6 million rows, we first run a quick PDI transformation that will group the crimes by year and community.

 

We need to join the crime dataset to the GeoJSON dataset we created as only that dataset contains the names of the community areas. Both datasets do share a common geographic code, allowing us to blend the data by community areas ID.

 

For this C*Tools dashboard, we will use another PDI transformation as the source of the data. As will be seen later, when the user selects a year on the dashboard, this transformation will be triggered to get the total number of crimes for every Chicago community area in a given year.  We will also get the max number of crimes for all neighborhoods, to be used later in the color gradient.

 

With this transformation, we now have a set of data that includes key metrics along with their corresponding geographic areas represented as polygons.

 

Create the Dashboard with NewMapComponent

 

Now to create the actual map in Pentaho, we will use the C*Tools Community Dashboard Editor (CDE). This will be done using the NewMapComponent under the Components View.

 

This powerful component utilizes a mapping engine that renders the world for free (choosing between OpenLayers or Google engines). Then, GeoJSON or KML is used to plot polygons above the map. In our case, the GeoJSON file outline the communities throughout Chicago will be mapped.

 

Much of the functionality will be used through JavaScript snippets in the Component Lifecycle steps, such as Pre-Execution and Post-Execution.

 

As a brief example here, we can load our GeoJSON file with the following JavaScript in our Pre-Execution stage:

 

// locate our GeoJSON file under resources. Note, the BA server does not
// recognize the .geojson extension, so we rename to .js
var getResource = this.dashboard.getWebAppPath() + '/plugin/pentaho-cdf-dd/api/resources';
var mapDef = '${solution:resources/geojson/community-areas-current-geojson.js}';

// here we pass in our GeoJSON file, and also specify the GeoJSON property
// to use as a polygon ID
this.shapeResolver = 'geoJSON';
this.setAddInOptions('ShapeResolver', 'geoJSON', {
     url: getResource + mapDef,
     idPropertyName: 'area_num_1'
});

 

This will plot the polygons, with 1 caveat. The NewMapComponent also takes a data source data source as input (see the properties tab). This data source must contain data that matches the IDs specified above in the GeoJSON file, and only those polygons for which data points with matching IDs exist will be rendered.

 

We can specify which columns from our data source to use as the ID in a snippet also in the Pre-Execution phase like so:

this.visualRoles = {
        id: 0,                            // communityarea
        fill: 1                           // crimes
};

Note, here we defined the column for id as the first column (index 0), and use the 2nd column (index 1) as the fill value (more below).

 

To load our pre-aggregated data and render it on our map, the Kettle transformation described above is used which takes a year parameter and then reads & filters the Aggregated-Crime-Counts.csv file.

 

This transformation ensures that the 1st column is the Community Area ID and the 2nd column is the # of crimes, to match our JavaScript above.

 

Finally, more JavaScript can be added to add additional dashboard features. For our heat map example, we want to vary the fill color based on # of crimes.

 

We've already linked the data with the code snippet above. The NewMapComponent has some defaults, but to ensure it works smoothly we can implement the fill function ourselves as follows, which is also implemented in the Pre Execution step:

// define the polygon's fill function manually based on the crimes/fill
// value incoming from the datasource
this.attributeMapping.fill = function(context, seriesRoot, mapping, row) {
        var value = row[mapping.fill];
        var maxValue = row[mapping.max];
        if (_.isNumber(value)) {
                 return this.mapColor(value,
                         0,                                        // min crimes
                         maxValue,                // max crimes from dataset in yr
                         this.getColorMap()     // a default color map of green->red
                 );
        }
};

 

The above function validates the incoming fill/crimes column, and then tweaks the default color map (green to red) and maps all values on a scale of 0 to the max number of crimes in a year (in 2015, this number was 17,336; coming from the western community of Austin, seen below). All values between will be somewhere on the gradient.

 

 

Another very useful function that can be implemented within the NewMapComponent is the tool tip, that will highlight information about a community area, displaying the community name and the total number of crimes, when hovered by a mouse. This is implemented in the Post Execution, again utilizing JavaScript.

 

function tooltipOnHover() {
    /* TOOLTIP ON MOUSE HOVER 2 */
    var me = this;
    /*
        ** Define events for mouse move
        **/
        this.mapEngine.map.events.register('mousemove', this.mapEngine.map, function (e) {
                 if (!_.isEmpty(me.currentFeatureOver)) {
                         var modelItem = me.mapEngine.model.findWhere({
                                  id: me.currentFeatureOver.id
                         });
                         $('#popupObj')
                                  .css('top', e.pageY -50)
                                  .css('left', e.pageX + 5)
                                  // html contained in popup
                                  .html(
                                          modelItem.attributes.data.area_name + 
                    '<br>Total Crimes: ' + modelItem.attributes.data.fill + ''
                                  );
                 }
        });
        this.mapEngine.map.events.register('movestart', this.mapEngine.map, function (e) {
                 $('#popupObj').fadeIn(500);
        });
        this.mapEngine.map.events.register('moveend', this.mapEngine.map, function (e) {
                 $('#popupObj').fadeOut(500);
        });    
}

 

 

 

Effectively analyzing geospatial data, especially when used with enhancement tools like NewMapComponent can be a very powerful tool. From this basic example, we can better understand how crime exists across a very large city and how that spread has changed over time. Using the polygons allows us to better group the data in order to gain valuable insight.

 

This approach is heavily indebted to Kleyson Rios's NMC-samples repository, which has similar examples and can also be zipped & uploaded for exploring the NewMapComponent.

 

The code for this example can be found on GitHub.

Outcomes