Clifford Grimm

Geospatial Data Usage in HCI

Blog Post created by Clifford Grimm Employee on Apr 11, 2018

Problem Overview

 

As technology and businesses/government continue to become more dispersed across the planet, the location of activity becomes almost as important as the data itself.   Increasingly, these entities find that location of activity provides value in finding patterns for predictive analytics or simply understanding the current and past location of assets.  Positioning on the globe is expressed as a geospatial location.

 

Geospatial location can be expressed in many different forms.  Over the years, organizations have created location specifications that focus on a specific region, and others that specify anywhere around the globe on land, sea, and air.  This article will not describe all the possible ways to specify location, but instead focus on the most common mechanism utilized by modern mapping software readily available.  For instance, Google Maps and Google Earth are likely the most popular.  These mapping solutions utilize the Geodetic system that references locations at sea level called WGS 84 (World Geodetic Systems)

 

There are many articles on the internet that describe these and other location specifications and go into great detail.  But a starting point are the following URLS:

 

https://en.wikipedia.org/wiki/Geodetic_datum

https://en.wikipedia.org/wiki/World_Geodetic_System

http://www.earthpoint.us/Convert.aspx

 

This article will focus on the WGS84 system expressing latitude and longitude.  The WGS84 system has multiple ways of specifying latitude and longitude, but this article will focus on the decimal number specification. For example, the position for the Empire State Building, New York City, New York, USA is:

 

Latitude: 40.748393

Longitude: -73.985413

 

Once location about data is available, the key component is to be able to utilize this information for relevant content discovery.  This article will utilize pictures taken with a camera that includes location coordinates for which the picture was taken. The HCI platform can enable this kind of discovery with appropriate setup.

 

This article will discuss the following topics to perform geospatial search.

  • Geospatial Data Preparation
  • Solr Index Preparation
  • Workflow Construction
  • Performing Geospatial Search

 

Geospatial Data Preparation

 

The first part of this effort is the construction of the data discovery and extraction phase that prepare data for indexing.   The images are preloaded into an HCP namespace to be utilized by HCI.  An HCI data connection of type HCP MQE and named “Image Data Lake” will be utilized to access the images.  There is nothing special about the data connection, thus will not be detailed by this article.

For processing the images in preparation for indexing, an HCI pipeline was constructed with 3 main parts:

  1. Data Extraction of geospatial information from images
  2. Data Enrichment and Preparation of geospatial information.
  3. Date/Time preparations

 

Data Extraction

 

The data extraction portion simply makes sure that the document is a JPEG file, and then performs the generic Text/Metadata Extraction to get any geospatial coordinates.  This stage places the coordinates in the geo_lat and geo_long document fields.

 

Screen Shot 2018-04-11 at 3.06.46 PM.png

 

Data Enrichment

 

The high-level goal of this part is to prepare geospatial information for indexing.  If the previous part generated geo_lat and geo_long fields, processing will proceed for enrichment.

 

First 3 stages utilize the Geocoding stage to generate the human readable city, state, country information.  The result are document fields named loc_city, loc_state, loc_country, and loc_display (combination of other 3 fields).

 

The last two stages setup a GPS_Coordinates field that is in the form of <lat>,<long>. This format is required by Solr for the location data type that will exist in the index once we create it later in this article. The Tagging stage sets up GPS_Coordinates to have the string of “LAT,LON” that will be used as a template for the next stage.  Then using the Replace stage replaces LAT with contents of field geo_lat, and LON with the contents of field geo_long, thus producing the document field like:

 

GPS_Coordinate: 40.748393,-73.985413

 

The pipeline stages for this processing is the following:

 

Screen Shot 2018-04-11 at 3.07.39 PM.png

 

Date/Time Preparations

 

In general, date and time specifications can be specified in nearly infinite number of ways. Although HCI has a built-in Date Conversion stage, there is still a little bit of processing required to prepare general conversion so that HCI can index the dates.

 

For instance, the GPS date and time information returned by the Text/Metadata Extraction stage as two different fields. The result is to reconstruct a field GPS_DateTime from these two fields in a form that the Date Conversion stage can understand without additional definitions in that stage.  The sample fields generated by Text/Metadata Extraction are:

 

GPS_Date_Stamp: 2018:01:27

GPS_Time_Stamp: 17:11:25.000 UTC

 

The goal is to put it into the following form that is understood by the Date Conversion stage:

 

2018-01-27T17:11:25.000+0000

 

What precisely each stage does is again beyond the scope of this article.

 

The pipeline stages for this processing is the following:

 

Screen Shot 2018-04-11 at 3.08.24 PM.png

 

Solr Index Preparation

 

Solr contains built-in support for indexing geospatial coordinates based on the WSG84 coding system. For indexing, the field must be of a specific type and be formatted for this type. The formatting was already performed previously, but essentially the field must be of the form:

 

<lat>,<long>

 

To prepare HCI for indexing and performing geospatial search, it is required to:

  1. Patch the HCI installation,
  2. Construct an appropriate index schema, and
  3. Define a Query Result configuration

 

HCI Installation Patch

 

When an HCI index is constructed, there is a base configuration included in the HCI installation that is used to help simplify the index creation process. One part of the base configuration is a managed-schema concept. This is essentially a configuration mechanism that maps internal Solr data types to simpler names along with default data types attributes.

 

For the purposes of geospatial data, there is a data type called location.  However, there is a problem with the definition in HCI as it uses deprecated and inadequate data types in the definition.   The current implementation of HCI (1.2) allows for changing the managed-schema of an index once it is created; however, the problem with this approach is that if the index is exported and imported, any changes to the managed-schema on that index will be lost.

 

For most deployments, it may not be necessary to export and then re-import an index; however, during development of work flows it is usually typical practice to want to totally clear out the index periodically. Thus the recommendation is to patch the HCI installation until such a time as HCI is updated with more appropriate definition for this data type.

 

The patch procedure is to manually edit 3 files on each node of the HCI installation.  Assuming the HCI installation is rooted at /opt/hci and the HCI version is 1.2.1.139, the managed-schema files are rooted at:

 

/opt/hci/1.2.1.139/data/com.hds.ensemble.plugins.service.adminApp/solr-configs

 

In this folder, the following are the relative paths to the 3 files that need to be updated:

 

basic/managed-schema

default/managed-schema

schemaless/managed-schema

 

 

The changes to the basic/managed-schema and default/managed-schema are identical by just changing the definition for the location field type and adding a locations data type for multi-valued field.  The following are the old and new line(s).

 

OLD LINE:

 

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

 

NEW LINES:

 

<fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/>

<fieldType name="locations" class="solr.LatLonPointSpatialField" docValues="true" multiValued="true"/>

 

The changes to the schemaless/managed-schema is a bit more complex and requires 3 changes.

 

Change 1:  Delete the following lines.

 

<!-- Type used to index the lat and lon components for the "location" FieldType -->

<dynamicField name="*_coordinate"  type="tdouble" indexed="true"  stored="false" />

 

 

Change 2: Change the following lines.

    OLD:

 

<dynamicField name="*_p"  type="location" indexed="true" stored="true"/>

 

    NEW:

 

<dynamicField name="*_p"  type="location" docValues="true" stored="true"/>

<dynamicField name="*_ps"  type="location" docValues="true" multiValued=”true” stored="true"/>

 

 

Change 3: Change the following lines.

   OLD:

 

<!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

 

 

   NEW:

 

<!-- A specialized field for geospatial search. -->

<fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/>

<fieldType name="locations" class="solr.LatLonPointSpatialField" docValues="true" multiValued="true"/>

 

Once all the changes have been made to the files, then the HCI software needs to be restarted. On a CentOS system, the command executed on all nodes would be:

 

systemctl restart HCI

 

Wait for HCI to completely reboot where all services are running.  This can be monitored in the Admin GUI under Monitoring -> Dashboard -> Services.

 

NOTES:

  • If nodes are added to the HCI installation, those new nodes must also be patched, otherwise, unexpected index definitions may occur if the AdminApp service is run on those nodes.

 

  • If this configuration is desired pre-installation of HCI, the procedure can be performed on the installation distribution and then the distribution repackaged.  Then all instances of HCI installations using the patched distribution media will contain the appropriate changes and will survive node additions.

 

Solr Index Schema Definition

 

Once the new managed-schema definition has been updated in the installation, the next step is to create a Solr Index schema to accept the geospatial data collected.  To keep things simple, the HCI index created is a Basic type.  Basic initial schema will create a base set of HCI fields for indexing.

 

Screen Shot 2018-04-11 at 3.34.36 PM.png

 

Along with the basic fields, the fields that will contain the location information generated by the Geocoding built-in stage must be created as simple strings:

 

Screen Shot 2018-04-11 at 3.09.44 PM.png

 

To hold the geospatial information, the following index fields need to be created:

 

Screen Shot 2018-04-11 at 3.10.27 PM.png

 

The GPS_DateTime field is a simple date index field.

The GPS_Coordinates field is a location field type and will contain the definition configured in the managed-schema definition.

 

To verify the new definition of the location field type is activated on HCI, within the schema view of the index just created (1) click on Advanced link, (2) click on managed-schema configuration file, and (3) observe the definition of the location field type as shown below.

 

Screen Shot 2018-04-11 at 3.33.21 PM.png

 

 

HCI Query Result Definition

 

Next step is to modify the image index Query Settings to specify how content should be returned by queries. At a minimum, it is necessary to add the GPS_* and loc_* index fields.  The simplest approach is to just add all fields to the query setting that exist in the index schema definition as there are only a few that were added.  This is accomplished in the index Query Settings page under Fields, then select the Action “Add All Fields”.  See the picture below for guidance.

 

Screen Shot 2018-04-11 at 3.32.33 PM.png

 

Optionally, these fields can also be added to the Query Settings Results as well and is left up as an exercise for the reader.

Workflow Construction


At this point, there is a Data Connection named “Image Data Lake” pointing at an HCP namespace, a pipeline named “Geospatial Image Indexing” that processes the images, and an index named “ImageIdx” that can receive the fields.  The last step is to construct a workflow that can be run to generate the index. Create the work flow by executing the wizard and adding the data connection, pipeline, and output index.  The result when viewing the workflow should look like the following:

 

Screen Shot 2018-04-11 at 3.05.11 PM.png

 

Run the workflow to generate the index.

Performing Geospatial Search

 

Now comes the fun part of performing geospatial queries. As previously mentioned, Solr has a powerful set of capabilities around geospatial points.  The simplest is the range search where two points are provided and all content within the box it forms will be returned.  Then there are more advanced search capabilities that utilize Solr functions that find all points within a distance from a given point, bounding box which sized by the distance from a point, and boosting and sorting capabilities based on the distance from the points.

 

For a fuller description see the following URL:

 

https://lucene.apache.org/solr/guide/6_6/spatial-search.html

 

WORD OF CAUTION:  There were problems with using some forms of the examples at this link specifically around the usage of Solr query filters.  Either it was user error, HCI confusion, or errors (or deprecated specifications) in the examples.  Regardless, the following examples are the forms that worked with HCI.

 

The very simplest form of query is to find all images that reside within a rectangular box that is constructed from two points. This is also called range search on geospatial data. The range search consists of the lower-left point of the box and the upper-right corner of the box.  An example criteria that can be used in the advanced query in the Search Console is:

 

+GPS_Coordinates:[42.377,-71.526 TO 42.378,-71.524]

 

This finds all images within the rectangle with lower-left point of 42.377,-71.526 and upper right point of 42.378,-72.524.  Below is the example run in HCI Search Console.

 

Screen Shot 2018-04-11 at 3.14.34 PM.png

 

The next example consists of utilizing built-in Solr geospatial query. In order to specify additional query parameters via the HCI Search Console, it is necessary to enable this functionality. This is accomplished in the index query results main settings as shown below.

 

Screen Shot 2018-04-11 at 3.31.41 PM.png

 

One such function is to filter content within a circle from a specified point and distance from that point. This function is called geofilt. The below image illustrates what this represents.

 

GeoCircle.png

 

Below is an example in the HCI Search Console using the geofilt filter for finding all pictures that are less than 1 kilometer from the point 42.39,-71.6.

 

Screen Shot 2018-04-11 at 3.25.01 PM.png

 

The last example consists of utilizing the built-in Solr geospatial filter bbox.  This filter constructs a box centered in a given point that the middle of the edges of the box are the specified distance.

 

bbox.png

To perform this type of query, change the geofilt filter to bbox in the Advanced Parameters field as shown below.

 

Screen Shot 2018-04-11 at 3.26.51 PM.png

Hope you enjoyed this article and gained valuable knowledge on how to utilize HCI to perform geospatial search for content.

Outcomes