Skip navigation
1 2 3 Previous Next

Hitachi Content Intelligence

33 posts

Use this blog to import HCI workflow component configuration .bundle file for the integration of Hitachi Data Instance Director(HDID), Hitachi Content Platform(HCP), and Hitachi Content Intelligence(HCI).

Import of the HCI bundle includes HCI Workflow, Data connection, Pipeline, Index, and Content Class which is needed for the HDID, HCP, and HCI products integration to perform self-service service using HCI.

 

Please find attached "hci_HDID_HCP_HCI_Export.bundle" bundle file.

 

Follow the the below steps to import HCI bundle which includes HCI workflow, Data connection, Pipeline, Index, Content Class.

1.  Login to Hitachi Content Intelligence admin console using https://IP:8000 web URL with admin user and password using administration app.

2.  Click on Workflows. The Workflow Designer page opens.

3,  Click on Import/Export. Click on Import.

4.  Click to Upload to upload the HDID bundle into the HCI.

5.  Download the attached ‘hci_HDID_HCP_HCI_Export.bundle’ bundle at some location. Browse and select the same HCI bundle.

6.  Select all the components to import.

7.  Click on the Complete Import button. This will import the HCI workflow component configuration .bundle file into the HCI.

 

After the successful HCI bundle import, HCI Workflow, Data connection, Pipeline, Index, Content class was created.

 

Workflow:

  • Workflow 'Workflow_HDID_HCP_HCI' was created.
  • This workflow is used to perform self-service search of Hitachi Content Platform data using Hitachi Content Intelligence with Hitachi Content Search.

Data Connection:

  • Data connection 'DataConnection_HDID_HCP_HCI' was created with 'HCP MQE' connection type.
  • This data connection is used in the 'Workflow_HDID_HCP_HCI' workflow.

  • User needs to modify the HCP Data Connection details. i.e. HCP System Name, HCP Tenant Name, HCP Namespace Name, HCP tenant User Name and Password as per the users environment. Figure shows the HCP connection details from the environment where bundle was exported.

Processing Pipeline:

  • Pipeline 'ProcessingPipeline_HDID_HCP_HCP' was created.
  • This pipeline is used in the 'Workflow_HDID_HCP_HCI' workflow.
  • This pipeline detects document types, expands archive, and performs basic content and metadata extraction. Suitable for basic enterprise search use cases.

  • In the 'ProcessingPipeline_HDID_HCP_HCP' pipeline inside the Content Class Extraction content class 'ContentClass_HDID_HCP_HCI' was added which was imported along with other component was added.

 

Index Collections:

  • Index 'IndexCollections_HDID_HCP_HCP' was created.
  • This index collection is used in the 'Workflow_HDID_HCP_HCI' workflow.

Content Classes:

  • Content Class 'ContentClass_HDID_HCP_HCP' was created.
  • This index collection is used in the 'ProcessingPipeline_HDID_HCP_HCP' Pipeline.

  • Following content properties were added in the 'ContentClass_HDID_HCP_HCP' content class which was needed to search HCP data which copied from the source data using HDID.


We have just release the Content Intelligence version 1.3.1 maintenance release.

 

This release contains a number of bug fixes. The release notes and installation instructions are available along with the product downloads from the Downloads page.

Hi everybody,

 

I've been cooking a couple of tools to try to analyze the capacity consumed by live and backup object versions in HCP using Content Intelligence.

 

What I have managed to obtain so far works like this:

The first tool is a stage plugin that calculates the total size of all versions of the object:

And the second tool is a python script that uses the data obtained above to generate a report with the size of active and backup versions aggregated by the field of your choice (so you can obtain the capacity consumption of, for example, each namespace, as seen here:)
The script output is a CSV report with that day's date so you can automate its execution so that it gives you reports each week, for example.

 

You can find the source code, plugin and python script here:

 

https://hcpanywhere.hitachivantara.com/u/1ORe3kpkXtfG5F7E/Latest?l

 

This is all super early concept for a PoC, so I would appreciate any tips/advices/suggestions/corrections you have.

 

There are some things that I'm still not sure about, specifically:

 

  • Can I access the authentication token in the HCP connector from inside the stage? If I could do that I wouldn't have to configure the auth token in the stage configuration.
  • Is the SOLR stats functionality expected to be added to the HCI Search API in the future (or was it already added and I didn't realize)? It's what I'm using provisionally for the PoC in the python script at the moment, but I would like to rewrite it to use the HCI Search API, if possible.

 

Thank you in advance!

 

EDIT 26/07/18 - Updated plugin and source code, modifying the authorization settings as suggested by Yury. Thanks again for the tip!

 

 

Jon Chinitz

Plugins

Posted by Jon Chinitz Employee Jun 14, 2018

I am seeing more folks contributing plugins, whether they are pipeline stages or connectors. I have created a dedicated card on the Overview page that will list them all. To have your plugin automatically added tag the upload with the string "hci stage" (or "plugin", "stage", "connector").

 

Thanks for all your contributions and keep them coming!

 

Jonathan

Hitachi Content Intelligence delivers a flexible and robust solution framework to provide comprehensive discovery and quick exploration of critical business data and storage operations.

 

Make smarter decisions with better data and deliver the best information to the right people at the right time.

  • Connect to all of your data for real-time access regardless of its location or format - including on-premises, off-premises, or in the cloud
  • Combine multiple data sources into a single, centralized, and unified search experience
  • Data in context is everything – put data into meaningful form that can be easily consumed
  • Deliver relevant and insightful business information to the right users - wherever they are, whenever they need it

 

Designed for performance and scalable to meet your needs.

  • Flexible deployment options enable physical, virtual, or hosted instances
  • Dynamically scale performance up to 10,000+ nodes
  • Adopt new data formats, and create custom data connections and processing stages for business integrations and custom applications with a fully-featured software development kit

 

Connect Understand Act.png

 

What’s new in Hitachi Content Intelligence v1.3

 

  • Hitachi Content Monitor
  • Simplified navigation of Hitachi Content Intelligence consoles
  • External storage support for Docker Service Containers
  • Increased flexibility with new Workflow Jobs
  • Enhanced data processing actions
  • New and improved data connectors
  • Overall improvements to performance and functionality

 

Hitachi Content Monitor provides enhanced storage monitoring for Hitachi Content Platform.

  • Centrally monitor HCP G Series and HCP VM storage performance at scale, in near real-time, and for specific time periods
  • Analyze trends to improve capacity planning of resources - such as storage, compute, and networking
  • Customize monitoring of performance metrics that are relevant to business needs
  • Create detailed analytics and graphical visualizations that are easy to understand

 

HCP Storage and Objects - for blog.png

 

Hitachi Content Platform (HCP) is a massively scalable, multi-tiered, multi-tenant, hybrid cloud solution that spans small, mid-sized, and enterprise organizations.  While HCP already provides monitoring capabilities, Hitachi Content Monitor (Content Monitor) is a tightly-integrated, cost-effective add-on that delivers enhanced monitoring and performance visualizations of HCP G Series and HCP VM storage nodes.

 

Content Monitor’s tight-integration with HCP enables comprehensive insights into HCP performance to enable proactive capacity planning and more timely troubleshooting.  Customizable and pre-built dashboards provide a convenient view of critical HCP events and performance violations.  Receive e-mail and syslog notifications when defined thresholds are exceeded.  Aggregate and visualize multiple HCP performance metrics into a single view, and correlate events with each other to enable deeper insights into HCP behavior.

 

Content Monitor is quick to install, easy to configure, and simple to use.

 

HCP Application Load.png

With Content Monitor, a feature of the Hitachi Content Intelligence (Content Intelligence) product, you can monitor multiple HCP clusters in near real-time from a single management console for information on capacity, I/O, utilization, throughput, latency, and more.

 

 

Simplified navigation

  • Easily and seamlessly navigate, and automatically authenticate, between Content Intelligence apps (Admin, Search, Monitor) with enhanced toolbar actions 
  • No more need for numerous web browser tabs

 

External storage support for Docker Service Containers

  • Use external storage with Content Intelligence for more robust data storage features and improved sharing of remote volumes across multiple containers

 

Increased flexibility with new Workflow Jobs

  • Each Content Intelligence workflow job can now be individually monitored and configured to run on all Content Intelligence instances, a specific subset of instances, or to float across instances to dynamically run wherever resources are available

 

Enhanced data processing actions

  • Conditionally index processed documents to existing Content Intelligence, Elasticsearch, or Apache Solr indexes
  • New Aggregation calculations for 'Standard Deviation' and 'Variance' of values in fields of data

 

New and improved Content Intelligence data connectors

  • New connector for performance monitoring of HCP systems
  • New connector for processing HCP syslog events on Apache Kafka queues, and improvements to existing Kafka queue connectors

 

For more information, join the Hitachi Content Intelligence Community.

 

Also, check out the following resources:

 

Thanks for reading!

 


Michael Pacheco

Senior Solutions Marketing Manager, Hitachi Vantara

 

Follow me on Twitter:  @TechMikePacheco

Problem Overview

 

As technology and businesses/government continue to become more dispersed across the planet, the location of activity becomes almost as important as the data itself.   Increasingly, these entities find that location of activity provides value in finding patterns for predictive analytics or simply understanding the current and past location of assets.  Positioning on the globe is expressed as a geospatial location.

 

Geospatial location can be expressed in many different forms.  Over the years, organizations have created location specifications that focus on a specific region, and others that specify anywhere around the globe on land, sea, and air.  This article will not describe all the possible ways to specify location, but instead focus on the most common mechanism utilized by modern mapping software readily available.  For instance, Google Maps and Google Earth are likely the most popular.  These mapping solutions utilize the Geodetic system that references locations at sea level called WGS 84 (World Geodetic Systems)

 

There are many articles on the internet that describe these and other location specifications and go into great detail.  But a starting point are the following URLS:

 

https://en.wikipedia.org/wiki/Geodetic_datum

https://en.wikipedia.org/wiki/World_Geodetic_System

http://www.earthpoint.us/Convert.aspx

 

This article will focus on the WGS84 system expressing latitude and longitude.  The WGS84 system has multiple ways of specifying latitude and longitude, but this article will focus on the decimal number specification. For example, the position for the Empire State Building, New York City, New York, USA is:

 

Latitude: 40.748393

Longitude: -73.985413

 

Once location about data is available, the key component is to be able to utilize this information for relevant content discovery.  This article will utilize pictures taken with a camera that includes location coordinates for which the picture was taken. The HCI platform can enable this kind of discovery with appropriate setup.

 

This article will discuss the following topics to perform geospatial search.

  • Geospatial Data Preparation
  • Solr Index Preparation
  • Workflow Construction
  • Performing Geospatial Search

 

Geospatial Data Preparation

 

The first part of this effort is the construction of the data discovery and extraction phase that prepare data for indexing.   The images are preloaded into an HCP namespace to be utilized by HCI.  An HCI data connection of type HCP MQE and named “Image Data Lake” will be utilized to access the images.  There is nothing special about the data connection, thus will not be detailed by this article.

For processing the images in preparation for indexing, an HCI pipeline was constructed with 3 main parts:

  1. Data Extraction of geospatial information from images
  2. Data Enrichment and Preparation of geospatial information.
  3. Date/Time preparations

 

Data Extraction

 

The data extraction portion simply makes sure that the document is a JPEG file, and then performs the generic Text/Metadata Extraction to get any geospatial coordinates.  This stage places the coordinates in the geo_lat and geo_long document fields.

 

Screen Shot 2018-04-11 at 3.06.46 PM.png

 

Data Enrichment

 

The high-level goal of this part is to prepare geospatial information for indexing.  If the previous part generated geo_lat and geo_long fields, processing will proceed for enrichment.

 

First 3 stages utilize the Geocoding stage to generate the human readable city, state, country information.  The result are document fields named loc_city, loc_state, loc_country, and loc_display (combination of other 3 fields).

 

The last two stages setup a GPS_Coordinates field that is in the form of <lat>,<long>. This format is required by Solr for the location data type that will exist in the index once we create it later in this article. The Tagging stage sets up GPS_Coordinates to have the string of “LAT,LON” that will be used as a template for the next stage.  Then using the Replace stage replaces LAT with contents of field geo_lat, and LON with the contents of field geo_long, thus producing the document field like:

 

GPS_Coordinate: 40.748393,-73.985413

 

The pipeline stages for this processing is the following:

 

Screen Shot 2018-04-11 at 3.07.39 PM.png

 

Date/Time Preparations

 

In general, date and time specifications can be specified in nearly infinite number of ways. Although HCI has a built-in Date Conversion stage, there is still a little bit of processing required to prepare general conversion so that HCI can index the dates.

 

For instance, the GPS date and time information returned by the Text/Metadata Extraction stage as two different fields. The result is to reconstruct a field GPS_DateTime from these two fields in a form that the Date Conversion stage can understand without additional definitions in that stage.  The sample fields generated by Text/Metadata Extraction are:

 

GPS_Date_Stamp: 2018:01:27

GPS_Time_Stamp: 17:11:25.000 UTC

 

The goal is to put it into the following form that is understood by the Date Conversion stage:

 

2018-01-27T17:11:25.000+0000

 

What precisely each stage does is again beyond the scope of this article.

 

The pipeline stages for this processing is the following:

 

Screen Shot 2018-04-11 at 3.08.24 PM.png

 

Solr Index Preparation

 

Solr contains built-in support for indexing geospatial coordinates based on the WSG84 coding system. For indexing, the field must be of a specific type and be formatted for this type. The formatting was already performed previously, but essentially the field must be of the form:

 

<lat>,<long>

 

To prepare HCI for indexing and performing geospatial search, it is required to:

  1. Patch the HCI installation,
  2. Construct an appropriate index schema, and
  3. Define a Query Result configuration

 

HCI Installation Patch

 

When an HCI index is constructed, there is a base configuration included in the HCI installation that is used to help simplify the index creation process. One part of the base configuration is a managed-schema concept. This is essentially a configuration mechanism that maps internal Solr data types to simpler names along with default data types attributes.

 

For the purposes of geospatial data, there is a data type called location.  However, there is a problem with the definition in HCI as it uses deprecated and inadequate data types in the definition.   The current implementation of HCI (1.2) allows for changing the managed-schema of an index once it is created; however, the problem with this approach is that if the index is exported and imported, any changes to the managed-schema on that index will be lost.

 

For most deployments, it may not be necessary to export and then re-import an index; however, during development of work flows it is usually typical practice to want to totally clear out the index periodically. Thus the recommendation is to patch the HCI installation until such a time as HCI is updated with more appropriate definition for this data type.

 

The patch procedure is to manually edit 3 files on each node of the HCI installation.  Assuming the HCI installation is rooted at /opt/hci and the HCI version is 1.2.1.139, the managed-schema files are rooted at:

 

/opt/hci/1.2.1.139/data/com.hds.ensemble.plugins.service.adminApp/solr-configs

 

In this folder, the following are the relative paths to the 3 files that need to be updated:

 

basic/managed-schema

default/managed-schema

schemaless/managed-schema

 

 

The changes to the basic/managed-schema and default/managed-schema are identical by just changing the definition for the location field type and adding a locations data type for multi-valued field.  The following are the old and new line(s).

 

OLD LINE:

 

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

 

NEW LINES:

 

<fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/>

<fieldType name="locations" class="solr.LatLonPointSpatialField" docValues="true" multiValued="true"/>

 

The changes to the schemaless/managed-schema is a bit more complex and requires 3 changes.

 

Change 1:  Delete the following lines.

 

<!-- Type used to index the lat and lon components for the "location" FieldType -->

<dynamicField name="*_coordinate"  type="tdouble" indexed="true"  stored="false" />

 

 

Change 2: Change the following lines.

    OLD:

 

<dynamicField name="*_p"  type="location" indexed="true" stored="true"/>

 

    NEW:

 

<dynamicField name="*_p"  type="location" docValues="true" stored="true"/>

<dynamicField name="*_ps"  type="location" docValues="true" multiValued=”true” stored="true"/>

 

 

Change 3: Change the following lines.

   OLD:

 

<!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

 

 

   NEW:

 

<!-- A specialized field for geospatial search. -->

<fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/>

<fieldType name="locations" class="solr.LatLonPointSpatialField" docValues="true" multiValued="true"/>

 

Once all the changes have been made to the files, then the HCI software needs to be restarted. On a CentOS system, the command executed on all nodes would be:

 

systemctl restart HCI

 

Wait for HCI to completely reboot where all services are running.  This can be monitored in the Admin GUI under Monitoring -> Dashboard -> Services.

 

NOTES:

  • If nodes are added to the HCI installation, those new nodes must also be patched, otherwise, unexpected index definitions may occur if the AdminApp service is run on those nodes.

 

  • If this configuration is desired pre-installation of HCI, the procedure can be performed on the installation distribution and then the distribution repackaged.  Then all instances of HCI installations using the patched distribution media will contain the appropriate changes and will survive node additions.

 

Solr Index Schema Definition

 

Once the new managed-schema definition has been updated in the installation, the next step is to create a Solr Index schema to accept the geospatial data collected.  To keep things simple, the HCI index created is a Basic type.  Basic initial schema will create a base set of HCI fields for indexing.

 

Screen Shot 2018-04-11 at 3.34.36 PM.png

 

Along with the basic fields, the fields that will contain the location information generated by the Geocoding built-in stage must be created as simple strings:

 

Screen Shot 2018-04-11 at 3.09.44 PM.png

 

To hold the geospatial information, the following index fields need to be created:

 

Screen Shot 2018-04-11 at 3.10.27 PM.png

 

The GPS_DateTime field is a simple date index field.

The GPS_Coordinates field is a location field type and will contain the definition configured in the managed-schema definition.

 

To verify the new definition of the location field type is activated on HCI, within the schema view of the index just created (1) click on Advanced link, (2) click on managed-schema configuration file, and (3) observe the definition of the location field type as shown below.

 

Screen Shot 2018-04-11 at 3.33.21 PM.png

 

 

HCI Query Result Definition

 

Next step is to modify the image index Query Settings to specify how content should be returned by queries. At a minimum, it is necessary to add the GPS_* and loc_* index fields.  The simplest approach is to just add all fields to the query setting that exist in the index schema definition as there are only a few that were added.  This is accomplished in the index Query Settings page under Fields, then select the Action “Add All Fields”.  See the picture below for guidance.

 

Screen Shot 2018-04-11 at 3.32.33 PM.png

 

Optionally, these fields can also be added to the Query Settings Results as well and is left up as an exercise for the reader.

Workflow Construction


At this point, there is a Data Connection named “Image Data Lake” pointing at an HCP namespace, a pipeline named “Geospatial Image Indexing” that processes the images, and an index named “ImageIdx” that can receive the fields.  The last step is to construct a workflow that can be run to generate the index. Create the work flow by executing the wizard and adding the data connection, pipeline, and output index.  The result when viewing the workflow should look like the following:

 

Screen Shot 2018-04-11 at 3.05.11 PM.png

 

Run the workflow to generate the index.

Performing Geospatial Search

 

Now comes the fun part of performing geospatial queries. As previously mentioned, Solr has a powerful set of capabilities around geospatial points.  The simplest is the range search where two points are provided and all content within the box it forms will be returned.  Then there are more advanced search capabilities that utilize Solr functions that find all points within a distance from a given point, bounding box which sized by the distance from a point, and boosting and sorting capabilities based on the distance from the points.

 

For a fuller description see the following URL:

 

https://lucene.apache.org/solr/guide/6_6/spatial-search.html

 

WORD OF CAUTION:  There were problems with using some forms of the examples at this link specifically around the usage of Solr query filters.  Either it was user error, HCI confusion, or errors (or deprecated specifications) in the examples.  Regardless, the following examples are the forms that worked with HCI.

 

The very simplest form of query is to find all images that reside within a rectangular box that is constructed from two points. This is also called range search on geospatial data. The range search consists of the lower-left point of the box and the upper-right corner of the box.  An example criteria that can be used in the advanced query in the Search Console is:

 

+GPS_Coordinates:[42.377,-71.526 TO 42.378,-71.524]

 

This finds all images within the rectangle with lower-left point of 42.377,-71.526 and upper right point of 42.378,-72.524.  Below is the example run in HCI Search Console.

 

Screen Shot 2018-04-11 at 3.14.34 PM.png

 

The next example consists of utilizing built-in Solr geospatial query. In order to specify additional query parameters via the HCI Search Console, it is necessary to enable this functionality. This is accomplished in the index query results main settings as shown below.

 

Screen Shot 2018-04-11 at 3.31.41 PM.png

 

One such function is to filter content within a circle from a specified point and distance from that point. This function is called geofilt. The below image illustrates what this represents.

 

GeoCircle.png

 

Below is an example in the HCI Search Console using the geofilt filter for finding all pictures that are less than 1 kilometer from the point 42.39,-71.6.

 

Screen Shot 2018-04-11 at 3.25.01 PM.png

 

The last example consists of utilizing the built-in Solr geospatial filter bbox.  This filter constructs a box centered in a given point that the middle of the edges of the box are the specified distance.

 

bbox.png

To perform this type of query, change the geofilt filter to bbox in the Advanced Parameters field as shown below.

 

Screen Shot 2018-04-11 at 3.26.51 PM.png

Hope you enjoyed this article and gained valuable knowledge on how to utilize HCI to perform geospatial search for content.

Here's a quick start guide to help you to integrate to Content Intelligence REST APIs!

 

All of the features available in the Admin and Search UI applications are also available via both the CLI and REST API.

 

Content Intelligence provides 2 distinct REST APIs:

  • Admin - Used to configure and manage a system
  • Search - Used to query across search engine indexes

 

 

REST UI

 

Both the Admin and Search applications in Content Intelligence provide Swagger REST API UI tool for each of their respective APIs.

 

The admin REST API UI can be found in any deployed Content Intelligence system here on the default admin port:

https://<host-ip>:8000/doc/api/

 

The search REST API UI can be found here, on the default search port:

https://<host-ip>:8888/doc/api/

 

searchRESTUI.jpg

 

This REST UI tool documents and demonstrates the entire REST API, allowing you to exercise real request/responses, list and manage system state, display curl examples, etc. This tool has proved invaluable in accelerating product integrations with Content Intelligence, while demonstrating it's capabilities.

 

Expand an API to see it's request/response model objects:

RESTUI1.PNG

 

 

Click "Try it out!" to run a request against the live service to see it's behavior (and get a curl example):

RESTUI2.PNG

 

 

Authentication

 

How does Content Intelligence handle authentication?

 

We use OAuth framework to generate access tokens. The process works as follows:

 

1. Request an access token

 

Once you have a user account, you need to request an authentication token from the system. To do this, you send an HTTP POST request to the /auth/oauth endpoint on the application you're using.

 

Here's an example using the cURL command-line tool:

curl -ik -X POST https://<system-hostname>:8000/auth/oauth/ \

-d grant_type=password \

-d username=<your-username> \

-d password=<your-password> \

-d scope=* \

-d client_secret=hci-client \

-d client_id=hci-client \

-d realm=<security-realm-name-for-an-identity-provider-added-to-HCI>

 

In response to this request, you receive a JSON response body containing an access_token field. The value for this field is your token. For example:

{ "access_token" : "eyJr287bjle..." }

 

2.  Submit your access token with each REST API call

 

You need to specify your access token as part of all REST API requests that you make. You do this by submitting an Authorization header along with your request. Here's an example that uses cURL to list the instances in the system:

curl -X GET --header "Accept:application/json" https://<system-hostname>:<admin-app-port>/api/admin/instances --header "Authorization: Bearer <your-access-token-here>"

 

Notes:

 

• This same mechanism works with local admin users and remote directory servers (e.g. "identity providers"). To get a list of security realms available in the system, send an HTTP GET request to the /setup endpoint. This can be used to let users select the type of authentication credentials they will be providing. For example, to do this with cURL to get the list of realms:

curl -X GET --header 'Accept: application/json' 'https://<hostname>:<admin-app-port>/api/admin/setup'

• To get an access token for the local admin user account, you can omit the realm option for the request, or specify a realm value of "Local".

• If a token expires (resulting in a 401 Unauthorized error), you may need to generate a new one the same way as before. This expiration duration is configurable in the Admin App.

 

 

Workflow Admin

 

List all instances in the system:

curl -X GET --header 'Accept: application/json' 'https://cluster110f-vm3:8000/api/admin/instances'

 

List all workflows in the system:

curl -X GET --header 'Accept: application/json' 'https://cluster110f-vm3:8000/api/admin/workflows'

 

Run a specific workflow:

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'https://cluster110f-vm3:8000/api/admin/workflows/1f7a6156-4a64-4ac0-b2e8-d73f691dea73/task'

 

Simple Search Queries

 

Querying Content Intelligence search indexes generally involves:

  • List the indexes in the system.
    • Index name is a required input for querying indexes (federated or otherwise).
  • Submit a query request, and obtain a response.
    • There are a number of queries users can perform, the most basic of which is a simple query string. When there are many results, the API also supports paging and limits on responses.

 

List all indexes in the system:

curl -X GET --header 'Accept: application/json' 'https://cluster110f-vm3:8000/api/admin/indexes'

 

Submit a simple query request:

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{

  "indexName": "Enron",

  "queryString": "*:*",

  "offset": 0,

  "itemsToReturn": 1

}' 'https://cluster110f-vm3:8888/api/search/query'

 

Query results returned:

{

  "indexName": "Enron",

  "results": [

    {

      "metadata": {

        "HCI_snippet": [

          "Rhonda,\n\nYou need to check with Genia as I have never handled the physical power agreement matters.\n\nSusan \n\n -----Original Message-----\nFrom: \tDenton, Rhonda L.  \nSent:\tTuesday, January 15, 2002 2:17 PM\nTo:\tBailey, Susan\nCc:\tHansen, Leslie\nSubject:\tSouthern Company Netting\n\nHere's Southern.  I never received a copy of the Virginia Electric Master Netting.  We do have netting within the EEI.\n << File: 96096123.pdf >> \n\n"

        ],

        "Content_Type": [

          "message/rfc822"

        ],

        "HCI_dataSourceUuid": [

          "f1c05be1-5947-41e1-a9f1-03a98f0fa036"

        ],

        "HCI_id": [

          "https://ns1.ten1.cluster27d.lab.archivas.com/rest/enron/maildir/bailey-s/deleted_items/25."

        ],

        "HCI_doc_version": [

          "2015-07-02T09:06:02-0400"

        ],

        "HCI_displayName": [

          "RE: Southern Company Netting"

        ],

        "HCI_URI": [

          "https://ns1.ten1.cluster27d.lab.archivas.com/rest/enron/maildir/bailey-s/deleted_items/25."

        ],

        "HCI_dataSourceName": [

          "HCP Enron"

        ]

      },

      "relevance": 1

      "id": "https://ns1.ten1.cluster27d.lab.archivas.com/rest/enron/maildir/bailey-s/deleted_items/25.",

      "title": "RE: Southern Company Netting",

      "link": "https://ns1.ten1.cluster27d.lab.archivas.com/rest/enron/maildir/bailey-s/deleted_items/25."

    }

  ],

  "facets": [],

  "hitCount": 478

}

 

The latest release of HCI 1.2, introduces us to new connector plugins that connect to external databases like mySql and Postgresql. With HCI 1.2 , we have this ability to create custom JDBC connectors for different databases using a base template and the whole development process is simplified by using the HCI plugin-sdk.

 

Before developing the jdbc connector make sure the following dependencies are available :

 

1) The latest HCI Plugin-sdk.jar

2) jdbc-1.2.0.jar -- this jar is located in the HCI installation directory.

    Example : /opt/hci/plugins/jdbc-1.2.0.jar

3) Database Specific driver jar.

    Example: ojdbc7.jar, sqljdbc.jar

4) Java jdk 1.8 or higher

5) Any build tool like gradle,maven to package the plugin jar with required dependencies.

 

 

Create a new project and extend the BaseJdbcConnectorPlugin Class present in the jdbc-1.2.0.jar.

Add all the unimplemented methods and the methods to override in your custom connector plugin.

The main methods to override are:

1) getJDBCDriver

2) getBatchSizeLimitPredicate

The "getJDBCDriver" method should provide the database specific driver. For example , to connect to an Oracle database use the "oracle.jdbc.driver.OracleDriver" driver.

 

The "getBatchSizeLimitPredicate" method should return the predicate for providing the default batch size to fetch when we test the data connection.

 

Example:

"OFFSET 0 ROWS FETCH NEXT " + batchSize + " ROWS ONLY";

In the above example the "bathSize" is determined by the Query Batch Size property value entered by the user during the data connection creation. This is required in order to prevent the plugin from fetching all the rows in the table. This is used as a safeguard against retrieving millions of rows from a database table, especially during a test.

 

 

Structure of the plugin:

 

 

The whole process of connecting to a database and executing SQL statements is abstracted by the BaseJDBCConenctorPlugin which simplifies development without having to worry about any database connections.

 

Use any build tool like gradle, maven to include/exclude dependencies while packaging the plugin.Make sure the plugin.json manifest file is present in the META-INF directory of the plugin.

 

Sample Plugin.json

 

I was asked today about the role of connector plugins in Content Intelligence, so thought I'd pass along the details.

 

Content Intelligence connector plugins can (today) operate in one of two modes:

  • List-based (CRAWL_LIST)
  • Change-based  (GET_CHANGES)

 

CRAWL_LIST mode

 

In this mode, users specify a “starting” Document by configuring the data source. The plugin's role is to perform a container listing from starting points that are requested by the HCI Crawler. HCI provides all of the bookkeeping, and decides which starting points to ask for.

 

Example:  If a content structure is as follows:

/folder1

/folder1/file1

/folder1/subfolder1

/folder2

/folder2/file2

/folder2/subfolder1

/folder2/subfolder1/file3

/file4

 

First, a user configures a starting point of “/” in the data source configuration.

 

When the workflow is executed, HCI will call the “root()” method on the plugin, which should return a Document for the “/” starting point. This is typically a "container" Document, which works like a directory.

 

HCI will call “list()” with that starting document of “/”, which should return all Documents under "/":

  • /folder1
  • /folder2
  • /file4

HCI will then call “list()” with as starting document of “/folder1”, which should return the following Documents:

  • /folder1/file1
  • /folder1/subfolder1

 

The process continues until all objects are crawled. HCI keeps track of the documents that have already been visited, and will not crawl the same object again unless directed later.

 

In continuous mode, the entire process automatically starts again from the root container. HCI will only send only changed Documents from the previous pass to the processing pipelines in this case.

 

 

GET_CHANGES mode

 

In this change-based mode, the connector plugin can collect and return Documents in any order or frequency it would like.

 

HCI calls the “getChanges()” method on plugins in this mode to return Documents. The plugin can return a plugin-defined "token" with each response. The token is opaque and is only interpreted by the plugin. HCI stores this token, and will provide the token returned in the last getChanges() call to the next call. Plugins decide what to do (if anything) with the provided token. For example, if the getChanges() call executes a time-based query to return Documents, the token can include a timestamp of the last discovered Document. On the next getChanges() call, HCI will provide this token to the plugin, which can use it to build the next query.

 

It’s completely up to the plugin to determine what to return in getChanges(), such as a batch of Documents or a single Document. This method can return no changes until the connector discovers a new Document to return.

 

 

The Role of Connector Plugins

 

For details, I've included Alan Bryant's excellent overview of the role of connector plugins here:

 

"A quick overview:

 

There are currently two modes that ConnectorPlugins can use, either list-based or change-based. Since you are working with a filesystem, you probably want list-based. You should implement getMode() to return ConnectorMode.CRAWL_LIST. You can also then implement getChanges() to throw some exception since you won't be needing it.

 

The starting point is the getDefaultConfig() method. This should define any configuration that the user should specify. In this case you should have them specify the starting path that the connector should provide access to.

 

Once the user has specified the config, build() will be called. You should construct a new instance of your plugin here with the provided config and the callback. See the examples.

 

startSession() will then be called. You should put any state associated with your plugin on this session... anything that's expensive to create or clean up. There is no guarantee that your plugin instance will be kept around between calls. The session will be cached where possible.

 

To actually crawl the datasource, we start with the root() method. This should return a Document representing the root of your datasource. Generally this should be a container (see DocumentBuilder.setIsContainer).

 

After that, the Crawler will call list() for a particular container. List should return an Iterator of Documents. Each Document generally represents something that has content to be processed (see DocumentBuilder.setHasContent) or that is a container of other Documents, like root(). These containers generally correspond to real-world objects, like Directories, but can really be any grouping you want.

 

If you are returning large numbers of Documents, look at StreamingDocumentIterator so you don't cause OutOfMemory issues by holding all the Documents in memory at once.

 

As Documents are discovered, they will be processed in the workflow. During this time, stages may ask for streams from the Document. This is implemented by calling openNamedStream(). openNamedStream should use metadata that was added in list() to be able to open the streams. So, list() just adds stream metadata (see DocumentBuilder.setStreamMetadata) and it's used later when we call openNamedStream.

 

Other things you should do:

  • get() should operate like openNamedStream being passed the StandardFields.CONTENT.
  • getMetadata returns an up-to-date version of a Document based on just a URI. It is very important that this returns the same types of metadata, in the same format, as list(). getMetadata() is used for pipeline and workflow tests. If the data is different, then the test will be useless.
  • test() should be implemented to ensure that the config is correct... basically, make sure the configured directory exists. In plugins that do network access, this can trigger SSL certificate processes."

 

 

Thanks,

-Ben

During some data assessments on CIFS filesystems I have observed some strange behavior of how HCI deals with system metadata time stamps. System metadata timestamps are the usual timestamps we all know from CIFS and NFS when looking into Windows Explorer or do an ls -la. This is NOT the additional timestamps can be stored with the object itself created and maintained by the creating application, like PDF writer, MS Office apps and DICOM applications, for instance.

Here is a some more information in addition what has been discussed already here in other threads for hopefully better understanding from a single point of view.

 

I hope this gives more insight into this because it is very important to understand this when it comes to file age profiling.

 

 

 

Also important to know BEFORE starting such an assessment is the following:

There has been a lot of talk recently about how to use the HCP connector, specifically how/when to use the actions Output File, Write Annotation and Write File. Following is an example of how I used these actions in a series of simple workflows.

 

I started out with a namespace (Cars) that I knew had files in it with custom metadata (CM) already attached. I actually used this namespace years ago to do some HCP MQE demos with. What I didn’t realize was that it also had some other files that didn’t have CM.

 

Step 1: Make a copy of the Cars namespace to do my work on. I accomplished this with a simple workflow that I keep around (Figure 1) .

 

Namespace_Copy_Workflow.png

Figure 1

The output section of the workflow uses the Output File action because I want to copy the entire object along with its CM (Figure 2).

 

outputFile_Example.png

Figure 2

 

Step 2: Get rid of the files that didn’t have CM. I wanted to preserve the files so the workflow copies them to another namespace (‘Split’) and then deletes them from the source. I accomplished this with a simple pipeline. The pipeline has an Output File action for the copy and a Delete action. Both are tucked inside an IF block that checks for the presence of the HCP_customMetadata field. The field is created by the HCP Connector and is set to true when the file has one or more metadata annotations (Figure 3).

 

Copy_if_no_CM_and_Delete_from_Source.pngFigure 3

 

The workflow for the copy is in Figure 4. Note that it has no Output section since the actions are performed in the processing pipeline.

 

Copy_and_Delete_Workflow.png

Figure 4

 

Step 3: To demonstrate adding a new annotation I first needed to create the annotation. The easiest way to do this is a simple pipeline that has two stages. The first creates a couple of fields using a Tagging stage (Figure 5).

 

writeAnnotation-2-tagging-stage.pngFigure 5

 

The second creates an XML stream using the fields in an XML Formatter stage. The stream will be used as the annotation. The CM is written to the object in the Output section of the workflow using the Write Annotation action (Figure 6). Note the fields and streams in the Write Annotation configuration.

Slide1.JPGFigure 6

 

The workflow that pulls all this together is in Figure 7.

Add_CM_Workflow.pngFigure 7

Do you want more insight into the state of your workflows? Do the workflow metrics don't update as frequently as you like? Are you interested to find out the speed of your data connector? Confused when to use which pipeline execution mode? Would HCI deployed in virtual environment give the same performance as physical deployment?

 

The attached white paper address these questions and highlight the improvements and optimizations introduced in version 1.2 to improve the overall performance of Hitachi Content Intelligence.

 

The content of the white paper:

  • Highlights workflow performance improvements and optimizations introduced in version 1.2 and compare to previous releases.
  • Summarizes the improved performance results of list-based data connectors.
  • Provides methodology used to determine pure crawler performance for different data connections.
  • Demonstrates that document failures reporting to the metrics service has been drastically improved.
  • Recommends when to use Preprocessing execution mode over Workflow-Agent.
  • Compares performance of a physical HCI with that of similarly configured HCI deployed in a virtual environment.

 

 

Questions/Feedback? Please use the comments section below.

 

-Nitesh

Before Updating to 1.2.1, please view the following question/answer addressing a known issue if you have updated from a previous version to 1.2.0 and more than one week has elapsed:

 

Updating from 1.2.0: Failed to initialize UpdateManager

 

Thanks,

-Jared

Jon Chinitz

Making an HCI OVF Bigger

Posted by Jon Chinitz Employee Oct 21, 2017

Some of you have asked about increasing the size of the OVF that ships with Hitachi Content Intelligence. The default disk volume today is 50GB. The following quick sheet of instructions will show you how to increase the disk volume.

 

Step 1: shutdown the node

Step 2: using the vSphere console (or any other method you feel comfortable with) navigate to the node's settings and change the size of "Hard Disk 1" (I chose to increase it from 50GB to 100GB):

 

Change_VMDK_Size.png

Step 3: save the edits and restart the node.

Step 4: ssh into the node and display the mounted filesystems. The filesystem we are after is the root filesystem mounted at /dev/sda3:

 

df.png

Step 5: run the fdisk command specifying the disk device /dev/sda:

 

fdisk.png

Step 6: while in fdisk you are going to (d)elete partition 3, create a (n)ew partition 3 with the default starting sector and size offered up by fdisk. The starting sector is the same as /dev/sda3 had only the size is now the number of sectors left to the end of the disk. Last thing to do is to (w)rite the new partition table back to the disk (you can safely ignore the error).

Step 7: reboot the node.

Step 8: ssh back into the node and run xfs_growfs. Be sure to specify the partition /dev/sda3:

 

xfs_growfs.png

As you can see the root filesystem has been resized to occupy the new space.

Hi,

 

This stage is an attempt to increase the HDI indexing functionalities of Content Intelligence.

It transforms the ACL entries found in the HCP custom metadata left by HDI. Whenever possible, the plugin attempts to create a separate metadata field for each ACL entry and transform the permissions and user/group IDs to readable formats, as seen in the following example:

 

ResultOverview.png

As an optional step, the plugin can automatically map user/group SIDs to their respective Active Directory names, by providing the parameters shown in the following example:

 

ConfigOverview.png

Alternatively, you can search for an specific user/group by first obtaining its SID in Active Directory and then using said SID for the query.

 

The plugin can not transform HDI RIDs in its current version.