Ben Isherwood

HCI: How to analyze HDI data in HCP

Blog Post created by Ben Isherwood Employee on Apr 14, 2017

Many people that we talk to about Hitachi Content Intelligence are curious about why we focus so intently on building out tools for content processing and data analysis. Isn't this a search tool?

It turns out that great search experiences are directly driven by the quality of the processing performed on the content. This makes processing, normalizing, and categorizing content the most important aspect of any (great) search technology. It also turns out that most search tools are surprisingly deficient in these areas! This has been a great opportunity for the HCI team to work on filling these gaps.

 

A while ago, I was asked whether or not HCI could be used to process and search HDI data that was stored (obfuscated) in HCP namespaces.  Here's what I did to find out the answer...

 

Step 1: Add an HCP data source

I started out testing HCI against a namespace in HCP that was backing an HDI file system.

The goal was to see how searching the core would work with HDI and the determine how difficult this task would be.

 

First, we connected HCI to the HCP namespace containing the HDI data:

Blog1_1.jpg

Configuring a data connection is all that's needed to begin processing data in HCI.

 

Step 2. Auto-generate a Content Class

HDI file paths in the HCP namespace are obfuscated, so it's impossible to search the contents of these namespaces directly.

However, because HDI stores both the full content and HDI custom metadata in HCP, we can easily take advantage of this.

Using example custom metadata from one of the files in the namespace, HCI was used to auto-generate an "HDI custom metadata" content class. This content class could be used to pull the metadata from the XML file into the pipeline engine for further processing.


HDI auto-generated content class:

Blog1_2.jpg

 

Step 3: Create and test an HDI processing pipeline

This step required some effort... and resulted in the addition of some new built-in plugins.

After cloning the default pipeline, I added a content class extraction stage to the pipeline for reading the "default" custom metadata annotation: “HCP_customMetadata_default”. This enabled me to pull the XML into the system, extract all of the fields, and present them to the processing pipeline.

After browsing for a file on the data source and running a pipeline "test" operation against it, I quickly found that the file paths found in the "default" annotation were URL encoded - making a search against these fields difficult. I built a URL Encoder/Decoder stage to decode them, uploaded the plugin, and started using it in the pipeline immediately. Now these fields were clearly visible!

URL decode any encoded HCI metadata fields:

Blog1_3.jpg


I noticed that there were a lot of UNIX timestamps in the metadata values on these fields. The date conversion stage didn't (yet) have support to normalize UNIX timestamps into standard date fields. Adding support in the stage for these resolved that issue.

 

Normalizing HCI metadata date fields:

Blog1_4.jpg

 

In some documents, the "HDI_file_path" metadata field wasn’t available, so I also configured the pipeline to use the "HCI_DisplayName" in place of the HDI_file_path metadata for those specific documents.

 

Step 4: Click to build an optimized index

After running the workflow, HCI automatically discovered all sorts of metadata from the files in the namespace.

Using the "Workflow > Task > Discoveries" UI, I created and configured an index from the field recommendations with a single click!


Auto-generated index schema:

Blog1_5.jpg

 

Step 5: Customize the search experience

HCI allows you to quickly customize the search experience for specific sets of users.

I customized a results display configuration in the index query setting where:

  • The value of the HDI file path field was used as the title of each document
  • The link takes you directly to the object in the HCP system
  • Add "snippet" text below each result containing raw extracted text from each document
  • Add a document metadata expandable panel with each result and named it "HDI Metadata"


Customizing search result display:

Blog1_6.jpg


Built-in default HCI autocomplete support uses phrases within the content itself:

Blog1_7.jpg


In the index "Public" query setting, enabled support for range queries on time fields, file names, etc:

Blog1_8.jpg


Can also expose detailed document metadata within each search result:

Blog1_9.jpg


Minutes later – full featured HDI file system search, just by crawling the HCP namespace!  And we had implemented 2 new features in the process: a new "URL Encoder/Decoder" stage and UNIX timestamp support in the "Data Conversion" stage.

 

Hopefully you've learned how HCI content processing technologies can accelerate the development of full featured search and categorization.

 

Thanks for reading!

-Ben

Outcomes