Ben Isherwood

HCI: Best Practices Checklist

Blog Post created by Ben Isherwood Employee on Mar 28, 2017

Empowering users with powerful intelligent search tools can change and improve the way that everyone works.


Hitachi Content Intelligence provides a great toolkit for building and optimizing such solutions - but no one starts out as an expert in content processing and search engine index schema management. Don't worry, we're here to help!


Here is a checklist of HCI best practices to help you hit the ground running:


1.  Connect your source repositories


    • Configure your data source(s) in the Workflows > Data Connections UI

    • Run the “test” operation on each to ensure connectivity

    • Resolve any certificate issues up front, by adding them automatically to the trust store when prompted


2. Build and test pipeline(s) with example document(s) loaded directly from the data source(s)


    • Begin with the default pipeline in the "Workflows > Pipelines" UI  (or clone it!)
    • In the pipeline test UI, click "Select Document" to browse your data connections for documents to test
    • Verify changes made to documents as they are processed by each stage in the pipeline (click "View Results" on each stage to see the changes each stage performed).
    • View the metadata fields available on each document by clicking the "Discovered Fields" tab.
    • Evolve the pipeline to customize desired processing behaviors
    • (Optional)  Define and use Content Classes
      • Extract field data from XML, JSON, and PATTERN matching regular expressions
      • Blend this data with other metadata fields to improve search quality
      • Re-use these definitions across systems, and enable testing before going into production
    • Leverage the community
      • Upload any custom plugins/packages you want to use via "System Configuration > Plugins" or "Packages".
      • Import workflow bundles containing pipelines, workflows, indexes, and content classes at "Workflows > Import/Export", or export your own workflow bundles to share!
      • Use the context-specific online help by pressing "(?) > Help", or access the HCI community any time using the UI link at "(?) > Community"


3. Run (and test) a workflow


    • Configure a data source with a representative subset of your data for exploration and issue identification
    • Add the default pipeline to see what processing is possible, or use your own pipelines for custom processing
    • (Optional)  Create and add an index as a workflow output
      • Using a "schemaless" index can help to identify fields and any document failures that could occur for indexing
      • Test end user queries in the index “query” tab and the search console
    • Run the workflow (with or without outputs) to generate a workflow task report
      • Explore the field discoveries, their counts and values
        • This information can be used to build an optimized production index schema with a click!
        • Pipelines should combine fields containing the same information together into a single field (use the “Mapping” stage)
        • Pipelines should drop unnecessary fields and documents to minimize index size & maximize performance (use the “Filter” and “Drop” stages)
        • Normalize date fields using a "Date Conversion" stage and adding the field name to the config
    • Document failures can be identified here and addressed early through data cleansing or special processing
    • Consider “exploding” any container documents found such as PST, ZIP, email, etc. into individual files on a data source for direct query and linking.
      • Otherwise, search results containing hits of files within a ZIP would simply link back to the ZIP file (not to the individual files contained inside the ZIP) for download.


4. Optimize Pipeline Performance


    • Check the "Workflow > Task > Performance" report to identify any expensive processing stages, and take steps to reduce this time:
      • Remove any unnecessary stages from the pipeline
      • Reduce stage processing time by introducing additional conditional logic to avoid the processing of any expensive files that are not producing meaningful value
    • Add known file extensions found in your data set to the MIME type detection stage configurations to avoid an expensive “deep” detection to improve performance


5.  Start with a new “Basic” index for production


    • The default "Schemaless" index template ensures that all fields discovered in the pipeline will get added to the index automatically.
      • This is great for getting started, but can result in bloated, inefficient indexes in production.
      • Dynamic field type "guessing" can sometimes guess wrong, which may result in document indexing failures
    • To avoid these issues completely, create a new index with the “Basic” template in "Workflows > Index Collections > Create Index"
      • This template defines only minimal required fields and eliminates dynamic fields for a controlled, predictable schema
    • Use the "Workflow > Task > Discoveries” tab to select and add specific fields (with optimized type recommendations) to your new production index with a single click. This ensures that you only index the fields you need, and keeps your index size as small as possible.
      • You can also use the pipeline/workflow test “Discovered Fields” UI to do the same


6.  Fine tune your index schema


    • Locate your index in "Workfows > Index Collections".
    • Remove any unnecessary fields from the schema that will not be used to query against
    • Pay close attention to field attributes marked “HIGH” impact and evaluate if they are necessary
    • Eliminate any variable indexing configurations (e.g. dynamic fields) if not required
    • Backup all your system configurations (pipelines, data sources, index schemas, etc.) to a “package” for safekeeping.
    • For scaling large indexes, it's best to have multiple collections (but not in the 100s) and then perform federated queries to search all indexes. It's best to try finding a logical way to split them into multiple indexes instead of storing all documents in a single collection. This can be done by time, region, namespace, department, or any other logical split that makes sense for the use case.


7.  Verify your enhanced search experience


    • Use the "Open Search App" button in "Workfows > Index Collections" to access the search console.
    • Ensure that queries match the desired behavior and performance characteristics
    • Customize index “query settings” for proper visibility to specific field information in results
    • Leverage the index “query” tab (in addition to the search console) to test the behavior of each query setting
    • Tweak faceting, refinement, field customization, and results display to match your use cases


8. Size your production system


    • Follow the documented HCI recommendations for specific document count targets and performance goals
      • Important Note: In order for the Index service to automatically repair & recover shards from intermittent outages, you must have at least 50% disk space available on the instances running the Index service otherwise shards can get corrupted. It's strongly recommended to plan the instances running the index service to have 2x to 3x the space space used.
    • Your mileage will vary!
      • The only way to be sure that your system can handle your use case is to try it out.
      • Check your test environments for expected index size and index service memory utilization with the given counts.
    • Compare your results to the recommendations and extrapolate accordingly.
      • It can be helpful to identify your largest sized container document
        • Some stages may require all data be loaded into memory for processing (PST Expansion, Email Expansion, MBOX Expansion).
        • Ensure that your configured workflow task “Executor Heap Memory” is sufficient to hold these documents (the default is 1 GB).
          • Check the instance Monitoring page to ensure that you have this memory available, and not allocated to other services.
        • If you run into workflow task “crawler” out of memory failures. increase the “Driver Heap Memory” (the default is 1 GB).
    • Decide on your availability needs
      • If you want your index to grow very large...
        • Index Shards
          • HCI allows you to break your index into smaller segments (called "shards") which can be dynamically distributed across the instances in your HCI cluster. This allows your system to grow very large, allowing for balancing shards to new instances if you run out of space or want to improve performance.
          • You set the index shard count when creating your index.
          • At least one shard per index service instance is ideal.
          • Increase shard count for an index to allow your index to grow very large (shards can be balanced to other Index service instances)
            • If your system will ever grow, you will want to over-shard so that extra shards may be seamlessly balanced to other instances in the future. For example, if your index instance count will double over time, double your initial shard count.
      • If you want your index to survive failures...
        • Index Replicas
          • HCI can create backup copies of your index shards, called "replicas", and store them on separate instances - allowing an index to survive a node outage and continue to support queries.
          • To create replicas, increase the “Index Protection Level” in "System Config > Services > Manage Services > Index > Configure" from the default of 1 to 2 in order for an index to survive single node outages. Increase further to create additional copies.
          • Increasing IPL automatically creates replica copies of each of your index shards on other instances, allowing the system to survive a single instance failure at the expense of additional resource utilization. Decreasing IPL will automatically delete replica copies.
          • Each new index copy will increase the resource requirements accordingly, and may require increasing index service instance counts.
      • The default service configuration recommendation for 4+ instance systems is typically 2 or 3 redundant service copies. This allows services to survive a node outage and continue working normally.
        • If you don’t need redundancy, consider scaling some services down to a single instance to free up resources (at the expense of HA)


9.       Optimize service distribution


    • As a general rule of thumb, you should never allocate more than 80% of system resources on each instance
      • ~20% of all system resources should be reserved for the operating system
      • ~1 GB of physical RAM should be reserved for the low level system services on WORKER instances
      • ~2 GB of physical RAM should be reserved for the low level system services on MASTER instances
    • Ensure that you have swap space enabled, and have enough to meet your needs (~5-10 GB partition is typical)
    • When possible, run each individual service you want to optimize by itself on dedicated instance(s)
    • The index service works best without any other services running on those instances (including workflow agent)
    • Maximize the RAM allocated to the index service – Solr requires as much RAM as you can spare in order to build & query a large index
      • Be careful not to meet/exceed the total physical RAM across your allocation of services!!
      • Limit the scaling of any JVM based service to 31.5 GB of RAM or less, to take advantage of JVM optimizations. Beyond this point, it likely makes sense to add instances rather than scaling up a service.
    • Check the Monitoring page
      • Provides detailed load and container metrics rolled up at the cluster, instance, and service levels
      • Can identify if a single instance is constantly busy - move services to instances that are not busy
      • Can identify if a service is utilizing all of it's allocated resources - increase those allocations, or move the service to an instance with free resources


10.   Test your production indexing


    • Check your indexing rate, and measure it against your target rate
    • Revisit stage performance and index schema fields to make further pipeline & index optimizations
    • Normalize any additional field data which did not match expectations
    • Check for any unexpected document failures, and take steps to resolve
    • Check the Monitoring page to help identify instances where heavy processing is occurring, and balance service load accordingly
    • Verify your end user search experience!



That's it!  You're now an HCI search expert.


If you have any other questions, feel free to ask them in the HCI Community!