Ben Isherwood

Debugging an HCI pipeline: Helpful stages

Blog Post created by Ben Isherwood Employee on Apr 10, 2017

HCI pipelines provide an easy-to-use mechanism for analyzing, normalizing, and transforming data.


But how do I know what additional stages I need to add to my pipeline to process a set of data?


HCI introduces a "pipeline test" tool. This tool allows you to browse a data source, select any file, and process that file using the pipeline. The pipeline test UI will show you the values of document metadata before entering a processing stage and can compare it with the metadata that exists after you leave that stage. This allows you to not only see exactly how each stage processes a document in the pipeline, but provides visibility to the metadata values that are already there.

 

Workflow pipeline test tool, displaying the new metadata added by the MIME Type Detection stage:

Blog3_1.jpg

 

There are many useful built-in stages to utilize when testing a pipeline. Let's take a look at a few now...

 

Snippet Extraction

 

First, let's talk about the Snippet Extraction stage.

One convenient aspect of the snippet extraction stage is that it supports pulling raw content from a data stream. This allows you to configure this stage to pull raw data into a metadata field of choice, and let you visualize the content of that raw stream.


For example, this stage has helped to resolve issues in which content class extraction was failing.

The content class extraction stage in the pipeline below was expecting to read raw XML data from the HCI_content stream, but it wasn't working. The stage was configured correctly with the correct source stream name, so why didn't it work?


Why is the content class stage not extracting metadata? There are no changes!

Blog3_2.jpg

 

To figure this out, we first added a Snippet Extraction stage to the pipeline before the Content Class Extraction stage, and configured it to read data from the stream and store it in a "$TestContents" field.


Note: the "$" prefix can be used to name fields that should never be indexed, but that may be used for debugging or stage to stage communication.


After running the pipeline test, voila! The content dump indicates that the stream attempting to be processed was not XML, but raw text! Looks like we configured the stage to process the raw "HCI_content" stream instead of the custom metadata stream we should have used: "HCP_customMetadata_default"...

 

Indicates that stream contains text, instead of the expected XML:

Blog3_3.jpg

 

Fixing the content class stage to point to the correct XML content stream name ("HCP_customMetadata_default") resolved the issue, allowing the extraction to work as expected:

Blog3_4.jpg

 

Snippet extraction may be used whenever you need to gain insight into what content is actually inside the data streams you are processing.

 

 

Reject Documents

 

Let's look at yet another stage that is extremely useful when building and testing pipelines: "Reject Documents".


The Reject Documents stage allows you to cause any Document to immediately fail processing with a custom error message of your choosing.


Adding Reject Documents to the pipeline:

Blog12_1.png

 

Document failures generated as a result of the Reject Documents stage look like any other document failure, and are reported exactly the same way. The workflow will halt all further processing of these rejected documents, but list them for users for further investigation. This stage can therefore serve a similar purpose as assert statements found in many programming languages.


The stage allows pipeline creators to specify "required" criteria for a Document to be further processed, allowing for validation of specific document conditions.


Configure a "Reject Document" stage by specifying a custom message:

Blog12_2.png

 

Consider a scenario in which you (the pipeline designer) expect all Documents to have a stream named "HCI_text" at a specific point in the pipeline. Since all further processing depends on this condition, you can introduce a "Reject" stage to enforce this.

Blog12_3.png

 

If any Document enters the pipeline at this position WITHOUT a stream named "HCI_text", that Document will fail processing and result in a document failure.


This behavior can be invaluable in identifying which documents in your data set do not conform to specific processing criteria. You can use this information to either process those failing documents further, or update the pipeline to handle them in special ways. In this specific case, you can add an additional "Text and Metadata Extraction" stage to generate the expected "HCI_text" stream in the case that it does not already exist.


This stage is also useful in "pausing" the expansion of documents such as CSV, log, PST, ZIP, and TAR files. If you'd like to halt processing to see how a pipeline is handling these sub-documents, you can add a "Reject Document" stage to halt processing at any point in the pipeline, optionally conditionalized based on a specific file. Using the pipeline test stage diff tools, you can then determine how these expanded Documents were handled by the pipeline stages before and adjust pipeline logic and stages accordingly.

 

 

Conclusion


As always, use pipeline testing with example documents from your data set for analysis. Even just a few minutes testing example documents can lead to vast improvements in index query performance, relevancy, and accuracy.

 

More on additional stages that can be used to debug pipelines in future blogs.

Thanks for reading!

-Ben

Outcomes