Ben Isherwood

Fields and Streams: The HCI Document

Blog Post created by Ben Isherwood Employee on Sep 20, 2017

When analyzing a collection of data of varying types, the first challenge you'll encounter is how to ensure that your processing tasks can all speak the same consistent language and provide common capabilities.


Can your system easily determine the difference between an email, image, PDF, or call recording? If so, how? Can the system make additional processing decisions automatically based only on the data provided?


Typically, these tasks are performed simply through the generation and evaluation of metadata. The MIME type of a file, for example, can help to determine how it should be processed. Geo-location metadata can help to identify where content was generated. Date/time metadata can identify when a document was created or last accessed. But can the system easily understand how to interact with, augment, and/or repair that metadata? Can it automatically determine the data types of those metadata values to assist in parsing and database entry? How does the system access the raw content streams for further analysis given only metadata? Is the data associated with any additional data streams? How are those accessed? Is the answer different each time?


In Content Intelligence, data is represented from any data source in the form of a data structure called a Document.


A Document is a representation of a specific piece of data and/or its associated metadata. Any data can be represented in this form, providing a normalization mechanism for advanced content processing and metadata analysis.


Documents are made up of any number of fields and/or streams.


A field is any individual metadata key/value pair associated with the data.


For example, a medical image can become a Document that contains field/value pairs such as "Doctor = John Smith" and "Location = City Hospital". These fields serve as the metadata for your files and can be used for general processing and to construct a searchable index. Fields may be (optionally) strongly typed, though all fields can still be evaluated in their native string form. Fields can also have a single value, or multiple values associated with them.


A stream is a pointer to a sequence of raw data bytes that live in another location, outside of the document itself, but that can be accessed on demand.


Streams typically point to larger data files that would be prohibitively expensive to load into memory as Document fields, such as the full text content of a large PDF file. Rather than spending system resources passing this large amount of data through a pipeline, Content Intelligence uses these streams to access data and read it from where it lives on-demand. This is accomplished through the evaluation of stream metadata that is evaluated by the connector to determine which data to pull into the system for streamed processing. These data streams are typically analyzed within the system without requiring the full contents of the stream to be loaded into memory.


Here's a visual example of a Document in Content Intelligence representing a PDF file:



Notice that this Document has a number of metadata fields defined, such as Content_Type, and HCI_filename. Processing stages may add, remove, and change these metadata fields to build a self describing entity. Tagging additional fields to Documents can direct other processing stages in how they should process this Document to extract additional value. 


This Document also has a "streams" section, where it defines 2 named streams. First, there's the HCI_content stream, which contains the raw bytes of the PDF file. Second (having been stored on HCP), we see an additional custom metadata annotation stream named .metapairs,  containing additional XML formatted metadata associated with this Document.


At any time during processing, each individual data stream associated with this Document can be read by name from the original content source. When directed, the system streams the data from the original data connection back to the processing stage that requested it. This allows for tasks such as reading that XML annotation, parsing the information contained, and adding that information to the Document as additional fields for further processing. Like fields, streams can also be added/removed from the Document on demand, so that other processing stages can easily consume it.


Creating and Updating Documents


Content Intelligence data connectors and processing stages both enable flexible interactions with Documents. See a previous blog of writing custom plugins for more details.


For example, a custom file system connector may perform a directory listing to identify metadata and create a Document for each file it found. Each Document would contain fields representing the metadata for each. Creating a Document is accomplished through the use of the DocumentBuilder, obtained from the PluginCallback object:


     Document emptyDocument = callback.documentBuilder().build();


Adding fields to this existing document is accomplished using the builder as follows:


     DocumentBuilder documentBuilder = callback.documentBuilder().copy(emptyDocument);

     documentBuilder.addMetadata("HCI_id",  StringDocumentFieldValue.builder().setString("/file.pdf").build());

     documentBuilder.addMetadata("HCI_URI",  StringDocumentFieldValue.builder().setString("file://file.pdf").build());

     documentBuilder.addMetadata("category",  StringDocumentFieldValue.builder().setString("Business").build());

     Document myFileDocument =;


This Document now includes required fields "HCI_id", containing the unique identifier of the file on that data source, and "HCI_URI", which has a single value "file://file.pdf" defining how to remotely access it. It also contains a custom field: "category = Business". You can do this with any information you obtain about this Document, effectively building a list of metadata associated with it that can be accessed by other parts of the system easily.


Now, let's allow callers to access the raw data stream from this Document by attaching a stream named "HCI_Content". Because we're only adding a pointer to the file (not actual stream contents), we use the setStreamMetadata method:


     DocumentBuilder documentBuilder = callback.documentBuilder().copy(myFileDocument);

     documentBuilder.setStreamMetadata("HCI_Content",  Collections.emptyMap());

     Document myFileDocumentWithStream =;


Notice that we haven't set any stream metadata for this new stream, and the provided Map is empty. This is because connectors already use the standard "HCI_Content" stream name to represent the raw data for this file. This directs the system to use the HCI_URI field to read the file (e.g. from the local filesystem) and present the stream contents to the caller.


If you have an InputStream, you can also write streams to system managed temp space using setStream:


     DocumentBuilder documentBuilder = callback.documentBuilder().copy(myFileDocument);

     documentBuilder.setStream("xmlAttachment",  Collections.emptyMap(), inputStream);

     Document myFileDocumentWithStream =;


When writing this inputStream to HCI, the system will attach additional stream metadata containing the local temp file path this file was written to. Stream metadata can be used, for example, to store any additional details required to tell the connector how it should read this data when asked. This metadata can tell the system to load the file from temp directory where it was stored. All temporary streams are deleted automatically when workflows complete.


Working with Documents


Callers from other processing stages can read  fields and streams from the provided Document as follows:


     // Reading fields

    String category = document.getMetadataValue("category").toString();  

     // Reading streams

     try (InputStream inputStream = callback.openNamedStream(document, streamName)) {

           // Use the inputStream for processing

           // Add additional metadata fields to the Document based on the contents found



Processing Example


Consider a virus detection stage, tasked with reading the content stream of each individual Document, and adding a metadata field to indicate "PASS" or "FAIL". This stage would follow the procedures above to first analyze the contents, and again to add additional metadata to the Document for evaluation by other stages.


     private Document scanForVirus(Document document) {    

          DocumentBuilder documentBuilder = callback.documentBuilder().copy(document);

          // Analyze the content stream from this document

          try (InputStream inputStream = callback.openNamedStream(document, "HCI_Content")) {

               // Determine if there's a virus!

               boolean foundVirus = readStreamContentsAndCheckForVirus(inputStream);   

               // Add the result to the document

               documentBuilder.addMetadata("VirusFound",  BooleanDocumentFieldValue.builder().setBoolean(foundVirus).build());





A subsequent stage could then check the value of VirusFound on incoming Documents, and take steps to quarantine any files on the data sources where the virus was detected.


This work can be performed without directly interacting with the data sources themselves - just by interacting with the Document representations in the Content Intelligence environment. This eliminates much of the complexity of dealing directly with client SDKs, connection pools, and retry logic, simplifying the development of new processing solutions.


Standardizing on field and stream names (such as HCI_URI and HCI_content), can reduce any custom configuration required on each processing stage, by leveraging built-in out of the box defaults. This can help to eliminate many common configuration mistakes, such as typos in field names, while promoting the re-use of stages.  


I hope this demonstrates the flexibility and convenience provided by standardizing on a useful data structure such as the Content Intelligence Document. Whether the data is a tweet, a database row, or an office document, the data can be represented, accessed, analyzed, and augmented in the same consistent way. By using a standard, normalized mechanism for accessing and consuming information, you can quickly generate reusable code that can help to quickly satisfy a number of use cases. Even those you haven't thought of yet...


Thanks for reading!