Ben Isherwood

Content Intelligence: Connector Modes

Blog Post created by Ben Isherwood Employee on Feb 16, 2018

I was asked today about the role of connector plugins in Content Intelligence, so thought I'd pass along the details.

 

Content Intelligence connector plugins can (today) operate in one of two modes:

  • List-based (CRAWL_LIST)
  • Change-based  (GET_CHANGES)

 

CRAWL_LIST mode

 

In this mode, users specify a “starting” Document by configuring the data source. The plugin's role is to perform a container listing from starting points that are requested by the HCI Crawler. HCI provides all of the bookkeeping, and decides which starting points to ask for.

 

Example:  If a content structure is as follows:

/folder1

/folder1/file1

/folder1/subfolder1

/folder2

/folder2/file2

/folder2/subfolder1

/folder2/subfolder1/file3

/file4

 

First, a user configures a starting point of “/” in the data source configuration.

 

When the workflow is executed, HCI will call the “root()” method on the plugin, which should return a Document for the “/” starting point. This is typically a "container" Document, which works like a directory.

 

HCI will call “list()” with that starting document of “/”, which should return all Documents under "/":

  • /folder1
  • /folder2
  • /file4

HCI will then call “list()” with as starting document of “/folder1”, which should return the following Documents:

  • /folder1/file1
  • /folder1/subfolder1

 

The process continues until all objects are crawled. HCI keeps track of the documents that have already been visited, and will not crawl the same object again unless directed later.

 

In continuous mode, the entire process automatically starts again from the root container. HCI will only send only changed Documents from the previous pass to the processing pipelines in this case.

 

 

GET_CHANGES mode

 

In this change-based mode, the connector plugin can collect and return Documents in any order or frequency it would like.

 

HCI calls the “getChanges()” method on plugins in this mode to return Documents. The plugin can return a plugin-defined "token" with each response. The token is opaque and is only interpreted by the plugin. HCI stores this token, and will provide the token returned in the last getChanges() call to the next call. Plugins decide what to do (if anything) with the provided token. For example, if the getChanges() call executes a time-based query to return Documents, the token can include a timestamp of the last discovered Document. On the next getChanges() call, HCI will provide this token to the plugin, which can use it to build the next query.

 

It’s completely up to the plugin to determine what to return in getChanges(), such as a batch of Documents or a single Document. This method can return no changes until the connector discovers a new Document to return.

 

 

The Role of Connector Plugins

 

For details, I've included Alan Bryant's excellent overview of the role of connector plugins here:

 

"A quick overview:

 

There are currently two modes that ConnectorPlugins can use, either list-based or change-based. Since you are working with a filesystem, you probably want list-based. You should implement getMode() to return ConnectorMode.CRAWL_LIST. You can also then implement getChanges() to throw some exception since you won't be needing it.

 

The starting point is the getDefaultConfig() method. This should define any configuration that the user should specify. In this case you should have them specify the starting path that the connector should provide access to.

 

Once the user has specified the config, build() will be called. You should construct a new instance of your plugin here with the provided config and the callback. See the examples.

 

startSession() will then be called. You should put any state associated with your plugin on this session... anything that's expensive to create or clean up. There is no guarantee that your plugin instance will be kept around between calls. The session will be cached where possible.

 

To actually crawl the datasource, we start with the root() method. This should return a Document representing the root of your datasource. Generally this should be a container (see DocumentBuilder.setIsContainer).

 

After that, the Crawler will call list() for a particular container. List should return an Iterator of Documents. Each Document generally represents something that has content to be processed (see DocumentBuilder.setHasContent) or that is a container of other Documents, like root(). These containers generally correspond to real-world objects, like Directories, but can really be any grouping you want.

 

If you are returning large numbers of Documents, look at StreamingDocumentIterator so you don't cause OutOfMemory issues by holding all the Documents in memory at once.

 

As Documents are discovered, they will be processed in the workflow. During this time, stages may ask for streams from the Document. This is implemented by calling openNamedStream(). openNamedStream should use metadata that was added in list() to be able to open the streams. So, list() just adds stream metadata (see DocumentBuilder.setStreamMetadata) and it's used later when we call openNamedStream.

 

Other things you should do:

  • get() should operate like openNamedStream being passed the StandardFields.CONTENT.
  • getMetadata returns an up-to-date version of a Document based on just a URI. It is very important that this returns the same types of metadata, in the same format, as list(). getMetadata() is used for pipeline and workflow tests. If the data is different, then the test will be useless.
  • test() should be implemented to ensure that the config is correct... basically, make sure the configured directory exists. In plugins that do network access, this can trigger SSL certificate processes."

 

 

Thanks,

-Ben

Outcomes