AnsweredAssumed Answered

Handling Incremental Crawls on custom connectors

Question asked by Eduardo Gamboa on Mar 20, 2017
Latest reply on Mar 21, 2017 by Alan Bryant

Hello All.

 

I want to write a custom connector for a certain content source. I already have some code that is able to fetch the files as Streams in any Java program.

As part of the requirements, we need to provide an "Incremental Crawl". This means that, after the first crawl (which includes all the documents by default), the connector needs to be able to detect the following events:

  1. When a new document is added
  2. When an actual document is updated
  3. When a document is deleted.

Event 1 is solved, as any crawl will return this. Event 2 and 3 are tricky, what I did in my code is to store an special signature for the actual document. If that signature is missing, then the document is deleted. If the signature changes, then the document is updated. To track this, I need to save those signatures into a persistent storage, which needs to be available from the connector. Based on this, only the changed and added documents will be passed down to the pipeline, while the non-changed documents will not be processed.

 

For example, think on crawling a Web site. If a new document is added, it will have a new URL. If a document is updated, the same URL is used, but with different content. This will generate, for the same URL, a different signature, which set the document as modified (generating a new version of it). Finally, when an URL is not found, a Delete event may be passed to the pipeline, to remove that document from the index.

 

To implement this type of behavior, I need to store those URLs and signatures inside a persistent storage. With this, I can check if a document has changed, or not, and process it accordingly. If I detect that the document has not changed, I can just ignore at a connector level, so no further processing is made down to the pipeline.

 

Is there any API call inside the HCI plugin SDK to get this type of data storage? I know that I can install an external service, however, this may be complicated in our environment.

Also, I tried to fit the getChanges method on the ConnectorPlugin interface into our crawl code, however, my source can't be queried with a checkpoint as the documentation states. Again, think about the web page examples, the URLs or the body of the page may not contain any data of when it was updated.

 

Any help will be appreciated.

 

Regards

Outcomes