Considering having to write a data connector for a data source that the only way to find content is to continually scan the data source. For instance, a file system does not necessarily have a "notification" mechanism when files are modified and/or new files are created. So to find new files, must continually scan the file system. HCP data connector (non MQE one), also "scans" the HCP namespace.
My understanding is that HCI "keeps track" of content it has already seen to avoid duplicate processing of content. How does the HCI architecture provide this capability? Does it need to be built into the data connector and is there a good example? Or is the data connector blind to this and the core HCI components keep track and do not seen existing content into pipelines?
Then almost the most important question is what are the limits/guidelines for the maximum number of files that HCI can keep track of before there starts to be processing issues?
So looking for some architectural details around this capability to not only understand how to accomplish building a data connector, but also to avoid causing undesirable system performance.