We can build workflow with PST Expansion. However this is for one time process.
How to build a system to continuously process live pst files on Exchange server side where the content changes incrementally?
You will have to look at set your workflow settings, andpossible Data connectors. In Data Connectors you can choose HCP orMQE. In the work flow settings you can set it to run on a schedule or consciously.
The problem w/PSTs is they are containers. Every time they change the data connector will report that and the workflow will crawl the container again. The PST expansion stage should skip all the PST entries that it has visited already assuming they have not changed themselves (this needs to be verified). However I am not sure if you delete any messages from the PST if the PST expansion stage will pick that up. Unless there is an explicit delete in the PST itself the stage will have no idea that messages were deleted.
This can depend on the use case. Is the end goal to search for PST files or for the individual emails and attachments contained inside?
The general best practice for search is to first expand any containers to a storage platform that can be utilized as indexed document link targets. The HCI "PST Expansion" stage can be used to expand PST containers and write the individual emails (and their attachments) back to a storage system, such as HCP. You can then crawl the HCP system for indexing, and allow the individual elements to be searchable. You typically want end users to be able to click on emails to download - rather than the entire PST it lives in.
A better option for emails is to leverage HCP's SMTP gateway to ingest the individual emails into a namespace. The HCP MQE connector for HCI could them be utilized to build a search index leveraging the HCI default pipeline. This connector can operate in a continuous mode, always indexing emails as they are ingested to HCP. You can enable the "Workflow > Task > Actions... > Edit Settings > Document Discovery > Check for Updates" setting to direct the system to continuously query the data source for this workflow to pick up any changes.
The end result of both approaches would be a search experience that allows you to both search for individual emails and their attachments. Clicking on each result would support downloading the individual email or attachment clicked, rather than downloading the entire PST container that the document originated from.
The same approach should be used for files found inside other container formats, such as zip and tar.
Hope this helps,
Both options that Ben suggests above are appropriate and would work quite well. The downside is that you have now increased your storage footprint (possibly significantly) for the purposes of searching email. That said, there are other benefits to pursuing Ben's approach: you might need to do this anyway for compliance/journaling purposes, you will benefit from compression/dedup of attachments, and you will definitely be able to do "cross domain search" as other applications will be doing the same (storing their data in HCP).
Retrieving data ...