Hi Hedde,
After we dealt with that issue on your system, we added the notification stages to help with this in the future.
The basic process to catch documents getting stuck is:
- Spin up a syslog server somewhere that your cluster has network access to (can even be on one of the HCI instances if you want)
- Add a Syslog Notification stage to the beginning of your pipeline with a message similar to STARTED pipeline: ${HCI_URI}
- Add a Syslog Notification stage to the end of your pipeline with a message similar to ENDED pipeline: ${HCI_URI}
- Now you can run the workflow, and your syslog server will have logs of every document that entered and exited the pipeline.
- There are a number of ways to compare and find the first document that entered but did not exit the pipeline, that's the one that was stuck. I think we put the syslog lines into an excel sheet, sorted them somehow, and and eyeballed it to find the outlier. The exact details are a bit rusty, cause it was a while ago.
One key to making this easier is to try to do this once you know you are close to the failing document. We had reduced the batch size and paused the workflow when we knew it was on the batch causing problems so that there weren't tons of documents in the batch. That makes finding the outlier in the logs quicker.
Hope this helps,
-Jared