Hitachi Content Platform​

 View Only

 How to debug a workflow

  • Object Storage
  • Hitachi Content Intelligence HCI
  • Hitachi Content Platform HCP
  • Object Storage
Hedde van der Hoeven's profile image
Hedde van der Hoeven posted 09-27-2019 10:50

Dear community,

 

A while ago we had an issue where one single HCP object would kill a workflow.

I went through this with HIC engineering and they managed to isolate the actual object causing issues so we could exclude it.

 

I have a similar issue but I can't remember exactly how to set it up.

 

I think these were roughly the steps.

 

1) Set logging mode - don't know where....????

2) Set job to use one instance

3) Set workflow to 1 object at a time

4) tail log - Which log and where, in container or host????

 

Can you please help, or if there ia na easier way please share :)

 

Hedde


#HitachiContentIntelligenceHCI
#HitachiContentPlatformHCP
#ObjectStorage
Jonathan Chinitz's profile image
Jonathan Chinitz

Hedde:

 

The bad news -- I don't know all the details of how to do this. I will get them to you.

The good news -- in HCI 1.5 (next month) we added a feature called "Stall Detection" that will do all this for you :-). The same way that you can monitor the progress of the jobs in the Task UI it will now show you what document is taking "too long".

Jared Cohen's profile image
Jared Cohen

Hi Hedde,

After we dealt with that issue on your system, we added the notification stages to help with this in the future.

The basic process to catch documents getting stuck is:

  1. Spin up a syslog server somewhere that your cluster has network access to (can even be on one of the HCI instances if you want)
  2. Add a Syslog Notification stage to the beginning of your pipeline with a message similar to STARTED pipeline: ${HCI_URI}
  3. Add a Syslog Notification stage to the end of your pipeline with a message similar to ENDED pipeline: ${HCI_URI}
  4. Now you can run the workflow, and your syslog server will have logs of every document that entered and exited the pipeline.
  5. There are a number of ways to compare and find the first document that entered but did not exit the pipeline, that's the one that was stuck. I think we put the syslog lines into an excel sheet, sorted them somehow, and and eyeballed it to find the outlier. The exact details are a bit rusty, cause it was a while ago.

 

One key to making this easier is to try to do this once you know you are close to the failing document. We had reduced the batch size and paused the workflow when we knew it was on the batch causing problems so that there weren't tons of documents in the batch. That makes finding the outlier in the logs quicker.

 

Hope this helps,

-Jared

Hedde van der Hoeven's profile image
Hedde van der Hoeven

Hi Jared, thanks for you reply.

 

We use ansible managed syslog configurations so I can't make any changes to this on our Linux instances.

 

Is there a possibility to do this the "old fashioned" way, if not I have to go and jump through some hoops :)

 

Cheers,

 

Hedde

Hedde van der Hoeven's profile image
Hedde van der Hoeven

Don't worry, got it working :)