Object Storage

 check driver heap limit error

  • HV Object Storage
  • Hitachi Content Intelligence HCI
Marek Kaszycki's profile image
Marek Kaszycki posted 10-12-2018 08:06

I have a workflow that essentially works (tests complete just fine), but it's unable to run as a task.

I have a HCI cluster with 5 nodes, they are dedicated to this single workflow. Each of them has 4 vCPUs, 16 GB RAM, 24 GB swap space and enough disk for everything (FS for docker has between 26% and 49% used per node).

My settings are 3072m (or 3g) for both driver heap limit and executor heap limit, so well within the available memory and within limit for swap space. But when I try to run the task, I run into this error:

Check Driver Heap limit

Please confirm the Driver Heap limit setting is tuned appropriately in the workflow task settings under Memory. If running on Fedora OS, also be sure to enable swap memory on all instances. Restart the workflow task. If the problem persists, contact your authorized service provider.

The error doesn't tell me if it's too little or too much. The files to be indexed are well within this limit (I don't think any of them are more than 30 MB).

During the task running, metrics don't budge from zero, nothing related to performance is listed, no aggregations or anything.

Also, the error doesn't appear until midnight, which means I can't know whether my changes to settings are doing anything before I come in the office on the next day, which means unnecessarily wasted time.

I'm stuck and I don't know how to move forward with this. I'd appreciate anything to help me move forward.


#HitachiContentIntelligenceHCI
Jonathan Chinitz's profile image
Jonathan Chinitz

Marek:

1. Copy and Paste the error from the Task window into this note.

2. What connector are you using?

3. How many documents do you expect the workflow to process?

4. Is the Workflow Agent job configured to run on ALL instances?

5. What are the task settings? Default or Custom? If Custom pls paste them here.

Marek Kaszycki's profile image
Marek Kaszycki

1. Title: "Check Driver Heap limit"

Description:

"Please confirm the Driver Heap limit setting is tuned appropriately in the workflow task settings under Memory. If running on Fedora OS, also be sure to enable swap memory on all instances. Restart the workflow task. If the problem persists, contact your authorized service provider."

2. HCP MQE.

3. I think it's a few million, let's say 2.5 million now.

4. I think so, I never disabled it, but I don't know how to check.

5. I used default, but I eventually customized them, I'm listing all settings:

Check for Updates: No (disabled from default)Workflow-Agent Recursion Enabled: Yes (limit: 50) (default)

Workflow Preprocessing Recursion Enabled: Yes (limit: 50) (default)

Performance: Default (used to be "Custom", but I didn't actually customize any settings)

Halt task after set amount of failures: No

Collect Aggregation Metrics: Yes

Collect Historical Metrics: No

Driver Heap Limit: 3g

Executor Heap Limit: 3g

Jonathan Chinitz's profile image
Jonathan Chinitz

Are you crawling the entire cluster? A specific Tenant? A specific namespace? Have you tried using the HCP connector instead of MQE?

Marek Kaszycki's profile image
Marek Kaszycki

Not the entire cluster, one specific machine. A specific tenant, and a specific namespace. The connection works.

I would prefer not use the HCP connector in place of HCP MQE because in the end, we are definitely going to use MQE, and it'd be like switching horses mid-river.

I can change this setting, but can it really help anything with this particular error? Could you point me to the support document that shows this as a potential solution?

Jonathan Chinitz's profile image
Jonathan Chinitz

Basic troubleshooting -- swap the MQE connector for the HCP one and see if you get the same error. Should take you 60 seconds.

Marek Kaszycki's profile image
Marek Kaszycki

Ok, done. It's not doing anything.

As I mentioned, for some reason, every HCI task I run doesn't do anything until midnight at which point it informs me that there's an error.

It's been running for 5 minutes now, 0 bytes read. As much as I'd like to stay, I have to leave for today, so I'll pick this up on Monday.

Thanks for your help so far.

Jonathan Chinitz's profile image
Jonathan Chinitz

Marek:

Open a Support Case please.

Jonathan Chinitz's profile image
Jonathan Chinitz

And upload the logs from the nodes to the case.

Jared Cohen's profile image
Jared Cohen

The message you are reporting indicates a probable OutOfMemory error.

This is most likely a resource issue. While 16GB memory is minimum required memory footprint per node and will work for some use cases, it is not enough for most, and that is why we recommend at least 32 GB memory per node.

If you are able, test this out on a system with more memory. You can increase the driver and executor heap in the workflow, although that may not be necessary.

If this doesn't help, then as Jon said we'll need to triage via an escalation.