Pentaho

 View Only

 Pentaho fills all available memory when running a simple job

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Giuseppe La Rosa's profile image
Giuseppe La Rosa posted 10-21-2018 21:46

Hi everyone,

I would like to ask the help of the Pentaho community in solving this issue I have with Pentaho.

I'm using Pentaho 7.1 - Community Edition.

The problem I have is that it happens really often to me that Pentaho fills all the available memory and gets stuck when running jobs that contains jobs inside.

 

To better clarify what I mean, I tried to reproduce the issue I encounter using a simple example (you can find it attached to this post).

pastedimage_4

It is just a simple job (master_job.kjb) that calls two other jobs. They contain 2 transformations each inside.

The first job reads an input file (1 Milion rows, 13 fields - about 90 MB), adds a new field, concatenates two of the fields into a new one and outputs everything to file. The second job reads the previous output, does some string substitution and then outputs a second file.

 

The problem is that once I run this simple job in Pentaho it gets stuck after the execution of the first job, filling all my memory and not advancing anymore to the second job.

 

pastedimage_3

 

I really cannot understand what it is happening here. To me, this looks like a pretty simple task, but for some reason Pentaho is not able to perform it.

 

FYI: I already did some research and increased the Java Virtual Machine memory in PDI by setting the environment variable "PENTAHO_DI_JAVA_OPTIONS" to "-Xms2048m -Xmx8g -XX:MaxPermSize=1024m" without getting any benefit.

 

I really hope you can give me some help/advice.

 

Bests.

 

 

Giuseppe


#Kettle
#PentahoDataIntegrationPDI
#Pentaho
Virgilio Pierini's profile image
Virgilio Pierini

Hi Giuseppe,

actually it still crashes even on 8.0 :-) and it's a matter of cardinality. 1000 rows are fine, 10.000 are good with a bit more memory, 100.000 crash and so on.

This issue seems to be out since quite a long time, you can read here something about getting Results from Jobs:

- How Do I calculate "Copy rows to result" memory limit?

- [PDI-7453] Java Heap Errors - Looping Over A Sub Job - Pentaho Platform Tracking

Can you articulate a bit more on your use case? The test you submitted needs to have the 2 transformations merged together as a standard ETL approach, but maybe I'm missing the bigger picture...

Regards

Virgilio

Diego Mainou's profile image
Diego Mainou

poor design.

use one job to calculate the parameters and a sub job to execute them.look for the job executor step.

Giuseppe La Rosa's profile image
Giuseppe La Rosa

First of all, thanks for your answers.

I would like to point out that what I provided you here is just an example that reproduces the problem that I usually have when I try to run jobs that call jobs inside. I just tried to simplify my use case.

For sure, if I put everything in the example into one single transformation, the problem doesn't show up and the transformation runs fast and flawlessly.

My question is: why do I have those memory problems when I refactor this simple transformation by splitting it into jobs that call jobs? What is happening inside Pentaho that creates this issue just by refactoring? Do I miss some knowledge about how nested jobs work?

I just thought that refactoring transformations into jobs was not a problem, that it was just a good way to reorganize things, but maybe I'm wrong here and I should use nested jobs only for specific needs. I am really puzzled.

Thanks again for your replies.

Diego Mainou's profile image
Diego Mainou

So big picture what you are doing is processing say 1billion rows, loading them into memory (your issue) passing them in one go to the next job and transformation and so on. You may or may not have enough memory.

What I am suggesting you do is to process a billion rows and output them into table A,

Next job reads table A, massages, outputs to table B and so forth. (let's call this process X)

Further to this, if you need a loop that runs process X values 1,2,3 (tomorrow 4,5,6 and so on)

You would:

1. generate a transformation that determines the values for today (i.e. 1,2,3)

2. passes the values in step 1, one row at a time) to sub job/transformation Y

Use the copy rows to result sparingly.

Diego Mainou's profile image
Diego Mainou

Look this job literally moves about 60 odd million rows every time it's run without issues.

CERN is using Pentaho to calculate data on particles. The problem is the method you are using to link your data. 1

2

3

Data Conversion's profile image
Data Conversion