Pentaho

 View Only

 how to launch more than 2 executors when using AEL running KTR on spark ?

  • Pentaho
  • Kettle
  • Pentaho
  • Pentaho Data Integration PDI
Fisher Hao's profile image
Fisher Hao posted 02-11-2018 02:42

Dear

I'm trying to use PDI AEL to run transform with spark engin on yarn.

But I found that the transform job was running very slow and there were only two executors launched, I have tried adding two lines in AEL application.properties as the following:sparkNumExecutors=8

 

sparkExecutorCores=4

 

it seems those two parameters was ignored by AEL , the AEL daemon.log is as following:

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    : Parsed arguments:

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   master                  yarn[5]

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   deployMode              null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   executorMemory          4g

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   executorCores           null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   totalExecutorCores      null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   propertiesFile          /root/spark-2.1.0-bin-hadoop2.7/kettleConf/spark-defaults.conf

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   driverMemory            4g

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   driverCores             null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   driverExtraClassPath    null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   driverExtraLibraryPath  null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   driverExtraJavaOptions  -Duser.dir=/root/data-integration -Djava.library.path=/root/data-integration/libswt/win64 -Dlog4j.configuration=file:/root/data-integration/classes/log4j.xml

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   supervise               false

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   queue                   null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   numExecutors            null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   files                   null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   pyFiles                 null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain    :   archives

I have no idea how to launch more executors to make the transform running faster?

I appreciate your feedback and help! Fisher hao


#Kettle
#Pentaho
#PentahoDataIntegrationPDI
Christopher Caspanello's profile image
Christopher Caspanello

What version of PDI are you using?  8.0 release?

If so we did not expose those properties directly so they cannot be used in the application.properties file.  However, when the transformation is run, a configuration file is generated each time at $SPARK_HOME/ketttleConf/spark-defaults.conf.  If you set the overwriteConfig property to false it will no longer write a new file; thus, enabling you to add any spark property to that configuration file.

 

If I recall correctly these are the properties I added/modified when I was working on some performance testing:

  • spark.dynamicAllocation.enabled
  • spark.executor.memory
  • spark.executor.cores

 

See Apache Spark documentation for more details:  https://spark.apache.org/docs/latest/configuration.html

 

In 8.1 we included the ability to add any spark. properties directly to the application.properties file or transformation parameters to make this easier.

Fisher Hao's profile image
Fisher Hao

Hey Christopher

Thanks for your reply, I have tried again by following your instruction and it works (PDI8+hadoop2.8.3+jdk8+spark2.1). The AEL daemon starts 4 executors and each with 4 cores. It run faster than before.

But there are still a problem remain, acctually I'm trying to do a simple wordcount transform on spark engin, so finally on the end of this transform, I need to use a "group by" step to count the rows by word , the "group by" step can't run in parallel and it's very slow. All the preceding steps such as "hadoop file input"/"split fields to rows " run in parallel and very fast

Is there a way to make "group by" in parallel?

Fisher Hao

Diethard Steiner's profile image
Diethard Steiner

Currently not unfortunately. I created this Jira case some time ago to get support to run the Group by step in parallel. You might want to watch the Jira case to hopefully see it resolved in the not so distant future.

Fisher Hao's profile image
Fisher Hao

Hey Diethard

Thanks for your comment! Looking forward to be solved in PDI8.1

Christopher Caspanello's profile image
Christopher Caspanello

Chun Peng Hao we did in fact make a Spark native operation for the GroupBy step.  Stay tuned for the 8.1 release.

Fisher Hao's profile image
Fisher Hao

Christopher

That's great, I'm looking forward to it, thanks!

Christopher Caspanello's profile image
Christopher Caspanello

Chun Peng Hao & Diethard Steiner - The 8.1 release will be available soon.  Make sure to check out the documentation at that time; there are a few known issues that we will be working to address in the near future.

https://help.pentaho.com/Documentation/8.1/Products/Data_Integration/Transformation_Step_Reference/Group_By

Note:  This link will not be active until 8.1 is officially released

Pentaho 8.1 When?

Fisher Hao's profile image
Fisher Hao

Hi Christopher

Got it, thanks!

Diethard Steiner's profile image
Diethard Steiner

Great news!