Pentaho

View Only

how to launch more than 2 executors when using AEL running KTR on spark ?

Fisher Hao posted 02-11-2018 02:42

Dear

I'm trying to use PDI AEL to run transform with spark engin on yarn.

But I found that the transform job was running very slow and there were only two executors launched, I have tried adding two lines in AEL application.properties as the following:sparkNumExecutors=8

sparkExecutorCores=4

it seems those two parameters was ignored by AEL , the AEL daemon.log is as following:

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : Parsed arguments:

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : master yarn[5]

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : deployMode null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : executorMemory 4g

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : executorCores null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : totalExecutorCores null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : propertiesFile /root/spark-2.1.0-bin-hadoop2.7/kettleConf/spark-defaults.conf

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : driverMemory 4g

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : driverCores null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : driverExtraClassPath null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : driverExtraLibraryPath null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : driverExtraJavaOptions -Duser.dir=/root/data-integration -Djava.library.path=/root/data-integration/libswt/win64 -Dlog4j.configuration=file:/root/data-integration/classes/log4j.xml

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : supervise false

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : queue null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : numExecutors null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : files null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : pyFiles null

[launcher-proc-1] o.a.s.launcher.app.SparkWebSocketMain : archives

I have no idea how to launch more executors to make the transform running faster?

I appreciate your feedback and help! Fisher hao

#Kettle
#Pentaho
#PentahoDataIntegrationPDI

Christopher Caspanello posted 02-12-2018 15:06

What version of PDI are you using? 8.0 release?

If so we did not expose those properties directly so they cannot be used in the application.properties file. However, when the transformation is run, a configuration file is generated each time at $SPARK_HOME/ketttleConf/spark-defaults.conf. If you set the overwriteConfig property to false it will no longer write a new file; thus, enabling you to add any spark property to that configuration file.

If I recall correctly these are the properties I added/modified when I was working on some performance testing:

spark.dynamicAllocation.enabled
spark.executor.memory
spark.executor.cores

See Apache Spark documentation for more details: https://spark.apache.org/docs/latest/configuration.html

In 8.1 we included the ability to add any spark. properties directly to the application.properties file or transformation parameters to make this easier.

Fisher Hao posted 02-13-2018 02:41

Hey Christopher

Thanks for your reply, I have tried again by following your instruction and it works (PDI8+hadoop2.8.3+jdk8+spark2.1). The AEL daemon starts 4 executors and each with 4 cores. It run faster than before.

But there are still a problem remain, acctually I'm trying to do a simple wordcount transform on spark engin, so finally on the end of this transform, I need to use a "group by" step to count the rows by word , the "group by" step can't run in parallel and it's very slow. All the preceding steps such as "hadoop file input"/"split fields to rows " run in parallel and very fast

Is there a way to make "group by" in parallel?

Fisher Hao

Diethard Steiner posted 03-19-2018 22:56

Currently not unfortunately. I created this Jira case some time ago to get support to run the Group by step in parallel. You might want to watch the Jira case to hopefully see it resolved in the not so distant future.

Fisher Hao posted 03-20-2018 00:55

Hey Diethard

Thanks for your comment! Looking forward to be solved in PDI8.1

Christopher Caspanello posted 03-20-2018 01:30

Chun Peng Hao we did in fact make a Spark native operation for the GroupBy step. Stay tuned for the 8.1 release.

Fisher Hao posted 03-20-2018 04:41

Christopher

That's great, I'm looking forward to it, thanks!

Christopher Caspanello posted 05-09-2018 13:25

Chun Peng Hao & Diethard Steiner - The 8.1 release will be available soon. Make sure to check out the documentation at that time; there are a few known issues that we will be working to address in the near future.

https://help.pentaho.com/Documentation/8.1/Products/Data_Integration/Transformation_Step_Reference/Group_By

Note: This link will not be active until 8.1 is officially released

Pentaho 8.1 When?

Fisher Hao posted 05-11-2018 00:22

Hi Christopher

Got it, thanks!

Diethard Steiner posted 05-12-2018 17:18

Great news!

Pentaho

how to launch more than 2 executors when using AEL running KTR on spark ?

Related Content

PDI AEL(yarn-client mode) Error

RE: PDI AEL(yarn-client mode) Error

Unable to run PDI AEL Spark Jobs

RE: Unable to run PDI AEL Spark Jobs

RE: Unable to run PDI AEL Spark Jobs