Customizing pyspark app start up script

November 10, 2017


A nice feature of pyspark applications is that you don’t have to use spark-submit manually. Instead, when you instantiate a SparkContext instance, it will take care of running spark-submit ... pyspark-shell for you.

Sometimes, however, you may want to customize that launch a bit. For instance, it is useful to tell spark-submit to include specific libraries in the classpath; which is in and of itself a pretty cool feature because you can provide Maven coordinates.

In cases like that, you can use PYSPARK_SUBMIT_ARGS, for instance:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] =
   '--packages com.amazonaws:aws-java-sdk:1.11.228,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

By default, only the pyspark-shell would have been appended to the spark-submit command. Using PYSPARK_SUBMIT_ARGS you get to control what to use instead of it.

