A nice feature of pyspark
applications is that you
don’t have to use spark-submit
manually. Instead, when you instantiate
a SparkContext
instance, it will take care of running spark-submit ... pyspark-shell
for you.
Sometimes, however, you may want to customize that launch a bit. For instance, it is useful to
tell spark-submit
to include specific libraries in the classpath; which is in and of itself a
pretty cool feature because you can provide Maven coordinates.
In cases like that, you can use PYSPARK_SUBMIT_ARGS
, for instance:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] =
'--packages com.amazonaws:aws-java-sdk:1.11.228,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
By default,
only the pyspark-shell
would have been appended to the spark-submit
command. Using PYSPARK_SUBMIT_ARGS
you get to control what to use instead of it.