A nice feature of
pyspark applications is that you
don’t have to use
spark-submit manually. Instead, when you instantiate
SparkContext instance, it will take care of running
spark-submit ... pyspark-shell for you.
Sometimes, however, you may want to customize that launch a bit. For instance, it is useful to
spark-submit to include specific libraries in the classpath; which is in and of itself a
pretty cool feature because you can provide Maven coordinates.
In cases like that, you can use
PYSPARK_SUBMIT_ARGS, for instance:
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.11.228,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
pyspark-shell would have been appended to the
spark-submit command. Using
you get to control what to use instead of it.