Add JAR files to a Spark job - spark-submit

JavaScalaApache SparkJarSpark Submit

Java Problem Overview


True... it has been discussed quite a lot.

However, there is a lot of ambiguity and some of the answers provided ... including duplicating JAR references in the jars/executor/driver configuration or options.

The ambiguous and/or omitted details

The following ambiguity, unclear, and/or omitted details should be clarified for each option:

  • How ClassPath is affected
    • Driver
    • Executor (for tasks running)
    • Both
    • not at all
  • Separation character: comma, colon, semicolon
  • If provided files are automatically distributed
    • for the tasks (to each executor)
    • for the remote Driver (if ran in cluster mode)
  • type of URI accepted: local file, HDFS, HTTP, etc.
  • If copied into a common location, where that location is (HDFS, local?)
The options which it affects:
  1. --jars
  2. SparkContext.addJar(...) method
  3. SparkContext.addFile(...) method
  4. --conf spark.driver.extraClassPath=... or --driver-class-path ...
  5. --conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
  6. --conf spark.executor.extraClassPath=...
  7. --conf spark.executor.extraLibraryPath=...
  8. not to forget, the last parameter of the spark-submit is also a .jar file.

I am aware where I can find the main Apache Spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However, that left for me still quite some holes, although it was answered partially too.

I hope that it is not all that complex, and that someone can give me a clear and concise answer.

If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

Would it be safe to assume that for simplicity, I can add additional application JAR files using the three main options at the same time?

spark-submit --jar additional1.jar,additional2.jar \
  --driver-library-path additional1.jar:additional2.jar \
  --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

I found a nice article on an answer to another posting. However, nothing new was learned. The poster does make a good remark on the difference between a local driver (yarn-client) and remote driver (yarn-cluster). It is definitely important to keep in mind.

Java Solutions


Solution 1 - Java

ClassPath:

ClassPath is affected depending on what you provide. There are a couple of ways to set something on the classpath:

  • spark.driver.extraClassPath or it's alias --driver-class-path to set extra classpaths on the node running the driver.
  • spark.executor.extraClassPath to set extra class path on the Worker nodes.

If you want a certain JAR to be effected on both the Master and the Worker, you have to specify these separately in BOTH flags.

Separation character:

Following the same rules as the JVM:

  • Linux: A colon, :
    • e.g: --conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar"
  • Windows: A semicolon, ;
    • e.g: --conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar"
File distribution:

This depends on the mode which you're running your job under:

  1. Client mode - Spark fires up a Netty HTTP server which distributes the files on start up for each of the worker nodes. You can see that when you start your Spark job:

    16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b
    16/05/08 17:29:12 INFO HttpServer: Starting HTTP Server
    16/05/08 17:29:12 INFO Utils: Successfully started service 'HTTP file server' on port 58922.
    16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/foo.jar at http://***:58922/jars/com.mycode.jar with timestamp 1462728552732
    16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/aws-java-sdk-1.10.50.jar at http://***:58922/jars/aws-java-sdk-1.10.50.jar with timestamp 1462728552767
    
  2. Cluster mode - In cluster mode Spark selected a leader Worker node to execute the Driver process on. This means the job isn't running directly from the Master node. Here, Spark will not set an HTTP server. You have to manually make your JAR files available to all the worker nodes via HDFS, S3, or Other sources which are available to all nodes.

Accepted URI's for files

In "Submitting Applications", the Spark documentation does a good job of explaining the accepted prefixes for files:

> When using spark-submit, the application jar along with any jars > included with the --jars option will be automatically transferred to > the cluster. Spark uses the following URL scheme to allow different > strategies for disseminating jars: > > - file: - Absolute paths and file:/ URIs are served by the driver’s HTTP > file server, and every executor pulls the file from the driver HTTP > server. > - hdfs:, http:, https:, ftp: - these pull down files and JARs > from the URI as expected > - local: - a URI starting with local:/ is > expected to exist as a local file on each worker node. This means that > no network IO will be incurred, and works well for large files/JARs > that are pushed to each worker, or shared via NFS, GlusterFS, etc.

> Note that JARs and files are copied to the working directory for each > SparkContext on the executor nodes.

As noted, JAR files are copied to the working directory for each Worker node. Where exactly is that? It is usually under /var/run/spark/work, you'll see them like this:

drwxr-xr-x    3 spark spark   4096 May 15 06:16 app-20160515061614-0027
drwxr-xr-x    3 spark spark   4096 May 15 07:04 app-20160515070442-0028
drwxr-xr-x    3 spark spark   4096 May 15 07:18 app-20160515071819-0029
drwxr-xr-x    3 spark spark   4096 May 15 07:38 app-20160515073852-0030
drwxr-xr-x    3 spark spark   4096 May 15 08:13 app-20160515081350-0031
drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172020-0032
drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172045-0033

And when you look inside, you'll see all the JAR files you deployed along:

[*@*]$ cd /var/run/spark/work/app-20160508173423-0014/1/
[*@*]$ ll
total 89988
-rwxr-xr-x 1 spark spark   801117 May  8 17:34 awscala_2.10-0.5.5.jar
-rwxr-xr-x 1 spark spark 29558264 May  8 17:34 aws-java-sdk-1.10.50.jar
-rwxr-xr-x 1 spark spark 59466931 May  8 17:34 com.mycode.code.jar
-rwxr-xr-x 1 spark spark  2308517 May  8 17:34 guava-19.0.jar
-rw-r--r-- 1 spark spark      457 May  8 17:34 stderr
-rw-r--r-- 1 spark spark        0 May  8 17:34 stdout
Affected options:

The most important thing to understand is priority. If you pass any property via code, it will take precedence over any option you specify via spark-submit. This is mentioned in the Spark documentation:

> Any values specified as flags or in the properties file will be passed > on to the application and merged with those specified through > SparkConf. Properties set directly on the SparkConf take highest > precedence, then flags passed to spark-submit or spark-shell, then > options in the spark-defaults.conf file

So make sure you set those values in the proper places, so you won't be surprised when one takes priority over the other.

Let’s analyze each option in the question:

  • --jars vs SparkContext.addJar: These are identical. Only one is set through Spark submit and one via code. Choose the one which suits you better. One important thing to note is that using either of these options does not add the JAR file to your driver/executor classpath. You'll need to explicitly add them using the extraClassPath configuration on both.
  • SparkContext.addJar vs SparkContext.addFile: Use the former when you have a dependency that needs to be used with your code. Use the latter when you simply want to pass an arbitrary file around to your worker nodes, which isn't a run-time dependency in your code.
  • --conf spark.driver.extraClassPath=... or --driver-class-path: These are aliases, and it doesn't matter which one you choose
  • --conf spark.driver.extraLibraryPath=..., or --driver-library-path ... Same as above, aliases.
  • --conf spark.executor.extraClassPath=...: Use this when you have a dependency which can't be included in an über JAR (for example, because there are compile time conflicts between library versions) and which you need to load at runtime.
  • --conf spark.executor.extraLibraryPath=... This is passed as the java.library.path option for the JVM. Use this when you need a library path visible to the JVM.

> Would it be safe to assume that for simplicity, I can add additional > application jar files using the 3 main options at the same time:

You can safely assume this only for Client mode, not Cluster mode. As I've previously said. Also, the example you gave has some redundant arguments. For example, passing JAR files to --driver-library-path is useless. You need to pass them to extraClassPath if you want them to be on your classpath. Ultimately, when you deploy external JAR files on both the driver and the worker is, you want:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

Solution 2 - Java

Another approach in Apache Spark 2.1.0 is to use --conf spark.driver.userClassPathFirst=true during spark-submit which changes the priority of the dependency load, and thus the behavior of the spark-job, by giving priority to the JAR files the user is adding to the class-path with the --jars option.

Solution 3 - Java

There is a restriction on using --jars: if you want to specify a directory for the location of jar/xml files, it doesn't allow directory expansions. This means if you need to specify an absolute path for each JAR file.

If you specify --driver-class-path and you are executing in yarn cluster mode, then the driver class doesn't get updated. We can verify if the class path is updated or not under the Spark UI or Spark history server under the tab environment.

The option which worked for me to pass JAR files which contains directory expansions and which worked in yarn cluster mode was the --conf option. It's better to pass the driver and executor class paths as --conf, which adds them to the Spark session object itself and those paths are reflected in the Spark configuration. But please make sure to put JAR files on the same path across the cluster.

spark-submit \
  --master yarn \
  --queue spark_queue \
  --deploy-mode cluster    \
  --num-executors 12 \
  --executor-memory 4g \
  --driver-memory 8g \
  --executor-cores 4 \
  --conf spark.ui.enabled=False \
  --conf spark.driver.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
  --conf spark.hadoop.mapred.output.dir=/tmp \
  --conf spark.executor.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
  --conf spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/tmp

Solution 4 - Java

Other configurable Spark option relating to JAR files and classpath, in case of yarn as deploy mode are as follows.

From the Spark documentation,

> ## spark.yarn.jars ## > List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

> ## spark.yarn.archive ## > An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.

Users can configure this parameter to specify their JAR files, which in turn gets included in Spark driver's classpath.

Solution 5 - Java

When using spark-submit with --master yarn-cluster, the application JAR file along with any JAR file included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths

Example:

spark-submit --master yarn-cluster --jars ../lib/misc.jar, ../lib/test.jar --class MainClass MainApp.jar
Reference

Submitting Applications

Solution 6 - Java

While we submit Apache Spark jobs using the spark-submit utility, there is an option, --jars . Using this option, we can pass the JAR file to Spark applications.

Solution 7 - Java

I know that adding jar with --jars option automatically adds it to classpath as well.

https://spark.apache.org/docs/3.2.1/submitting-applications.html

> That list is included in the driver and executor classpaths.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionYoYoView Question on Stackoverflow
Solution 1 - JavaYuval ItzchakovView Answer on Stackoverflow
Solution 2 - JavaStanislavView Answer on Stackoverflow
Solution 3 - JavaTanveerView Answer on Stackoverflow
Solution 4 - JavaDaRkMaNView Answer on Stackoverflow
Solution 5 - JavaShiva GargView Answer on Stackoverflow
Solution 6 - JavabalaView Answer on Stackoverflow
Solution 7 - JavaHeedo LeeView Answer on Stackoverflow