What does setMaster `local[*]` mean in spark?

ScalaApache Spark

Scala Problem Overview


I found some code to start spark locally with:

val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val ctx = new SparkContext(conf)

What does the [*] mean?

Scala Solutions


Solution 1 - Scala

From the doc:

./bin/spark-shell --master local[2]

> The --master option specifies the master URL for a distributed > cluster, or local to run locally with one thread, or local[N] to run > locally with N threads. You should start by using local for testing.

And from here:

> local[*] Run Spark locally with as many worker threads as logical > cores on your machine.

Solution 2 - Scala

> Master URL Meaning


local : Run Spark locally with one worker thread (i.e. no parallelism at all).


local[K] : Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).


local[K,F] : Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)


local[*] : Run Spark locally with as many worker threads as logical cores on your machine.


local[*,F] : Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.


spark://HOST:PORT : Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.


spark://HOST1:PORT1,HOST2:PORT2 : Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.


mesos://HOST:PORT : Connect to the given Mesos cluster. The port must be whichever you have configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.


yarn : Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

https://spark.apache.org/docs/latest/submitting-applications.html

Solution 3 - Scala

Some additional Info

Do not run Spark Streaming programs locally with master configured as "local" or "local[ 1]". This allocates only one CPU for tasks and if a receiver is running on it, there is no resource left to process the received data. Use at least "local[ 2]" to have more cores.

From -Learning Spark: Lightning-Fast Big Data Analysis

Solution 4 - Scala

Master URL

You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.

The URL says how many threads can be used in total:

local uses 1 thread only.

local[n] uses n threads.

local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).

local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of spark.task.maxFailures.

Solution 5 - Scala

You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.

The URL says how many threads can be used in total:-

local uses 1 thread only.

local[n] uses n threads.

local[*] uses as many threads as your spark local machine have, where you are running your application.

you can check by lscpu in your Linux machine

[ie@mapr2 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2

if your machine has 56 cores means CPU then your spark jobs will be partitioned in 56 part.

NOTE:- there may be the case that in your spark cluster the spark-defaults.conf file has limited the partition value with the default value (like 10 or else) then your partitioned will be the same as default value has been set in config.

local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of spark.task.maxFailures.

Solution 6 - Scala

without * spark will use single thread.

With * spark will use all the available threads the run this program

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionFreewindView Question on Stackoverflow
Solution 1 - ScalacchenesonView Answer on Stackoverflow
Solution 2 - ScalaFreeManView Answer on Stackoverflow
Solution 3 - Scalamat77View Answer on Stackoverflow
Solution 4 - ScalaRam GhadiyaramView Answer on Stackoverflow
Solution 5 - ScalaDevbrat ShuklaView Answer on Stackoverflow
Solution 6 - ScalasimoView Answer on Stackoverflow