SparkR vs sparklyr

RApache SparkSparkrSparklyr

R Problem Overview


Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is pretty straight forward (both to install but also to use, especially with the dplyr inputs). Can sparklyr only be used to run dplyr functions in parallel or also "normal" R-Code?

Best

R Solutions


Solution 1 - R

The biggest advantage of SparkR is the ability to run on Spark arbitrary user-defined functions written in R:

https://spark.apache.org/docs/2.0.1/sparkr.html#applying-user-defined-function

Since sparklyr translates R to SQL, you can only use very small set of functions in mutate statements:

http://spark.rstudio.com/dplyr.html#sql_translation

That deficiency is somewhat alleviated by Extensions (http://spark.rstudio.com/extensions.html#wrapper_functions).

Other than that, sparklyr is a winner (in my opinion). Aside from the obvious advantage of using familiar dplyr functions, sparklyr has much more comprehensive API for MLlib (http://spark.rstudio.com/mllib.html) and the Extensions mentioned above.

Solution 2 - R

For the overview and indepth details, you may refer to the documentation. Quoting from the documentation, "the sparklyr package provides a complete dplyr backend". This reflects that sparklyr is NOT a replacement to the original apache spark but an extension to it.

Continuing further, talking about its installation (I'm a Windows user) on a standalone computer you would either need to download and install the new RStudio Preview version or else execute the following series of commands in the RStudio shell,

> devtools::install_github("rstudio/sparklyr")

install readr and digest packages if you do not have them installed.

install.packages("readr")
install.packages("digest")
library(sparklyr)
spark_install(version = "1.6.2")`

Once the packages are installed and you try to connect Connecting to local instance of spark using the command;

sc <- spark_connect(master = "local")

You may see an error such as

> Created default hadoop bin directory under: C:\spark-1.6.2\tmp\hadoop Error:

To run Spark on Windows you need a copy of Hadoop winutils.exe:

  1. Download Hadoop winutils.exe from
  2. Copy winutils.exe to C:\spark-1.6.2\tmp\hadoop\bin

Alternatively, if you are using RStudio you can install the RStudio Preview Release which includes an embedded copy of Hadoop winutils.exe.

The error resolution is given to you. Head over to the github account, download the winutils.exe file and save it to the location, C:\spark-1.6.2\tmp\hadoop\bin and try creating the spark context again. Last year I published a comprehensive post on my blog detailing the installation and working with sparkR on windows environment.

Having said that, I would recommend not to go through this painful path of installing a local instance of spark on the usual RStudio, rather try the RStudio Preview version. It will greatly save you the hassle of creating the sparkcontext. Continuing further, here is a detailed post on how sparklyr can be used R-bloggers.

I hope this helps.

Cheers.

Solution 3 - R

I can give you the highlights for sparklyr:

In the current 0.4 version, it does not support arbitrary parallel code execution yet. However, extensions can be easily written in Scala to overcome this limitation, see sparkhello.

Solution 4 - R

Being a wrapper, there are some limitations to sparklyr. For example, using copy_to() to create a Spark dataframe does not preserve columns formatted as dates. With SparkR, as.Dataframe() preserves dates.

Solution 5 - R

... adding to the above from Javier...

That I can find so far, sparklyr does not support do(), making it of use only when you want to do what's permitted by mutate, summarise, etc. Under the hood, sparklyr is transforming to Spark SQL, but doesn't (yet?) transform do() to something like a UDF.

Also, that I can find so far, sparklyr doesn't support tidyr, including unnest().

Solution 6 - R

As I don't see too many answers which are in favour sparkR I just want to mention that as a newbie I started learning them both and I see that sparkR api is more closely related to the one I use with standard scala-spark. As I study them both I mean I want to use rstudio and also scala, I need to choose between sparkr and sparklyr. Learning sparkR together with scala-spark api, seem's to be of less effort than learning sparklyr which is much more different at least in my perspective. However sparklyr appears more powerful. So for me it's a question of do you want to use the more powerful and commonly used library with more support from community or do you compromise and use the more similar api as in scala-spark that is at least my perspective on choosing.

Solution 7 - R

I recently wrote an overview of the advantages/disadvantages of SparkR vs sparklyr, which may be of interest: https://eddjberry.netlify.com/post/2017-12-05-sparkr-vs-sparklyr/.

There's a table at the top of the post that gives a rough overview of the differences for a range of criteria.

I conclude that sparklyr is preferable to SparkR. The most notable advantages are:

  1. Better data manipulation through compatibility with dpylr
  2. Better function naming conventions
  3. Better tools for quickly evaluating ML models
  4. Easier to run arbitrary code on a Spark DataFrame

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionkoVexView Question on Stackoverflow
Solution 1 - RAlex VorobievView Answer on Stackoverflow
Solution 2 - RmnmView Answer on Stackoverflow
Solution 3 - RJavier LuraschiView Answer on Stackoverflow
Solution 4 - RReuben L.View Answer on Stackoverflow
Solution 5 - RCarl F.View Answer on Stackoverflow
Solution 6 - RTomer Ben DavidView Answer on Stackoverflow
Solution 7 - REdddView Answer on Stackoverflow