Why use purrr::map instead of lapply?
RPurrrR Problem Overview
Is there any reason why I should use
map(<list-like-object>, function(x) <do stuff>)
instead of
lapply(<list-like-object>, function(x) <do stuff>)
the output should be the same and the benchmarks I made seem to show that lapply
is slightly faster (it should be as map
needs to evaluate all the non-standard-evaluation input).
So is there any reason why for such simple cases I should actually consider switching to purrr::map
? I am not asking here about one's likes or dislikes about the syntax, other functionalities provided by purrr etc., but strictly about comparison of purrr::map
with lapply
assuming using the standard evaluation, i.e. map(<list-like-object>, function(x) <do stuff>)
. Is there any advantage that purrr::map
has in terms of performance, exception handling etc.? The comments below suggest that it does not, but maybe someone could elaborate a little bit more?
R Solutions
Solution 1 - R
If the only function you're using from purrr is map()
, then no, the
advantages are not substantial. As Rich Pauloo points out, the main
advantage of map()
is the helpers which allow you to write compact
code for common special cases:
-
~ . + 1
is equivalent tofunction(x) x + 1
(and\(x) x + 1
in R-4.1 and newer) -
list("x", 1)
is equivalent tofunction(x) x[["x"]][[1]]
. These helpers are a bit more general than[[
- see?pluck
for details. For data rectangling, the.default
argument is particularly helpful.
But most of the time you're not using a single *apply()
/map()
function, you're using a bunch of them, and the advantage of purrr is
much greater consistency between the functions. For example:
-
The first argument to
lapply()
is the data; the first argument tomapply()
is the function. The first argument to all map functions is always the data. -
With
vapply()
,sapply()
, andmapply()
you can choose to suppress names on the output withUSE.NAMES = FALSE
; butlapply()
doesn't have that argument. -
There's no consistent way to pass consistent arguments on to the mapper function. Most functions use
...
butmapply()
usesMoreArgs
(which you'd expect to be calledMORE.ARGS
), andMap()
,Filter()
andReduce()
expect you to create a new anonymous function. In map functions, constant argument always come after the function name. -
Almost every purrr function is type stable: you can predict the output type exclusively from the function name. This is not true for
sapply()
ormapply()
. Yes, there isvapply()
; but there's no equivalent formapply()
.
You may think that all of these minor distinctions are not important (just as some people think that there's no advantage to stringr over base R regular expressions), but in my experience they cause unnecessary friction when programming (the differing argument orders always used to trip me up), and they make functional programming techniques harder to learn because as well as the big ideas, you also have to learn a bunch of incidental details.
Purrr also fills in some handy map variants that are absent from base R:
-
modify()
preserves the type of the data using[[<-
to modify "in place". In conjunction with the_if
variant this allows for (IMO beautiful) code likemodify_if(df, is.factor, as.character)
-
map2()
allows you to map simultaneously overx
andy
. This makes it easier to express ideas likemap2(models, datasets, predict)
-
imap()
allows you to map simultaneously overx
and its indices (either names or positions). This is makes it easy to (e.g) load allcsv
files in a directory, adding afilename
column to each.dir("\\.csv$") %>% set_names() %>% map(read.csv) %>% imap(~ transform(.x, filename = .y))
-
walk()
returns its input invisibly; and is useful when you're calling a function for its side-effects (i.e. writing files to disk).
Not to mention the other helpers like safely()
and partial()
.
Personally, I find that when I use purrr, I can write functional code with less friction and greater ease; it decreases the gap between thinking up an idea and implementing it. But your mileage may vary; there's no need to use purrr unless it actually helps you.
Microbenchmarks
Yes, map()
is slightly slower than lapply()
. But the cost of using
map()
or lapply()
is driven by what you're mapping, not the overhead
of performing the loop. The microbenchmark below suggests that the cost
of map()
compared to lapply()
is around 40 ns per element, which
seems unlikely to materially impact most R code.
library(purrr)
n <- 1e4
x <- 1:n
f <- function(x) NULL
mb <- microbenchmark::microbenchmark(
lapply = lapply(x, f),
map = map(x, f)
)
summary(mb, unit = "ns")$median / n
#> [1] 490.343 546.880
Solution 2 - R
purrr
and lapply
boils down to convenience and speed.
Comparing purrr::map
is syntactically more convenient than lapply
1. extract second element of the list
map(list, 2)
which as @F. Privé pointed out, is the same as:
map(list, function(x) x[[2]])
with lapply
lapply(list, 2) # doesn't work
we need to pass an anonymous function...
lapply(list, function(x) x[[2]]) # now it works
...or as @RichScriven pointed out, we pass [[
as an argument into lapply
lapply(list, `[[`, 2) # a bit more simple syntantically
So if find yourself applying functions to many lists using lapply
, and tire of either defining a custom function or writing an anonymous function, convenience is one reason to favor purrr
.
2. Type-specific map functions simply many lines of code
map_chr()
map_lgl()
map_int()
map_dbl()
map_df()
Each of these type-specific map functions returns a vector, rather than the lists returned by map()
and lapply()
. If you're dealing with nested lists of vectors, you can use these type-specific map functions to pull out the vectors directly, and coerce vectors directly into int, dbl, chr vectors. The base R version would look something like as.numeric(sapply(...))
, as.character(sapply(...))
, etc.
The map_<type>
functions also have the useful quality that if they cannot return an atomic vector of the indicated type, they fail. This is useful when defining strict control flow, where you want a function to fail if it [somehow] generates the wrong object type.
lapply
is [slightly] faster than map
3. Convenience aside, Using purrr
's convenience functions, as @F. Privé pointed out slows down processing a bit. Let's race each of the 4 cases I presented above.
# devtools::install_github("jennybc/repurrrsive")
library(repurrrsive)
library(purrr)
library(microbenchmark)
library(ggplot2)
mbm <- microbenchmark(
lapply = lapply(got_chars[1:4], function(x) x[[2]]),
lapply_2 = lapply(got_chars[1:4], `[[`, 2),
map_shortcut = map(got_chars[1:4], 2),
map = map(got_chars[1:4], function(x) x[[2]]),
times = 100
)
autoplot(mbm)
And the winner is....
lapply(list, `[[`, 2)
In sum, if raw speed is what you're after: base::lapply
(although it's not that much faster)
For simple syntax and expressibility: purrr::map
This excellent purrr
tutorial highlights the convenience of not having to explicitly write out anonymous functions when using purrr
, and the benefits of type-specific map
functions.
Solution 3 - R
If we do not consider aspects of taste (otherwise this question should be closed) or syntax consistency, style etc, the answer is no, there’s no special reason to use map
instead of lapply
or other variants of the apply family, such as the stricter vapply
.
PS: To those people gratuitously downvoting, just remember the OP wrote:
> I am not asking here about one's likes or dislikes about the syntax, > other functionalities provided by purrr etc., but strictly about > comparison of purrr::map with lapply assuming using the standard > evaluation
If you do not consider syntax nor other functionalities of purrr
, there's no special reason to use map
. I use purrr
myself and I'm fine with Hadley's answer, but it ironically goes over the very things the OP stated upfront he was not asking.
Solution 4 - R
tl;dr
> I am not asking about one's likes or dislikes about syntax or other functionalities provided by purrr.
Choose the tool that matches your use case, and maximizes your productivity. For production code that prioritizes speed use *apply
, for code that requires small memory footprint use map
. Based on ergonomics, map
is likely preferable for most users and most one-off tasks.
Convenience
update October 2021 Since both the accepted answer and the 2nd most voted post mention syntax convenience:
R versions 4.1.1 and higher now support shorthand anonymous function \(x)
and pipe |>
syntax. To check your R version, use version[['version.string']]
.
library(purrr)
library(repurrrsive)
lapply(got_chars[1:2], `[[`, 2) |>
lapply(\(.) . + 1)
#> [[1]]
#> [1] 1023
#>
#> [[2]]
#> [1] 1053
map(got_chars[1:2], 2) %>%
map(~ . + 1)
#> [[1]]
#> [1] 1023
#>
#> [[2]]
#> [1] 1053
Syntax for the purrr
approach generally is shorter to type if your task involves more than 2 manipulations of list-like objects.
nchar(
"lapply(x, fun, y) |>
lapply(\\(.) . + 1)")
#> [1] 45
nchar(
"library(purrr)
map(x, fun) %>%
map(~ . + 1)")
#> [1] 45
Considering a person might write tens or hundreds of thousands of these calls in their career, this syntax length difference can equate to writing 1 or 2 novels (av. novel 80 000 letters), given the code is typed. Further consider your code input speed (~65 words per minute?), your input accuracy (do you find that you often mistype certain syntax (\"<
?), your recall of function arguments, then you can make a fair comparison of your productivity using one style, or a combination of the two.
Another consideration might be your target audience. Personally I found explaining how purrr::map
works harder than lapply
precisely because of its concise syntax.
1 |>
lapply(\(.z) .z + 1)
#> [[1]]
#> [1] 2
1 %>%
map(~ .z+ 1)
#> Error in .f(.x[[i]], ...) : object '.z' not found
but,
1 %>%
map(~ .+ 1)
#> [[1]]
#> [1] 2
Speed
Often when dealing with list-like objects, multiple operations are performed. A nuance to the discussion that the overhead of purrr
is insignificant in most code - dealing with large lists and use cases.
got_large <- rep(got_chars, 1e4) # 300 000 elements, 1.3 GB in memory
bench::mark(
base = {
lapply(got_large, `[[`, 2) |>
lapply(\(.) . * 1e5) |>
lapply(\(.) . / 1e5) |>
lapply(\(.) as.character(.))
},
purrr = {
map(got_large, 2) %>%
map(~ . * 1e5) %>%
map(~ . / 1e5) %>%
map(~ as.character(.))
}, iterations = 100,
)[c(1, 3, 4, 5, 7, 8, 9)]
# A tibble: 2 x 7
expression median `itr/sec` mem_alloc n_itr n_gc total_time
<bch:expr> <bch:tm> <dbl> <bch:byt> <int> <dbl> <bch:tm>
1 base 1.19s 0.807 9.17MB 100 301 2.06m
2 purrr 2.67s 0.363 9.15MB 100 919 4.59m
This diverges the more actions are performed. If you are writing code that is used routinely by some users or packages depend on it, the speed might be a significant factor to consider in your choice between base R and purr. Notice purrr
has a slightly lower memory footprint.
There is, however a counterargument: If you want speed, go to a lower level language.