As has been mentioned by several folks, `mutate_each()` and `summarise_each()` are deprecated in favour of the new `across()` function. 

Answer as of `dplyr` version 1.0.5:

```
df %&gt;%
  group_by(sex) %&gt;%
  summarise(across(everything(), mean))
```

Original answer:

`dplyr` now has `summarise_each`:

    df %&gt;% 
      group_by(sex) %&gt;% 
      summarise_each(funs(mean))

The `data.table` idiom is `lapply(.SD, mean)`, which is 

    DT &lt;- data.table(df)
    DT[, lapply(.SD, mean), by = sex]
    #     sex age bmi  chol
    # 1:  boy  55  24 203.5
    # 2: girl  51  28 197.0

I&#39;m not sure of a `dplyr` idiom for the same thing, but you can do something like

    dg &lt;- group_by(df, sex)
    # the names of the columns you want to summarize
    cols &lt;- names(dg)[-1]
    # the dots component of your call to summarise
    dots &lt;- sapply(cols ,function(x) substitute(mean(x), list(x=as.name(x))))
    do.call(summarise, c(list(.data=dg), dots))
    # Source: local data frame [2 x 4]
    
    #    sex age bmi  chol
    # 1  boy  55  24 203.5
    # 2 girl  51  28 197.0


Note that there is a github issue [#178](https://github.com/hadley/dplyr/issues/178) to efficienctly implement the `plyr` idiom `colwise` in `dplyr`.
  

I want to make a grouped filter using `dplyr`, in a way that within each group only that row is returned which has the minimum value of variable `x`.

My problem is: As expected, in the case of multiple minima *all* rows with the minimum value are returned. But in my case, **I only want the first row** if multiple minima are present.

Here&#39;s an example:

    df &lt;- data.frame(
	A=c(&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;, &quot;C&quot;, &quot;C&quot;, &quot;C&quot;),
	x=c(1, 1, 2, 2, 3, 4, 5, 5, 5),
	y=rnorm(9)
    )
    
    library(dplyr)
    df.g &lt;- group_by(df, A)
    filter(df.g, x == min(x))

As expected, all minima are returned:

    Source: local data frame [6 x 3]
    Groups: A
    
      A x           y
    1 A 1 -1.04584335
    2 A 1  0.97949399
    3 B 2  0.79600971
    4 C 5 -0.08655151
    5 C 5  0.16649962
    6 C 5 -0.05948012


With ddply, I would have approach the task that way:

    library(plyr)
    ddply(df, .(A), function(z) {
    	z[z$x == min(z$x), ][1, ]
    })

... which works:

      A x           y
    1 A 1 -1.04584335
    2 B 2  0.79600971
    3 C 5 -0.08655151

**Q: Is there a way to approach this in dplyr?** (For speed reasons)

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

In a Pro Spring 3 Book,
Chapter 4 - Introduction IOC and DI in Spring - Page 59, In &quot;Setter Injection vs. Constructor Injection&quot; section, a paragraph says

&gt; Spring included, provide a mechanism for ensuring that all dependencies are defined when
you use Setter Injection, but by using Constructor Injection, you assert the requirement for the dependency in a container-agnostic manner&quot;

Could you explain with examples


Explain why constructor inject is better than other options

dplyr is amazingly fast, but I wonder if I&#39;m missing something: is it possible summarise over several variables.  For example:   

    library(dplyr)
    library(reshape2)
    
    (df=dput(structure(list(sex = structure(c(1L, 1L, 2L, 2L), .Label = c(&quot;boy&quot;, 
    &quot;girl&quot;), class = &quot;factor&quot;), age = c(52L, 58L, 40L, 62L), bmi = c(25L, 
    23L, 30L, 26L), chol = c(187L, 220L, 190L, 204L)), .Names = c(&quot;sex&quot;, 
    &quot;age&quot;, &quot;bmi&quot;, &quot;chol&quot;), row.names = c(NA, -4L), class = &quot;data.frame&quot;)))

       sex age bmi chol
    1  boy  52  25  187
    2  boy  58  23  220
    3 girl  40  30  190
    4 girl  62  26  204
    
    dg=group_by(df,sex)

With this small dataframe, it&#39;s easy to write

    summarise(dg,mean(age),mean(bmi),mean(chol))

And I know that to get what I want, I could melt, get the means, and then dcast such as 

    dm=melt(df, id.var=&#39;sex&#39;)
    dmg=group_by(dm, sex, variable); 
    x=summarise(dmg, means=mean(value))
    dcast(x, sex~variable)

But what if I have &gt;20 variables and a very large number of rows.  Is there anything similar to .SD in data.table that would allow me to take the means of all variables in the grouped data frame?  Or, is it possible to somehow use lapply on the grouped data frame?

Thanks for any help

Can dplyr summarise over several variables without listing each one?

dplyr is amazingly fast, but I wonder if I'm missing something: is it possible summarise over several variables. For example:
<pre><code class="hljs language-scss">library(dplyr)
library(reshape2)

(df=dput(structure(list(sex = structure(c(1L, 1L, 2L, 2L), .Label = c("boy", 
"girl"), class = "factor"), age = c(52L, 58L, 40L, 62L), bmi = c(25L, 
23L, 30L, 26L), chol = c(187L, 220L, 190L, 204L)), .Names = c("sex", 
"age", "bmi", "chol"), row.names = c(NA, -4L), class = "data.frame")))

 sex age bmi chol
1 boy 52 25 187
2 boy 58 23 220
3 girl 40 30 190
4 girl 62 26 204

dg=group_by(df,sex)
</code></pre>
With this small dataframe, it's easy to write
<pre><code class="hljs language-scss">summarise(dg,mean(age),mean(bmi),mean(chol))
</code></pre>
And I know that to get what I want, I could melt, get the means, and then dcast such as
<pre><code class="hljs language-ini">dm=melt(df, id.var='sex')
dmg=group_by(dm, sex, variable); 
x=summarise(dmg, means=mean(value))
dcast(x, sex~variable)
</code></pre>
But what if I have >20 variables and a very large number of rows. Is there anything similar to .SD in data.table that would allow me to take the means of all variables in the grouped data frame? Or, is it possible to somehow use lapply on the grouped data frame?
Thanks for any help

Stargazer produces very nice latex tables for lm (and other) objects.  Suppose I&#39;ve fit a model by maximum likelihood.  I&#39;d like stargazer to produce a lm-like table for my estimates.  How can I do this?

Although it&#39;s a bit hacky, one way might be to create a &quot;fake&quot; lm object containing my estimates -- I think this would work as long as summary(my.fake.lm.object) works.  Is that easily doable?

An example:

    library(stargazer)
    
    N &lt;- 200
    df &lt;- data.frame(x=runif(N, 0, 50))
    df$y &lt;- 10 + 2 * df$x + 4 * rt(N, 4)  # True params
    plot(df$x, df$y)
    
    model1 &lt;- lm(y ~ x, data=df)
    stargazer(model1, title=&quot;A Model&quot;)  # I&#39;d like to produce a similar table for the model below
    
    ll &lt;- function(params) {
        ## Log likelihood for y ~ x + student&#39;s t errors
        params &lt;- as.list(params)
        return(sum(dt((df$y - params$const - params$beta*df$x) / params$scale, df=params$degrees.freedom, log=TRUE) -
                   log(params$scale)))
    }
    
    model2 &lt;- optim(par=c(const=5, beta=1, scale=3, degrees.freedom=5), lower=c(-Inf, -Inf, 0.1, 0.1),
                    fn=ll, method=&quot;L-BFGS-B&quot;, control=list(fnscale=-1), hessian=TRUE)
    model2.coefs &lt;- data.frame(coefficient=names(model2$par), value=as.numeric(model2$par),
                               se=as.numeric(sqrt(diag(solve(-model2$hessian)))))
    
    stargazer(model2.coefs, title=&quot;Another Model&quot;, summary=FALSE)  # Works, but how can I mimic what stargazer does with lm objects?

To be more precise:  with lm objects, stargazer nicely prints the dependent variable at the top of the table, includes SEs in parentheses below the corresponding estimates, and has the R^2 and number of observations at the bottom of the table.  Is there a(n easy) way to obtain the same behavior with a &quot;custom&quot; model estimated by maximum likelihood, as above?

Here are my feeble attempts at dressing up my optim output as a lm object:

    model2.lm &lt;- list()  # Mimic an lm object
    class(model2.lm) &lt;- c(class(model2.lm), &quot;lm&quot;)
    model2.lm$rank &lt;- model1$rank  # Problematic?
    model2.lm$coefficients &lt;- model2$par
    names(model2.lm$coefficients)[1:2] &lt;- names(model1$coefficients)
    model2.lm$fitted.values &lt;- model2$par[&quot;const&quot;] + model2$par[&quot;beta&quot;]*df$x
    model2.lm$residuals &lt;- df$y - model2.lm$fitted.values
    model2.lm$model &lt;- df
    model2.lm$terms &lt;- model1$terms  # Problematic?
    summary(model2.lm)  # Not working



Get coefficients estimated by maximum likelihood into a stargazer table

What are the main differences between `.RData`, `.Rda` and `.Rds` files? 

- Are there differences in compression, etc.?
- When should each type be used? 
- How can one type be converted to another?
  

What are the main differences between R data files?

### Overview ###

I&#39;m relatively familiar with `data.table`, not so much with `dplyr`.  I&#39;ve read through some [`dplyr` vignettes][1] and examples that have popped up on SO, and so far my conclusions are that:

 1. `data.table` and `dplyr` are comparable in speed, except when there are many (i.e. &gt;10-100K) groups, and in some other circumstances (see benchmarks below)
 2. `dplyr` has more accessible syntax
 3. `dplyr` abstracts (or will) potential DB interactions
 4. There are some minor functionality differences (see &quot;Examples/Usage&quot; below)

In my mind 2. doesn&#39;t bear much weight because I am fairly familiar with it `data.table`, though I understand that for users new to both it will be a big factor.  I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with `data.table`.  I also would like to avoid a discussion about how &quot;more intuitive&quot; leads to faster analysis (certainly true, but again, not what I&#39;m most interested about here).

### Question ###

What I want to know is:

 1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
 2. Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One [recent SO question][3] got me thinking about this a bit more, because up until that point I didn&#39;t think `dplyr` would offer much beyond what I can already do in `data.table`.  Here is the `dplyr` solution (data at end of Q):

    dat %.%
      group_by(name, job) %.%
      filter(job != &quot;Boss&quot; | year == min(year)) %.%
      mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a `data.table` solution.  That said, good `data.table` solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

    setDT(dat)[,
      .SD[job != &quot;Boss&quot; | year == min(year)][, cumjob := cumsum(job2)], 
      by=list(id, job)
    ]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you&#39;re used to `data.table` (i.e. doesn&#39;t use some of the more esoteric tricks).

Ideally what I&#39;d like to see is some good examples were the `dplyr` or `data.table` way is substantially more concise or performs substantially better.

### Examples ###

#### Usage ####

 - `dplyr` does not allow grouped operations that return arbitrary number of rows (from **[eddi&#39;s question][6]**, note: this looks like it will be implemented in **[dplyr 0.5](https://github.com/hadley/dplyr/issues/154)**, also, @beginneR shows a potential work-around using `do` in the answer to @eddi&#39;s question).
 - `data.table` supports **[rolling joins](https://stackoverflow.com/questions/12030932/rolling-joins-data-table-in-r)** (thanks @dholstius) as well as **[overlap joins](https://stackoverflow.com/questions/23371747/range-join-data-frames-specific-date-column-with-date-ranges-intervals-in-r/23377309#23377309)**
 - `data.table` internally optimises expressions of the form `DT[col == value]` or `DT[col %in% values]` for *speed* through *automatic indexing* which uses *binary search* while using the same base R syntax. [See here](https://gist.github.com/arunsrinivasan/dacb9d1cac301de8d9ff) for some more details and a tiny benchmark.
 - `dplyr` offers standard evaluation versions of functions (e.g. `regroup`, `summarize_each_`) that can simplify the programmatic use of `dplyr` (note programmatic use of `data.table` is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)

#### Benchmarks ####

- I ran **[my own benchmarks](https://www.brodieg.com/2014/04/18/datatable-vs-dplyr-in-split-apply-comgine/)** and found both packages to be comparable in &quot;split apply combine&quot; style analysis, except when there are very large numbers of groups (&gt;100K) at which point `data.table` becomes substantially faster.
- @Arun ran some **[benchmarks on joins](https://gist.github.com/arunsrinivasan/db6e1ce05227f120a2c9)**, showing that `data.table` scales better than `dplyr` as the number of groups increase (updated with recent enhancements in both packages and recent version of R).  Also, a benchmark when trying to get **[unique values](https://stackoverflow.com/questions/23668593/using-fastmatch-package-in-r/23680560?noredirect=1#comment36389040_23680560)** has `data.table` ~6x faster.
- (Unverified) has `data.table` 75% faster on larger versions of a group/apply/sort while `dplyr` was 40% faster on the smaller ones (**[another SO question from comments][7]**, thanks danas).
- Matt, the main author of `data.table`, has [**benchmarked grouping operations on `data.table`, `dplyr` and python `pandas` on up to 2 billion rows (~100GB in RAM)**](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping).
- An **[older benchmark on 80K groups][2]** has `data.table` ~8x faster

### Data ###

This is for the first example I showed in the question section.

    dat &lt;- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L), name = c(&quot;Jane&quot;, &quot;Jane&quot;, &quot;Jane&quot;, &quot;Jane&quot;, 
    &quot;Jane&quot;, &quot;Jane&quot;, &quot;Jane&quot;, &quot;Jane&quot;, &quot;Bob&quot;, &quot;Bob&quot;, &quot;Bob&quot;, &quot;Bob&quot;, &quot;Bob&quot;, 
    &quot;Bob&quot;, &quot;Bob&quot;, &quot;Bob&quot;), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
    1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
    1991L, 1992L), job = c(&quot;Manager&quot;, &quot;Manager&quot;, &quot;Manager&quot;, &quot;Manager&quot;, 
    &quot;Manager&quot;, &quot;Manager&quot;, &quot;Boss&quot;, &quot;Boss&quot;, &quot;Manager&quot;, &quot;Manager&quot;, &quot;Manager&quot;, 
    &quot;Boss&quot;, &quot;Boss&quot;, &quot;Boss&quot;, &quot;Boss&quot;, &quot;Boss&quot;), job2 = c(1L, 1L, 1L, 
    1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c(&quot;id&quot;, 
    &quot;name&quot;, &quot;year&quot;, &quot;job&quot;, &quot;job2&quot;), class = &quot;data.frame&quot;, row.names = c(NA, 
    -16L))


  [1]: http://rpubs.com/hadley/dplyr-intro
  [2]: http://www.r-statistics.com/2013/09/a-speed-test-comparison-of-plyr-data-table-and-dplyr/
  [3]: https://stackoverflow.com/questions/21421004/how-to-cumulatively-add-values-in-one-vector-in-r
  [4]: https://stackoverflow.com/questions/21295936/can-dplyr-summarise-over-several-variables-without-listing-each-one
  [5]: https://stackoverflow.com/questions/22644804/how-can-i-use-dplyr-to-apply-a-function-to-all-non-group-by-columns
  [6]: https://stackoverflow.com/questions/21737815/grouped-operations-that-result-in-length-not-equal-to-1-or-length-of-group-in-dp
  [7]: https://stackoverflow.com/questions/21477525/fast-frequency-and-percentage-table-with-dplyr/

data.table vs dplyr: can one do something well the other can&#39;t or does poorly?

In a shiny app (by RStudio), on the _server_ side, I have a reactive that returns a list of variables by parsing the content of a `textInput`. The list of variables is then used in `selectInput` and/or `updateSelectInput`.

I can&#39;t make it work. Any suggestions?

I have made two attempts. The first approach is to use the reactive `outVar` directly into `selectInput`. The second approach is to use the reactive `outVar` in `updateSelectInput`. Neither works.

### server.R

    shinyServer(
      function(input, output, session) {
    
        outVar &lt;- reactive({
            vars &lt;- all.vars(parse(text=input$inBody))
            vars &lt;- as.list(vars)
            return(vars)
        })
    
        output$inBody &lt;- renderUI({
            textInput(inputId = &quot;inBody&quot;, label = h4(&quot;Enter a function:&quot;), value = &quot;a+b+c&quot;)
        })
    
        output$inVar &lt;- renderUI({  ## works but the choices are non-reactive
            selectInput(inputId = &quot;inVar&quot;, label = h4(&quot;Select variables:&quot;), choices =  list(&quot;a&quot;,&quot;b&quot;))
        })
    
        observe({  ## doesn&#39;t work
            choices &lt;- outVar()
            updateSelectInput(session = session, inputId = &quot;inVar&quot;, choices = choices)
        })

    })


### ui.R

    shinyUI(
      basicPage(
        uiOutput(&quot;inBody&quot;),
        uiOutput(&quot;inVar&quot;)
      )
    )

A short while ago, I posted the same question at shiny-discuss, but it has generated little interest, so I&#39;m asking again, with apologies, https://groups.google.com/forum/#!topic/shiny-discuss/e0MgmMskfWo


__Edit 1__

@Ramnath has kindly posted a solution that appears to work, denoted _Edit 2_ by him. But that solution does not address the problem because the `textinput` is on the `ui` side instead of on the `server` side as it is in my problem. If I move the `textinput` of Ramnath&#39;s second edit to the `server` side, the problem crops up again, namely: nothing shows and RStudio crashes. I found that wrapping `input$text` in `as.character` makes the problem disappear.

__Edit 2__

In further discussion, Ramnath has shown me that the problem arises when the server attempts to apply the dynamic function `outVar` before its arguments have been returned by `textinput`. The solution is to first check whether `is.null(input$inBody)` exists.

_**Checking for existence of arguments is a crucial aspect of building a shiny app**_, so why did I not think of it? Well, I did, but I must have done something wrong! Considering the amount of time I spent on the problem, it&#39;s a bitter experience. I show after the code how to check for existence.

Below is Ramnath&#39;s code with `textinput` moved to the `server` side. It crashes RStudio so don&#39;t try it at home. (I have used his notation)

    library(shiny)
    runApp(list(
      ui = bootstrapPage(
        uiOutput(&#39;textbox&#39;),  ## moving Ramnath&#39;s textinput to the server side
        uiOutput(&#39;variables&#39;)
      ),
      server = function(input, output){
        outVar &lt;- reactive({
          vars &lt;- all.vars(parse(text = input$text))  ## existence check needed here to prevent a crash
          vars &lt;- as.list(vars)
          return(vars)
        })
    
        output$textbox = renderUI({
          textInput(&quot;text&quot;, &quot;Enter Formula&quot;, &quot;a=b+c&quot;)
        })
    
        output$variables = renderUI({
          selectInput(&#39;variables2&#39;, &#39;Variables&#39;, outVar())
        })
      }
    ))

The way I usually check for existence is like this:

    if (is.null(input$text) || is.na(input$text)){
      return()
    } else {
      vars &lt;- all.vars(parse(text = input$text))
      return(vars)
    }

Ramnath&#39;s code is shorter:

    if (!is.null(mytext)){
      mytext = input$text
      vars &lt;- all.vars(parse(text = mytext))
      return(vars)
    }

Both seem to work, but I&#39;ll be doing it Ramnath&#39;s way from now on: maybe an unbalanced bracket in my construct had earlier prevented me to make the check work? Ramnath&#39;s check is more direct.

Lastly, I&#39;d like to note a couple of things about my various attempts to debug. 

In my debugging quest, I discovered that there is an option to &quot;rank&quot; the priority of &quot;outputs&quot; on the server side, which I explored in an attempt to solve my problem, but didn&#39;t work since the problem was elsewhere. Still, it&#39;s interesting to know and seems not very well known at this time:

    outputOptions(output, &quot;textbox&quot;, priority = 1)
    outputOptions(output, &quot;variables&quot;, priority = 2)

In that quest, I also _tried_ `try`:

    try(vars &lt;- all.vars(parse(text = input$text)))

That was pretty close, but still did not fix it.

The first solution I stumbled upon was: 

    vars &lt;- all.vars(parse(text = as.character(input$text)))

I suppose it would be interesting to know why it worked: is it because it slows things down enough? is it because `as.character` &quot;waits&quot; for `input$text` to be non-null?

Whatever the case may be, I am extremely grateful to Ramnath for his effort, patience and guidance.

R shiny passing reactive to selectInput choices

I like plyr&#39;s renaming function `rename`.  I have recently started using dplyr, and was wondering if there is an easy way to rename variables using a function from dplyr, that is as easy to use as to plyr&#39;s `rename`?

Replacement for &quot;rename&quot; in dplyr

Is there a more succinct way to get one column of a dplyr tbl as a vector, from a tbl with database back-end (i.e. the data frame/table can&#39;t be subset directly)?

    require(dplyr)
    db &lt;- src_sqlite(tempfile(), create = TRUE)
    iris2 &lt;- copy_to(db, iris)
    iris2$Species
    # NULL

That would have been too easy, so

    collect(select(iris2, Species))[, 1]
    # [1] &quot;setosa&quot;     &quot;setosa&quot;     &quot;setosa&quot;     &quot;setosa&quot;  etc.

But it seems a bit clumsy.


Extract a dplyr tbl column as a vector

I&#39;m struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.

    df &lt;- data.frame(
        a = sample(1:5, n, replace = TRUE), 
        b = sample(1:5, n, replace = TRUE), 
        c = sample(1:5, n, replace = TRUE), 
        d = sample(1:5, n, replace = TRUE), 
        grp = sample(1:3, n, replace = TRUE)
    )
    df %&gt;% group_by(grp) %&gt;% summarise(mean(a))

This gives me the mean for column &quot;a&quot; for each group indicated by &quot;grp&quot;.

My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat `df %&gt;% group_by(grp) %&gt;% summarise(mean(a))` for each column?

What I would like to have is something like

    df %&gt;% group_by(grp) %&gt;% summarise(mean(a:d)) # &quot;mean(a:d)&quot; does not work

Content Type	Original Author	Original Content on Stackoverflow
Question	David F	View Question on Stackoverflow
Solution 1 - R	rrs	View Answer on Stackoverflow
Solution 2 - R	mnel	View Answer on Stackoverflow

Can dplyr summarise over several variables without listing each one?

R Problem Overview

R Solutions

Solution 1 - R

Solution 2 - R

Explain why constructor inject is better than other options

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

Attributions