Summarize all group values and a conditional subset in the same call

RDplyrSqldf

R Problem Overview


I'll illustrate my question with an example.

Sample data:

 df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"), B =     c(1, 5, 7, 23, 54, 202))

df
  ID   A   B
1  1 foo   1
2  1 bar   5
3  2 foo   7
4  2 foo  23
5  3 bar  54
6  5 bar 202

What I want to do is to summarize, by ID, the sum of B and the sum of B when A is "foo". I can do this in a couple steps like:

require(magrittr)
require(dplyr)

df1 <- df %>%
  group_by(ID) %>%
  summarize(sumB = sum(B))

df2 <- df %>%
  filter(A == "foo") %>%
  group_by(ID) %>%
  summarize(sumBfoo = sum(B))

left_join(df1, df2)

  ID sumB sumBfoo
1  1    6       1
2  2   30      30
3  3   54      NA
4  5  202      NA

However, I'm looking for a more elegant/faster way, as I'm dealing with 10gb+ of out-of-memory data in sqlite.

require(sqldf)
my_db <- src_sqlite("my_db.sqlite3", create = T)
df_sqlite <- copy_to(my_db, df)

I thought of using mutate to define a new Bfoo column:

df_sqlite %>%
  mutate(Bfoo = ifelse(A=="foo", B, 0))

Unfortunately, this doesn't work on the database end of things.

Error in sqliteExecStatement(conn, statement, ...) : 
  RS-DBI driver: (error in statement: no such function: IFELSE)

R Solutions


Solution 1 - R

You can do both sums in a single dplyr statement:

df1 <- df %>%
  group_by(ID) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(B[A=="foo"]))

And here is a data.table version:

library(data.table)

dt = setDT(df) 

dt1 = dt[ , .(sumB = sum(B),              sumBfoo = sum(B[A=="foo"])), 
          by = ID]

dt1

> ID sumB sumBfoo > 1: 1 6 1 > 2: 2 30 30 > 3: 3 54 0 > 4: 5 202 0

Solution 2 - R

Writing up @hadley's comment as an answer

df_sqlite %>%
  group_by(ID) %>%
  mutate(Bfoo = if(A=="foo") B else 0) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(Bfoo)) %>%
  collect

Solution 3 - R

If you want to do counting instead of summarizing, then the answer is somewhat different. The change in code is small, especially in the conditional counting part.

df1 <- df %>%
    group_by(ID) %>%
    summarize(countB = n(),
              countBfoo = sum(A=="foo"))

df1
Source: local data frame [4 x 3]

  ID countB countBfoo
1  1      2         1
2  2      2         2
3  3      1         0
4  5      1         0

Solution 4 - R

If you wanted to count the rows, instead of summing them, can you pass a variable to the function:

    df1 <- df %>%
group_by(ID) %>%
summarize(RowCountB = n(),
          RowCountBfoo = n(A=="foo"))

I get an error both with n() and nrow().

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionkevinykuoView Question on Stackoverflow
Solution 1 - Reipi10View Answer on Stackoverflow
Solution 2 - RkevinykuoView Answer on Stackoverflow
Solution 3 - RLauriKView Answer on Stackoverflow
Solution 4 - Ruser3116297View Answer on Stackoverflow