Applying group_by and summarise on data while keeping all the columns' info

R Problem Overview

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

    data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))

    d<- data %>%
    group_by(b) %>%
    summarise(min_values= min(c))
    d
    b min_values
    1 a        1.2
    2 b        1.7
    3 c        3.1
    4 d        2.2

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

I have found some explanation here "https://stackoverflow.com/questions/29597970/dplyr-group-by-subset-and-summarise" and here "https://stackoverflow.com/questions/29549731/dplyr-finding-percentage-in-a-sub-group-using-group-by-and-summarise" but none of the addresses my problem.

R Solutions

Solution 1 - R

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

Solution 2 - R

You can use group_by without summarize:

data %>%
  group_by(b) %>%
  mutate(min_values = min(c)) %>%
  ungroup()

Solution 3 - R

Using sqldf:

library(sqldf)
 # Two options:
sqldf('SELECT * FROM data GROUP BY b HAVING min(c)')
sqldf('SELECT a, b, min(c) min, d FROM data GROUP BY b')

Output:

   a b   c     d
1  1 a 1.2 small
2  4 b 1.7  larg
3  6 c 3.1   med
4 10 d 2.2   med

Content Type	Original Author	Original Content on Stackoverflow
Question	Momeneh Foroutan	View Question on Stackoverflow
Solution 1 - R	talat	View Answer on Stackoverflow
Solution 2 - R	bergant	View Answer on Stackoverflow
Solution 3 - R	mpalanco	View Answer on Stackoverflow

Applying group_by and summarise on data while keeping all the columns' info

R Problem Overview

R Solutions

Solution 1 - R

Solution 2 - R

Solution 3 - R

How to center items of a recyclerview?

Can Java 8 Streams operate on an item in a collection, and then remove it?

Attributions