Select the first row by group

RDataframeSqldf

R Problem Overview


From a dataframe like this

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10

> test
    id string
 1   1      A
 2   1      F
 3   2      B
 4   2      G
 5   3      C
 6   3      H
 7   4      D
 8   4      I
 9   5      E
 10  5      J

I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:

res <- sqldf("select id, min(rownames(test)), string 
              from test 
              group by id, string")

> res
    id string
 1   1      A
 3   2      B
 5   3      C
 7   4      D
 9   5      E

Is there a solution short of creating a new column like

test$row <- rownames(test)

and running the same sqldf query with min(row)?

R Solutions


Solution 1 - R

You can use duplicated to do this very quickly.

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))jply <- function() ddply(test,.(id),function(x) head(x,1))jdt <- function() {  testd <- as.data.table(test)  setkey(testd,id)  # Initial solution (slow)  # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
  # Faster options :
  testd[!duplicated(id)]               # (1)
  # testd[, .SD[1L], by=key(testd)]    # (2)
  # testd[J(unique(id)),mult="first"]  # (3)
  # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
    replications=5, order="relative")[,1:6]
#     test replications elapsed relative user.self sys.self
# 1   ju()            5    0.03    1.000      0.03     0.00
# 5  jdt()            5    0.03    1.000      0.03     0.00
# 3  gs2()            5    3.49  116.333      2.87     0.58
# 2  gs1()            5    3.58  119.333      3.00     0.58
# 4 jply()            5    3.69  123.000      3.11     0.51

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
#    test replications elapsed relative user.self sys.self
# 1  ju()          100    5.48    1.000      4.44     1.00
# 2 jdt()          100    6.92    1.263      5.70     1.15

Solution 2 - R

I favor the dplyr approach.

group_by(id) followed by either

  • filter(row_number()==1) or
  • slice(1) or
  • slice_head(1) #(dplyr => 1.0)
  • top_n(n = -1)
  • top_n() internally uses the rank function. Negative selects from the bottom of rank.

In some instances arranging the ids after the group_by can be necessary.

library(dplyr)

# using filter(), top_n() or slice()

m1 <-
test %>% 
  group_by(id) %>% 
  filter(row_number()==1)

m2 <-
test %>% 
  group_by(id) %>% 
  slice(1)

m3 <-
test %>% 
  group_by(id) %>% 
  top_n(n = -1)

All three methods return the same result

# A tibble: 5 x 2
# Groups:   id [5]
     id string
  <int> <fct> 
1     1 A     
2     2 B     
3     3 C     
4     4 D     
5     5 E

Solution 3 - R

What about

DT <- data.table(test)
setkey(DT, id)

DT[J(unique(id)), mult = "first"]

Edit

There is also a unique method for data.tables which will return the the first row by key

jdtu <- function() unique(DT)

I think, if you are ordering test outside the benchmark, then you can removing the setkey and data.table conversion from the benchmark as well (as the setkey basically sorts by id, the same as order).

set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]

jdt <- function() DT[J(unique(id)),mult = 'first']


 library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
##    test replications elapsed relative user.self sys.self 
## 2 jdt()            5    0.01        1      0.02        0        
## 1  ju()            5    0.05        5      0.05        0         

and with more data

** Edit with unique method**

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
       test replications elapsed relative user.self sys.self 
2  jdt()            5    0.09     2.25      0.09     0.00    
3 jdtu()            5    0.04     1.00      0.05     0.00      
1   ju()            5    0.22     5.50      0.19     0.03        

The unique method is fastest here.

Solution 4 - R

A simple ddply option:

ddply(test,.(id),function(x) head(x,1))

If speed is an issue, a similar approach could be taken with data.table:

testd <- data.table(test)
setkey(testd,id)
testd[,.SD[1],by = key(testd)]

or this might be considerably faster:

testd[testd[, .I[1], by = key(testd]$V1]

Solution 5 - R

now, for dplyr, adding a distinct counter.

df %>%
    group_by(aa, bb) %>%
    summarise(first=head(value,1), count=n_distinct(value))

You create groups, them summarise within groups.

If data is numeric, you can use:
first(value) [there is also last(value)] in place of head(value, 1)

see: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Full:

> df
Source: local data frame [16 x 3]

   aa bb value
1   1  1   GUT
2   1  1   PER
3   1  2   SUT
4   1  2   GUT
5   1  3   SUT
6   1  3   GUT
7   1  3   PER
8   2  1   221
9   2  1   224
10  2  1   239
11  2  2   217
12  2  2   221
13  2  2   224
14  3  1   GUT
15  3  1   HUL
16  3  1   GUT

> library(dplyr)
> df %>%
>   group_by(aa, bb) %>%
>   summarise(first=head(value,1), count=n_distinct(value))

Source: local data frame [6 x 4]
Groups: aa

  aa bb first count
1  1  1   GUT     2
2  1  2   SUT     2
3  1  3   SUT     3
4  2  1   221     3
5  2  2   217     3
6  3  1   GUT     2

Solution 6 - R

(1) SQLite has a built in rowid pseudo-column so this works:

sqldf("select min(rowid) rowid, id, string 
               from test 
               group by id")

giving:

  rowid id string
1     1  1      A
2     3  2      B
3     5  3      C
4     7  4      D
5     9  5      E

(2) Also sqldf itself has a row.names= argument:

sqldf("select min(cast(row_names as real)) row_names, id, string 
              from test 
              group by id", row.names = TRUE)

giving:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

(3) A third alternative which mixes the elements of the above two might be even better:

sqldf("select min(rowid) row_names, id, string 
               from test 
               group by id", row.names = TRUE)

giving:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

Note that all three of these rely on a SQLite extension to SQL where the use of min or max is guaranteed to result in the other columns being chosen from the same row. (In other SQL-based databases that may not be guaranteed.)

Solution 7 - R

A base R option is the split()-lapply()-do.call() idiom:

> do.call(rbind, lapply(split(test, test$id), head, 1))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

A more direct option is to lapply() the [ function:

> do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

The comma-space 1, ) at the end of the lapply() call is essential as this is equivalent of calling [1, ] to select first row and all columns.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiondmviannaView Question on Stackoverflow
Solution 1 - RJoshua UlrichView Answer on Stackoverflow
Solution 2 - RKrestenView Answer on Stackoverflow
Solution 3 - RmnelView Answer on Stackoverflow
Solution 4 - RjoranView Answer on Stackoverflow
Solution 5 - RPaulView Answer on Stackoverflow
Solution 6 - RG. GrothendieckView Answer on Stackoverflow
Solution 7 - RGavin SimpsonView Answer on Stackoverflow