Extract rows for the first occurrence of a variable in a data frame

R

R Problem Overview


I have a data frame with two variables, Date and Taxa and want to get the date for the first time each taxa occurs. There are 9 different dates and 40 different taxa in the data frame consisting of 172 rows, but my answer should only have 40 rows.

Taxa is a factor and Date is a date.

For example, my data frame (called 'species') is set up like this:

Date          Taxa
2013-07-12    A
2011-08-31    B
2012-09-06    C
2012-05-17    A
2013-07-12    C
2012-09-07    B

and I would be looking for an answer like this:

Date          Taxa
2012-05-17    A
2011-08-31    B
2012-09-06    C

I tried using:

t.first <-  species[unique(species$Taxa),]

and it gave me the correct number of rows but there were Taxa repeated. If I just use unique(species$Taxa) it appears to give me the right answer, but then I don't know the date when it first occurred.

Thanks for any help.

R Solutions


Solution 1 - R

t.first <- species[match(unique(species$Taxa), species$Taxa),]

should give you what you're looking for. match returns indices of the first match in the compared vectors, which give you the rows you need.

Solution 2 - R

In the following command, duplicated creates a logical index for duplicated data$Taxa values. A subset of the data frame without the corresponding rows is created with:

data[!duplicated(data$Taxa), ]

The result:

        Date Taxa
1 2012-05-17    A
2 2011-08-31    B
3 2012-09-06    C

Solution 3 - R

Here is a dplyr option that is not dependent on the data being sorted in date order and accounts for ties:

library(dplyr)
df %>% 
  mutate(Date = as.Date(Date)) %>% 
  group_by(Taxa) %>% 
  filter(Date == min(Date)) %>% 
  slice(1) %>% # takes the first occurrence if there is a tie
  ungroup()

# A tibble: 3 x 2
  Date       Taxa 
  <date>     <chr>
1 2012-05-17 A    
2 2011-08-31 B    
3 2012-09-06 C 

# sample data:
df <- read.table(text = 'Date          Taxa
                         2013-07-12    A
                         2011-08-31    B
                         2012-09-06    C
                         2012-05-17    A
                         2013-07-12    C
                         2012-09-07    B', header = TRUE, stringsAsFactors = FALSE)

And you could get the same by sorting by date as well:

df %>% 
  mutate(Date = as.Date(Date)) %>% 
  group_by(Taxa) %>% 
  arrange(Date) %>% 
  slice(1) %>% 
  ungroup()

Solution 4 - R

Here is a solution using data.table:

library(data.table)
setDT(species)
species[, .SD[which.min(Date)], by = Taxa]
#    Taxa       Date
# 1:    A 2012-05-17
# 2:    B 2011-08-31
# 3:    C 2012-09-06

Data:

species <- data.frame(
  Date = as.Date(c("2013-07-12", "2011-08-31", "2012-09-06", 
                   "2012-05-17", "2013-07-12", "2012-09-07")), 
  Taxa = c("A", "B", "C", "A", "C", "B")
)

Solution 5 - R

This should do the trick:

# Create some dummy data:

# Create some dates 
Date=as.POSIXct(c("2013-07-12","2011-08-31","2012-09-06","2009-01-01",
                  "2012-05-17","2013-07-12","2012-09-07","2013-02-02"))

# Create unique taxa
Taxa=rep(c("A","B","C","D"),2)

# Combine the two into a dataframe
data=as.data.frame(list(Date=Date,Taxa=Taxa))

# this returns a numeric vector of the minimum dates
xx=tapply(data$Date,list(data$Taxa),min)

# And this will return a dataframe with the first occurence
# of your taxa (or variables)
as.data.frame(list(Date=as.POSIXct(xx,origin="1970-01-01"),
                   Taxa=names(xx)))

Note: You can add simplify=T in tapply to return a POSIXt object but it returns a list. More info can be found here: Unexpected behaviour of min, tapply and POSIXct/POSIXlt classes?

Solution 6 - R

This is a good question. First of all, I want to highlight that the output you've mentioned is not correct as per your requirement. It should be:

Date          Taxa
2013-07-12    A
2011-08-31    B
2012-09-06    C

i.e., the first entry was not correct. Now talking about the code for this, all these are good answers but the solution I propose is more robust. To demonstrate better, I have used a new dataframe.

d <- data.frame(a = c(rep("A", 4), c(rep("B",4)), rep("C",4)), b=c(0,0,1,1,0,1,1,1,0,0,0,1))
d %>% group_by(a) %>% mutate(c = detect_index(.x = b, .f = p), d = row_number()) %>% mutate(e = ifelse(c==d,1,0)) %>% ungroup()

Keep Coding!

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser2614883View Question on Stackoverflow
Solution 1 - RalexwhanView Answer on Stackoverflow
Solution 2 - RSven HohensteinView Answer on Stackoverflow
Solution 3 - RsbhaView Answer on Stackoverflow
Solution 4 - Rsindri_baldurView Answer on Stackoverflow
Solution 5 - Rs_scolaryView Answer on Stackoverflow
Solution 6 - RVishal SharmaView Answer on Stackoverflow