Finding ALL duplicate rows, including "elements with smaller subscripts"

RDuplicatesR Faq

R Problem Overview


R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector

FALSE, FALSE, FALSE, TRUE, TRUE

But in this case I actually want to get

FALSE, FALSE, TRUE, TRUE, TRUE

that is, I want to know whether a row is duplicated by a row with a larger subscript too.

R Solutions


Solution 1 - R

duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.


Some late Edit: You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums

vec <- c("a", "b", "c","c","c") 
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"

Edit: And an example for the case of a data frame:

df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
##   X1 X2
## 3  c  c
## 4  c  c

Solution 2 - R

You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.

> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
>  vec %in% unique(vec[ duplicated(vec)]) 
[1] FALSE FALSE  TRUE  TRUE  TRUE

Solution 3 - R

Duplicated rows in a dataframe could be obtained with dplyr by doing

library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()

To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.

If the row indices and not just the data is actually needed, you could add them first as in:

df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)

Solution 4 - R

I've had the same question, and if I'm not mistaken, this is also an answer.

vec[col %in% vec[duplicated(vec$col),]$col]

Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.

Solution 5 - R

Here is @Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():

allDuplicated <- function(vec){
  front <- duplicated(vec)
  back <- duplicated(vec, fromLast = TRUE)
  all_dup <- front + back > 0
  return(all_dup)
}

Using the same example:

vec <- c("a", "b", "c","c","c") 
allDuplicated(vec) 
[1] FALSE FALSE  TRUE  TRUE  TRUE

Solution 6 - R

I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:

df <- df %>% 
  group_by(Column1, Column2, Column3) %>% 
  mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
                            TRUE ~ "No")) %>%
  ungroup()

The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.

Solution 7 - R

If you are interested in which rows are duplicated for certain columns you can use a plyr approach:

ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())

Adding a count variable with dplyr:

df %>% add_count(col1, col2) %>% filter(n > 1)  # data frame
df %>% add_count(col1, col2) %>% select(n) > 1  # logical vector

For duplicate rows (considering all columns):

df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1

The benefit of these approaches is that you can specify how many duplicates as a cutoff.

Solution 8 - R

This updates @Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.

Thus, to get all rows for which there is a duplicate you can do this: iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()

To include the indices of such rows, add a 'rowid' column but exclude it from the grouping: iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()

Append %>% pull(rowid) after the above and you'll get a vector of the indices.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionLauren SamuelsView Question on Stackoverflow
Solution 1 - RJoshua UlrichView Answer on Stackoverflow
Solution 2 - RIRTFMView Answer on Stackoverflow
Solution 3 - RHolger BrandlView Answer on Stackoverflow
Solution 4 - RFrançois M.View Answer on Stackoverflow
Solution 5 - Rcanderson156View Answer on Stackoverflow
Solution 6 - RAdnan HajizadaView Answer on Stackoverflow
Solution 7 - RqwrView Answer on Stackoverflow
Solution 8 - RMCornejoView Answer on Stackoverflow