How to check if two data frames are equal

DatabaseRDatasetCompareDataframe

Database Problem Overview


Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets:

df1 <- data.frame(num = 1:5, let = letters[1:5])
df2 <- df1
df3 <- data.frame(num = c(1:5, NA), let = letters[1:6])
df4 <- df3

So this is what I do to compare them:

table(x == y, useNA = 'ifany')

Which works great when the datasets have no NAs:

> table(df1 == df2, useNA = 'ifany')
TRUE 
  10 

But not so much when they have NAs:

> table(df3 == df4, useNA = 'ifany')
TRUE <NA> 
  11    1 

In the example, it's easy to dismiss the NA as not a problem since we know that both dataframes are equal. The problem is that NA == <anything> yields NA, so whenever one of the datasets has an NA, it doesn't matter what the other one has on that same position, the result is always going to be NA.

So using table() to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical?

P.S.: Note this is not a duplicate of https://stackoverflow.com/questions/12567524/r-comparing-several-datasets, https://stackoverflow.com/questions/7459138/comparing-2-datasets-in-r or https://stackoverflow.com/questions/8872333/compare-datasets-in-r

Database Solutions


Solution 1 - Database

Look up all.equal. It has some riders but it might work for you.

all.equal(df3,df4)
# [1] TRUE
all.equal(df2,df1)
# [1] TRUE

Solution 2 - Database

As Metrics pointed out, one could also use identical() to compare the datasets. The difference between this approach and that of Codoremifa is that identical() will just yield TRUE of FALSE, depending whether the objects being compared are identical or not, whereas all.equal() will either return TRUE or hints about the differences between the objects. For instance, consider the following:

> identical(df1, df3)
[1] FALSE

> all.equal(df1, df3)
[1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"                                
[2] "Component 1: Numeric: lengths (5, 6) differ"                                                
[3] "Component 2: Lengths: 5, 6"                                                                 
[4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >"
[5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"   

Moreover, from what I've tested identical() seems to run much faster than all.equal().

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionWaldir LeoncioView Question on Stackoverflow
Solution 1 - DatabaseTheComeOnManView Answer on Stackoverflow
Solution 2 - DatabaseWaldir LeoncioView Answer on Stackoverflow