Merge two data frames while keeping the original row order

RSortingDataframeMerge

R Problem Overview


I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).

Here are some sample data (all values from class column are defined in both data frames):

df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))

If I do:

merge(df.2, df.1)

Output is:

  class object prob
1     1      B  0.5
2     1      C  0.5
3     2      A  0.7
4     2      D  0.7
5     3      F  0.3

If I add sort = FALSE:

merge(df.2, df.1, sort = F)                                                        

Result is:

  class object prob
1     2      A  0.7
2     2      D  0.7
3     1      B  0.5
4     1      C  0.5
5     3      F  0.3

But what I would like is:

  class object prob
1     2      A  0.7
2     1      B  0.5
3     2      D  0.7
4     3      F  0.3    
5     1      C  0.5

R Solutions


Solution 1 - R

You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :

df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id  <- 1:nrow(df.2)
out  <- merge(df.2,df.1, by = "class")
out[order(out$id), ]

Solution 2 - R

Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.

Using your example data, we would use join like this:

> join(df.2,df.1)
Joining by: class
  object class prob
1      A     2  0.7
2      B     1  0.5
3      D     2  0.7
4      F     3  0.3
5      C     1  0.5

Here are a couple of links describing fixes to the merge function for keeping the row order:

http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/

http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html

Solution 3 - R

You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.

Using your example above, the inner_join result looks like this:

inner_join(df.2,df.1)
Joining by: "class"
  object class prob
1      A     2  0.7
2      B     1  0.5
3      D     2  0.7
4      F     3  0.3
5      C     1  0.5

Solution 4 - R

From data.table v1.9.5+, you can do:

require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]

The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.

Solution 5 - R

For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:

library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]

> object class prob > 1: A 2 0.7 > 2: B 1 0.5 > 3: D 2 0.7 > 4: F 3 0.3 > 5: C 1 0.5

Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.

Solution 6 - R

The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of https://stackoverflow.com/questions/7235421/how-to-ddply-without-sorting/7235591#7235591, which deals with the issue of keeping order but in a split-apply-combine context:

> This came up on the plyr mailing list a while back (raised by @kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:

#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) { 
  col <- ".sortColumn"
  data[,col] <- 1:nrow(data) 
  out <- fn(data, ...) 
  if (!col %in% colnames(out)) stop("Ordering column not preserved by function") 
  out <- out[order(out[,col]),] 
  out[,col] <- NULL 
  out 
} 

So now you can use this generic keeping.order function to keep the original row order of a merge call:

df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")

Which will yield, as requested:

> keeping.order(df.2, merge, y=df.1, by = "class")
  class object id prob
3     2      A  1  0.7
1     1      B  2  0.5
4     2      D  3  0.7
5     3      F  4  0.3
2     1      C  5  0.5

So keeping.order effectively automates the approach in the accepted answer.

Solution 7 - R

Thanks to @PAC , I came up with something like this:

merge_sameord = function(x, y, ...) {
    UseMethod('merge_sameord')
}

merge_sameord.data.frame = function(x, y, ...) {
    rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
    x[, rstr] = 1:nrow(x)
    res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
    res = res[order(res[, rstr]), ]
    res[, rstr] = NULL
    res
}

This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.

Solution 8 - R

In this specific case you could us factor for a compact base solution:

df.2$prob = factor(df.2$class,labels=df.1$prob)

df.2
#   object class prob
# 1      A     2  0.7
# 2      B     1  0.5
# 3      D     2  0.7
# 4      F     3  0.3
# 5      C     1  0.5

Not a general solution however, it works if:

  1. You have a lookup table containing unique values
  2. You want to update a table, not create a new one
  3. the lookup table is sorted by the merging column
  4. The lookup table doesn't have extra levels
  5. You want a left_join
  6. If you're fine with factors

1 is not negotiable, for the rest we can do:

df.3  <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.

Solution 9 - R

For package developers

As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.

To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).

left_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, ...)
}

# workhorse:
merge_exec <- function(x, y, ...) {
  # set index
  x$join_id_ <- 1:nrow(x)
  # do the join
  joined <- merge(x = x, y = y, sort = FALSE, ...)
  # get suffices (yes, I prefer this over suffixes)
  if ("suffixes" %in% names(list(...))) {
    suffixes <- list(...)$suffixes
  } else {
    suffixes <- c("", "")
  }
  # get columns names in right order, so the 'by' column won't be forced first
  cols <- unique(c(colnames(x), 
                   paste0(colnames(x), suffixes[1]), 
                   colnames(y), 
                   paste0(colnames(y), suffixes[2])))
  # get the original row and column index
  joined[order(joined$join_id),         cols[cols %in% colnames(joined) & cols != "join_id_"]]
}

Solution 10 - R

The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:

df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))  
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))  
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])

I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is

data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
           row.names = NULL)

I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.

Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.

Solution 11 - R

There are several uses cases in which a simple subset will do:

# Use the key variable as row.names
row.names(df.1) = df.1$key

# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]

# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)

This code will preserve df.2 and it's order and add only matching data from df.1

If only one variable is to be added, the cbind() ist not required:

row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]

Solution 12 - R

There may be a more efficient way in base. This would be fairly simple to make into a function.

varorder <- names(mydata)  # --- Merge 
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])

mydata[c(varorder,restOfvars)]

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDJackView Question on Stackoverflow
Solution 1 - RPACView Answer on Stackoverflow
Solution 2 - Ruser2635373View Answer on Stackoverflow
Solution 3 - Ralex23lemmView Answer on Stackoverflow
Solution 4 - RArunView Answer on Stackoverflow
Solution 5 - RUweView Answer on Stackoverflow
Solution 6 - RlandroniView Answer on Stackoverflow
Solution 7 - RqedView Answer on Stackoverflow
Solution 8 - RmoodymudskipperView Answer on Stackoverflow
Solution 9 - RMS BerendsView Answer on Stackoverflow
Solution 10 - RDan MurphyView Answer on Stackoverflow
Solution 11 - RBurninLeoView Answer on Stackoverflow
Solution 12 - RJames HollandView Answer on Stackoverflow