Replace values in a dataframe based on lookup table

RDataframeLookup

R Problem Overview


I am having some trouble replacing values in a dataframe. I would like to replace values based on a separate table. Below is an example of what I am trying to do.

I have a table where every row is a customer and every column is an animal they purchased. Lets call this dataframe table.

> table
#       P1     P2     P3
# 1    cat lizard parrot
# 2 lizard parrot    cat
# 3 parrot    cat lizard

I also have a table that I will reference called lookUp.

> lookUp
#      pet   class
# 1    cat  mammal
# 2 lizard reptile
# 3 parrot    bird

What I want to do is create a new table called new with a function replaces all values in table with the class column in lookUp. I tried this myself using an lapply function, but I got the following warnings.

new <- as.data.frame(lapply(table, function(x) {
  gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)

Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used

Any ideas on how to make this work?

R Solutions


Solution 1 - R

You posted an approach in your question which was not bad. Here's a smiliar approach:

new <- df  # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])

An alternative approach which will be faster is:

new <- df
new[] <- look$class[match(unlist(df), look$pet)]

Note that I use empty brackets ([]) in both cases to keep the structure of new as it was (a data.frame).

(I'm using df instead of table and look instead of lookup in my answer)

Solution 2 - R

Another options is a combination of tidyr and dplyr

library(dplyr)
library(tidyr)
table %>%
   gather(key = "pet") %>%
   left_join(lookup, by = "pet") %>%
   spread(key = pet, value = class)
   

Solution 3 - R

Anytime you have two separate data.frames and are trying to bring info from one to the other, the answer is to merge.

Everyone has their own favorite merge method in R. Mine is data.table.

Also, since you want to do this to many columns, it'll be faster to melt and dcast -- rather than loop over columns, apply it once to a reshaped table, then reshape again.

library(data.table)

#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE) 
setDT(lookUp)

#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')     
       # merging
       ][lookup, new_value := i.class, on = c(value = 'pet') 
         #reform back to original shape
         ][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
#    rn      P1      P2      P3
# 1:  1  mammal reptile    bird
# 2:  2 reptile    bird  mammal
# 3:  3    bird  mammal reptile
             

In case you find the dcast/melt bit a bit intimidating, here's an approach that just loops over columns; dcast/melt is simply sidestepping the loop for this problem.

setDT(table) #don't need row names this time
setDT(lookUp)

sapply(names(table), #(or to whichever are the relevant columns)
       function(cc) table[lookUp, (cc) := #merge, replace                            #need to pass a _named_ vector to 'on', so use setNames                            i.class, on = setNames("pet", cc)])

Solution 4 - R

Make a named vector, and loop through every column and match, see:

# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1    
#      cat    lizard    parrot 
# "mammal" "reptile"    "bird" 

# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL

# res
#        P1      P2      P3
# 1  mammal reptile    bird
# 2 reptile    bird  mammal
# 3    bird  mammal reptile
data
df1 <- read.table(text = "
       P1     P2     P3
 1    cat lizard parrot
 2 lizard parrot    cat
 3 parrot    cat lizard", header = TRUE)

lookUp <- read.table(text = "
      pet   class
 1    cat  mammal
 2 lizard reptile
 3 parrot    bird", header = TRUE)

Solution 5 - R

I did it using the factor built-in.

table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)

Solution 6 - R

The answer above showing how to do this in dplyr doesn't answer the question, the table is filled with NAs. This worked, I would appreciate any comments showing a better way:

# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>% 
    # put in long format, naming column filled with P1, P2, P3 "petCount"
    gather(key="petCount", value="pet", -customer) %>% 
    # add a new column based on the pet's class in data frame "lookup"
    left_join(lookup, by="pet") %>%
    # since you wanted to replace the values in "table" with their
    # "class", remove the pet column
    select(-pet) %>% 
    # put data back into wide format
    spread(key="petCount", value="class")

Note that it would likely be useful to keep the long table that contains the customer, the pet, the pet's species(?) and their class. This example simply adds an intermediary save to a variable:

table$customer = seq(nrow(table))
petClasses <- table %>% 
    gather(key="petCount", value="pet", -customer) %>% 
    left_join(lookup, by="pet")

custPetClasses <- petClasses %>%
    select(-pet) %>% 
    spread(key="petCount", value="class")

Solution 7 - R

I tried other approaches and they took a really long time with my very large dataset. I used the following instead:

    # make table "new" using ifelse. See data below to avoid re-typing it
    new <- ifelse(table1 =="cat", "mammal",
                        ifelse(table1 == "lizard", "reptile",
                               ifelse(table1 =="parrot", "bird", NA)))

This method requires you to write more text for your code, but the vectorization of ifelse makes it run faster. You have to decide, based on your data, if you want to spend more time writing code or waiting for your computer to run. If you want to make sure it worked (you didn't have any typos in your iflese commands), you can use apply(new, 2, function(x) mean(is.na(x))).

data

    # create the data table
    table1 <- read.table(text = "
       P1     P2     P3
     1    cat lizard parrot
     2 lizard parrot    cat
     3 parrot    cat lizard", header = TRUE)

Solution 8 - R

Benchmark

Out of burning curiosity, I just ran a benchmark with some of the approaches that I want to share with you. I couldn't quite believe some of the statements about performance in the answers and am trying to clarify this herewith. In order not to be misled by different rows/columns ratios, I consider three scenarios:

  1. ncol == nrow

  2. ncol << nrow

  3. ncol >> nrow.

It might be beneficial to coerce as.matrix beforehand, so I included this as an additional solution (unlist_mat).

microbenchmark::microbenchmark(
  lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
  unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
  unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)],  ## added
  ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
                                 ifelse(Dat3[col_set] == "lizard", "reptile",
                                        ifelse(Dat3[col_set] == "parrot", "bird", NA))),
  look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
  times=3L
)

## 1e3 x 1e3
# Unit: milliseconds
#       expr       min        lq      mean    median        uq       max neval cld
#     lapply  40.42905  63.47053  78.03831  86.51201  96.84294 107.17387     3  a 
#     unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297     3   b
# unlist_mat  45.91743  56.51087  68.50595  67.10432  79.80021  92.49611     3  a 
#     ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581     3  ab
#   look_vec  58.54449  88.40293 112.91165 118.26137 140.09522 161.92908     3  a 

## 1e4 x 1e4
# Unit: seconds
#       expr       min        lq      mean    median         uq        max neval cld
#     lapply  2.427077  3.558234  3.992481  4.689390   4.775183   4.860977     3  a 
#     unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084     3   b
# unlist_mat  4.940254  5.011684  5.576553  5.083114   5.894703   6.706291     3  a 
#     ifelse  9.714553 14.444899 36.176777 19.175244  49.407889  79.640535     3  a 
#   look_vec  8.460969  8.558600  8.784463  8.656230   8.946209   9.236188     3  a 

## 1e5 x 1e3
# Unit: seconds
#       expr       min        lq      mean    median        uq        max neval cld
#     lapply  2.314427  2.403001  3.270708  2.491575  3.748848   5.006120     3  a 
#     unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586     3   b
# unlist_mat  5.018869  5.060865  5.638499  5.102861  5.948314   6.793767     3  a 
#     ifelse  6.244744 16.488266 39.208119 26.731788 55.689807  84.647825     3  ab
#   look_vec  4.512672  6.434651  7.496267  8.356630  8.988064   9.619498     3  a 

## 1e3 x 1e5
# Unit: seconds
#       expr        min         lq       mean     median         uq        max neval cld
#     lapply  52.833019  55.373432  71.308981  57.913845  80.546963 103.180080     3 ab 
#     unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819     3   c
# unlist_mat   3.872551   4.422904   4.695393   4.973257   5.106814   5.240372     3 a  
#     ifelse  72.592437  76.473418 103.930063  80.354399 119.598876 158.843354     3  b 
#   look_vec  56.444824  58.904604  62.677267  61.364383  65.793488  70.222593     3 ab 

Note: Performed on an Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz using R --vanilla.

all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1))  ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!

Conclusion

Using lapply with a lookup matrix appears to be a good choice if the number of columns is rather low/lower than the number of rows. If we have many columns, especially compared to rows, we might benefit from coercing the respective columns of the data frame into a matrix first, which should only take a blink of an eye.


set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
                  class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
  Mat <- as.matrix(Dat)
)
#  user  system elapsed 
# 0.844   0.318   1.161 
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionjbunkView Question on Stackoverflow
Solution 1 - RtalatView Answer on Stackoverflow
Solution 2 - RThierryView Answer on Stackoverflow
Solution 3 - RMichaelChiricoView Answer on Stackoverflow
Solution 4 - Rzx8754View Answer on Stackoverflow
Solution 5 - RulidtkoView Answer on Stackoverflow
Solution 6 - RdannitView Answer on Stackoverflow
Solution 7 - RmikeyView Answer on Stackoverflow
Solution 8 - Rjay.sfView Answer on Stackoverflow