Remove rows conditionally from a data.table in R
Rdata.tableR Problem Overview
I have a data.table with fields {id, menuitem, amount}.
This is transaction data - so, ids are unique, but menuitem repeats. Now, I want to remove all entries where menuitem == 'coffee'
.
Also, want to delete all rows where amount <= 0
;
What is the right way to do this in data.table?
I can use data$menuitem!='coffee'
and then index int into data[] - but that is not necessarily efficient and does not take advantage of data.table.
Any pointers in the right direction are appreciated.
R Solutions
Solution 1 - R
In this scenario it is not so different than data.frame
data <- data[ menuitem != 'coffee' | amount > 0]
Delete/add row by reference it is to be implemented. You find more info in this question
Regarding speed:
1 You can benefit from keys by doing something like:
setkey(data, menuitem)
data <- data[!"coffee"]
which will be faster than data <- data[ menuitem != 'coffee']
. However to apply the same filters you asked in the question you'll need a rolling join (I've finished my lunch break I can add something later :-)).
2 Even without key data.table is much faster for relatively big table (similar speed for handful amount of rows)
dt<-data.table(id=sample(letters,1000000,T),var=rnorm(1000000))
df<-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000))
library(microbenchmark)
> microbenchmark(dt[ id == "a"], df[ df$id == "a",])
Unit: milliseconds
expr min lq median uq max neval
dt[id == "a"] 24.42193 25.74296 26.00996 26.35778 27.36355 100
df[df$id == "a", ] 138.17500 146.46729 147.38646 149.06766 154.10051 100
Solution 2 - R
try this:
data <- data[ !(menuitem == 'coffee' | amount <= 0),]
Generally:
dt <- data.table(a=c(1,1,1,2,2,2,3,3,3),b=c(4,2,3,1,5,3,4,7,6))
dt
#> a b
#> 1: 1 4
#> 2: 1 2
#> 3: 1 3
#> 4: 2 1
#> 5: 2 5
#> 6: 2 3
#> 7: 3 4
#> 8: 3 7
#> 9: 3 6
dt[a!=1,]
#> a b
#> 1: 2 1
#> 2: 2 5
#> 3: 2 3
#> 4: 3 4
#> 5: 3 7
#> 6: 3 6