Left join using data.table

RJoinMergedata.table

R Problem Overview


Suppose I have two data.table's:

A:

  A  B
1: 1 12
2: 2 13
3: 3 14
4: 4 15

B:

   A  B
1: 2 13
2: 3 14

and I have the following code:

merge_test = merge(dataA, dataB, by="A", all.data=TRUE)

I get:

   A B.x B.y
1: 2  13  13
2: 3  14  14

However, I want all the rows in dataA in the final merged table. Is there a way to do this?

R Solutions


Solution 1 - R

If you want to add the b values of B to A, then it's best to join A with B and update A by reference as follows:

A[B, on = 'a', bb := i.b]

which gives:

> > A > a b bb > 1: 1 12 NA > 2: 2 13 13 > 3: 3 14 14 > 4: 4 15 NA

This is a better approach than using B[A, on='a'] because the latter just prints the result to the console. When you want to get the results back into A, you need to use A <- B[A, on='a'] which will give you the same result.

The reason why A[B, on = 'a', bb := i.b] is better than A <- B[A, on = 'a'] is memory efficiency. With A[B, on = 'a', bb := i.b] the location of A in memory stays the same:

> > address(A) > [1] "0x102afa5d0" > > A[B, on = 'a', bb := i.b] > > address(A) > [1] "0x102afa5d0"

While on the other hand with A <- B[A, on = 'a'], a new object is created and saved in memory as A and hence has another location in memory:

> > address(A) > [1] "0x102abae50" > > A <- B[A, on = 'a'] > > address(A) > [1] "0x102aa7e30"

Using merge (merge.data.table) results in a similar change in memory location:

> > address(A) > [1] "0x111897e00" > > A <- merge(A, B, by = 'a', all.x = TRUE) > > address(A) > [1] "0x1118ab000"

For memory efficiency it is thus better to use an 'update-by-reference-join' syntax:

A[B, on = 'a', bb := i.b] 

Although this doesn't make a noticeable difference with small datasets like these, it does make a difference on large datasets for which data.table was designed.

Probably also worth mentioning is that the order of A stays the same.


To see the effect on speed and memory use, let's benchmark with some larger datasets (for data, see the 2nd part of the used data-section below):

library(bench)
bm <- mark(AA <- BB[AA, on = .(aa)],
           AA[BB, on = .(aa), cc := cc],
           iterations = 1)

which gives (only relevant measurements shown):

> > bm[,c(1,3,5)] > # A tibble: 2 x 3 > expression median mem_alloc > > 1 AA <- BB[AA, on = .(aa)] 4.98s 4.1GB > 2 AA[BB, on = .(aa), :=(cc, cc)] 560.88ms 384.6MB

So, in this setup the 'update-by-reference-join' is about 9 times faster and consumes 11 times less memory.

NOTE: Gains in speed and memory use might differ in different setups.


Used data:

# initial datasets
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14)

# large datasets for the benchmark
set.seed(2019)
AA <- data.table(aa = 1:1e8, bb = sample(12:19, 1e7, TRUE))
BB <- data.table(aa = sample(AA$a, 2e5), cc = sample(2:8, 2e5, TRUE))

Solution 2 - R

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a') 

B[A]

Solution 3 - R

For the sake of completeness, I add the table.express version of an answer to your questions. table.express nicely extends the tidyverse language to data.table making it a handy tool to work fastly with huge datasets. Here is the solution using your datasets from the question above:

> merge_test = dataA %>% left_join(dataB, by="A")

A left_join keeps all rows from dataA in the joined dataset.

Note: You must load the packages data.table and table.express.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionlord12View Question on Stackoverflow
Solution 1 - RJaapView Answer on Stackoverflow
Solution 2 - Rdanas.zuokasView Answer on Stackoverflow
Solution 3 - RToWiiView Answer on Stackoverflow