What you can do with a data.frame that you can't with a data.table?
RDataframedata.tableR Problem Overview
I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
R Solutions
Solution 1 - R
From the data.table FAQ
##FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
>
> As FAQ 1.1 highlights, j
in [.data.table
is fundamentally
> different from j
in [.data.frame
. Even something as simple as
> DF[,1]
would break existing code in many packages and user code.
> This is by design, and we want it to work this way for more
> complicated syntax to work. There are other differences, too (see FAQ
> 2.17).
>
> Furthermore, data.table
inherits from data.frame
. It is a
> data.frame
, too. A data.table
can be passed to any package that
> only accepts data.frame
and that package can use [.data.frame
> syntax on the data.table
.
>
> We have proposed enhancements to R wherever possible, too. One of
> these was accepted as a new feature in R 2.12.0 :
> >unique()
and match()
are now faster on character vectors where all elements are in the global CHARSXP
cache and have unmarked
> encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
> to the way the hash code is generated in unique.
c.
>
> A second proposal was to use memcpy
in duplicate.c
, which is much
> faster than a for loop in C. This would improve the way that R copies
> data internally (on some measures by 13 times). The thread on r-devel
> is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
>
data.frame
and data.table
What are the smaller syntax differences between >
DT[3]
refers to the 3rd row, butDF[3]
refers to the 3rd columnDT[3, ] == DT[3]
, butDF[ , 3] == DF[3]
(somewhat confusingly in data.frame, whereas data.table is consistent)- For this reason we say the comma is optional in
DT
, but not optional inDF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ]
, wherei
is a single integer, returns a single row, just likeDF[i, ]
, but unlike a matrix single-row subset which returns a vector.DT[ , j]
wherej
is a single integer returns a one-column data.table, unlikeDF[, j]
which returns a vector by defaultDT[ , "colA"][[1]] == DF[ , "colA"]
.DT[ , colA] == DF[ , "colA"]
(currently in data.table v1.9.8 but is about to change, see release notes)DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA]
returns 1 row ofNA
, butDF[NA]
returns an entire copy ofDF
containingNA
throughout. The symbolNA
is typelogical
in R and is therefore recycled by[.data.frame
. The user's intention was probablyDF[NA_integer_]
.[.data.table
diverts to this probable intention automatically, for convenience.DT[c(TRUE, NA, FALSE)]
treats theNA
asFALSE
, butDF[c(TRUE, NA, FALSE)]
returnsNA
rows for eachNA
DT[ColA == ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4))
creates 3 columns, data.table creates onelist
column.check.names
is by defaultTRUE
indata.frame
butFALSE
in data.table, for convenience.stringsAsFactors
is by defaultTRUE
indata.frame
butFALSE
in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting tofactor
.- Atomic vectors in
list
columns are collapsed when printed using", "
indata.frame
, but","
in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects. In[.data.frame
we very often setdrop = FALSE
. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single columndata.frame
. In[.data.table
we took the opportunity to make it consistent and droppeddrop
. When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table
is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
-
From the NEWS for v 1.8.2
>
- base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
- An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.