Practical limits of R data frame

RPerformanceDataframeRcpp

R Problem Overview


I have been reading about how read.table is not efficient for large data files. Also how R is not suited for large data sets. So I was wondering where I can find what the practical limits are and any performance charts for (1) Reading in data of various sizes (2) working with data of varying sizes.

In effect, I want to know when the performance deteriorates and when I hit a road block. Also any comparison against C++/MATLAB or other languages would be really helpful. finally if there is any special performance comparison for Rcpp and RInside, that would be great!

R Solutions


Solution 1 - R

R is suited for large data sets, but you may have to change your way of working somewhat from what the introductory textbooks teach you. I did a post on Big Data for R which crunches a 30 GB data set and which you may find useful for inspiration.

The usual sources for information to get started are High-Performance Computing Task View and the R-SIG HPC mailing list at R-SIG HPC.

The main limit you have to work around is a historic limit on the length of a vector to 2^31-1 elements which wouldn't be so bad if R did not store matrices as vectors. (The limit is for compatibility with some BLAS libraries.)

We regularly analyse telco call data records and marketing databases with multi-million customers using R, so would be happy to talk more if you are interested.

Solution 2 - R

The physical limits arise from the use of 32-bit indexes on vectors. As a result, vectors up to 2^31 - 1 are allowed. Matrices are vectors with dimensions, so the product of nrow(mat) and ncol(mat) must be within 2^31 - 1. Data frames and lists are general vectors, so each component can take 2^31 - 1 entries, which for data frames means you can have that many rows and columns. For lists you can have 2^31 - 1 components, each of 2^31 - 1 elements. This is drawn from a recent posting by Duncan Murdoch in reply to a Q on R-Help

Now that all has to fit in RAM with standard R so that might be a more pressing limit, but the High-Performance Computing Task View that others have mentioned contains details of packages that can circumvent the in-memory issues.

Solution 3 - R

  1. The R Import / Export manual should be the first port of call for questions about importing data - there are many options and what will work for your could be very specific.

http://cran.r-project.org/doc/manuals/R-data.html

read.table specifically has greatly improved performance if the options provided to it are used, particular colClasses, comment.char, and nrows - this is because this information has to be inferred from the data itself, which can be costly.

  1. There is a specific limit for the length (total number of elements) for any vector, matrix, array, column in a data.frame, or list. This is due to a 32-bit index used under the hood, and is true for 32-bit and 64-bit R. The number is 2^31 - 1. This is the maximum number of rows for a data.frame, but it is so large you are far more likely to run out of memory for even single vectors before you start collecting several of them.

See help(Memory-limits) and help(Memory) for details.

A single vector of that length will take many gigabytes of memory (depends on the type and storage mode of each vector - 17.1 for numeric) so it's unlikely to be a proper limit unless you are really pushing things. If you really need to push things past the available system memory (64-bit is mandatory here) then standard database techniques as discussed in the import/export manual, or memory-mapped file options (like the ff package), are worth considering. The CRAN Task View High Performance Computing is a good resource for this end of things.

Finally, if you have stacks of RAM (16Gb or more) and need 64-bit indexing it might come in a future release of R. http://www.mail-archive.com/[email protected]/msg92035.html

Also, Ross Ihaka discusses some of the historical decisions and future directions for an R like language in papers and talks here: http://www.stat.auckland.ac.nz/~ihaka/?Papers_and_Talks

Solution 4 - R

I can only answer the one about read.table, since I don't have any experience with large data sets. read.table performs poorly if you don't provide colClasses arguments. Without it, read.table defaults to NA and tries to guess a class of every column, and that can be slow, especially when you have a lot of columns.

Solution 5 - R

When reading large csv files x GB <=> y.1e6 rows I think data.table::fread (as of version 1.8.7) is the quickest alternative you can get it doing install.packages("data.table", repos="http://R-Forge.R-project.org")

You usually gain a factor 5 to 10 (and all sep, row.names etc are dealt by the function itself). If you have many files and a decent enough computer (several cores), I recommend using the parallel package (as part of R.2.14) to load one file per core.

Last time I did this between monothreaded loading with read.csv and multithreaded on 4 cores use of fread I went from 5 minutes to 20 seconds

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionEgonView Question on Stackoverflow
Solution 1 - RAllan EngelhardtView Answer on Stackoverflow
Solution 2 - RGavin SimpsonView Answer on Stackoverflow
Solution 3 - RmdsumnerView Answer on Stackoverflow
Solution 4 - RaL3xaView Answer on Stackoverflow
Solution 5 - RstatquantView Answer on Stackoverflow