How to organize large R programs?

RPackageConventionsCode OrganizationProject Organization

R Problem Overview


When I undertake an R project of any complexity, my scripts quickly get long and confusing.

What are some practices I can adopt so that my code will always be a pleasure to work with? I'm thinking about things like

  • Placement of functions in source files
  • When to break something out to another source file
  • What should be in the master file
  • Using functions as organizational units (whether this is worthwhile given that R makes it hard to access global state)
  • Indentation / line break practices.
    • Treat ( like {?
    • Put things like )} on 1 or 2 lines?

Basically, what are your rules of thumb for organizing large R scripts?

R Solutions


Solution 1 - R

The standard answer is to use packages -- see the Writing R Extensions manual as well as different tutorials on the web.

It gives you

  • a quasi-automatic way to organize your code by topic
  • strongly encourages you to write a help file, making you think about the interface
  • a lot of sanity checks via R CMD check
  • a chance to add regression tests
  • as well as a means for namespaces.

Just running source() over code works for really short snippets. Everything else should be in a package -- even if you do not plan to publish it as you can write internal packages for internal repositories.

As for the 'how to edit' part, the R Internals manual has excellent R coding standards in Section 6. Otherwise, I tend to use defaults in Emacs' ESS mode.

Update 2008-Aug-13: David Smith just blogged about the Google R Style Guide.

Solution 2 - R

I like putting different functionality in their own files.

But I don't like R's package system. It's rather hard to use.

I prefer a lightweight alternative, to place a file's functions inside an environment (what every other language calls a "namespace") and attach it. For example, I made a 'util' group of functions like so:

util = new.env()

util$bgrep = function [...]

util$timeit = function [...]

while("util" %in% search())
  detach("util")
attach(util)

This is all in a file util.R. When you source it, you get the environment 'util' so you can call util$bgrep() and such; but furthermore, the attach() call makes it so just bgrep() and such work directly. If you didn't put all those functions in their own environment, they'd pollute the interpreter's top-level namespace (the one that ls() shows).

I was trying to simulate Python's system, where every file is a module. That would be better to have, but this seems OK.

Solution 3 - R

This might sound a little obvious especially if you're a programmer, but here's how I think about logical and physical units of code.

I don't know if this is your case, but when I'm working in R, I rarely start out with a large complex program in mind. I usually start in one script and separate code into logically separable units, often using functions. Data manipulation and visualization code get placed in their own functions, etc. And such functions are grouped together in one section of the file (data manipulation at the top, then visualization, etc). Ultimately you want to think about how to make it easier for you to maintain your script and lower the defect rate.

How fine/coarse grained you make your functions will vary and there are various rules of thumb: e.g. 15 lines of code, or "a function should be responsible for doing one task which is identified by its name", etc. Your mileage will vary. Since R doesn't support call-by-reference, I'm usually vary of making my functions too fine grained when it involves passing data frames or similar structures around. But this may be overcompensation for some silly performance mistakes when I first started out with R.

When to extract logical units into their own physical units (like source files and bigger groupings like packages)? I have two cases. First, if the file gets too large and scrolling around among logically unrelated units is an annoyance. Second, if I have functions that can be reused by other programs. I usually start out by placing some grouped unit, say data manipulation functions, into a separate file. I can then source this file from any other script.

If you're going to deploy your functions, then you need to start thinking about packages. I don't deploy R code in production or for re-use by others for various reasons (briefly: org culture prefers other langauges, concerns about performance, GPL, etc). Also, I tend to constantly refine and add to my collections of sourced files, and I'd rather not deal with packages when I make a change. So you should check out the other package related answers, like Dirk's, for more details on this front.

Finally, I think your question isn't necessarily particular to R. I would really recommend reading Code Complete by Steve McConnell which contains a lot of wisdom about such issues and coding practices at large.

Solution 4 - R

My concise answer:

  1. Write your functions carefully, identifying general enough outputs and inputs;
  2. Limit the use of global variables;
  3. Use S3 objects and, where appropriate, S4 objects;
  4. Put the functions in packages, especially when your functions are calling C/Fortran.

I believe R is more and more used in production, so the need for reusable code is greater than before. I find the interpreter much more robust than before. There is no doubt that R is 100-300x slower than C, but usually the bottleneck is concentrated around a few lines of code, which can be delegated to C/C++. I think it would be a mistake to delegate the strengths of R in data manipulation and statistical analysis to another language. In these instances, the performance penalty is low, and in any case well worth the savings in development effort. If execution time alone were the matter, we'd be all writing assembler.

Solution 5 - R

I've been meaning to figure out how to write packages but haven't invested the time. For each of my mini-projects I keep all of my low-level functions in a folder called 'functions/', and source them into a separate namespace that I explicitly create.

The following lines of code will create an environment named "myfuncs" on the search path if it doesn't already exist (using attach), and populate it with the functions contained in the .r files in my 'functions/' directory (using sys.source). I usually put these lines at the top of my main script meant for the "user interface" from which high-level functions (invoking the low-level functions) are called.

if( length(grep("^myfuncs$",search()))==0 )
  attach("myfuncs",pos=2)
for( f in list.files("functions","\\.r$",full=TRUE) )
  sys.source(f,pos.to.env(grep("^myfuncs$",search())))

When you make changes you can always re-source it with the same lines, or use something like

evalq(f <- function(x) x * 2, pos.to.env(grep("^myfuncs$",search())))

to evaluate additions/modifications in the environment you created.

It's kludgey I know, but avoids having to be too formal about it (but if you get the chance I do encourage the package system - hopefully I will migrate that way in the future).

As for coding conventions, this is the only thing I've seen regarding aesthetics (I like them and loosely follow but I don't use too many curly braces in R):

http://www1.maths.lth.se/help/R/RCC/

There are other "conventions" regarding the use of [,drop=FALSE] and <- as the assignment operator suggested in various presentations (usually keynote) at the useR! conferences, but I don't think any of these are strict (though the [,drop=FALSE] is useful for programs in which you are not sure of the input you expect).

Solution 6 - R

Count me as another person in favor of packages. I'll admit to being pretty poor on writing man pages and vignettes until if/when I have to (ie being released), but it makes for a real handy way to bundle source doe. Plus, if you get serious about maintaining your code, the points that Dirk brings up all come into plya.

Solution 7 - R

I also agree. Use the package.skeleton() function to get started. Even if you think your code may never be run again, it may help motivate you to create more general code that could save you time later.

As for accessing the global environment, that is easy with the <<- operator, though it is discouraged.

Solution 8 - R

Having not learned how to write packages yet, I have always organized by sourcing sub scripts. Its similar to writing classes but not as involved. Its not programatically elegant but I find I build up analyses over time. Once I have a big section that works I often move it to a different script and just source it since it will use the workspace objects. Perhaps I need to import data from several sources, sort all of them and find the intersections. I might put that section into an additional script. However, if you want to distribute your "application" for other people, or it uses some interactive input, a package is probably a good route. As a researcher I rarely need to distribute my analysis code but I OFTEN need to augment or tweak it.

Solution 9 - R

I have also been searching for the holy grail of the right workflow for putting together an R large project. Last year, I found this package called rsuite, and, certainly, it was what I was looking for. This R package was explicitly developed for deployment of large R projects, but I found that it can be used for smaller, medium size, and large size R projects. I will give links to real-world examples in a minute (below), but first, I want to explain the new paradigm of building R projects with rsuite.

Note. I am not the creator or developer of rsuite.

  1. We have been doing projects all wrong with RStudio; the goal shouldn't be the creation of a project or a package but of a larger scope. In rsuite you create a super-project or master project, which holds the standard R projects and R packages, in all combinations possible.

  2. By having an R super-project you don't need anymore Unix make to manage the lower levels of the R projects underneath; you use R scripts at the top. Let me show you. When you create a rsuite master project, you get this folder structure:

enter image description here

  1. The folder R is where you put your project management scripts, the ones that will replace make.

  2. The folder packages is the folder where rsuite holds all the packages that compose the super-project. You can also copy paste a package that is not accessible from the internet, and rsuite will build it as well.

  3. the folder deployment is where rsuite will write all the package binaries that were indicated in the packages DESCRIPTION files. So, this makes, by itself, you project totally reproducible accros time.

  4. rsuite comes with a client for all operating systems. I have tested them all. But you can also install it as an addin for RStudio.

  5. rsuite also lets you build an isolated conda installation in its own folder conda. This is not an environment but a physical Python installation derived from Anaconda in your machine. This works together with R's SystemRequirements, from which you could install all the Python packages you want, from any conda channel you want.

  6. You can also create local repositories to pull R packages when you are offline, or want to build the whole thing faster.

  7. If you want, you can also build the R project as a zip file and share it with colleagues. It will run, providing your colleagues have the same R version installed.

  8. Another option is building a container of the whole project in Ubuntu, Debian, or CentOS. So, instead of sharing a zip file with your project build, you share the whole Docker container with your project ready to run.

I have been experimenting a lot with rsuite looking for full reproducibility, and avoid depending of the packages that one installs in the global environment. This is wrong because as soon as you install a package update, the project, more often than not, stops working, specially those packages with very specific calls to a function with certain parameters.

The first thing I started to experiment was with bookdown ebooks. I have never been lucky enough to have a bookdown to survive the test of time longer than six months. So, what I did is converting the original bookdown project to follow the rsuite framework. Now, I don't have to worry about updating my global R environment, because the project has its own set of packages in the deployment folder.

The next thing I did was creating machine learning projects but in the rsuite way. A master, orchestrating project at the top, and all sub-projects and packages to be under the control of the master. It really changes the way you code with R, making you more productive.

After that I started working in a new package of mine called rTorch. This was possible, in large part, because of rsuite; it lets you think and go big.

One piece of advice though. Learning rsuite is not easy. Because it presents a new way of creating R projects, it feels hard. Do not dismay at the first attempts, continue climbing the slope until you make it. It requires advanced knowledge of your operating system and of your file system.

I expect that one day RStudio allows us to generate orchestrating projects like rsuite does from the menu. It would be awesome.

Links:

RSuite GitHUb repo

r4ds bookdown

keras and shiny tutorial

moderndive-book-rsuite

interpretable_ml-rsuite

IntroMachineLearningWithR-rsuite

clark-intro_ml-rsuite

hyndman-bookdown-rsuite

statistical_rethinking-rsuite

fread-benchmarks-rsuite

dataviz-rsuite

retail-segmentation-h2o-tutorial

telco-customer-churn-tutorial

sclerotinia_rsuite

Solution 10 - R

R is OK for interactive use and small scripts, but I wouldn't use it for a large program. I'd use a mainstream language for most of the programming and wrap it in an R interface.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDan GoldsteinView Question on Stackoverflow
Solution 1 - RDirk EddelbuettelView Answer on Stackoverflow
Solution 2 - RBrendan OConnorView Answer on Stackoverflow
Solution 3 - RarsView Answer on Stackoverflow
Solution 4 - RgappyView Answer on Stackoverflow
Solution 5 - RhatmatrixView Answer on Stackoverflow
Solution 6 - RgeoffjentryView Answer on Stackoverflow
Solution 7 - Rcameron.brackenView Answer on Stackoverflow
Solution 8 - Rkpierce8View Answer on Stackoverflow
Solution 9 - Rf0nzieView Answer on Stackoverflow
Solution 10 - RJohn D. CookView Answer on Stackoverflow