What's the biggest R-gotcha you've run across?

R Problem Overview

Is there a certain R-gotcha that had you really surprised one day? I think we'd all gain from sharing these.

Here's mine: in list indexing, my.list[[1]] is not my.list[1]. Learned this in the early days of R.

R Solutions

Solution 1 - R

[Hadley pointed this out in a comment.]

When using a sequence as an index for iteration, it's better to use the seq_along() function rather than something like 1:length(x).

Here I create a vector and both approaches return the same thing:

> x <- 1:10
> 1:length(x)
 [1]  1  2  3  4  5  6  7  8  9 10
> seq_along(x)
 [1]  1  2  3  4  5  6  7  8  9 10

Now make the vector NULL:

> x <- NULL
> seq_along(x) # returns an empty integer; good behavior
integer(0)
> 1:length(x) # wraps around and returns a sequence; this is bad
[1] 1 0

This can cause some confusion in a loop:

> for(i in 1:length(x)) print(i)
[1] 1
[1] 0
> for(i in seq_along(x)) print(i)
>

Solution 2 - R

The automatic creation of factors when you load data. You unthinkingly treat a column in a data frame as characters, and this works well until you do something like trying to change a value to one that isn't a level. This will generate a warning but leave your data frame with NA's in it ...

When something goes unexpectedly wrong in your R script, check that factors aren't to blame.

Solution 3 - R

Forgetting the drop=FALSE argument in subsetting matrices down to single dimension and thereby dropping the object class as well:

R> X <- matrix(1:4,2)
R> X
     [,1] [,2]
[1,]    1    3
[2,]    2    4
R> class(X)
[1] "matrix"
R> X[,1]
[1] 1 2
R> class(X[,1])
[1] "integer"
R> X[,1, drop=FALSE]
     [,1]
[1,]    1
[2,]    2
R> class(X[,1, drop=FALSE])
[1] "matrix"
R>

Solution 4 - R

Removing rows in a dataframe will cause non-uniquely named rows to be added, which then errors out:

> a<-data.frame(c(1,2,3,4),c(4,3,2,1))
> a<-a[-3,]
> a
  c.1..2..3..4. c.4..3..2..1.
1             1             4
2             2             3
4             4             1
> a[4,1]<-1
> a
Error in data.frame(c.1..2..3..4. = c("1", "2", "4", "1"), c.4..3..2..1. = c(" 4",  : 
  duplicate row.names: 4

So what is going on here is:

A four row data.frame is created, so the rownames are c(1,2,3,4)
The third row is deleted, so the rownames are c(1,2,4)
A fourth row is added, and R automatically sets the row name equal to the index i.e. 4, so the row names are c(1,2,4,4). This is illegal because row names should be unique. I don't see why this type of behavior should be allowed by R. It seems to me that R should provide a unique row name.

Solution 5 - R

First, let me say that I understand fundamental problems of representing numbers in a binary system. Nevertheless, one problem that I think could be easily improved is the representation of numbers when the decimal value is beyond R's typical scope of presentation.

x <- 10.2 * 100
x
1020
as.integer(x)
1019

I don't mind if the result is represented as an integer when it really can be represented as an integer. For example, if the value really was 1020 then printing that for x would be fine. But something as simple as 1020.0 in this case when printing x would have made it more obvious that the value was not an integer and not representable as one. R should default to some kind of indication when there is an extremely small decimal component that isn't presented.

Solution 6 - R

It can be annoying to have to allow for combinations of NA, NaN and Inf. They behave differently, and tests for one won't necessarily work for the others:

> x <- c(NA,NaN,Inf)
> is.na(x)
[1]  TRUE  TRUE FALSE
> is.nan(x)
[1] FALSE  TRUE FALSE
> is.infinite(x)
[1] FALSE FALSE  TRUE

However the safest way to test any of these trouble-makers is:

> is.finite(x)
[1] FALSE FALSE FALSE

Solution 7 - R

Always test what happens when you have an NA!

One thing that I always need to pay careful attention to (after many painful experiences) is NA values. R functions are easy to use, but no manner of programming will overcome issues with your data.

For instance, any net vector operation with an NA is equal to NA. This is "surprising" on the face of it:

> x <- c(1,1,2,NA)
> 1 + NA
[1] NA
> sum(x)
[1] NA
> mean(x)
[1] NA

This gets extrapolated out into other higher-level functions.

In other words, missing values frequently have as much importance as measured values by default. Many functions have na.rm=TRUE/FALSE defaults; it's worth spending some time deciding how to interpret these default settings.

Edit 1: Marek makes a great point. NA values can also cause confusing behavior in indexes. For instance:

> TRUE && NA
[1] NA
> FALSE && NA
[1] FALSE
> TRUE || NA
[1] TRUE
> FALSE || NA
[1] NA

This is also true when you're trying to create a conditional expression (for an if statement):

> any(c(TRUE, NA))
[1] TRUE
> any(c(FALSE, NA))
[1] NA
> all(c(TRUE, NA))
[1] NA

When these NA values end up as your vector indexes, many unexpected things can follow. This is all good behavior for R, because it means that you have to be careful with missing values. But it can cause major headaches at the beginning.

Solution 8 - R

Forgetting that strptime() and friends return POSIXt POSIXlt where length() is always nine -- converting to POSIXct helps:

R> length(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S"))
[1] 9
R> length(as.POSIXct(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S")))
[1] 1
R>

Solution 9 - R

The round function always rounds to the even number.

> round(3.5)
[1] 4  

> round(4.5)
[1] 4

Solution 10 - R

Math on integers is subtly different from doubles (and sometimes complex is weird too)

UPDATE They fixed some things in R 2.15

1^NA      # 1
1L^NA     # NA
(1+0i)^NA # NA 

0L %/% 0L # 0L  (NA from R 2.15)
0 %/% 0   # NaN
4L %/% 0L # 0L  (NA from R 2.15)
4 %/% 0   # Inf

Solution 11 - R

I'm surprised that no one mention this but:

T & F can be override, TRUE & FALSE don't.

Example:

x <- sample(c(0,1,NA), 100, T)
T <- 0:10

mean(x, na.rm=T)
# Warning in if (na.rm) x <- x[!is.na(x)] :
#   the condition has length > 1 and only the first element will be used
# Calls: mean -> mean.default
# [1] NA

plot(rnorm(7), axes=T)
# Warning in if (axes) { :
#   the condition has length > 1 and only the first element will be used
# Calls: plot -> plot.default
# Warning in if (frame.plot) localBox(...) :
#   the condition has length > 1 and only the first element will be used
# Calls: plot -> plot.default

[edit] ctrf+F trick me. Shane mention about this in his comment.

Solution 12 - R

Reading in data can be more problematic than you may think. Today I found that if you use read.csv(), if a line in the .csv file is blank, read.csv() automatically skips it. This makes sense for most applications, but if you're automatically extracting data from (for example) row 27 from several thousand files, and some of the preceding rows may or may not be blank, if you're not careful things can go horribly wrong.

I now use

data1 <- read.table(file_name, blank.lines.skip = F, sep = ",")

When you're importing data, check that you're doing what you actually think you're doing again and again and again...

Solution 13 - R

The tricky behaviour of the all.equal() function.

One of my continuos errors is comparing a set of floating point numbers. I have a CSV like:

... mu,  tau, ...
... 0.5, 1.7, ...

Reading the file and trying to subset the data sometimes works, sometimes fails - of course, due to falling into the pits of the floating point trap again and again. At first, the data contains only integer values, then later on it always transforms into real values, you know the story. Comparing should be done with the all.equal() function instead of the == operator, but of course, the code I first wrote used the latter approach.

Yeah, cool, but all.equal() returns TRUE for equal numbers, but a textual error message if it fails:

> all.equal(1,1)
[1] TRUE
> all.equal(1:10, 1:5)
[1] "Numeric: lengths (10, 5) differ"
> all.equal(1:10, c(1:5,1:5))
[1] "Mean relative difference: 0.625"

The solution is using isTRUE() function:

if (!isTRUE(all.equal(x, y, tolerance=doubleErrorRate))) {
    ...
}

How many times had I got to read the all.equals() description...

Solution 14 - R

This one hurt so much that I spent hours adding comments to a bug-report. I didn't get my wish, but at least the next version of R will generate an error.

R> nchar(factor(letters))
 [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Update: As of R 3.2.0 (probably earlier), this example now generates an error message. As mentioned in the comments below, a factor is NOT a vector and nchar() requires a vector.

R> nchar(factor(letters))
Error in nchar(factor(letters)) : 'nchar()' requires a character vector
R> is.vector(factor(letters))
[1] FALSE

Solution 15 - R

accidentally listing source code of a function by forgetting to include empty parentheses: e.g. "ls" versus "ls()"
true & false don't cut it as pre-defined constants, like in Matlab, C++, Java, Python; must use TRUE & FALSE
invisible return values: e.g. ".packages()" returns nothing, while "(.packages())" returns a character vector of package base names

Solution 16 - R

For instance, the number 3.14 is a numerical constant, but the expressions +3.14 and -3.14 are calls to the functions + and -:

> class(quote(3.14))
[1] "numeric"
> class(quote(+3.14))
[1] "call"
> class(quote(-3.14))
[1] "call"

See Section 13.2 in John Chambers book Software for Data Analysis - Programming with R

Solution 17 - R

Partial matching in the $ operator: This applies to lists, but also on data.frame

df1 <- data.frame(foo=1:10, foobar=10:1)
df2 <- data.frame(foobar=10:1)

df1$foo # Correctly gets the foo column
df2$foo # Expect NULL, but this returns the foobar column!!!

# So, should use double bracket instead:
df1[["foo"]]
df2[["foo"]]

The [[ operator also has an exact flag, but it is thankfully TRUE by default.

Partial matching also affects attr:

x1 <- structure(1, foo=1:10, foobar=10:1)
x2 <- structure(2, foobar=10:1)

attr(x1, "foo") # Correctly gets the foo attribute
attr(x2, "foo") # Expect NULL, but this returns the foobar attribute!!!

# So, should use exact=TRUE
attr(x1, "foo", exact=TRUE)
attr(x2, "foo", exact=TRUE)

Solution 18 - R

Automatic repeating of vectors ("recycling") used as indices:

R> all.numbers <- c(1:5)
R> all.numbers
[1] 1 2 3 4 5
R> good.idxs <- c(T,F,T)
R> #note unfortunate length mismatch
R> good.numbers <- all.numbers[good.idxs]
R> good.numbers
[1] 1 3 4
R> #wtf? 
R> #why would you repeat the vector used as an index 
R> #without even a warning?

Solution 19 - R

Zero-length vectors have some quirks:

R> kk=vector(mode="numeric",length=0)
R> kk
numeric(0)
R> sum(kk)
[1] 0
R> var(kk)
[1] NA

Solution 20 - R

Working with lists, there are a couple of unintuitive things:

Of course, the difference between [ and [[ takes some getting used to. For lists, the [ returns a list of (potentially 1) elements whereas the [[ returns the element inside the list.

List creation:

# When you're used to this:
x <- numeric(5) # A vector of length 5 with zeroes
# ... this might surprise you
x <- list(5)    # A list with a SINGLE element: the value 5
# This is what you have to do instead:
x <- vector('list', 5) # A vector of length 5 with NULLS

So, how to insert NULL into a list?

x <- list("foo", 1:3, letters, LETTERS) # A sample list
x[[2]] <- 1:5        # Put 1:5 in the second element
# The obvious way doesn't work: 
x[[2]] <- NULL       # This DELETES the second element!
# This doesn't work either: 
x[2] <- NULL       # This DELETES the second element!

# The solution is NOT very intuitive:
x[2] <- list(NULL) # Put NULL in the second element

# Btw, now that we think we know how to delete an element:
x <- 1:10
x[[2]] <- NULL  # Nope, gives an ERROR!
x <- x[-2]    # This is the only way for atomic vectors (works for lists too)

Finally some advanced stuff like indexing through a nested list:

x <- list(a=1:3, b=list(c=42, d=13, e="HELLO"), f='bar')
x[[c(2,3)]] # HELLO (first selects second element and then it's third element)
x[c(2,3)]   # The second and third elements (b and f)

Solution 21 - R

One of the big confusion in R is that [i, drop = TRUE] does drop factor levels, but [i, j, drop = TRUE] does not!

> df = data.frame(a = c("europe", "asia", "oceania"), b = c(1, 2, 3))
> df$a[1:2, drop = TRUE]
[1] europe asia  
Levels: asia europe          <---- drops factor levels, works fine
> df[1:2,, drop = TRUE]$a
[1] europe asia  
Levels: asia europe oceania  <---- does not drops factor levels!

For more info see: https://stackoverflow.com/q/14123792/684229

Solution 22 - R

Coming from compiled language and Matlab, I've gotten occasionally confused about a fundamental aspect of functions in functional languages: they have to be defined before they're used! It's not enough just for them to be parsed by the R interpreter. This mostly rears its head when you use nested functions.

In Matlab you can do:

function f1()
  v1 = 1;
  v2 = f2();
  fprintf('2 == %d\n', v2);

  function r1 = f2()
    r1 = v1 + 1 % nested function scope
  end
end

If you try to do the same thing in R, you have to put the nested function first, or you get an error! Just because you've defined the function, it's not in the namespace until it's assigned to a variable! On the other hand, the function can refer to a variable that has not been defined yet.

f1 <- function() {
  f2 <- function() {
    v1 + 1
  }

  v1 <- 1

  v2 = f2()

  print(sprintf("2 == %d", v2))
}

Solution 23 - R

Mine from today: qnorm() takes Probabilities and pnorm() takes Quantiles.

Solution 24 - R

For me it is the counter intuitive way in which when you export a data.frame to a text file using write.csv, then to import it afterwards you need to add an additional argument to get exactly the same data.frame, like this:

write.csv(m, file = 'm.csv')
read.csv('m.csv', row.names = 1) # Note the row.names argument

I also posted this question in SO and was suggested as an answer to this Q by @BenBolker.

Solution 25 - R

The apply set of functions does not only work for matrices, but scales up to multi-dimensional array's. In my research I often have a dataset of for example temperature of the atmosphere. This is stored in a multi-dimensional array with dimensions x,y,level,time, from now on called multi_dim_array. A mockup example would be:

multi_dim_array = array(runif(96 * 48 * 6 * 100, -50, 50), 
                        dim = c(96, 48, 6, 100))
> str(multi_dim_array)
#     x     y     lev  time    
 num [1:96, 1:48, 1:6, 1:100] 42.4 16 32.3 49.5 24.9 ...

Using apply one can easily get the:

# temporal mean value
> str(apply(multi_dim_array, 4, mean))
 num [1:100] -0.0113 -0.0329 -0.3424 -0.3595 -0.0801 ...
# temporal mean value per gridcell (x,y location)
> str(apply(multi_dim_array, c(1,2), mean))
 num [1:96, 1:48] -1.506 0.4553 -1.7951 0.0703 0.2915 ...
# temporal mean value per gridcell and level (x,y location, level)
> str(apply(multi_dim_array, c(1,2,3), mean))
 num [1:96, 1:48, 1:6] -3.839 -3.672 0.131 -1.024 -2.143 ...
# Spatial mean per level
> str(apply(multi_dim_array, c(3,4), mean))
 num [1:6, 1:100] -0.4436 -0.3026 -0.3158 0.0902 0.2438 ...

This makes the margin argument to apply seem much less counter intuitive. I first though, why not use "row" and "col" instead of 1 and 2. But the fact that it also works for array's with more dimensions makes it clear why using margin like this is preferred.

Solution 26 - R

which.min and which.max function opposite expectations when using comparison operator and can even give incorrect answers. So for example trying to figure out which element in a list of sorted numbers is the largest number that is less than a threshold. (i.e. in a sequence from 100 to 200 which is the largest number that is less than 110)

set.seed(420)
x = seq(100, 200)
which(x < 110)
> [1]  1  2  3  4  5  6  7  8  9 10
which.max(x < 110)
> [1] 1
which.min(x < 110)
> [1] 11
x[11]
> [1] 110
max(which(x < 110))
>[1] 10
x[10]
> [1] 109

Solution 27 - R

The dirties gotcha that can be really hard to find! Cutting multi-line expressions like this one:

K <- hyperpar$intcept.sigma2
		+ cov.NN.additive(x1$env, x2 = NULL, sigma2_int = hyperpar$env.sigma2_int, sigma2_slope = hyperpar$env.sigma2_slope)
		+ hyperpar$env.sigma2 * K.cache$k.env

R will only evaluate the first line, and the other two will just go waste! And it will not say any warning, nothing! This is pretty nasty treachery on unsuspecting user. It must actually be written like this:

K <- hyperpar$intcept.sigma2 +
		cov.NN.additive(x1$env, x2 = NULL, sigma2_int = hyperpar$env.sigma2_int, sigma2_slope = hyperpar$env.sigma2_slope) +
		hyperpar$env.sigma2 * K.cache$k.env

which is not quite natural way of writing.

Solution 28 - R

This one!

all(c(1,2,3,4) == NULL)
$[1] TRUE

I had this check in my code, I really need both tables to have the same column names:

stopifnot(all(names(x$x$env) == names(x$obsx$env)))

But the check passed (evaluated to TRUE) when x$x$env didn't even exist!

Solution 29 - R

You can use options(warn = 2), which, according to the manual:

> If warn is two or larger all warnings are turned into errors.

Indeed, the warnings are turned into errors, but, gotcha! The code still continues running after such errors!!!

source("script.R")
# ...
# Loading required package: bayesmeta
# Failed with error:  ‘(converted from warning) there is no package called ‘bayesmeta’’
# computing posterior (co)variances ... 
# (script continues running)
...

PS: but some other errors converted from warning do stop the script... so I don't know, I am confused. This one did stop the script:

Error in optimise(psiline, c(0, 2), adiff, a, as.matrix(K), y, d0, mn,  :
  (converted from warning) NA/Inf replaced by maximum positive value

Content Type	Original Author	Original Content on Stackoverflow
Question	Vince	View Question on Stackoverflow
Solution 1 - R	Shane	View Answer on Stackoverflow
Solution 2 - R	edward	View Answer on Stackoverflow
Solution 3 - R	Dirk Eddelbuettel	View Answer on Stackoverflow
Solution 4 - R	Ian Fellows	View Answer on Stackoverflow
Solution 5 - R	John	View Answer on Stackoverflow
Solution 6 - R	nullglob	View Answer on Stackoverflow
Solution 7 - R	Shane	View Answer on Stackoverflow
Solution 8 - R	Dirk Eddelbuettel	View Answer on Stackoverflow
Solution 9 - R	Milktrader	View Answer on Stackoverflow
Solution 10 - R	Tommy	View Answer on Stackoverflow
Solution 11 - R	Marek	View Answer on Stackoverflow
Solution 12 - R	Andrew	View Answer on Stackoverflow
Solution 13 - R	rlegendi	View Answer on Stackoverflow
Solution 14 - R	Kevin Wright	View Answer on Stackoverflow
Solution 15 - R	user186060	View Answer on Stackoverflow
Solution 16 - R	rcs	View Answer on Stackoverflow
Solution 17 - R	Tommy	View Answer on Stackoverflow
Solution 18 - R	user116293	View Answer on Stackoverflow
Solution 19 - R	Kevin Wright	View Answer on Stackoverflow
Solution 20 - R	Tommy	View Answer on Stackoverflow
Solution 21 - R	Tomas	View Answer on Stackoverflow
Solution 22 - R	Harlan	View Answer on Stackoverflow
Solution 23 - R	Adam SO	View Answer on Stackoverflow
Solution 24 - R	Juan	View Answer on Stackoverflow
Solution 25 - R	Paul Hiemstra	View Answer on Stackoverflow
Solution 26 - R	DChaps	View Answer on Stackoverflow
Solution 27 - R	Tomas	View Answer on Stackoverflow
Solution 28 - R	Tomas	View Answer on Stackoverflow
Solution 29 - R	Tomas	View Answer on Stackoverflow