That Way Madness Lies: Arithmetic on data.frames

2014-07-21 3-minute read

tl;dr

Please do not use arithmetic on data.frame objects when programming in R. It’s a hack that only works if you know everything about your datasets. If anything happens to change the order of the rows in your data set, previously safe data.frame arithmetic operations will produce incorrect answers. If you learn to always explicitly merge two tables together before performing arithmetic on their shared columns, you’ll produce code that is both more reliable and more powerful.

Arithmetic between tables: getting wrong answers quickly

You may not be aware of it, but R allows you to do arithmetic on data.frame objects. For example, the following code works in R as of version 3.0.2:

> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(1, 2), Obs = c(2.0, 3.0))
> df3 <- (df1 + df2) / 2
> df3
  ID Obs
1  1 1.5
2  2 2.5

If you discover that you can do this, you might think that it’s a really cool trick. You might even start using data.frame arithmetic without realizing that your specific example had a bunch of special structure that was directly responsible for you getting the right answer.

Unfortunately, other examples that you didn’t see would have produced rather less pleasant outputs and led you to realize that arithmetic operations on data.frame objects don’t really make sense:

> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0))
> df3 <- (df1 + df2) / 2
> df3
   ID Obs
1 1.5   2
2 1.5   2

What happened here is obvious in retrospect: R added all of the columns together and then divided the result by two. The problem is that you didn’t actually want to add all of the columns together and then divide the result by two, because you had forgotten that the matching rows in df1 and df2 were not in the same index positions in the two tables.

Getting right answers with just a little more typing

Thankfully, it turns out that doing the right thing just requires a few more characters. What you should have done was to call merge before doing any arithmetic:

> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0))
> df3 <- merge(df1, df2, by = "ID")
> df3 <- transform(df3, AvgObs = (Obs.x + Obs.y) / 2)
> df3
  ID Obs.x Obs.y AvgObs
1  1     1     2    1.5
2  2     2     3    2.5

What makes merge so unequivocally superior to data.frame arithmetic is that it still works when the two inputs have different numbers of rows:

> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(1, 2, 3), Obs = c(5.0, 6.0, 7.0))
> df3 <- merge(df1, df2, by = "ID")
> df3 <- transform(df3, AvgObs = (Obs.x + Obs.y) / 2)
> df3
  ID Obs.x Obs.y AvgObs
1  1     1     5      3
2  2     2     6      4

Knowledge is half the battle

Now that you know why performing arithmetic operations on data.frame objects is generally unsafe, I implore you to stop doing it. Learn to love merge.