Understanding vectorization
One of the real advantages of R is vectorization: the ability of R to do the same operation on each element of a vector.
Think of it this way: in R, we can add two objects using simple syntax
like this: y <- a + b. This works if each of a and b are single
numbers:
a <- 2
b <- 3
y <- a + b
y
[1] 5
…but it works just as well if we add two vectors:
a <- 1:5
b <- 11:15
y <- a + b
y
[1] 12 14 16 18 20
This is kind of amazing! In a lot of programming languages we would have
to tell the computer something like, “for each element of a, find the
corresponding element of b. Add the two and create an object y of
the same length as a and b and put the results into that.
This is cumbersome! R makes it easy for us!
Vector recycling
It’s kind of weird, when you think about it, that this also works:
a <- 1:5
b <- 1
a + b
[1] 2 3 4 5 6
Why is that weird? Well, check this out
length(a)
[1] 5
length(b)
[1] 1
a has a length of 10, b has a length of 1, but R is not thrown
off by this discrepancy.
It might seem obvious that this would work, but consider that R has had
to make a decision here: how to handle a vectorized operation for
objects of different lengths? For instance, R could treat the missing
values in b as 0, yielding:
a + b
[1] 2 2 3 4 5
Or it could treat them as NA. Since anything plus NA is NA, we
would get:
a + b
[1] 2 NA NA NA NA
Or it could just refuse to do the operation, since it isn’t obvious how to handle the length disparity.
But instead, R figures out that what you really want to do is to add 5
to every element of a. R silently recycles the vector b so that it
is the same length as a, as if we had set b <- c(5, 5, 5, 5, 5) (or
b <- rep(5, times = 5).
Vectors are usually (always?) recycled
In fact, R will recycle any shorter vector to the length of the longer vector. Observe:
a <- rep(1, times = 3)
b <- c(2, 3)
a + b
Warning in a + b: longer object length is not a multiple of shorter object
length
[1] 3 4 3
Here, R has “recycled” the first element of b so that it is added to
the third element of a. This is a bit dangerous. We do get a
warning (always pay attention to your warnings!) but I can’t think of
that many reasons we would want to do this, and I can think of a lot of
reasons we wouldn’t want to do this. Usually, in my life, if I am
performing vectorized operations on vectors of different lengths, it is
because I am making a mistake.
Pay particular attention to vectors whose lengths are multiples of each other:
a <- rep(1, 4)
b <- c(2, 3)
a + b
[1] 3 4 3 4
This situation is even more dangerous because we don’t even get a warning.
There is some good news: if you do all your work in data frames, with a tidyverse-based approach, it is not that easy to run into this kind of problem.