Understanding vectorization
One of the real advantages of R is vectorization: the ability of R to do the same operation on each element of a vector.
Think of it this way: in R, we can add two objects using simple syntax
like this: y <- a + b
. This works if each of a
and b
are single
numbers:
a <- 2
b <- 3
y <- a + b
y
[1] 5
…but it works just as well if we add two vectors:
a <- 1:5
b <- 11:15
y <- a + b
y
[1] 12 14 16 18 20
This is kind of amazing! In a lot of programming languages we would have
to tell the computer something like, “for each element of a
, find the
corresponding element of b
. Add the two and create an object y
of
the same length as a
and b
and put the results into that.
This is cumbersome! R makes it easy for us!
Vector recycling
It’s kind of weird, when you think about it, that this also works:
a <- 1:5
b <- 1
a + b
[1] 2 3 4 5 6
Why is that weird? Well, check this out
length(a)
[1] 5
length(b)
[1] 1
a
has a length of 10
, b
has a length of 1
, but R is not thrown
off by this discrepancy.
It might seem obvious that this would work, but consider that R has had
to make a decision here: how to handle a vectorized operation for
objects of different lengths? For instance, R could treat the missing
values in b
as 0, yielding:
a + b
[1] 2 2 3 4 5
Or it could treat them as NA
. Since anything plus NA
is NA
, we
would get:
a + b
[1] 2 NA NA NA NA
Or it could just refuse to do the operation, since it isn’t obvious how to handle the length disparity.
But instead, R figures out that what you really want to do is to add 5
to every element of a
. R silently recycles the vector b
so that it
is the same length as a
, as if we had set b <- c(5, 5, 5, 5, 5)
(or
b <- rep(5, times = 5)
.
Vectors are usually (always?) recycled
In fact, R will recycle any shorter vector to the length of the longer vector. Observe:
a <- rep(1, times = 3)
b <- c(2, 3)
a + b
Warning in a + b: longer object length is not a multiple of shorter object
length
[1] 3 4 3
Here, R has “recycled” the first element of b
so that it is added to
the third element of a
. This is a bit dangerous. We do get a
warning (always pay attention to your warnings!) but I can’t think of
that many reasons we would want to do this, and I can think of a lot of
reasons we wouldn’t want to do this. Usually, in my life, if I am
performing vectorized operations on vectors of different lengths, it is
because I am making a mistake.
Pay particular attention to vectors whose lengths are multiples of each other:
a <- rep(1, 4)
b <- c(2, 3)
a + b
[1] 3 4 3 4
This situation is even more dangerous because we don’t even get a warning.
There is some good news: if you do all your work in data frames, with a tidyverse-based approach, it is not that easy to run into this kind of problem.