Understanding vectorization

One of the real advantages of R is vectorization: the ability of R to do the same operation on each element of a vector.

Think of it this way: in R, we can add two objects using simple syntax like this: y <- a + b. This works if each of a and b are single numbers:

a <- 2
b <- 3
y <- a + b
y
[1] 5

…but it works just as well if we add two vectors:

a <- 1:5
b <- 11:15
y <- a + b
y
[1] 12 14 16 18 20

This is kind of amazing! In a lot of programming languages we would have to tell the computer something like, “for each element of a, find the corresponding element of b. Add the two and create an object y of the same length as a and b and put the results into that.

This is cumbersome! R makes it easy for us!

Vector recycling

It’s kind of weird, when you think about it, that this also works:

a <- 1:5
b <- 1
a + b
[1] 2 3 4 5 6

Why is that weird? Well, check this out

length(a)
[1] 5
length(b)
[1] 1

a has a length of 10, b has a length of 1, but R is not thrown off by this discrepancy.

It might seem obvious that this would work, but consider that R has had to make a decision here: how to handle a vectorized operation for objects of different lengths? For instance, R could treat the missing values in b as 0, yielding:

a + b

[1] 2 2 3 4 5

Or it could treat them as NA. Since anything plus NA is NA, we would get:

a + b

[1] 2 NA NA NA NA

Or it could just refuse to do the operation, since it isn’t obvious how to handle the length disparity.

But instead, R figures out that what you really want to do is to add 5 to every element of a. R silently recycles the vector b so that it is the same length as a, as if we had set b <- c(5, 5, 5, 5, 5) (or b <- rep(5, times = 5).

Vectors are usually (always?) recycled

In fact, R will recycle any shorter vector to the length of the longer vector. Observe:

a <- rep(1, times = 3)
b <- c(2, 3)
a + b
Warning in a + b: longer object length is not a multiple of shorter object
length

[1] 3 4 3

Here, R has “recycled” the first element of b so that it is added to the third element of a. This is a bit dangerous. We do get a warning (always pay attention to your warnings!) but I can’t think of that many reasons we would want to do this, and I can think of a lot of reasons we wouldn’t want to do this. Usually, in my life, if I am performing vectorized operations on vectors of different lengths, it is because I am making a mistake.

Pay particular attention to vectors whose lengths are multiples of each other:

a <- rep(1, 4)
b <- c(2, 3)
a + b
[1] 3 4 3 4

This situation is even more dangerous because we don’t even get a warning.

There is some good news: if you do all your work in data frames, with a tidyverse-based approach, it is not that easy to run into this kind of problem.

Avatar
Drew Steen
Assistant Professor of Microbiology and Earth and Planetary Sciences

We in the Steen Lab want to understand how microbes interact with organic matter in aquatic systems. To do that, I use the tools of organic geochemistry as well as microbial ecology. These questions have lead us to work on new approaches to analyze DNA sequences from environmental microbiomes and to study the distribution of taxa and functions across all of microbial life.