R code vectorization best practices
By row operations
The key in vectorizing R code, is to reduce or eliminate “by row operations” or method dispatching of R functions.
That means that when approaching a problem that at first glance requires “by row operations”, such as calculating the means of each row, one needs to ask themselves:
- What are the classes of the data sets I’m dealing with?
- Is there an existing compiled code that can achieve this without the need of repetitive evaluation of R functions?
- If not, can I do these operation by columns instead by row?
- Finally, is it worth spending a lot of time on developing complicated vectorized code instead of just running a simple
apply
loop? In other words, is the data big/sophisticated enough that R can’t handle it efficiently using a simple loop?
Putting aside the memory pre-allocation issue and growing object in loops, we will focus in this example on how to possibly avoid apply
loops, method dispatching or re-evaluating R functions within loops.
A standard/easy way of calculating mean by row would be:
apply(mtcars, 1, mean)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
29.90727 29.98136 23.59818 38.73955 53.66455 35.04909 59.72000
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
24.63455 27.23364 31.86000 31.78727 46.43091 46.50000 46.35000
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
66.23273 66.05855 65.97227 19.44091 17.74227 18.81409 24.88864
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
47.24091 46.00773 58.75273 57.37955 18.92864 24.77909 24.88027
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
60.97182 34.50818 63.15545 26.26273
But can we do better? Lets’s see what happened here:
- First, we converted a
data.frame
to amatrix
. (Note that his happens within theapply
function.) This is both inefficient and dangerous. amatrix
can’t hold several column types at a time. Hence, such conversion will probably lead to loss of information and some times to misleading results (compareapply(iris, 2, class)
withstr(iris)
or withsapply(iris, class)
). - Second of all, we performed an operation repetitively, one time for each row. Meaning, we had to evaluate some R function
nrow(mtcars)
times. In this specific case,mean
is not a computationally expensive function, hence R could likely easily handle it even for a big data set, but what would happen if we need to calculate the standard deviation by row (which involves an expensive square root operation)? Which brings us to the next point: - We evaluated the R function many times, but maybe there already is a compiled version of this operation?
Indeed we could simply do:
rowMeans(mtcars)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
29.90727 29.98136 23.59818 38.73955 53.66455 35.04909 59.72000
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
24.63455 27.23364 31.86000 31.78727 46.43091 46.50000 46.35000
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
66.23273 66.05855 65.97227 19.44091 17.74227 18.81409 24.88864
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
47.24091 46.00773 58.75273 57.37955 18.92864 24.77909 24.88027
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
60.97182 34.50818 63.15545 26.26273
This involves no by row operations and therefore no repetitive evaluation of R functions. However, we still converted a data.frame
to a matrix
. Though rowMeans
has an error handling mechanism and it won’t run on a data set that it can’t handle, it’s still has an efficiency cost.
rowMeans(iris)
Error in rowMeans(iris) : 'x' must be numeric
But still, can we do better? We could try instead of a matrix conversion with error handling, a different method that will allow us to use mtcars
as a vector (because a data.frame
is essentially a list
and a list
is a vector
).
Reduce(`+`, mtcars)/ncol(mtcars)
[1] 29.90727 29.98136 23.59818 38.73955 53.66455 35.04909 59.72000 24.63455 27.23364 31.86000 31.78727 46.43091 46.50000 46.35000 66.23273 66.05855
[17] 65.97227 19.44091 17.74227 18.81409 24.88864 47.24091 46.00773 58.75273 57.37955 18.92864 24.77909 24.88027 60.97182 34.50818 63.15545 26.26273
Now for possible speed gain, we lost column names and error handling (including NA
handling).
Another example would be calculating mean by group, using base R we could try
aggregate(. ~ cyl, mtcars, mean)
cyl mpg disp hp drat wt qsec vs am gear carb
1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000
Still, we are basically evaluating an R function in a loop, but the loop is now hidden in an internal C function (it matters little whether it is a C or an R loop).
Could we avoid it? Well there is a compiled function in R called rowsum
, hence we could do:
rowsum(mtcars[-2], mtcars$cyl)/table(mtcars$cyl)
mpg disp hp drat wt qsec vs am gear carb
4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000
Though we had to convert to a matrix first too.
A this point we may question whether our current data structure is the most appropriate one. Is a data.frame
is the best practice? Or should one just switch to a matrix
data structure in order to gain efficiency?
By row operations will get more and more expensive (even in matrices) as we start to evaluate expensive functions each time. Lets us consider a variance calculation by row example.
Lets say we have a matrix m
:
set.seed(100)
m <- matrix(sample(1e2), 10)
m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 8 33 39 86 71 100 81 68 89 84
[2,] 12 16 57 80 32 82 69 11 41 92
[3,] 62 91 53 13 42 31 60 70 98 79
[4,] 66 94 29 67 45 59 20 96 64 1
[5,] 36 63 76 6 10 48 85 75 99 2
[6,] 18 4 27 19 44 56 37 95 26 40
[7,] 3 24 21 25 52 51 83 28 49 17
[8,] 46 5 22 43 47 74 35 97 77 65
[9,] 55 54 78 34 50 90 30 61 14 58
[10,] 88 73 38 15 9 72 7 93 23 87
One could simply do:
apply(m, 1, var)
[1] 871.6556 957.5111 699.2111 941.4333 1237.3333 641.8222 539.7889 759.4333 500.4889 1255.6111
On the other hand, one could also completely vectorize this operation by following the formula of variance
RowVar <- function(x) {
rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)
}
RowVar(m)
[1] 871.6556 957.5111 699.2111 941.4333 1237.3333 641.8222 539.7889 759.4333 500.4889 1255.6111