*apply family of functions (functionals)
Remarks#
A function in the *apply
family is an abstraction of a for
loop. Compared with the for
loops *apply
functions have the following advantages:
- Require less code to write.
- Doesn’t have an iteration counter.
- Doesn’t use temporary variables to store intermediate results.
However for
loops are more general and can give us more control allowing to achieve complex computations that are not always trivial to do using *apply
functions.
The relationship between for
loops and *apply
functions is explained in the documentation for for
loops.
Members of the *apply
Family
The *apply
family of functions contains several variants of the same principle that differ based primarily on the kind of output they return.
function | Input | Output |
---|---|---|
apply |
matrix , data.frame , or array |
vector or matrix (depending on the length of each element returned) |
sapply |
vector or list |
vector or matrix (depending on the length of each element returned) |
lapply |
vector or list |
list |
vapply |
vector or `list | vector or matrix (depending on the length of each element returned) of the user-designated class |
mapply |
multiple vectors, lists or a combination |
list |
See “Examples” to see how each of these functions is used.
Use anonymous functions with apply
apply
is used to evaluate a function (maybe an anonymous one) over the margins of an array or matrix.
Let’s use the iris
dataset to illustrate this idea. The iris
dataset has measurements of 150 flowers from 3 species. Let’s see how this dataset is structured:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Now, imagine that you want to know the mean of each of these variables. One way to solve this might be to use a for
loop, but R programmers will often prefer to use apply
(for reasons why, see Remarks):
> apply(iris[1:4], 2, mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
- In the first parameter, we subset
iris
to include only the first 4 columns, becausemean
only works on numeric data. - The second parameter value of
2
indicates that we want to work on the columns only (the second subscript of the r×c array);1
would give the row means.
In the same way we can calculate more meaningful values:
# standard deviation
apply(iris[1:4], 2, sd)
# variance
apply(iris[1:4], 2, var)
Caveat: R has some built-in functions which are better for calculating column and row sums and means: colMeans
and rowMeans
.
Now, let’s do a different and more meaningful task: let’s calculate the mean only for those values which are bigger than 0.5
. For that, we will create our own mean
function.
> our.mean.function <- function(x) { mean(x[x > 0.5]) }
> apply(iris[1:4], 2, our.mean.function)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.665347
(Note the difference in the mean of Petal.Width
)
But, what if we don’t want to use this function in the rest of our code? Then, we can use an anonymous function, and write our code like this:
apply(iris[1:4], 2, function(x) { mean(x[x > 0.5]) })
So, as we have seen, we can use apply
to execute the same operation on columns or rows of a dataset using only one line.
Caveat: Since apply
returns very different kinds of output depending on the length of the results of the specified function, it may not be the best choice in cases where you are not working interactively. Some of the other *apply
family functions are a bit more predictable (see Remarks).
Bulk File Loading
for a large number of files which may need to be operated on in a similar process and with well structured file names.
firstly a vector of the file names to be accessed must be created, there are multiple options for this:
-
Creating the vector manually with
paste0()
files <- paste0("file_", 1:100, ".rds")
-
Using
list.files()
with a regex search term for the file type, requires knowledge of regular expressions (regex) if other files of same type are in the directory.files <- list.files("./", pattern = "\\.rds$", full.names = TRUE)
where X
is a vector of part of the files naming format used.
lapply
will output each response as element of a list.
readRDS
is specific to .rds
files and will change depending on the application of the process.
my_file_list <- lapply(files, readRDS)
This is not necessarily faster than a for loop from testing but allows all files to be an element of a list without assigning them explicitly.
Finally, we often need to load multiple packages at once.
This trick can do it quite easily by applying library()
to all libraries that we wish to import:
lapply(c("jsonlite","stringr","igraph"),library,character.only=TRUE)
Combining multiple data.frames
(lapply
, mapply
)
In this exercise, we will generate four bootstrap linear regression models and combine the summaries of these models into a single data frame.
library(broom)
#* Create the bootstrap data sets
BootData <- lapply(1:4,
function(i) mtcars[sample(1:nrow(mtcars),
size = nrow(mtcars),
replace = TRUE), ])
#* Fit the models
Models <- lapply(BootData,
function(BD) lm(mpg ~ qsec + wt + factor(am),
data = BD))
#* Tidy the output into a data.frame
Tidied <- lapply(Models,
tidy)
#* Give each element in the Tidied list a name
Tidied <- setNames(Tidied, paste0("Boot", seq_along(Tidied)))
At this point, we can take two approaches to inserting the names into the data.frame.
#* Insert the element name into the summary with `lapply`
#* Requires passing the names attribute to `lapply` and referencing `Tidied` within
#* the applied function.
Described_lapply <-
lapply(names(Tidied),
function(nm) cbind(nm, Tidied[[nm]]))
Combined_lapply <- do.call("rbind", Described_lapply)
#* Insert the element name into the summary with `mapply`
#* Allows us to pass the names and the elements as separate arguments.
Described_mapply <-
mapply(
function(nm, dframe) cbind(nm, dframe),
names(Tidied),
Tidied,
SIMPLIFY = FALSE)
Combined_mapply <- do.call("rbind", Described_mapply)
If you’re a fan of magrittr
style pipes, you can accomplish the entire task in a single chain (though it may not be prudent to do so if you need any of the intermediary objects, such as the model objects themselves):
library(magrittr)
library(broom)
Combined <- lapply(1:4,
function(i) mtcars[sample(1:nrow(mtcars),
size = nrow(mtcars),
replace = TRUE), ]) %>%
lapply(function(BD) lm( mpg ~ qsec + wt + factor(am), data = BD)) %>%
lapply(tidy) %>%
setNames(paste0("Boot", seq_along(.))) %>%
mapply(function(nm, dframe) cbind(nm, dframe),
nm = names(.),
dframe = .,
SIMPLIFY = FALSE) %>%
do.call("rbind", .)
Using built-in functionals
Built-in functionals: lapply(), sapply(), and mapply()
R comes with built-in functionals, of which perhaps the most well-known are the apply family of functions. Here is a description of some of the most common apply functions:
lapply()
= takes a list as an argument and applies the specified function to the list.sapply()
= the same aslapply()
but attempts to simplify the output to a vector or a matrix.vapply()
= a variant ofsapply()
in which the output object’s type must be specified.
mapply()
= likelapply()
but can pass multiple vectors as input to the specified function. Can be simplified likesapply()
.Map()
is an alias tomapply()
withSIMPLIFY = FALSE
.
lapply()
lapply()
can be used with two different iterations:
-
lapply(variable, FUN)
-
lapply(seq_along(variable), FUN)
Two ways of finding the mean of x
set.seed(1) df <- data.frame(x = rnorm(25), y = rnorm(25)) lapply(df, mean) lapply(seq_along(df), function(x) mean(df[[x]))
sapply()
sapply()
will attempt to resolve its output to either a vector or a matrix.
# Two examples to show the different outputs of sapply()
sapply(letters, print) ## produces a vector
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
sapply(x, quantile) ## produces a matrix
mapply()
mapply()
works much like lapply()
except it can take multiple vectors as input (hence the m for multivariate).
mapply(sum, 1:5, 10:6, 3) # 3 will be "recycled" by mapply
Using user-defined functionals
User-defined functionals
Users can create their own functionals to varying degrees of complexity. The following examples are from Functionals by Hadley Wickham:
randomise <- function(f) f(runif(1e3))
lapply2 <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
out
}
In the first case, randomise
accepts a single argument f
, and calls it on a sample of Uniform random variables. To demonstrate equivalence, we call set.seed
below:
set.seed(123)
randomise(mean)
#[1] 0.4972778
set.seed(123)
mean(runif(1e3))
#[1] 0.4972778
set.seed(123)
randomise(max)
#[1] 0.9994045
set.seed(123)
max(runif(1e3))
#[1] 0.9994045
The second example is a re-implementation of base::lapply
, which uses functionals to apply an operation (f
) to each element in a list (x
). The ...
parameter allows the user to pass additional arguments to f
, such as the na.rm
option in the mean
function:
lapply(list(c(1, 3, 5), c(2, NA, 6)), mean)
# [[1]]
# [1] 3
#
# [[2]]
# [1] NA
lapply2(list(c(1, 3, 5), c(2, NA, 6)), mean)
# [[1]]
# [1] 3
#
# [[2]]
# [1] NA
lapply(list(c(1, 3, 5), c(2, NA, 6)), mean, na.rm = TRUE)
# [[1]]
# [1] 3
#
# [[2]]
# [1] 4
lapply2(list(c(1, 3, 5), c(2, NA, 6)), mean, na.rm = TRUE)
# [[1]]
# [1] 3
#
# [[2]]
# [1] 4