Subsetting

Introduction#

Given an R object, we may require separate analysis for one or more parts of the data contained in it. The process of obtaining these parts of the data from a given object is called subsetting.

Remarks#

Missing values:

Missing values (NAs) used in subsetting with [ return NA since a NA index

picks an unknown element and so returns NA in the corresponding element..

The “default” type of NA is “logical” (typeof(NA)) which means that, as any “logical” vector used in subsetting, will be recycled to match the length of the subsetted object. So x[NA] is equivalent to x[as.logical(NA)] which is equivalent to x[rep_len(as.logical(NA), length(x))] and, consequently, it returns a missing value (NA) for each element of x. As an example:

x <- 1:3
x[NA]
## [1] NA NA NA

While indexing with “numeric”/“integer” NA picks a single NA element (for each NA in index):

x[as.integer(NA)]
## [1] NA

x[c(NA, 1, NA, NA)]
## [1] NA  1 NA NA

Subsetting out of bounds:

The [ operator, with one argument passed, allows indices that are > length(x) and returns NA for atomic vectors or NULL for generic vectors. In contrast, with [[ and when [ is passed more arguments (i.e. subsetting out of bounds objects with length(dim(x)) > 2) an error is returned:

(1:3)[10]
## [1] NA
(1:3)[[10]]
## Error in (1:3)[[10]] : subscript out of bounds
as.matrix(1:3)[10]
## [1] NA
as.matrix(1:3)[, 10]
## Error in as.matrix(1:3)[, 10] : subscript out of bounds
list(1, 2, 3)[10]
## [[1]]
## NULL
list(1, 2, 3)[[10]]
## Error in list(1, 2, 3)[[10]] : subscript out of bounds

The behaviour is the same when subsetting with “character” vectors, that are not matched in the “names” attribute of the object, too:

c(a = 1, b = 2)["c"]
## <NA> 
##   NA 
list(a = 1, b = 2)["c"]
## <NA>
## NULL

Help topics:

See ?Extract for further information.

Atomic vectors

Atomic vectors (which excludes lists and expressions, which are also vectors) are subset using the [ operator:

# create an example vector
v1 <- c("a", "b", "c", "d")

# select the third element
v1[3]
## [1] "c"

The [ operator can also take a vector as the argument. For example, to select the first and third elements:

v1 <- c("a", "b", "c", "d")

v1[c(1, 3)]
## [1] "a" "c"

Some times we may require to omit a particular value from the vector. This can be achieved using a negative sign(-) before the index of that value. For example, to omit to omit the first value from v1, use v1[-1]. This can be extended to more than one value in a straight forward way. For example, v1[-c(1,3)].

> v1[-1]
[1] "b" "c" "d"
> v1[-c(1,3)]
[1] "b" "d"

On some occasions, we would like to know, especially, when the length of the vector is large, index of a particular value, if it exists:

> v1=="c"
[1] FALSE FALSE  TRUE FALSE
> which(v1=="c")
[1] 3

If the atomic vector has names (a names attribute), it can be subset using a character vector of names:

v <- 1:3
names(v) <- c("one", "two", "three")

v
##  one   two three 
##    1     2     3 

v["two"]
## two 
##   2

The [[ operator can also be used to index atomic vectors, with differences in that it accepts a indexing vector with a length of one and strips any names present:

v[[c(1, 2)]]
## Error in v[[c(1, 2)]] : 
##  attempt to select more than one element in vectorIndex

v[["two"]]
## [1] 2

Vectors can also be subset using a logical vector. In contrast to subsetting with numeric and character vectors, the logical vector used to subset has to be equal to the length of the vector whose elements are extracted, so if a logical vector y is used to subset x, i.e. x[y], if length(y) < length(x) then y will be recycled to match length(x):

v[c(TRUE, FALSE, TRUE)]
##  one three 
##    1     3 

v[c(FALSE, TRUE)]  # recycled to 'c(FALSE, TRUE, FALSE)'
## two 
##   2 

v[TRUE]   # recycled to 'c(TRUE, TRUE, TRUE)'
##  one   two three 
##    1     2     3 

v[FALSE]   # handy to discard elements but save the vector's type and basic structure
## named integer(0)

Lists

A list can be subset with [:

l1 <- list(c(1, 2, 3), 'two' = c("a", "b", "c"), list(10, 20))
l1
## [[1]]
## [1] 1 2 3
## 
## $two
## [1] "a" "b" "c"
##
## [[3]]
## [[3]][[1]]
## [1] 10
##
## [[3]][[2]]
## [1] 20

l1[1]
## [[1]]
## [1] 1 2 3

l1['two']
## $two
## [1] "a" "b" "c"

l1[[2]]
## [1] "a" "b" "c"

l1[['two']]
## [1] "a" "b" "c"

Note the result of l1[2] is still a list, as the [ operator selects elements of a list, returning a smaller list. The [[ operator extracts list elements, returning an object of the type of the list element.

Elements can be indexed by number or a character string of the name (if it exists). Multiple elements can be selected with [ by passing a vector of numbers or strings of names. Indexing with a vector of length > 1 in [ and [[ returns a “list” with the specified elements and a recursive subset (if available), respectively:

l1[c(3, 1)]
## [[1]]
## [[1]][[1]]
## [1] 10
## 
## [[1]][[2]]
## [1] 20
## 
## 
## [[2]]
## [1] 1 2 3

Compared to:

l1[[c(3, 1)]]
## [1] 10

which is equivalent to:

l1[[3]][[1]]
## [1] 10

The $ operator allows you to select list elements solely by name, but unlike [ and [[, does not require quotes. As an infix operator, $ can only take a single name:

l1$two
## [1] "a" "b" "c"

Also, the $ operator allows for partial matching by default:

l1$t
## [1] "a" "b" "c"

in contrast with [[ where it needs to be specified whether partial matching is allowed:

l1[["t"]]
## NULL
l1[["t", exact = FALSE]]
## [1] "a" "b" "c"

Setting options(warnPartialMatchDollar = TRUE), a “warning” is given when partial matching happens with $:

l1$t
## [1] "a" "b" "c"
## Warning message:
## In l1$t : partial match of 't' to 'two'

Matrices

For each dimension of an object, the [ operator takes one argument. Vectors have one dimension and take one argument. Matrices and data frames have two dimensions and take two arguments, given as [i, j] where i is the row and j is the column. Indexing starts at 1.

## a sample matrix
mat <- matrix(1:6, nrow = 2, dimnames = list(c("row1", "row2"), c("col1", "col2", "col3")))

mat
#      col1 col2 col3
# row1    1    3    5
# row2    2    4    6

mat[i,j] is the element in the i-th row, j-th column of the matrix mat. For example, an i value of 2 and a j value of 1 gives the number in the second row and the first column of the matrix. Omitting i or j returns all values in that dimension.

mat[ , 3]
## row1 row2 
##    5    6 

mat[1, ]
# col1 col2 col3 
#    1    3    5

When the matrix has row or column names (not required), these can be used for subsetting:

mat[ , 'col1']
# row1 row2 
#    1    2

By default, the result of a subset will be simplified if possible. If the subset only has one dimension, as in the examples above, the result will be a one-dimensional vector rather than a two-dimensional matrix. This default can be overriden with the drop = FALSE argument to [:

## This selects the first row as a vector
class(mat[1, ])
# [1] "integer"

## Whereas this selects the first row as a 1x3 matrix:
class(mat[1, , drop = F])
# [1] "matrix"

Of course, dimensions cannot be dropped if the selection itself has two dimensions:

mat[1:2, 2:3]  ## A 2x2 matrix
#      col2 col3
# row1    3    5
# row2    4    6

Selecting individual matrix entries by their positions

It is also possible to use a Nx2 matrix to select N individual elements from a matrix (like how a coordinate system works). If you wanted to extract, in a vector, the entries of a matrix in the (1st row, 1st column), (1st row, 3rd column), (2nd row, 3rd column), (2nd row, 1st column) this can be done easily by creating a index matrix with those coordinates and using that to subset the matrix:

mat
#      col1 col2 col3
# row1    1    3    5
# row2    2    4    6

ind = rbind(c(1, 1), c(1, 3), c(2, 3), c(2, 1))
ind
#      [,1] [,2]
# [1,]    1    1
# [2,]    1    3
# [3,]    2    3
# [4,]    2    1

mat[ind]
# [1] 1 5 6 2

In the above example, the 1st column of the ind matrix refers to rows in mat, the 2nd column of ind refers to columns in mat.

Data frames

Subsetting a data frame into a smaller data frame can be accomplished the same as subsetting a list.

> df3 <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)

> df3
##   x y
## 1 1 a
## 2 2 b
## 3 3 c

> df3[1]   # Subset a variable by number
##   x
## 1 1
## 2 2
## 3 3

> df3["x"]   # Subset a variable by name 
##   x
## 1 1
## 2 2
## 3 3

> is.data.frame(df3[1])
## TRUE

> is.list(df3[1])
## TRUE

Subsetting a dataframe into a column vector can be accomplished using double brackets [[ ]] or the dollar sign operator $.

> df3[[2]]    # Subset a variable by number using [[ ]]
## [1] "a" "b" "c"

> df3[["y"]]  # Subset a variable by name using [[ ]]
## [1] "a" "b" "c"

> df3$x    # Subset a variable by name using $
## [1] 1 2 3

> typeof(df3$x)
## "integer"

> is.vector(df3$x)
## TRUE

Subsetting a data as a two dimensional matrix can be accomplished using i and j terms.

> df3[1, 2]    # Subset row and column by number
## [1] "a"

> df3[1, "y"]  # Subset row by number and column by name
## [1] "a"

> df3[2, ]     # Subset entire row by number  
##   x y
## 2 2 b

> df3[ , 1]    # Subset all first variables 
## [1] 1 2 3

> df3[ , 1, drop = FALSE]
##   x
## 1 1
## 2 2
## 3 3

Note: Subsetting by j (column) alone simplifies to the variable’s own type, but subsetting by i alone returns a data.frame, as the different variables may have different types and classes. Setting the drop parameter to FALSE keeps the data frame.

> is.vector(df3[, 2])
## TRUE

> is.data.frame(df3[2, ])
## TRUE

> is.data.frame(df3[, 2, drop = FALSE])
## TRUE

Other objects

The [ and [[ operators are primitive functions that are generic. This means that any object in R (specifically isTRUE(is.object(x)) —i.e. has an explicit “class” attribute) can have its own specified behaviour when subsetted; i.e. has its own methods for [ and/or [[.

For example, this is the case with “data.frame” (is.object(iris)) objects where [.data.frame and [[.data.frame methods are defined and they are made to exhibit both “matrix”-like and “list”-like subsetting. With forcing an error when subsetting a “data.frame”, we see that, actually, a function [.data.frame was called when we -just- used [.

iris[invalidArgument, ]
## Error in `[.data.frame`(iris, invalidArgument, ) : 
##   object 'invalidArgument' not found

Without further details on the current topic, an example[ method:

x = structure(1:5, class = "myClass")
x[c(3, 2, 4)]
## [1] 3 2 4
'[.myClass' = function(x, i) cat(sprintf("We'd expect '%s[%s]' to be returned but this a custom `[` method and should have a `?[.myClass` help page for its behaviour\n", deparse(substitute(x)), deparse(substitute(i))))

x[c(3, 2, 4)]
## We'd expect 'x[c(3, 2, 4)]' to be returned but this a custom `[` method and should have a `?[.myClass` help page for its behaviour
## NULL

We can overcome the method dispatching of [ by using the equivalent non-generic .subset (and .subset2 for [[). This is especially useful and efficient when programming our own “class”es and want to avoid work-arounds (like unclass(x)) when computing on our “class”es efficiently (avoiding method dispatch and copying objects):

.subset(x, c(3, 2, 4))
## [1] 3 2 4

Vector indexing

Elementwise Matrix Operations

Let A and B be two matrices of same dimension. The operators +,-,/,*,^ when used with matrices of same dimension perform the required operations on the corresponding elements of the matrices and return a new matrix of the same dimension. These operations are usually referred to as element-wise operations.

Operator	A op B	Meaning
+	A + B	Addition of corresponding elements of A and B
-	A - B	Subtracts the elements of B from the corresponding elements of A
/	A / B	Divides the elements of A by the corresponding elements of B
*	A * B	Multiplies the elements of A by the corresponding elements of B
^	A^(-1)	For example, gives a matrix whose elements are reciprocals of A
----------------------------------
For “true” matrix multiplication, as seen in Linear Algebra, use `%%`. For example, multiplication of A with B is: `A %% B`. The dimensional requirements are that the `ncol()` of `A` be the same as `nrow()` of `B`

Some Functions used with Matrices

Function	Example	Purpose
nrow()	nrow(A)	determines the number of rows of A
ncol()	ncol(A)	determines the number of columns of A
rownames()	rownames(A)	prints out the row names of the matrix A
colnames()	colnames(A)	prints out the column names of the matrix A
rowMeans()	rowMeans(A)	computes means of each row of the matrix A
colMeans()	colMeans(A)	computes means of each column of the matrix A
upper.tri()	upper.tri(A)	returns a vector whose elements are the upper
		triangular matrix of square matrix A
lower.tri()	lower.tri(A)	returns a vector whose elements are the lower
		triangular matrix of square matrix A
det()	det(A)	results in the determinant of the matrix A
solve()	solve(A)	results in the inverse of the non-singular matrix A
diag()	diag(A)	returns a diagonal matrix whose off-diagnal elemts are zeros and
		diagonals are the same as that of the square matrix A
t()	t(A)	returns the the transpose of the matrix A
eigen()	eigen(A)	retuens the eigenvalues and eigenvectors of the matrix A
is.matrix()	is.matrix(A)	returns TRUE or FALSE depending on whether A is a matrix or not.
as.matrix()	as.matrix(x)	creates a matrix out of the vector x