Run-length encoding

Remarks#

A run is a consecutive sequence of repeated values or observations. For repeated values, R’s “run-length encoding” concisely describes a vector in terms of its runs. Consider:

dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)

We have a length-one run of 1s; then a length-three run of 2s; then a length-one run of 3s; and so on. R’s run-length encoding captures all the lengths and values of a vector’s runs.

Extensions

A run can also refer to consecutive observations in a tabular data. While R doesn’t have a natural way of encoding these, they can be handled with rleid from the data.table package (currently a dead-end link).

Run-length Encoding with `rle`

Run-length encoding captures the lengths of runs of consecutive elements in a vector. Consider an example vector:

dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)

The rle function extracts each run and its length:

r <- rle(dat)
r
# Run Length Encoding
#   lengths: int [1:6] 1 3 1 1 2 2
#   values : num [1:6] 1 2 3 1 4 1

The values for each run are captured in r$values:

r$values
# [1] 1 2 3 1 4 1

This captures that we first saw a run of 1’s, then a run of 2’s, then a run of 3’s, then a run of 1’s, and so on.

The lengths of each run are captured in r$lengths:

r$lengths
# [1] 1 3 1 1 2 2

We see that the initial run of 1’s was of length 1, the run of 2’s that followed was of length 3, and so on.

Identifying and grouping by runs in base R

One might want to group their data by the runs of a variable and perform some sort of analysis. Consider the following simple dataset:

(dat <- data.frame(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
#   x y
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
# 5 2 5
# 6 1 6

The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean values are 1.5, 4, and 6).

In base R, we would first compute the run-length encoding of the x variable using rle:

(r <- rle(dat$x))
# Run Length Encoding
#   lengths: int [1:3] 2 3 1
#   values : num [1:3] 1 2 1

The next step is to compute the run number of each row of our dataset. We know that the total number of runs is length(r$lengths), and the length of each run is r$lengths, so we can compute the run number of each of our runs with rep:

(run.id <- rep(seq_along(r$lengths), r$lengths))
# [1] 1 1 2 2 2 3

Now we can use tapply to compute the mean y value for each run by grouping on the run id:

data.frame(x=r$values, meanY=tapply(dat$y, run.id, mean))
#   x meanY
# 1 1   1.5
# 2 2   4.0
# 3 1   6.0

Identifying and grouping by runs in data.table

The data.table package provides a convenient way to group by runs in data. Consider the following example data:

library(data.table)
(DT <- data.table(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
#    x y
# 1: 1 1
# 2: 1 2
# 3: 2 3
# 4: 2 4
# 5: 2 5
# 6: 1 6

The data.table rleid function provides an id indicating the run id of each element of a vector:

rleid(DT$x)
# [1] 1 1 2 2 2 3

One can then easily group on this run ID and summarize the y data:

DT[,mean(y),by=.(x, rleid(x))]
#    x rleid  V1
# 1: 1     1 1.5
# 2: 2     2 4.0
# 3: 1     3 6.0

Run-length encoding to compress and decompress vectors

Long vectors with long runs of the same value can be significantly compressed by storing them in their run-length encoding (the value of each run and the number of times that value is repeated). As an example, consider a vector of length 10 million with a huge number of 1’s and only a small number of 0’s:

set.seed(144)
dat <- sample(rep(0:1, c(1, 1e5)), 1e7, replace=TRUE)
table(dat)
#       0       1 
#     103 9999897

Storing 10 million entries will require significant space, but we can instead create a data frame with the run-length encoding of this vector:

rle.df <- with(rle(dat), data.frame(values, lengths))
dim(rle.df)
# [1] 207   2
head(rle.df)
#   values lengths
# 1      1   52818
# 2      0       1
# 3      1  219329
# 4      0       1
# 5      1  318306
# 6      0       1

From the run-length encoding, we see that the first 52,818 values in the vector are 1’s, followed by a single 0, followed by 219,329 consecutive 1’s, followed by a 0, and so on. The run-length encoding only has 207 entries, requiring us to store only 414 values instead of 10 million values. As rle.df is a data frame, it can be stored using standard functions like write.csv.

Decompressing a vector in run-length encoding can be accomplished in two ways. The first method is to simply call rep, passing the values element of the run-length encoding as the first argument and the lengths element of the run-length encoding as the second argument:

decompressed <- rep(rle.df$values, rle.df$lengths)

We can confirm that our decompressed data is identical to our original data:

identical(decompressed, dat)
# [1] TRUE

The second method is to use R’s built-in inverse.rle function on the rle object, for instance:

rle.obj <- rle(dat)                            # create a rle object here
class(rle.obj)
# [1] "rle"

dat.inv <- inverse.rle(rle.obj)               # apply the inverse.rle on the rle object

We can confirm again that this produces exactly the original dat:

identical(dat.inv, dat)
# [1] TRUE