Hashmaps
Environments as hash maps
Note: in the subsequent passages, the terms hash map and hash table are used interchangeably and refer to the same concept, namely, a data structure providing efficient key lookup through use of an internal hash function.
Introduction
Although R does not provide a native hash table structure, similar functionality can be achieved by leveraging the fact that the environment
object returned from new.env
(by default) provides hashed key lookups. The following two statements are equivalent, as the hash
parameter defaults to TRUE
:
H <- new.env(hash = TRUE)
H <- new.env()
Additionally, one may specify that the internal hash table is pre-allocated with a particular size via the size
parameter, which has a default value of 29. Like all other R objects, environment
s manage their own memory and will grow in capacity as needed, so while it is not necessary to request a non-default value for size
, there may be a slight performance advantage in doing so if the object will (eventually) contain a very large number of elements. It is worth noting that allocating extra space via size
does not, in itself, result in an object with a larger memory footprint:
object.size(new.env())
# 56 bytes
object.size(new.env(size = 10e4))
# 56 bytes
Insertion
Insertion of elements may be done using either of the [[<-
or $<-
methods provided for the environment
class, but not by using “single bracket” assignment ([<-
):
H <- new.env()
H[["key"]] <- rnorm(1)
key2 <- "xyz"
H[[key2]] <- data.frame(x = 1:3, y = letters[1:3])
H$another_key <- matrix(rbinom(9, 1, 0.5) > 0, nrow = 3)
H["error"] <- 42
#Error in H["error"] <- 42 :
# object of type 'environment' is not subsettable
Like other facets of R, the first method (object[[key]] <- value
) is generally preferred to the second (object$key <- value
) because in the former case, a variable maybe be used instead of a literal value (e.g key2
in the example above).
As is generally the case with hash map implementations, the environment
object will not store duplicate keys. Attempting to insert a key-value pair for an existing key will replace the previously stored value:
H[["key3"]] <- "original value"
H[["key3"]] <- "new value"
H[["key3"]]
#[1] "new value"
Key Lookup
Likewise, elements may be accessed with [[
or $
, but not with [
:
H[["key"]]
#[1] 1.630631
H[[key2]] ## assuming key2 <- "xyz"
# x y
# 1 1 a
# 2 2 b
# 3 3 c
H$another_key
# [,1] [,2] [,3]
# [1,] TRUE TRUE TRUE
# [2,] FALSE FALSE FALSE
# [3,] TRUE TRUE TRUE
H[1]
#Error in H[1] : object of type 'environment' is not subsettable
Inspecting the Hash Map
Being just an ordinary environment
, the hash map can be inspected by typical means:
names(H)
#[1] "another_key" "xyz" "key" "key3"
ls(H)
#[1] "another_key" "key" "key3" "xyz"
str(H)
#<environment: 0x7828228>
ls.str(H)
# another_key : logi [1:3, 1:3] TRUE FALSE TRUE TRUE FALSE TRUE ...
# key : num 1.63
# key3 : chr "new value"
# xyz : 'data.frame': 3 obs. of 2 variables:
# $ x: int 1 2 3
# $ y: chr "a" "b" "c"
Elements can be removed using rm
:
rm(list = c("key", "key3"), envir = H)
ls.str(H)
# another_key : logi [1:3, 1:3] TRUE FALSE TRUE TRUE FALSE TRUE ...
# xyz : 'data.frame': 3 obs. of 2 variables:
# $ x: int 1 2 3
# $ y: chr "a" "b" "c"
Flexibility
One of the major benefits of using environment
objects as hash tables is their ability to store virtually any type of object as a value, even other environment
s:
H2 <- new.env()
H2[["a"]] <- LETTERS
H2[["b"]] <- as.list(x = 1:5, y = matrix(rnorm(10), 2))
H2[["c"]] <- head(mtcars, 3)
H2[["d"]] <- Sys.Date()
H2[["e"]] <- Sys.time()
H2[["f"]] <- (function() {
H3 <- new.env()
for (i in seq_along(names(H2))) {
H3[[names(H2)[i]]] <- H2[[names(H2)[i]]]
}
H3
})()
ls.str(H2)
# a : chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" ...
# b : List of 5
# $ : int 1
# $ : int 2
# $ : int 3
# $ : int 4
# $ : int 5
# c : 'data.frame': 3 obs. of 11 variables:
# $ mpg : num 21 21 22.8
# $ cyl : num 6 6 4
# $ disp: num 160 160 108
# $ hp : num 110 110 93
# $ drat: num 3.9 3.9 3.85
# $ wt : num 2.62 2.88 2.32
# $ qsec: num 16.5 17 18.6
# $ vs : num 0 0 1
# $ am : num 1 1 1
# $ gear: num 4 4 4
# $ carb: num 4 4 1
# d : Date[1:1], format: "2016-08-03"
# e : POSIXct[1:1], format: "2016-08-03 19:25:14"
# f : <environment: 0x91a7cb8>
ls.str(H2$f)
# a : chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" ...
# b : List of 5
# $ : int 1
# $ : int 2
# $ : int 3
# $ : int 4
# $ : int 5
# c : 'data.frame': 3 obs. of 11 variables:
# $ mpg : num 21 21 22.8
# $ cyl : num 6 6 4
# $ disp: num 160 160 108
# $ hp : num 110 110 93
# $ drat: num 3.9 3.9 3.85
# $ wt : num 2.62 2.88 2.32
# $ qsec: num 16.5 17 18.6
# $ vs : num 0 0 1
# $ am : num 1 1 1
# $ gear: num 4 4 4
# $ carb: num 4 4 1
# d : Date[1:1], format: "2016-08-03"
# e : POSIXct[1:1], format: "2016-08-03 19:25:14"
Limitations
One of the major limitations of using environment
objects as hash maps is that, unlike many aspects of R, vectorization is not supported for element lookup / insertion:
names(H2)
#[1] "a" "b" "c" "d" "e" "f"
H2[[c("a", "b")]]
#Error in H2[[c("a", "b")]] :
# wrong arguments for subsetting an environment
Keys <- c("a", "b")
H2[[Keys]]
#Error in H2[[Keys]] : wrong arguments for subsetting an environment
Depending on the nature of the data being stored in the object, it may be possible to use vapply
or list2env
for assigning many elements at once:
E1 <- new.env()
invisible({
vapply(letters, function(x) {
E1[[x]] <- rnorm(1)
logical(0)
}, FUN.VALUE = logical(0))
})
all.equal(sort(names(E1)), letters)
#[1] TRUE
Keys <- letters
E2 <- list2env(
setNames(
as.list(rnorm(26)),
nm = Keys),
envir = NULL,
hash = TRUE
)
all.equal(sort(names(E2)), letters)
#[1] TRUE
Neither of the above are particularly concise, but may be preferable to using a for
loop, etc. when the number of key-value pairs is large.
package:hash
The hash package offers a hash structure in R. However, it terms of timing for both inserts and reads it compares unfavorably to using environments as a hash. This documentation simply acknowledges its existence and provides sample timing code below for the above stated reasons. There is no identified case where hash is an appropriate solution in R code today.
Consider:
# Generic unique string generator
unique_strings <- function(n){
string_i <- 1
string_len <- 1
ans <- character(n)
chars <- c(letters,LETTERS)
new_strings <- function(len,pfx){
for(i in 1:length(chars)){
if (len == 1){
ans[string_i] <<- paste(pfx,chars[i],sep='')
string_i <<- string_i + 1
} else {
new_strings(len-1,pfx=paste(pfx,chars[i],sep=''))
}
if (string_i > n) return ()
}
}
while(string_i <= n){
new_strings(string_len,'')
string_len <- string_len + 1
}
sample(ans)
}
# Generate timings using an enviornment
timingsEnv <- plyr::adply(2^(10:15),.mar=1,.fun=function(i){
strings <- unique_strings(i)
ht1 <- new.env(hash=TRUE)
lapply(strings, function(s){ ht1[[s]] <<- 0L})
data.frame(
size=c(i,i),
seconds=c(
system.time(for (j in 1:i) ht1[[strings[j]]]==0L)[3]),
type = c('1_hashedEnv')
)
})
timingsHash <- plyr::adply(2^(10:15),.mar=1,.fun=function(i){
strings <- unique_strings(i)
ht <- hash::hash()
lapply(strings, function(s) ht[[s]] <<- 0L)
data.frame(
size=c(i,i),
seconds=c(
system.time(for (j in 1:i) ht[[strings[j]]]==0L)[3]),
type = c('3_stringHash')
)
})
package:listenv
Although package:listenv
implements a list-like interface to environments, its performance relative to environments for hash-like purposes is poor on hash retrieval. However, if the indexes are numeric, it can be quite fast on retrieval. However, they have other advantages, e.g. compatibility with package:future
. Covering this package for that purpose goes beyond the scope of the current topic. However, the timing code provided here can be used in conjunction with the example for package:hash for write timings.
timingsListEnv <- plyr::adply(2^(10:15),.mar=1,.fun=function(i){
strings <- unique_strings(i)
le <- listenv::listenv()
lapply(strings, function(s) le[[s]] <<- 0L)
data.frame(
size=c(i,i),
seconds=c(
system.time(for (k in 1:i) le[[k]]==0L)[3]),
type = c('2_numericListEnv')
)
})