On the strengths and perils of traditional memoization
Memoization is the practice of storing the return values of function calls for later use, a well-established technique across many languages for skipping redundant work and saving time. In R, the memoise package by Hadley Wickham, Jim Hester, Kirill Müller, and others is one of the most elegant and useful packages I know.
Efficiency gains
A memoized function simply returns a prior result if called a second time with the same inputs. Here is just a taste of the time savings.
library(memoise)
f <- function(n) mean(rnorm(n))
mf = memoise(f)
system.time(x1 <- mf(1e8))
## user system elapsed
## 4.968 0.000 4.973
system.time(x2 <- mf(1e8))
## user system elapsed
## 0 0 0
identical(x1, x2)
## [1] TRUE
However, there are good reasons to apply caution when using traditional memoization.
Dependencies
What if you define multiple functions and nest them? Does a memoized function update the results when non-local dependencies change?
g <- function(x) {
2*x + rnorm(1)
}
f <- function(x) g(x)
mf <- memoise(f)
mf(1)
## [1] 0.9391441 # Correct
g <- function(x) {
1e4 * x + rnorm(1)
}
mf(1)
## [1] 0.9391441 # Memoise does not know that g() changed!
forget(mf) # Throw away old results, get the next one from scratch.
## [1] TRUE
mf(1)
## [1] 9999.867 # Correct
Fortunately, with the memoise package, you can manually force mf()
to react to changes to g()
.
mf = memoise(f, ~g)
mf(1)
## [1] 10000.56
mf(1)
## [1] 10000.56
g <- function(x) {
2*x + rnorm(1)
}
mf(1)
## [1] 0.08486043
mf(1)
## [1] 0.08486043
To look for the immediate dependencies of a function, you can use findGlobals()
from Luke Tierney’s long-established codetools package. Alternatively, CodeDepends is a more recent effort by Gabe Becker, Duncan Temple Lang, and others, and it goes beyond simply finding dependencies. Whatever tool you use, just keep in mind that no static code analysis tool is perfect, and dependencies like g()
may have dependencies of their own.
library(codetools)
findGlobals(f)
## [1] "g"
library(CodeDepnds)
str(getInputs(body(f)))
## Formal class 'ScriptNodeInfo' [package "CodeDepends"] with 11 slots
## ..@ files : chr(0)
## ..@ strings : chr(0)
## ..@ libraries : chr(0)
## ..@ inputs : chr "x"
## ..@ outputs : chr(0)
## ..@ updates : chr(0)
## ..@ functions : Named logi NA
## .. ..- attr(*, "names")= chr "g"
## ..@ removes : chr(0)
## ..@ nsevalVars : chr(0)
## ..@ sideEffects: chr(0)
## ..@ code : language g(x)
Parallel computing
What if your code has multiple simultaneous calls to mf(1)
? First, notice how memoization can fail if you use the default in-memory cache.
library(parallel)
f <- function(n) rnorm(n)
mf <- memoise(f)
run <- function(){
output <- mclapply(c(1, 1), mf, mc.cores = 2)
unlist(output)
}
run()
## [1] -0.1669109 1.1784395
run()
## [1] -0.4694902 1.8671464 # new results computed
At minimum, I recommend the file system cache.
mf <- memoise(f, cache = cache_filesystem("my_cache"))
run()
## [1] -0.01171047 -0.01171047
run()
## [1] -0.01171047 -0.01171047
But even then, you should avoid calling the same memoized function on the same inputs twice simultaneously. As RStudio's Jim Hester explains, multiple processes could write to the same file at the same time and corrupt the results. In general, this sort of unpredictable behavior is called a race condition, and it is the bane of any kind of parallel computing. For the memoise package specifically, there is ongoing discussion about preventing race conditions in future development, possibly with Gábor Csárdi's filelock or Ivan Popanov's flock.
Other solutions
Make and its spinoffs resemble memoise, but go they extra mile: they automatically account for dependencies and unlock implicit parallel computing. There already exist Make-like packages just for R.