On the strengths and perils of traditional memoization

Memoization is the practice of storing the return values of function calls for later use, a well-established technique across many languages for skipping redundant work and saving time. In R, the memoise package by Hadley Wickham, Jim Hester, Kirill Müller, and others is one of the most elegant and useful packages I know.

Efficiency gains

A memoized function simply returns a prior result if called a second time with the same inputs. Here is just a taste of the time savings.

library(memoise)
f <- function(n) mean(rnorm(n))
mf = memoise(f)
system.time(x1 <- mf(1e8))

##   user  system elapsed 
##  4.968   0.000   4.973

system.time(x2 <- mf(1e8))

##   user  system elapsed 
##      0       0       0

identical(x1, x2)

## [1] TRUE

However, there are good reasons to apply caution when using traditional memoization.

Dependencies

What if you define multiple functions and nest them? Does a memoized function update the results when non-local dependencies change?

g <- function(x) { 
  2*x + rnorm(1)
}
f <- function(x) g(x)
mf <- memoise(f)
mf(1)

## [1] 0.9391441 # Correct

g <- function(x) {
  1e4 * x + rnorm(1)
}
mf(1)

## [1] 0.9391441 # Memoise does not know that g() changed!

forget(mf) # Throw away old results, get the next one from scratch.

## [1] TRUE

mf(1)

## [1] 9999.867 # Correct

Fortunately, with the memoise package, you can manually force mf() to react to changes to g().

mf = memoise(f, ~g)
mf(1)

## [1] 10000.56

mf(1)

## [1] 10000.56

g <- function(x) { 
  2*x + rnorm(1)
}
mf(1)

## [1] 0.08486043

mf(1)

## [1] 0.08486043

To look for the immediate dependencies of a function, you can use findGlobals() from Luke Tierney’s long-established codetools package. Alternatively, CodeDepends is a more recent effort by Gabe Becker, Duncan Temple Lang, and others, and it goes beyond simply finding dependencies. Whatever tool you use, just keep in mind that no static code analysis tool is perfect, and dependencies like g() may have dependencies of their own.

library(codetools)
findGlobals(f)

## [1] "g"

library(CodeDepnds)
str(getInputs(body(f)))

## Formal class 'ScriptNodeInfo' [package "CodeDepends"] with 11 slots
##   ..@ files      : chr(0) 
##   ..@ strings    : chr(0) 
##   ..@ libraries  : chr(0) 
##   ..@ inputs     : chr "x"
##   ..@ outputs    : chr(0) 
##   ..@ updates    : chr(0) 
##   ..@ functions  : Named logi NA
##   .. ..- attr(*, "names")= chr "g"
##   ..@ removes    : chr(0) 
##   ..@ nsevalVars : chr(0) 
##   ..@ sideEffects: chr(0) 
##   ..@ code       : language g(x)

Parallel computing

What if your code has multiple simultaneous calls to mf(1)? First, notice how memoization can fail if you use the default in-memory cache.

library(parallel)
f <- function(n) rnorm(n)
mf <- memoise(f)
run <- function(){
  output <- mclapply(c(1, 1), mf, mc.cores = 2)
  unlist(output)
}
run()

## [1] -0.1669109  1.1784395

run()

## [1] -0.4694902  1.8671464 # new results computed

At minimum, I recommend the file system cache.

mf <- memoise(f, cache = cache_filesystem("my_cache"))
run()

## [1] -0.01171047 -0.01171047

run()

## [1] -0.01171047 -0.01171047

But even then, you should avoid calling the same memoized function on the same inputs twice simultaneously. As RStudio's Jim Hester explains, multiple processes could write to the same file at the same time and corrupt the results. In general, this sort of unpredictable behavior is called a race condition, and it is the bane of any kind of parallel computing. For the memoise package specifically, there is ongoing discussion about preventing race conditions in future development, possibly with Gábor Csárdi's filelock or Ivan Popanov's flock.

Efficiency gains

Dependencies

Parallel computing

Other solutions