Reproducible computation at scale in R

class: center, middle, inverse, title-slide

# Reproducible computation at scale in R
### Will Landau

---

## Demanding computation in R

* Deep learning: `keras`, `tensorflow`, `torch`
* Machine learning: `tidymodels`
* Bayesian data analysis: JAGS, Stan, NIMBLE, `greta`
* PK/PD: `nlmixr`, `mrgsolve`
* Clinical trial simulation: `rpact`, `Mediana`
* Statistical genomics
* Social network analysis
* Permutation tests
* Database queries: `DBI`
* ETL on large data

---

## Repetition: the overlooked bane of long computation

<br>

![](./images/reality.png)

---

## Workflows have interconnected steps.

![](./images/workflow.png)

---

## If you change code or data...

![](./images/change.png)

---

## ...the downstream steps are no longer valid.

![](./images/downstream.png)

---

## Dilemma: short runtimes or reproducible results?

![](./images/decisions.png)

---

## Let a pipeline tool figure out what to rerun.

![](./images/pipeline_graph.png)

* Save time while ensuring computational reproducibility.
* Automatic parallel/distributed computing based on the directed acyclic graph.

---

## Pipeline tools

* Existing pipeline tools: https://github.com/pditommaso/awesome-pipeline
* Most are language-agnostic or designed for Python or the shell.

## {targets}

* Fundamentally designed for R.
* Supports a clean, modular, function-oriented programming style.
* Abstracts files as R objects and automatically manages data.
* Surpasses the permanent limitations of its predecessor, [`drake`](https://github.com/ropensci/drake): <https://books.ropensci.org/targets/drake.html>.
* Continuation of the ideas from `remake` by Rich FitzJohn: <https://github.com/richfitz/remake>.

---

## {targets} embraces functions.

> Nearly everything that happens in R results from a function call. Therefore, basic programming centers on creating and refining functions.
>
> [John Chambers (2008)](https://www.springer.com/gp/book/9780387759357)

* Functions are fundamental building blocks in most computer languages.
* Advantages of functions:
    * Clearly define relationships among inputs and outputs.
    * Help `targets` decide which tasks to run and which ones to skip.
    * Break down complicated ideas into manageable pieces.
    * Custom shorthand to make code easier to read.
* Common uses of R functions in data science:
    1. Preprocess a dataset.
    2. Analyze a dataset.
    3. Summarize an analysis.

---

## Example: customer churn

* Deep learning with Keras.
* IBM Cognos Analytics Telco customer churn dataset.
* Find the model that best predicts which customers will cancel their telecom subscriptions.
* Example comes from Matt Dancho's 2018 RStudio AI Blog post: https://blogs.rstudio.com/ai/posts/2018-01-11-keras-customer-churn/.

---

## Project files

* Many variations on this theme, only `_targets.R` is strictly required.

```r
_targets.R
R/
└── functions.R
data/
└── customer_churn.csv
```

---

## functions.R file

```r
split_data <- function(file) {
  read_csv(file, col_types = cols()) %>%
    initial_split(prop = 0.7)
}

prepare_recipe <- function(data) {
  # ...
}

test_model <- function(data, recipe, act1) {
  # ...
}
```

---

## _targets.R file

```r
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  tar_target(file, "data/churn.csv", format = "file"),
  tar_target(data, split_data(file)),
  tar_target(recipe, prepare_recipe(data)),
  tar_target(model1, test_model(data, recipe, act1 = "relu")),
  tar_target(model2, test_model(data, recipe, act1 = "sigmoid")),
  tar_target(model3, test_model(data, recipe, act1 = "linear"))
)
```

---

## Check the graph.

```r
tar_visnetwork(targets_only = TRUE)
```

![](./images/keras-before.png)

---

## Run the pipeline.

```r
tar_make()
#> • start target file
#> • built target file
#> • start target data
#> • built target data
#> • start target recipe
#> • built target recipe
#> • start target model1
#> • built target model1
#> • start target model2
#> • built target model2
#> • start target model3
#> • built target model3
#> • end pipeline
```

---

## Check the results.

```r
tar_read(model1)
#> # A tibble: 1 x 2
#>   accuracy act1 
#>      <dbl> <chr>
#> 1    0.794 relu 
tar_read(model3)
#> # A tibble: 1 x 2
#>   accuracy act1  
#>      <dbl> <chr> 
#> 1    0.798 linear
```

---

## Change a target.

```r
library(targets)
source("R/functions.R")
tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick"))
list(
  tar_target(file, "data/churn.csv", format = "file"),
  tar_target(data, split_data(file)),
  tar_target(recipe, prepare_recipe(data)),
  tar_target(model1, test_model(data, recipe, act1 = "relu")),
  tar_target(model2, test_model(data, recipe, act1 = "sigmoid")),
* tar_target(model3, test_model(data, recipe, act1 = "softmax"))
)
```

---

## Only model3 will rerun next time.

```r
tar_visnetwork(targets_only = TRUE)
```

![](./images/keras-command.png)

---

## The other models are skipped to save time.

```r
tar_make()
#> ✓ skip target file
#> ✓ skip target data
#> ✓ skip target recipe
#> ✓ skip target model1
#> ✓ skip target model2
#> • start target model3
#> • built target model3
#> • end pipeline
```

---

## Change function code.

```r
define_model <- function(churn_recipe, act1) {
  input_shape <- ncol(
    juice(churn_recipe, all_predictors(), composition = "matrix")
  )
  out <- keras_model_sequential() %>%
    layer_dense(
      units = units1,
      kernel_initializer = "uniform",
      activation = act1,
      input_shape = input_shape
    ) %>%
*   layer_dropout(rate = 0.2) # previously 0.1
  # ...
}
```

---

## {targets} understand dependencies among functions.

```r
tar_visnetwork(targets_only = FALSE)
```

---

## Run the models but skip the data.

```r
tar_make()
#> ✓ skip target file
#> ✓ skip target data
#> ✓ skip target recipe
#> • start target model1
#> • built target model1
#> • start target model2
#> • built target model2
#> • start target model3
#> • built target model3
#> • end pipeline
```

---

## Tangible evidence of reproducibility

```r
tar_make()
#> ✓ skip target file
#> ✓ skip target data
#> ✓ skip target recipe
#> ✓ skip target model1
#> ✓ skip target model2
#> ✓ skip target model3
#> ✓ skip pipeline
```

---

## Extending {targets}

![](./images/targetopia.png)

---

## Target factories

* A target factory is a reusable function that creates target objects.

```r
target_factory <- function(file) {
  list(
    tar_target_raw("file", file, format = "file", deployment = "main"),
    tar_target_raw("data", quote(read_data(file)), format = "fst_tbl", deployment = "main"),
    tar_target_raw("model", quote(run_model(data)), format = "qs")
  )
}
```

---

## Target factories simplify pipeline construction.

```r
# _targets.R
library(targets)
library(yourExamplePackage)
list(
  target_factory("data.csv")
)
```

```r
# R console
tar_manifest(fields = command)
#> # A tibble: 3 x 2
#>   name  command          
#>   <chr> <chr>            
#> 1 file  "\"data.csv\""   
#> 2 data  "read_data(file)"           
#> 3 model "run_model(data)"
```

---

## Example: {stantargets}

* Easy pipeline construction for Stan statistical models.
* Uses R packages [`cmdstanr`](https://mc-stan.org/cmdstanr/) and [`posterior`](https://mc-stan.org/posterior/).

---

## About Stan

* Probabilistic programming language ([Carpenter et al. 2017](https://www.jstatsoft.org/article/view/v076i01)).
* Markov chain Monte Carlo (MCMC) with HMC and NUTS.
    * Often more efficient than Gibbs sampling.
    * Flexible specification of posterior distributions.
    * Indifferent to conjugacy.
* Variational inference (ADVI)
* Penalized MLE (L-BFGS)

---

## Example Stan model

```c
// model.stan
data {
  int <lower = 1> n;
  vector[n] x;
  vector[n] y;
}
parameters {
  vector[2] beta;
}
model {
  y ~ normal(beta[1] + x * beta[2], 1);
  beta ~ normal(0, 1);
}

```

---

## Interval-based validation study

* For thousands of independent replications:
    * Simulate data from the prior predictive distribution.
    * Fit the model to the simulated data using MCMC.
    * Calculate x% posterior intervals for each of `beta[1]` and `beta[2]`.
* For each of `beta[1]` and `beta[2]`, roughly x% of the posterior intervals should cover the corresponding parameter draws from the joint prior.
* Simulation-based calibration extends this idea further ([Cook et al. 2006](https://www.jstor.org/stable/27594203); [Talts et al. 2020](https://arxiv.org/abs/1804.06788)).

---

## Function for prior predictive simulations

```r
simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    # Hold on to draws from the prior:
    .join_data = list(beta = beta)
  )
}
```

---

## Function to calculate coverage statistics

```r
calculate_coverage <- function(results) {
  results %>%
    group_by(variable) %>%
    summarize(
      coverage_50 = mean(q25 < .join_data & .join_data < q75),
      coverage_95 = mean(q2.5 < .join_data & .join_data < q97.5)
    )
}
```

---

## _targets.R file setup

```r
library(targets)
library(stantargets)
source("R/functions.R")
tar_option_set(packages = "tidyverse")
*options(clustermq.scheduler = "multicore")
list(
  tar_stan_mcmc_rep_summary(
    name = model, stan_files = "model.stan", data = simulate_data(),
    batches = 160, reps = 25, variables = "beta",
    summaries = list(
      ~posterior::quantile2(.x, probs = c(0.025, 0.25, 0.75, 0.975)),
      rhat = posterior::rhat
    ),
  ),
  tar_target(coverage, calculate_coverage(model))
)
```

---

## Multiple targets

![](./images/stan-graph.png)

---

## Run MCMC batches in parallel

```r
> tar_make_clustermq(workers = 4) # 4 local cores
#> • start target model_batch
#> • built target model_batch
#> • start branch model_data_b0b9380a
#> • start branch model_data_ffcdb73c
#> • start branch model_data_b968a03a
#> • start branch model_data_f8763cb2
#> ...
```

* High-performance computing with `targets`: <https://books.ropensci.org/targets/hpc.html>
* Distributed computing available, not just multicore: SLURM, SGE, TORQUE, PBS, LSF 
* Persistent workers with the [`clustermq`](https://mschubert.github.io/clustermq/) package: `tar_make_clustermq()`
* Transient workers with the [`future`](https://future.futureverse.org/) package: `tar_make_future()`
* Monitor progress with [`tar_watch()`](https://docs.ropensci.org/targets/reference/tar_watch.html), [`tar_poll()`](https://docs.ropensci.org/targets/reference/tar_poll.html), or [`tar_progress_branches()`](https://docs.ropensci.org/targets/reference/tar_progress_branches.html).

---

## Simulation results

```r
tar_read(model)
#> # A tibble: 8,000 x 10
#>    variable    q2.5     q25    q75   q97.5  rhat .join_data .rep     .file      .name
#>    <chr>      <dbl>   <dbl>  <dbl>   <dbl> <dbl>      <dbl> <chr>    <chr>      <chr>
#>  1 beta[1]   0.411   0.798   1.21   1.58    1.00      0.768 14556f0a model.stan model
#>  2 beta[2]  -0.816  -0.249   0.336  0.916   1.00      0.163 14556f0a model.stan model
#>  3 beta[1]  -0.610  -0.220   0.186  0.582   1.00     -0.602 a72e116b model.stan model
#>  4 beta[2]  -0.527   0.0733  0.692  1.27    1.00      1.15  a72e116b model.stan model
#>  5 beta[1]  -0.0102  0.357   0.772  1.13    1.00      0.353 df298e43 model.stan model
#>  6 beta[2]  -0.0254  0.544   1.14   1.71    1.00      0.128 df298e43 model.stan model
#>  7 beta[1]   0.346   0.732   1.13   1.51    1.00      1.50  7267cadc model.stan model
#>  8 beta[2]  -1.66   -1.07   -0.475  0.103   1.00     -0.624 7267cadc model.stan model
#>  9 beta[1]  -1.24   -0.866  -0.455 -0.0642  1.00     -0.682 6ca726a2 model.stan model
#> 10 beta[2]  -1.68   -1.12   -0.502  0.0777  1.00     -0.139 6ca726a2 model.stan model
#> # … with 7,990 more rows
```

---

## No evidence of lack of convergence.

```r
max(tar_read(model)$rhat)
#> [1] 1.0078
```

## Coverage is nominal.

```r
tar_read(coverage)
#> # A tibble: 2 x 3
#>   variable coverage_50 coverage_95
#>   <chr>          <dbl>       <dbl>
#> 1 beta[1]        0.506       0.949
#> 2 beta[2]        0.495       0.949
```

---

## Resources

* `targets`: <https://docs.ropensci.org/targets/>
* `stantargets`:  <https://docs.ropensci.org/stantargets/>
* Stan: <https://mc-stan.org/>

## References

.small[
* Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of Statistical Software 76(1). [10.18637/jss.v076.i01](https://www.jstatsoft.org/article/view/v076i01).
* Samantha R. Cook, Andrew Gelman, and Donald B. Rubin. 2006. "Validation of Software for Bayesian Models Using Posterior Quantiles." Journal of Computational and Graphical Statistics 15 (3): 675–92. <http://www.jstor.org/stable/27594203>.
* John M. Chambers. 2008. <u>Software for Data Analysis: Programming with R</u>. Chapter 3. Springer, 978-1441926128, <https://www.springer.com/gp/book/9780387759357>.
* Matt Dancho. "Deep Learning With Keras To Predict Customer Churn." RStudio AI Blog, 2018-01-11. <https://blogs.rstudio.com/tensorflow/posts/2018-01-11-keras-customer-churn/>.
* Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. "Validating Bayesian Inference Algorithms with Simulation-Based Calibration." <http://arxiv.org/abs/1804.06788>.
]