class: center, middle, inverse, title-slide # Reproducible computation at scale in R ### Will Landau --- <style> .inverse { background-color: transparent; text-shadow: 0 0 0px transparent; } .title-slide { vertical-align: bottom !important; text-align: center !important; } .title-slide h1 { position: absolute; top: 0; left: 0; right: 0; width: 100%; line-height: 4em; color: #666666; } .title-slide h3 { line-height: 6em; color: #666666; } .title-slide { background-color: white; background-image: url('images/logo.png'); background-repeat: no-repeat; background-size: 25%; } .remark-slide-content:after { content: "Copyright Eli Lilly and Company"; position: absolute; bottom: -5px; left: 10px; height: 40px; width: 100%; font-family: Helvetica, Arial, sans-serif; font-size: 0.7em; color: gray; background-repeat: no-repeat; background-size: contain; } .remark-slide-content .nocopyright:after { content: ""; } .small { font-size: 85%; } </style> ## Demanding computation in R * Deep learning: `keras`, `tensorflow`, `torch` * Machine learning: `tidymodels` * Bayesian data analysis: JAGS, Stan, NIMBLE, `greta` * PK/PD: `nlmixr`, `mrgsolve` * Clinical trial simulation: `rpact`, `Mediana` * Statistical genomics * Social network analysis * Permutation tests * Database queries: `DBI` * ETL on large data --- ## Repetition: the overlooked bane of long computation <br>  --- ## Workflows have interconnected steps.  --- ## If you change code or data...  --- ## ...the downstream steps are no longer valid.  --- ## Dilemma: short runtimes or reproducible results?  --- ## Let a pipeline tool figure out what to rerun.  * Save time while ensuring computational reproducibility. * Automatic parallel/distributed computing based on the directed acyclic graph. --- ## Pipeline tools <center> <img src="./images/infographic.png" height = "125px"> </center> * Existing pipeline tools: https://github.com/pditommaso/awesome-pipeline * Most are language-agnostic or designed for Python or the shell. ## {targets} * Fundamentally designed for R. * Supports a clean, modular, function-oriented programming style. * Abstracts files as R objects and automatically manages data. * Surpasses the permanent limitations of its predecessor, [`drake`](https://github.com/ropensci/drake): <https://books.ropensci.org/targets/drake.html>. * Continuation of the ideas from `remake` by Rich FitzJohn: <https://github.com/richfitz/remake>. --- ## {targets} embraces functions. > Nearly everything that happens in R results from a function call. Therefore, basic programming centers on creating and refining functions. > > [John Chambers (2008)](https://www.springer.com/gp/book/9780387759357) * Functions are fundamental building blocks in most computer languages. * Advantages of functions: * Clearly define relationships among inputs and outputs. * Help `targets` decide which tasks to run and which ones to skip. * Break down complicated ideas into manageable pieces. * Custom shorthand to make code easier to read. * Common uses of R functions in data science: 1. Preprocess a dataset. 2. Analyze a dataset. 3. Summarize an analysis. --- ## Example: customer churn * Deep learning with Keras. * IBM Cognos Analytics Telco customer churn dataset. * Find the model that best predicts which customers will cancel their telecom subscriptions. * Example comes from Matt Dancho's 2018 RStudio AI Blog post: https://blogs.rstudio.com/ai/posts/2018-01-11-keras-customer-churn/. --- ## Project files * Many variations on this theme, only `_targets.R` is strictly required. ```r _targets.R R/ └── functions.R data/ └── customer_churn.csv ``` --- ## functions.R file ```r split_data <- function(file) { read_csv(file, col_types = cols()) %>% initial_split(prop = 0.7) } prepare_recipe <- function(data) { # ... } test_model <- function(data, recipe, act1) { # ... } ``` --- ## _targets.R file ```r library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(file, "data/churn.csv", format = "file"), tar_target(data, split_data(file)), tar_target(recipe, prepare_recipe(data)), tar_target(model1, test_model(data, recipe, act1 = "relu")), tar_target(model2, test_model(data, recipe, act1 = "sigmoid")), tar_target(model3, test_model(data, recipe, act1 = "linear")) ) ``` --- ## Check the graph. ```r tar_visnetwork(targets_only = TRUE) ```  --- ## Run the pipeline. ```r tar_make() #> • start target file #> • built target file #> • start target data #> • built target data #> • start target recipe #> • built target recipe #> • start target model1 #> • built target model1 #> • start target model2 #> • built target model2 #> • start target model3 #> • built target model3 #> • end pipeline ``` --- ## Check the results. ```r tar_read(model1) #> # A tibble: 1 x 2 #> accuracy act1 #> <dbl> <chr> #> 1 0.794 relu tar_read(model3) #> # A tibble: 1 x 2 #> accuracy act1 #> <dbl> <chr> #> 1 0.798 linear ``` --- ## Change a target. ```r library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(file, "data/churn.csv", format = "file"), tar_target(data, split_data(file)), tar_target(recipe, prepare_recipe(data)), tar_target(model1, test_model(data, recipe, act1 = "relu")), tar_target(model2, test_model(data, recipe, act1 = "sigmoid")), * tar_target(model3, test_model(data, recipe, act1 = "softmax")) ) ``` --- ## Only model3 will rerun next time. ```r tar_visnetwork(targets_only = TRUE) ```  --- ## The other models are skipped to save time. ```r tar_make() #> ✓ skip target file #> ✓ skip target data #> ✓ skip target recipe #> ✓ skip target model1 #> ✓ skip target model2 #> • start target model3 #> • built target model3 #> • end pipeline ``` --- ## Change function code. ```r define_model <- function(churn_recipe, act1) { input_shape <- ncol( juice(churn_recipe, all_predictors(), composition = "matrix") ) out <- keras_model_sequential() %>% layer_dense( units = units1, kernel_initializer = "uniform", activation = act1, input_shape = input_shape ) %>% * layer_dropout(rate = 0.2) # previously 0.1 # ... } ``` --- ## {targets} understand dependencies among functions. ```r tar_visnetwork(targets_only = FALSE) ``` <image src="./images/keras-function.png" height = "400px"> --- ## Run the models but skip the data. ```r tar_make() #> ✓ skip target file #> ✓ skip target data #> ✓ skip target recipe #> • start target model1 #> • built target model1 #> • start target model2 #> • built target model2 #> • start target model3 #> • built target model3 #> • end pipeline ``` --- ## Tangible evidence of reproducibility ```r tar_make() #> ✓ skip target file #> ✓ skip target data #> ✓ skip target recipe #> ✓ skip target model1 #> ✓ skip target model2 #> ✓ skip target model3 #> ✓ skip pipeline ``` --- ## Extending {targets}  --- ## Target factories * A target factory is a reusable function that creates target objects. ```r target_factory <- function(file) { list( tar_target_raw("file", file, format = "file", deployment = "main"), tar_target_raw("data", quote(read_data(file)), format = "fst_tbl", deployment = "main"), tar_target_raw("model", quote(run_model(data)), format = "qs") ) } ``` --- ## Target factories simplify pipeline construction. ```r # _targets.R library(targets) library(yourExamplePackage) list( target_factory("data.csv") ) ``` ```r # R console tar_manifest(fields = command) #> # A tibble: 3 x 2 #> name command #> <chr> <chr> #> 1 file "\"data.csv\"" #> 2 data "read_data(file)" #> 3 model "run_model(data)" ``` --- ## Example: {stantargets} <center> <image src="./images/stantargets.png" height = "300px"> </center> * Easy pipeline construction for Stan statistical models. * Uses R packages [`cmdstanr`](https://mc-stan.org/cmdstanr/) and [`posterior`](https://mc-stan.org/posterior/). --- ## About Stan * Probabilistic programming language ([Carpenter et al. 2017](https://www.jstatsoft.org/article/view/v076i01)). * Markov chain Monte Carlo (MCMC) with HMC and NUTS. * Often more efficient than Gibbs sampling. * Flexible specification of posterior distributions. * Indifferent to conjugacy. * Variational inference (ADVI) * Penalized MLE (L-BFGS) --- ## Example Stan model ```c // model.stan data { int <lower = 1> n; vector[n] x; vector[n] y; } parameters { vector[2] beta; } model { y ~ normal(beta[1] + x * beta[2], 1); beta ~ normal(0, 1); } ``` --- ## Interval-based validation study * For thousands of independent replications: * Simulate data from the prior predictive distribution. * Fit the model to the simulated data using MCMC. * Calculate x% posterior intervals for each of `beta[1]` and `beta[2]`. * For each of `beta[1]` and `beta[2]`, roughly x% of the posterior intervals should cover the corresponding parameter draws from the joint prior. * Simulation-based calibration extends this idea further ([Cook et al. 2006](https://www.jstor.org/stable/27594203); [Talts et al. 2020](https://arxiv.org/abs/1804.06788)). --- ## Function for prior predictive simulations ```r simulate_data <- function(n = 10L) { beta <- rnorm(n = 2, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- rnorm(n, beta[1] + x * beta[2], 1) list( n = n, x = x, y = y, # Hold on to draws from the prior: .join_data = list(beta = beta) ) } ``` --- ## Function to calculate coverage statistics ```r calculate_coverage <- function(results) { results %>% group_by(variable) %>% summarize( coverage_50 = mean(q25 < .join_data & .join_data < q75), coverage_95 = mean(q2.5 < .join_data & .join_data < q97.5) ) } ``` --- ## _targets.R file setup ```r library(targets) library(stantargets) source("R/functions.R") tar_option_set(packages = "tidyverse") *options(clustermq.scheduler = "multicore") list( tar_stan_mcmc_rep_summary( name = model, stan_files = "model.stan", data = simulate_data(), batches = 160, reps = 25, variables = "beta", summaries = list( ~posterior::quantile2(.x, probs = c(0.025, 0.25, 0.75, 0.975)), rhat = posterior::rhat ), ), tar_target(coverage, calculate_coverage(model)) ) ``` --- ## Multiple targets  --- ## Run MCMC batches in parallel ```r > tar_make_clustermq(workers = 4) # 4 local cores #> • start target model_batch #> • built target model_batch #> • start branch model_data_b0b9380a #> • start branch model_data_ffcdb73c #> • start branch model_data_b968a03a #> • start branch model_data_f8763cb2 #> ... ``` * High-performance computing with `targets`: <https://books.ropensci.org/targets/hpc.html> * Distributed computing available, not just multicore: SLURM, SGE, TORQUE, PBS, LSF * Persistent workers with the [`clustermq`](https://mschubert.github.io/clustermq/) package: `tar_make_clustermq()` * Transient workers with the [`future`](https://future.futureverse.org/) package: `tar_make_future()` * Monitor progress with [`tar_watch()`](https://docs.ropensci.org/targets/reference/tar_watch.html), [`tar_poll()`](https://docs.ropensci.org/targets/reference/tar_poll.html), or [`tar_progress_branches()`](https://docs.ropensci.org/targets/reference/tar_progress_branches.html). --- ## Simulation results ```r tar_read(model) #> # A tibble: 8,000 x 10 #> variable q2.5 q25 q75 q97.5 rhat .join_data .rep .file .name #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> #> 1 beta[1] 0.411 0.798 1.21 1.58 1.00 0.768 14556f0a model.stan model #> 2 beta[2] -0.816 -0.249 0.336 0.916 1.00 0.163 14556f0a model.stan model #> 3 beta[1] -0.610 -0.220 0.186 0.582 1.00 -0.602 a72e116b model.stan model #> 4 beta[2] -0.527 0.0733 0.692 1.27 1.00 1.15 a72e116b model.stan model #> 5 beta[1] -0.0102 0.357 0.772 1.13 1.00 0.353 df298e43 model.stan model #> 6 beta[2] -0.0254 0.544 1.14 1.71 1.00 0.128 df298e43 model.stan model #> 7 beta[1] 0.346 0.732 1.13 1.51 1.00 1.50 7267cadc model.stan model #> 8 beta[2] -1.66 -1.07 -0.475 0.103 1.00 -0.624 7267cadc model.stan model #> 9 beta[1] -1.24 -0.866 -0.455 -0.0642 1.00 -0.682 6ca726a2 model.stan model #> 10 beta[2] -1.68 -1.12 -0.502 0.0777 1.00 -0.139 6ca726a2 model.stan model #> # … with 7,990 more rows ``` --- ## No evidence of lack of convergence. ```r max(tar_read(model)$rhat) #> [1] 1.0078 ``` ## Coverage is nominal. ```r tar_read(coverage) #> # A tibble: 2 x 3 #> variable coverage_50 coverage_95 #> <chr> <dbl> <dbl> #> 1 beta[1] 0.506 0.949 #> 2 beta[2] 0.495 0.949 ``` --- ## Resources * `targets`: <https://docs.ropensci.org/targets/> * `stantargets`: <https://docs.ropensci.org/stantargets/> * Stan: <https://mc-stan.org/> ## References .small[ * Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of Statistical Software 76(1). [10.18637/jss.v076.i01](https://www.jstatsoft.org/article/view/v076i01). * Samantha R. Cook, Andrew Gelman, and Donald B. Rubin. 2006. "Validation of Software for Bayesian Models Using Posterior Quantiles." Journal of Computational and Graphical Statistics 15 (3): 675–92. <http://www.jstor.org/stable/27594203>. * John M. Chambers. 2008. <u>Software for Data Analysis: Programming with R</u>. Chapter 3. Springer, 978-1441926128, <https://www.springer.com/gp/book/9780387759357>. * Matt Dancho. "Deep Learning With Keras To Predict Customer Churn." RStudio AI Blog, 2018-01-11. <https://blogs.rstudio.com/tensorflow/posts/2018-01-11-keras-customer-churn/>. * Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. "Validating Bayesian Inference Algorithms with Simulation-Based Calibration." <http://arxiv.org/abs/1804.06788>. ]