class: center, middle, inverse, title-slide # Reproducible computation at scale with targets ### Will Landau --- <style> .inverse { background-color: transparent; text-shadow: 0 0 0px transparent; } .title-slide { vertical-align: bottom !important; text-align: center !important; } .title-slide h1 { position: absolute; top: 0; left: 0; right: 0; width: 100%; line-height: 4em; color: #666666; } .title-slide h3 { line-height: 6em; color: #666666; } .title-slide { background-color: white; background-image: url('images/logo.png'); background-repeat: no-repeat; background-size: 25%; } .remark-slide-content:after { content: "Copyright Eli Lilly and Company https://wlandau.github.io/rpharma2020 https://github.com/wlandau/rpharma2020"; position: absolute; bottom: -5px; left: 10px; height: 40px; width: 100%; font-family: Helvetica, Arial, sans-serif; font-size: 0.7em; white-space: pre; color: gray; background-repeat: no-repeat; background-size: contain; } </style> ## Large statistical computation * [Bayesian data analysis](https://mc-stan.org/) * [Bayesian network meta-analysis](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/bayesian-network-meta-analysis.html) * [Graph-based multiple comparison procedures](https://github.com/kornl/gMCP) * [Subgroup identification](https://cran.r-project.org/web/packages/TSDT/index.html) * [Predictive modeling](http://appliedpredictivemodeling.com/computing) * [Deep neural networks](https://keras.rstudio.com/) * [PK/PD modeling](https://github.com/nlmixrdevelopment/nlmixr) * Clinical trial simulation * Target identification ??? In the life sciences, we develop ambitious computation for Statistics and data science. There's a lot of Bayesian analysis, machine learning, etc. --- ## Common features 1. Heavy use of the [R language](https://www.r-project.org/). 2. Long runtimes. 3. Multiple sub-tasks. 4. Frequent changes to code and data. <img src = "./images/sisyphus.svg" align="left" style="border: none; box-shadow: none; height: 375px; text-align: center;"> <br> <!--https://openclipart.org/detail/275842/sisyphus-overcoming-silhouette--> ??? These projects require long runtimes. MCMC and deep neural nets are expensive. It could take several minutes or hours just to fit a single model. So when the code is under rapid development, we run into trouble. --- ## Interconnected tasks <center> <img src = "./images/workflow.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? A large workflow has a bunch of moving parts. --- ## Changes <center> <img src = "./images/change.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? If you change any one of these stages, --- ## Consequences <center> <img src = "./images/downstream.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? then everything that depends on it is no longer valid, and you need to rerun the computation to bring the output back up to date. Changes like this happen all the time. They usually happen much faster than it actually takes to run the project, and there's no way the results can keep up. --- ## Pipeline tools and workflow managers <center> <img src = "./images/infographic.svg" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> - Several exist already: [github.com/pditommaso/awesome-pipeline](https://github.com/pditommaso/awesome-pipeline). - Most are language-agnostic or designed for Python or the shell. ??? ...unless you use a Make-like pipeline tool to avoid repeating yourself. There are some great pipeline tools for workflow automation, but historically not a whole lot for R. --- ## What distinguishes `targets`? <center> <img src = "./images/R.png" align="middle" style="border: none; box-shadow: none; text-align: center; height: 80px"> </center> * Fundamentally designed for R. * Supports a clean, modular, function-oriented programming style. * Abstracts files as R objects and automatically manages data. * Surpasses the permanent limitations of its predecessor, [`drake`](https://github.com/ropensci/drake): <https://wlandau.github.io/targets/articles/need.html> ??? That's where the new targets package comes in. targets is a Make-like pipeline tool fundamentally designed for R. --- ## What about `drake`? * `drake` is still an excellent choice for pipeline management, but it has permanent user-side limitations. * `targets` was created to overcome these limitations and create a smoother user experience. 1. Stronger guardrails by design. 1. A friendlier, lighter, more transparent data management system. 1. Seamless AWS S3 integration. 1. Show which *functions* are up to date. 1. More flexible dynamic branching, including compatibility with `dplyr::group_by()`. 1. Improved parallel efficiency. 1. Designed for custom user-side [metaprogramming](https://wlandau.github.io/targets-manual/branching.html#metaprogramming) and target archetypes: <https://wlandau.github.io/tarchetypes/>. ??? But what about drake? drake already does this. It's never going away, I will still maintian it. However, drake has improved so much over the last four years, that the only way left to make nontrivial progress for users is to finally confront systemic limitations that drake is stuck with forever, limitations so deep in the architecture that we can't fix them without breaking the tool or existing projects that use it. So to move the general capability forward, beyond what drake will ever be capable of, we need a new package. And that package is called "targets". --- ## Guardrails in `targets` * The only way to use `targets` is the correct way. * Main guardrails: 1. Always run in a fresh R process (unless you deliberately configure `targets` for debugging). 2. Require a `_targets.R` configuration file in the project root. 3. Require the `_targets/` data store to always be in the project root. ??? First of all, targets is deliberately less flexible. Less flexibility is actually a good thing here. Targets runs the pipeline in a clean new process by default, and it has strict policies about your working directory and data store. So it's more reproducible, more dependable, and ultimately smoother for the user. --- ## `drake`'s cache ``` .drake/ ├── config/ ├── data/ ├───── 17bfcef645301416.rds ├───── 21935c86f12692e2.rds ├───── 37caf5df2892cfc4.rds ├───── ... ├── drake/ ├── keys/ ├───── memoize/ ├───── meta/ ├───── objects/ ├───── ... └── scratch/ ``` ??? Now for data management. drake's cache has a huge number of cryptically-named files. It's not portable, and it's brittle. --- ## The data store in `targets` ``` _targets/ ├── meta/ ├───── meta ├───── progress ├── objects/ ├───── target_name_1 ├───── target_name_2 ├───── target_name_3 └───── ... ``` ??? Targets simplifies storage. The data store is light, portable, resilient, and easy to understand. It can recover if a data file breaks, and third-party products like Git and Dropbox and OneDrive have a much easier time. --- ## Seamless AWS S3 integration <https://wlandau.github.io/targets-manual/cloud.html> ```r # _targets.R tar_option_set(resources = list(bucket = "my-bucket-name")) tar_pipeline( tar_target(dataset, get_large_dataset(), format = "aws_fst_tbl"), tar_target(analysis, analyze_dataset(dataset), format = "aws_qs") ) ``` ```r # R session tar_make() tar_read(dataset) ``` ??? The simpler data store paved the way for seamless, Metaflow-like integration with Amazon S3. After just a little config, `targets` can automatically upload output to an S3 bucket and track it. And retrieving cloud data feels exactly the same as reading local data. --- ## Show which functions are out of date ![](./images/graph_imports.png) ??? drake tells you which targets are out of date, but it says less about functions. The targets package shows the status of functions too, which makes it easier to find out why things are not up to date. --- ## Dynamic branching with `dplyr` ```r library(targets) source("functions.R") # Defines get_data() and analyze_group(). tar_options(packages = "dplyr") tar_pipeline( tar_target( grouped_data, get_data() %>% group_by(id) %>% tar_group(), iteration = "group" ), tar_target( model, analyze_group(grouped_data), dynamic = map(grouped_data) # Maps over groups defined by group_by() ) ) ``` ??? Dynamic branching is more flexible. Targets lets you take a data frame, group it with dplyr, and then dynamically branch over the subsets. Folks have been asking for this in drake for years. And it's only possible now because unlike drake, targets is dynamic to the core. --- ## Inefficienct dynamic branching in drake ![](./images/dynamic_drake.png) ??? Dynamic branching is also more efficient. So because of the original design of the architecture, drake was forced to look at dynamic branching with a traditional map-reduce mindset. --- ## Efficient dynamic branching in targets ![](./images/dynamic_targets.png) ??? But in targets, downstream branches can start even if some of the upstream branches are still running. So your work gets done faster. --- ## Metaprogramming * `tar_target_raw()` avoids non-standard evaluation and supports third-party metaprogramming. * The following are equivalent ways to define a target. ```r # For most users: tar_target(data, simulate_data(), pattern = map(index)) # For developers who metaprogram reusable pipeline archetypes: tar_target_raw( "data", quote(simulate_data()), pattern = quote(map(index)) ) ``` ??? Lastly, targets is easier to extend and build on. There's a way to declare a target while avoiding non-standard evaluation, which opens a Pandora's Box of third-party interface development. --- ## Target archetypes * The `tarchetypes` package has helpers for commonly used targets: <https://wlandau.github.io/tarchetypes/> Function | Target archetype ---|--- `tar_render()` | Render a dependency-aware R Markdown report. `tar_knit()` | Run a dependency-aware `knitr` report. `tar_change()` | Always run a target when a custom object changes. `tar_force()` | Always run a target when a custom condition is true. `tar_suppress()` | Never run a target when a custom condition is true. `tar_plan()` | Simplified `drake`-like syntax for `targets` pipelines. ??? The "tarchetypes" package takes advantage of this. It's full of little archetypes for commonly-used targets. `tar_render()`, for example, seamlessly integrates dependency-aware R Markdown into pipelines, including but not limited to parameterized reports where the parameters are actually upstream targets. --- ## Example: COVID-19 clinical trial simulation * Motivation: design a randomized placebo-controlled phase 2 clinical trial of a potential new treatment of COVID-19. * Goal: understand the operating characteristics of a 200-patient trial under different effect size scenarios. * Enroll newly hospitalized patients (within 3 days of admission) who do not need supplemental oxygen or ventilation. * Endpoint: days until the patient is discharged from the hospital. * Efficacy rule: graduate to phase 3 only if Prob(hazard ratio (of hospital discharge) > 1.5) > 0.6. * Simulation: 1. Simulate time to event data from each arm (1 treatment and 1 placebo) from normal distributions (left-truncated right-censored). 2. Analyze with a Bayesian survival model by [Zhou, Hanson, and Zhang](https://www.jstatsoft.org/article/view/v092i09) (2020; R package [`spBayesSurv`](https://cran.r-project.org/web/packages/spBayesSurv/index.html)). 3. Aggregate over simulations to calculate operating characteristics. ??? Let's go to an example. I am part of a capabilities team at Lilly, and we spend a lot of time designing and simulating clinical trials. And of course this year, we're busy with COVID-19. We've been using drake to manage the computation almost the entire time, and here's how it works in targets. This made-up oversimplified example is not tied to any one real trial in particular, but it does get across how we think about the computation in general. --- ## File structure ```r run.sh run.R _targets.R sge.tmpl R/ └── functions.R ``` ??? We start by setting up an organized file structure. That part is up to us as users and is outside the scope of the tool. So we borrow from established conventions around R packages and research compendia. --- ## `functions.R` ```r simulate_trial <- function( mean_control = 15, mean_treatment = 10, patients_per_arm = 100, censor = 30 ) { bind_rows( simulate_arm(mean_control, censor, patients_per_arm, "control"), simulate_arm(mean_control, censor, patients_per_arm, "treatment") ) %>% mutate( patients_per_arm = patients_per_arm, mean_control = mean_control, mean_treatment = mean_treatment ) } ``` ??? Just like in drake, for most of our code, we're going to write our own functions. Each function should have no side-effects, and the return value should be a nice clean R object that humans can understand and the computer can serialize. --- ## `functions.R` ```r model_hazard <- function(patients, iterations) { samples <- replicate(4L, run_chain(patients, iterations), simplify = FALSE) summarize_samples(samples, patients) } summarize_samples <- function(samples, patients) { hazard_ratio_list <- map(samples, ~as.mcmc(t(exp(.x$beta)))) hazard_ratio <- unlist(hazard_ratio_list) tibble( prob_effect = mean(hazard_ratio > 1.5), median = median(hazard_ratio), psrf = gelman.diag(hazard_ratio_list, multivariate = FALSE)$psrf[, 1], patients_per_arm = patients$patients_per_arm[1], mean_control = patients$mean_control[1], mean_treatment = patients$mean_treatment[1] ) } # And a few more... ``` ??? User-defined functions for data analysis usually have one of three purposes: get data, analyze data, or summarize things. So you're usually going to return a dataset, a fitted model object, or a table or figure. --- ## `_targets.R` * The `_targets.R` script defines the pipeline (see `tar_script()`). ```r library(targets) source("R/functions.R") tar_option_set(packages = c("coda", "spBayesSurv", "tidyverse", "truncnorm")) options(clustermq.scheduler = "sge", clustermq.template = "sge.tmpl") tar_pipeline( tar_target(sim, seq_len(1000), deployment = "local"), tar_target(mean_treatment, c(10, 20), deployment = "local"), tar_target( patients, simulate_trial( mean_control = 20, mean_treatment = mean_treatment, patients_per_arm = 100, censor = 30 ), pattern = cross(sim, mean_treatment), format = "fst_tbl" ), ``` ??? The _targets.R file is special. That's where we define the pipeline. It has a bunch of targets, and each target is a skippable step of the computation: again, usually a dataset, analysis, or summary. --- ## `_targets.R` ```r tar_target( models, model_hazard(patients, 2000), pattern = map(patients), format = "fst_tbl" ), tar_target( summaries, summarize_models(models), format = "fst_tbl" ), tar_target( results, summaries, format = "fst_tbl", deployment = "local" ) ) ``` ??? A target should be large enough to take a bite out of the runtime, but small enough so that some targets can be skipped even if others need to run. --- ## Inspect the pipeline. ```r tar_visnetwork() ``` ![](./images/graph1.png) ??? Now before you actually run the pipeline, always look at the graph. Targets automatically figures out dependency relationships and execution order using static code analysis, and you want to make sure things are connected properly before you deploy a long set of jobs. --- ## Run the pipeline on a cluster. ```r tar_make_clustermq(workers = 100) #> ● run target mean_treatment #> ● run target sim #> ● run branch patients_db68b7ea #> ● run branch patients_9e31afca #> ● run branch models_eba1673a #> ● run branch models_212ba124 #> ... #> ● run target summaries #> ● run target results ``` ??? We actually run the pipeline with one of the tar_make() functions. This one talks to computing clusters. It runs the correct targets in the correct order, farms out to a bunch of remote nodes, and writes the return values to storage. All this is automatic, so your custom functions can focus on the actual methodology instead of micromanaging the details of parallel computing. --- ## Inspect the results ```r tar_read(results) #> # A tibble: 2 x 6 #> prob_success mean_treatment mean_control patients_per_arm median max_psrf #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0.997 10 20 100 2.38 1.02 #> 2 0.003 20 20 100 0.996 1.02 ``` ??? After that, all the targets are in a special data store, and there are helper functions to retrieve the data based on the variable name. --- ## Add a new effect size scenario in `_targets.R` ```r tar_pipeline( # ... tar_target(mean_treatment, c(10, 15, 20), deployment = "local"), # ... ) ``` ??? Now let's say we add new targets to the pipeline. --- ## Some targets are outdated. ```r tar_visnetwork() ``` ![](./images/graph2.png) ??? The new ones are out of date, and all the downstream ones are called into question. --- ## Only the new patients and branches run. * Skips 2000 up-to-date models (8000 MCMC chains you do not have to run). ```r tar_make_clustermq(workers = 100) #> ✔ skip target mean_treatment #> ✔ skip target sim #> ✔ skip branch patients_db68b7ea #> ✔ skip branch patients_9e31afca #> ... #> ● run branch patients_80f54768 #> ● run branch patients_e24a9e83 #> ... #> ✔ skip branch models_eba1673a #> ✔ skip branch models_212ba124 #> ... #> ● run branch models_f324af00 #> ● run branch models_0689710c #> ... #> ● run target summaries #> ● run target results ``` ??? When we run the pipeline again, only the new simulation scenarios actually get computed. Anything already up to date gets skipped. In our case, that's 2000 expensive Bayesian models we don't need to run. --- ## New combined results ```r readd(results) #> # A tibble: 3 x 6 #> prob_success mean_treatment mean_control patients_per_arm median max_psrf #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0.997 10 20 100 2.38 1.02 #> 2 0.51 15 20 100 1.57 1.02 #> 3 0.003 20 20 100 0.996 1.02 ``` ??? And our final results reflect the new targets we added. --- ## Tangible evidence of reproducibility. ```r tar_make_clustermq(workers = 100) #> ... #> ✔ skip target summaries #> ✔ skip target results #> ✓ Already up to date. ``` ```r tar_outdated() #> character(0) ``` ??? At the end of the day, the targets package can tell you if everything is up to date. This is tangible evidence that your output matches the code and data it's supposed to come from. It's evidence that someone else running the same code would get the same results, and you don't have to waste time rerunning the project to prove it. So workflow automation with {targets} is just as important to reproducibility as literate programming with R Markdown, or version control with Git, or the way you organize the files in a project with devtools or usethis. --- ## Links * Development repository: <https://github.com/wlandau/targets> * Reference website: <https://wlandau.github.io/targets/> * User manual: <https://wlandau.github.io/targets-manual/> * Workshop: <https://github.com/wlandau/targets-tutorial> ## Examples * Minimal: <https://github.com/wlandau/targets-minimal> * Validating a Stan model: <https://github.com/wlandau/targets-stan> * Machine learning with Keras: <https://github.com/wlandau/targets-keras> ## Helpers * Target archetypes: <https://wlandau.github.io/tarchetypes/> * Shiny app to help sketch pipelines: <https://wlandau.shinyapps.io/targetsketch> ??? targets is under review at rOpenSci, and it should go to CRAN in a couple months. In the meantime, you get get it from GitHub, and there's a lot of documentation and examples, including the half-day workshop I gave at this conference last week. Finally, I'd like to thank R/Pharma for the opportunity, and I'd like to thank all the early adopters of this package for the incredibly useful feedback. --- ## References 1. Zhou, Haiming and Hanson, Timothy and Zhang, Jiajia. "spBayesSurv: Fitting Bayesian Survival Models Using R". `Journal of Statistical Software`, 92 (9), 2020. [doi:10.18637/jss.v092.i09](https://doi.org/10.18637/jss.v092.i09).