class: center, middle, inverse, title-slide # Target Markdown and stantargets for Bayesian model validation pipelines ### Will Landau --- <style> .inverse { background-color: transparent; text-shadow: 0 0 0px transparent; } .title-slide { vertical-align: bottom !important; text-align: center !important; } .title-slide h1 { position: absolute; top: 0; left: 0; right: 0; width: 100%; line-height: 4em; color: #666666; font-size: 2em; } .title-slide h3 { line-height: 2em; color: #666666; } .title-slide { background-color: white; background-image: url('images/logos.png'); background-repeat: no-repeat; background-size: 50%; } .remark-slide-content:after { content: "Copyright Eli Lilly and Company"; position: absolute; bottom: -5px; left: 10px; height: 40px; width: 100%; font-family: Helvetica, Arial, sans-serif; font-size: 0.7em; color: gray; background-repeat: no-repeat; background-size: contain; } .remark-slide-content .nocopyright:after { content: ""; } .small { font-size: 65%; } </style> ## Demanding computation in R * **Bayesian data analysis: JAGS, Stan, NIMBLE, `greta`** * Deep learning: `keras`, `tensorflow`, `torch` * Machine learning: `tidymodels` * PK/PD: `nlmixr`, `mrgsolve` * Clinical trial simulation: `rpact`, `Mediana` * Statistical genomics * Social network analysis * Permutation tests * Database queries: `DBI` * ETL on large data ??? Every workflow system has tradeoffs. R Markdown is easy to use, but it struggles to handle a lot of code and a lot of runtime. On the other hand the {targets} package handles a ton of work, but on it's own, it's a bit pedantic for routine data analysis. This talk is the debut of a new system called Target Markdown, which has all the convenience of R Markdown and all the power of {targets}. I will also discuss {stantargets}, an extension to {targets} for Bayesian Statistics. Working together, these tools tackle daunting tasks that come up a lot in the life sciences. Machine learning, Markov chain Monte Carlo, simulation, prediction, genomics, PK/PD, and database queries are just some examples. --- ## Repetition: the overlooked bane of long computation <br> ![](./images/reality.png) ??? Part of what makes these tasks so daunting is that: Number one: the code is usually slow, and Number two: you're never really done running it. There are always bugs to fix, follow-up questions to answer, new data, and all sorts of other reasons to rerun all of that slow code. It is easy to get stuck in a Sisyphean loop where you spend a lot of time waiting for jobs to finish and you struggle to get hold of a complete set of results that are current and up to date. Even a single minute of delay between updates is long enough to feel, because it's compounded by every update you make. --- ## Workflows have interconnected steps. ![](./images/workflow.png) ??? To get out of the death loop, we have to break it down, and think of a data analysis workflow as a pipeline. A pipeline is a collection of interconnected steps, or targets, with clearly identified inputs and outputs. --- ## If you change code or data... ![](./images/change.png) ??? If you change the code for a model, for example, --- ## ...the downstream steps are no longer valid. ![](./images/downstream.png) ??? that change invalidates everything that uses that model. The post-processing and summaries for that model need to rerun to get the latest answers. But you shouldn't have to waste time recomputing the upstream data. --- ## Dilemma: short runtimes or reproducible results? ![](./images/decisions.png) ??? So in real life, which targets can we skip to save time, and which targets really need to rerun? Unfortunately, no human can be trusted to answer that question reliably. Especially if this is a project you haven't touched in several months. To ensure our results are correct while saving as much time as possible, we need automated tools to make objective decisions about what to run and what to skip. --- ## Let a pipeline tool figure out what to rerun. ![](./images/pipeline_graph.png) * Save time while ensuring computational reproducibility. * Automatic parallel/distributed computing based on the directed acyclic graph. ??? This is exactly the job of a Make-like pipeline tool. By identifying the inputs and outputs of each target, a Make-like pipeline tool arranges the targets in a directed acyclic graph, and it runs the correct targets in the correct order. And by analyzing the graph, it even detects opportunities to use parallel computing or distributed computing to run multiple targets simultaneously. And of course, it automatically skips any targets whose code or upstream dependencies have not changed. Not only does this let you adapt to changes quickly, it also gives you tangible evidence of the status of the results. If the pipeline tool tells you everything is up to date, that's telling you that someone else could run your code from scratch and get the same results as you. That is a definition of reproducibility. --- ## Pipeline tools <center> <img src="./images/infographic.png" height = "125px"> </center> * Existing pipeline tools: https://github.com/pditommaso/awesome-pipeline * Most are language-agnostic or designed for Python or the shell. ## {targets} * Fundamentally designed for R. * Supports a clean, modular, function-oriented programming style. * Abstracts files as R objects and automatically manages data. * Surpasses the permanent limitations of its predecessor, [`drake`](https://github.com/ropensci/drake): <https://books.ropensci.org/targets/drake.html>. * Continuation of the ideas from `remake` by Rich FitzJohn: <https://github.com/richfitz/remake>. ??? There are hundreds of pipeline tools for other languages, but historically not a whole lot for R. The {targets} package, which builds on its predecessors of {drake} and {remake}, is designed to work seamlessly within R itself. It encourages good programming practices, and it abstracts files as variables, and natively integrates with R Markdown, as you will see soon. So the {targets} package lets you work more naturally in R than a language-agnostic pipeline tool would. --- ## Challenge * Most pipelines have a lot of user-side code. * `targets` prefers code to be in pure user-defined functions. * Leads to a lot of user-side software engineering. ## Solutions for Bayesian workflows * `stantargets` reduces the volume of user-side code and automates entire validation pipelines: <https://docs.ropensci.org/stantargets/articles/simulation.html> * Target Markdown is a comfortable interface for interactive prototyping and non-interactive pipeline construction: <https://books.ropensci.org/targets/markdown.html> ??? But there are drawbacks. If you work with {targets} on its own, constructing a pipeline requires you to have a clear idea the exact inputs and the exact outputs of each target. You don't have to explicitly declare them like in other tools, because {targets} automatically analyzes your code to detect them, but you still have to be disciplined about the way you structure your R code. And that means writing pure functions to produce datasets, run models on those datasets, and summarize those models. That's a lot of software engineering to ask of a statistician or data analyst. But today, I will share two major breakthroughs that address this usability problem and democratize pipelines, and demonstrate their usefulness with an example in Bayesian Statistics for clinical trials. --- ## Extending {targets} ![](./images/targetopia.png) ??? The first is the R Targetopia, an emerging collection of packages like `stantargets` that produce ready-made pipelines for specialized situations. These packages already have functions and targets built in, so the user do not have to write nearly as much code or think as hard about engineering the pipeline at a low level. --- ## Target factories * A target factory is a reusable function that creates target objects. ```r #' @title Example target factory in an R package. #' @export #' @description A target factory to analyze data. #' @return A list of 3 target objects to: #' 1. Track the file for changes, #' 2. Read the data in the file, and #' 3. Analyze the data. #' @param File Character of length 1, path to the file. target_factory <- function(file) { list( tar_target_raw("file", file, format = "file", deployment = "main"), tar_target_raw("data", quote(read_data(file)), format = "fst_tbl", deployment = "main"), tar_target_raw("model", quote(run_model(data)), format = "qs") ) } ``` ??? The mechanism behind these packages is a domain-specific pattern the called the Target Factory. A target factory is a function, usually in a package, that produces one or more target objects. A target object is just the definition of a target represented in a special instance of an S3 class. --- ## Target factories simplify pipeline construction. ```r # _targets.R library(targets) library(yourExamplePackage) list( target_factory("data.csv") ) ``` ```r # R console tar_manifest(fields = command) #> # A tibble: 3 x 2 #> name command #> <chr> <chr> #> 1 file "\"data.csv\"" #> 2 data "read_data(file)" #> 3 model "run_model(data)" ``` ??? For the user, target factories simplify pipeline construction. In this example, you feel like you're only writing one target, but you actually get three. So the details of constructing those targets and connecting them together are all abstracted away. --- ## Example: {stantargets} <center> <image src="./images/stantargets.png" height = "300px"> </center> * Easy pipeline construction for Stan statistical models. * Uses R packages [`cmdstanr`](https://mc-stan.org/cmdstanr/) and [`posterior`](https://mc-stan.org/posterior/). ??? {stantargets} leverages this idea for Bayesian data analysis with Stan. It has target factories from single-run workflows to large-scale simulation studies. If you use R and you use Stan, {stantargets} is worth trying out. --- ## About Stan * Probabilistic programming language ([Carpenter et al. 2017](https://www.jstatsoft.org/article/view/v076i01)). * Markov chain Monte Carlo (MCMC) with HMC and NUTS. * Often more efficient than Gibbs sampling. * Flexible specification of posterior distributions. * Indifferent to conjugacy. * Variational inference (ADVI) * Penalized MLE (L-BFGS) ??? A bit of background: Stan is a probabilistic programming language for all kinds of statistical modeling. It is most famous for Hamiltonian Monte Carlo to fit Bayesian models, but it also variational inference and optimization. It can be really fast, but as with anything Bayesian, it inevitably comes with a nontrivial computational burden, which is exactly where {targets} and {stantargets} can help. --- ## Example: Bayesian longitudinal model for clinical trials <!-- $$ \begin{aligned} & y \sim \text{MVN}(X_{(n \cdot t) \times p} \beta, \ I_{n \times n} \otimes \Sigma_{t \times t} ) \\ & \qquad \beta \sim \text{MVN} (0, 10^2 I_{p \times p})\\ & \qquad \Sigma_{t \times t} = \left (I_{t \times t} \sigma \right ) \Lambda_{t \times t} \Lambda_{t \times t}' \left (I_{t \times t} \sigma \right ) \\ & \qquad \qquad \sigma_1, \ldots, \sigma_t \stackrel{\text{ind}}{\sim} \text{Cauchy}^+(0, 5) \\ & \qquad \qquad \Lambda_{t \times t}\Lambda_{t \times t}' \sim \text{LKJ}(\text{shape} = 1, \text{order} = t) \end{aligned} $$ --> ![](./images/model.png) * A common variant of this model uses inverse-Wishart for the covariance, which induces troublesome prior relationships among covariance components ([Alvarez et al. 2016](https://arxiv.org/abs/1408.4050)). * The above LKJ-based model could help refine some existing models currently used on real clinical trial data. * **But first, we need to ensure the above model is implemented correctly.** ??? I'm going to show how this works using a Bayesian longitudinal linear model. What you see here is like an MMRM but without random effects. I don't claim it's the best model, but it is extremely popular in clinical trial data analysis. Most of the time, you see it with an inverse-Wishart prior if the covariance is unstructured, and this induces problematic associations among variances and correlations. Which motivates the model you see here. A solution is to model the variances and the correlations separately and to use an LKJ prior or similar on the correlation matrix. It was straightforward to write this model in Stan because Stan is so flexible and because the algorithms are indifferent to conditional conjugacy. Before using this model in the topline analysis of an actual clinical trial, we need to validate it. That includes checking all sorts of issues, one of which is the correctness of the implementation. In other words, does the Stan code accurately follow the model specification? --- ## Interval-based validation study * For several independent replications: * Simulate data from the prior predictive distribution. * Fit the model to the simulated data using MCMC. * Calculate x% posterior intervals for each scalar parameter. * For each of scalar parameter, roughly x% of the posterior intervals should cover the corresponding parameter draws from the joint prior. * 50% and 95% are common choices for x%. * Based on the concept of calibration ([Carpenter 2017](https://statmodeling.stat.columbia.edu/2017/04/12/bayesian-posteriors-calibrated/)). * Simulation-based calibration extends this idea further ([Cook et al. 2006](https://www.jstor.org/stable/27594203); [Talts et al. 2020](https://arxiv.org/abs/1804.06788)). ??? One way to do that is with calibration. The idea is to simulate thousands of datasets from the prior predictive distribution, analyze each dataset with the model, and find out how well the posterior parameter samples agree with the parameter samples from the prior. So if we compute a 50% posterior interval for each simulation, then 50% of those intervals should contain the corresponding prior draws of that parameter. Likewise for 95% intervals or whatever credible level you choose. If coverage is nominal, that's evidence that the analysis model and the data-generating model agree. In practice, I have found this exercise to uncover a lot of bugs, and I will walk through it in a moment. --- ## Write the pipeline in Target Markdown * R Markdown interface for `targets`. * Interactive mode for prototyping and emulation. * Non-interactive mode for pipeline construction. * Template available through RStudio and [`use_targets()`](https://docs.ropensci.org/targets/reference/use_targets.html). <center> <img src="./images/target_markdown.png" height="350" align = "center"> </center> ??? When I do, I will use Target Markdown, a brand new system that combines the best of {targets} with the best of R Markdown. We want the power of `stantargets` to run a huge simulation pipeline, and we also want everything to live inside R Markdown documents because it's convenient and because we can explain the details of the methodology right next to the actual code that runs it. There are two ways to use Target Markdown. There's an interactive mode for testing and prototyping, and there's a non-interactive mode for pipeline construction. What this looks like is you'll just use R Markdown pretty much like you would in other situations, but you have a special {targets} language engine that creates a pipeline behind the scenes, one code chunk at a time. This works whether you have a single R Markdown report or multiple reports like in a `bookdown` project. If you have the latest version of {targets}, you can get an example Target Markdown document either through the RStudio template system or through the use_targets() function. --- ## One function to simulate prior predictive data * No other user-defined function required. * Interactive mode emulates `targets`' behavior in your local environment. <center> <img src="./images/target_markdown_globals.png" height = "400"> </center> ??? So we're inside an R Markdown report. And to study the calibration of our Bayesian model of clinical trial data, we want to define a function to simulate data from the prior predictive distribution. Thanks to {stantargets}, this is the only user-defined function that the pipeline needs us to write by hand. To make this function and other global objects available to the pipeline as dependencies, we use the {targets} language engine instead of the R language engine, and we set the tar_globals chunk option to TRUE. If you're in the notebook interface and you click the green "play" button on the right, Target Markdown will just run the code and assign it to the environment of the pipeline like it says here at the bottom. You can do this if you want to test and prototype the function locally in your R session, and you can divide up your functions and globals among multiple code chunks like this one. Interactive mode for a target or list of targets is similar. Target Markdown will resolve the directed acyclic graph, run the correct targets in the correct order out of the ones in the chunk, test that the targets can be stored and retrieved properly, and then assign them to memory as variables in your R session. --- ## Simulation and MCMC with {stantargets} * Non-interactive mode writes the `_targets.R` file and supporting scripts. * Declares targets but does not run them. <center> <img src="./images/target_markdown_targets.png" height = "400"> </center> ??? So interactive mode runs code in your environment and saves nothing to persistent storage. It's for testing and prototyping only, and the results go away when you restart your R session. But non-interactive mode, which runs when you knit the entire document, is the opposite. Non-interactive mode does not actually execute the code in the chunk. Instead, it saves that code to a script file to define part of the pipeline. That goes for functions like in the previous slide, as well as targets and target factories like you see here. The idea is to incrementally define a pipeline now, and then do a serious run of the pipeline outside of Target Markdown later on once all the R scripts are established. In this chunk, we invoke a target factory called tar_stan_mcmc_rep_summary() to define the bulk of the work of this simulation study: draw prior predictive data, run the model, and compute summary statistics and convergence diagnostics. You don't need to worry about what R script file to put it in or how targets and functions are organized within scripts. Thanks to Target Markdown, all you need to focus on is the chunks inside the report. --- ## Simple target for convergence diagnostics <center> <img src="./images/target_markdown_convergence.png"> </center> ??? For even more smoothness, you can even turn arbitrary code chunks into targets with the tar_simple chunk option, with the chunk code as the command and the chunk label as the target name. It will work as long as the chunk acts like a pure function, meaning it returns a value and does not cause any side effects. So here we have one target to summarize convergence diagnostics, --- ## Simple target for coverage statistics <center> <img src="./images/target_markdown_coverage.png"> </center> ??? and another target to calculate coverage. At this point, our entire pipeline is defined. When we run the report from end to end, the script files get written, and the report returns quickly. We then have the option of running the pipeline using the `tar_make()` function or similar. --- ## Optional R code chunk to run the pipeline * Either run the pipeline in an ordinary R code chunk (below) or invoke `tar_make_clustermq()` outside the R Markdown report. <center> <img src="./images/target_markdown_run.png"> </center> ??? In this example, we run the pipeline in the same report that defines the pipeline. We write an ordinary R code chunk after all the targets chunks, so were not using the {targets} engine anymore, and we call the `tar_make_clustermq()` function to distribute our simulations across 100 workers on a computing cluster. --- ## Optional R code chunks to read the results <center> <img src="./images/target_markdown_results.png"> </center> ??? It's also nice to have ordinary R code chunks to read the results from the {targets} data store. --- ## First run takes a long time. <center> <img src="./images/tar_make_clustermq.png"> </center> ??? The first time I ran this pipeline, it took almost 7 hours to finish, despite the heavy-duty distributed computing. That's how large Bayesian computation can get sometimes. --- ## Subsequent runs skip up-to-date targets. <center> <img src="./images/tar_make_clustermq_skip.png"> </center> ??? But the second time around, all the results were up to date. Thanks to {targets}, it took only a few seconds for the whole report to re-render. So Target Markdown is kind of like the caching system in {knitr}, but taking it to the next level. --- ## Convergence diagnostics <center> <img src="./images/tar_read_convergence.png"> </center> ??? And the last section of the rendered report shows the results. Convergence diagnostics looked good, there was only one simulation out of 1000 with any potential scale reduction factor above 1.01. --- ## Coverage is nominal. <center> <img src="./images/tar_read_coverage.png" height = "500"> </center> ??? And coverage looks nominal. On average, 50% of the 50% posterior intervals covered the truth, and 95% of the 95% posterior intervals covered the truth, which is evidence of good calibration. So we have a complete, well-documented story wrapped up in an R Markdown document backed by a powerful {targets} pipeline. --- ## Resources Resource | Link ---|--- Slides | <https://wlandau.github.io/rmedicine2021-slides/> Slide source | <https://github.com/wlandau/rmedicine2021-slides> Pipeline report | <https://wlandau.github.io/rmedicine2021-pipeline/> Pipeline source | <https://github.com/wlandau/rmedicine2021-pipeline> `targets` | <https://docs.ropensci.org/targets/> Target Markdown | <https://books.ropensci.org/targets/markdown.html> `stantargets` | <https://docs.ropensci.org/stantargets/> Stan | <https://mc-stan.org/> `cmdstanr` | <https://mc-stan.org/cmdstanr/> `posterior` | <https://mc-stan.org/posterior/> ??? These slides are publicly available, and so is the fully rendered version of the Target Markdown report I showed today. In these links you can also find the source code of those materials and the various packages I mentioned. --- ## Thanks * `stantargets`: Melina Vidoni served as editor and Krzysztof Sakrejda and Matt Warkentin served as reviewers during the rOpenSci software review process. * Target Markdown: Christophe Dervieux and Yihui Xie provided crucial advice during initial development. * Richard Payne and Karen Price reviewed this Bayesian model validation project. ??? I would like to thank everyone who helped out with {targets} and its ecosystem, especially the folks listed here who helped make {stantargets} and Target Markdown possible. Thanks also to the R Medicine Program Committee for allowing me to speak, and thank you to everyone listening. --- ## References .small[ * Alvarez, Ignacio, Jarad Niemi, and Matt Simpson. 2016. "Bayesian Inference for a Covariance Matrix." http://arxiv.org/abs/1408.4050. * Bürkner P, Gabry J, Kay M, Vehtari A (2021). "posterior: Tools for Working with Posterior Distributions." R package version 0.1.6, <https://mc-stan.org/posterior>. * Carpenter, Bob. 2017. "Bayesian Posteriors are Calibrated by Definition". Statistical Modeling, Causal Inference, and Social Science. <https://statmodeling.stat.columbia.edu/2017/04/12/bayesian-posteriors-calibrated/> * Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of Statistical Software 76(1). [10.18637/jss.v076.i01](https://www.jstatsoft.org/article/view/v076i01). * Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. "Validation of Software for Bayesian Models Using Posterior Quantiles." Journal of Computational and Graphical Statistics 15 (3): 675–92. http://www.jstor.org/stable/27594203. * Gabry, Jonah, and Rok Češnovar (2021). cmdstanr: R Interface to 'CmdStan'. https://mc-stan.org/cmdstanr, https://discourse.mc-stan.org. * Gelman, Andrew. 2006. "Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper)." Bayesian Analysis 1 (3): 515–34. https://doi.org/10.1214/06-BA117A. * Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014. Bayesian Data Analysis. Edited by Francesca Dominici, Julian J. Faraway, Martin Tanner, and Jim idek. Third. CRC Press. * Landau, William Michael. 2021a. "The Stantargets R Package: A Workflow Framework for Efficient Reproducible Stan-Powered Bayesian Data Analysis Pipelines." Journal of Open Source Software 6 (60): 3193. https://doi.org/10.21105/joss.03193. * ———. 2021b. "The Targets R Package: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for Reproducibility and High-Performance Computing." Journal of Open Source Software 6 (57): 2959. https://doi.org/10.21105/joss.02959. * Schubert, Michael. 2019. "clustermq enables efficient parallelization of genomic analyses." Bioinformatics 35 (21): 4493–95. https://doi.org/10.1093/bioinformatics/btz284. * Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. "Validating Bayesian Inference Algorithms with Simulation-Based Calibration." http://arxiv.org/abs/1804.06788. * Xie, Yihui, J.J. Allaire, and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. https://bookdown.org/yihui/rmarkdown. * Xie, Yihui, Christophe Dervieux, and Emily Riederer (2020). R Markdown Cookbook. Chapman and Hall/CRC. ISBN 9780367563837. https://bookdown.org/yihui/rmarkdown-cookbook. ]