class: center, middle, inverse, title-slide # Reproducible computation at scale in R ### Will Landau --- <style> .inverse { background-color: transparent; text-shadow: 0 0 0px transparent; } .title-slide { vertical-align: bottom !important; text-align: center !important; } .title-slide h1 { position: absolute; top: 0; left: 0; right: 0; width: 100%; line-height: 4em; color: #666666; } .title-slide h3 { line-height: 6em; color: #666666; } .title-slide { background-color: white; background-image: url('images/logo.png'); background-repeat: no-repeat; background-size: 25%; } .remark-slide-content:after { content: "Copyright Eli Lilly and Company"; position: absolute; bottom: -5px; left: 10px; height: 40px; width: 100%; font-family: Helvetica, Arial, sans-serif; font-size: 0.7em; color: gray; background-repeat: no-repeat; background-size: contain; } </style> ## Large statistical computation * [Bayesian data analysis](https://mc-stan.org/) * [Bayesian network meta-analysis](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/bayesian-network-meta-analysis.html) * [Graph-based multiple comparison procedures](https://github.com/kornl/gMCP) * [Subgroup identification](https://cran.r-project.org/web/packages/TSDT/index.html) * [Predictive modeling](http://appliedpredictivemodeling.com/computing) * [Deep neural networks](https://keras.rstudio.com/) * [PK/PD modeling](https://github.com/nlmixrdevelopment/nlmixr) * Clinical trial simulation * Target identification ??? Thank you all for coming, and thank you to R/Pharma for the opportunity to speak today. I come from the life sciences, and we develop ambitious computational workflows for Statistics and data science. There's a lot of Bayesian analysis, machine learning, simulation, and prediction. Other domains have similar workloads, and we need to think about both efficiency and reproducibility right from the start. --- ## Repetition: the overlooked bane of long computation <br> ![](./images/reality.png) --- ## Interconnected tasks <center> <img src = "./images/workflow.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? A large workflow has a large number of moving parts. We have datasets that we preprocess or simulate, analyses of those datasets, and summaries of the analyses. --- ## Changes <center> <img src = "./images/change.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? If you change any one of these parts - whether it's a bugfix, a tweak to a model, or some new data - --- ## Consequences <center> <img src = "./images/downstream.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? Then everything that depends on it is no longer valid, and you need to rerun the computation to bring the results back up to date. This is seriously frustrating when you're in development and you're still making a constant stream of changes to code and data in real time. If every change means you need to rerun the project, there's no way the results can keep up... --- ## Dilemma: short runtimes or reproducible results? ![](./images/decisions.png) --- ## Let a pipeline tool figure out what to rerun. ![](./images/pipeline_graph.png) * Save time while ensuring computational reproducibility. * Automatic parallel/distributed computing based on the directed acyclic graph. --- ## Pipeline tools <center> <img src="./images/infographic.png" height = "125px"> </center> * Existing pipeline tools: https://github.com/pditommaso/awesome-pipeline * Most are language-agnostic or designed for Python or the shell. ## {targets} * Fundamentally designed for R. * Supports a clean, modular, function-oriented programming style. * Abstracts files as R objects and automatically manages data. * Surpasses the permanent limitations of its predecessor, [`drake`](https://github.com/ropensci/drake): <https://books.ropensci.org/targets/drake.html>. * Continuation of the ideas from `remake` by Rich FitzJohn: <https://github.com/richfitz/remake>. --- ## Example targets workflow * Find the model that best predicts which customers will cancel their telecom subscriptions. * IBM Cognos Analytics Telco customer churn dataset (links [here](https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113) and [here](https://github.com/IBM/telco-customer-churn-on-icp4d/blob/main/data/Telco-Customer-Churn.csv)). * Example from [Matt Dancho's 2018 RStudio AI Blog post](https://blogs.rstudio.com/tensorflow/posts/2018-01-11-keras-customer-churn/). * Workflow principles generalize to the other fields, such as the life sciences. ??? Let's look at an example. We're going to set up a machine learning project to predict customer dropout, or churn, based on the publicly available IBM Watson Telco customer churn dataset. We're going to train a bunch of deep neural nets on the data and pick the one with the best testing accuracy. --- background-image: ./images/not.png ## <img src="./images/no.png" width="40" height="40"> Move away from numbered imperative scripts. ```r run_everything.R R/ ├── 01-data.R ├── 02-munge.R ├── 03-model.R ├── 04-results.R └── 05-plot.R data/ └── customer_churn.csv ``` ??? Before we get started, let's talk about the implementation strategy. We're going to move away from numbered scripts and R Markdown as a way to manage the computation end to end. It's an okay strategy for small projects, but it falls apart quickly as a project grows. --- ## <img src="./images/yes.png" width="60" height="40"> Embrace **functions**. > - Everything that exists is an object. > - Everything that happens is a function call. > > John Chambers ```r add_things <- function(argument1, argument2) { argument1 + argument2 } add_things(1, 2) #> [1] 3 add_things(c(3, 4), c(5, 6)) #> [1] 8 10 ``` ??? Functions scale much better for big stuff. A function is just a reusable set of instructions with multiple inputs and a single return value. Usually those inputs are explicitly defined and easy to create, and usually the function has an informative name. Functions are a fundamental built-in feature of almost every programming language we have, and they are particularly well suited to R, which was designed with formal functional programming principles in mind. The most obvious use for functions is as a way to avoid copies of repeated code scattered throughout your project. So instead of copying and pasting the same code block everywhere, you just call the function. But functions are not just for code you want to reuse, they're for code you want to understand. Functions are custom shorthand. They make your work easier to read, understand, break down into manageable pieces, document, test, and validate for serious research. --- ## Functions for customer churn ```r split_data <- function(churn_file) { read_csv(churn_file, col_types = cols()) %>% initial_split(prop = 0.7) } prepare_recipe <- function(churn_data) { churn_data %>% training() %>% recipe(Churn ~ .) %>% step_rm(customerID) %>% step_naomit(all_outcomes(), all_predictors()) %>% step_discretize(tenure, options = list(cuts = 6)) %>% step_log(TotalCharges) %>% step_mutate(Churn = ifelse(Churn == "Yes", 1, 0)) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_center(all_predictors(), -all_outcomes()) %>% step_scale(all_predictors(), -all_outcomes()) %>% prep() } ``` ??? Most of your functions revolve around 3 kinds of tasks: preparing datasets, analyzing datasets, and summarizing analyses. These two functions are all about the data. The first one accepts a file path, reads in the data, and splits it into training and test sets. The prepare_recipe() function accepts the return value from split_data() and converts the data into a format that our Keras models can accept in subsequent steps. --- ## Functions for customer churn ```r define_model <- function(churn_recipe, units1, units2, act1, act2, act3) { # ... } train_model <- function(churn_recipe, units1, units2, act1, act2, act3) { # ... } test_accuracy <- function(churn_data, churn_recipe, churn_model) { # ... } test_model <- function(churn_data, churn_recipe, units1, units2, act1, act2, act3) { # ... } retrain_run <- function(churn_run, churn_recipe) { # ... } ``` ??? Similarly, you define functions to define, train, and test the Keras models. --- ## Typical project structure * There are many variations on this theme. ```r _targets.R # Required top-level configuration file. R/ *└── functions.R data/ └── customer_churn.csv ``` ??? You can organize your functions however you want, but it's common practice to put them in scripts inside an R/ folder. Similarly, input datasets can go anywhere, but a data/ folder just helps keep things clean. And at this point, you have a good clean function-oriented project. Even if you decide not to use targets, this function-oriented style still has a lot of value. However, if you're thinking about using targets, then converting to functions is almost all of the work. Once you've done that, you're already almost there. All you need now is to outline the specific steps of the computation in a formal pipeline. And that's where this _targets.R script comes into play. It always lives at the root of the project, and it's always called "_targets.R". --- ## Build up your workflow as a list of targets. ```r # _targets.R library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(churn_file, "data/customer_churn.csv", format = "file"), tar_target(churn_data, split_data(churn_file)), tar_target(churn_recipe, prepare_recipe(churn_data)) ) ``` ??? The purpose of _targets.R is to set up the project at a high level. It loads the packages required to define the pipeline, it loads your custom functions and global objects, it sets high-level options such as the packages the targets are going to need, and it defines the pipeline at the very end. At the bottom of _targets.R, you list out objects called targets. Each target is an individual step in the workflow. It has an informative name like "sim" or "patients", and it has a R command that invokes your custom functions and returns a value. --- ## The pipeline is a collection of skippable *targets*. ```r tar_manifest(fields = c("name", "command")) #> # A tibble: 3 × 2 #> name command #> <chr> <chr> #> 1 churn_file "\"data/customer_churn.csv\"" #> 2 churn_data "split_data(churn_file)" #> 3 churn_recipe "prepare_recipe(churn_data)" ``` ??? There are several utility functions that inspect the pipeline for correctness. The tar_manifest() function shows you all the names of the targets and the R commands associated with them. --- ## targets understands code and data dependencies. ```r tar_visnetwork() ``` <center> <img src="images/graph1.png", height = "400px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? It's always good practice to visualize the dependency graph of the plan. targets has functions to do this for you, and it really demystifies how the package works. So here you see the flow of the project from left to right. We reproducibly track an input data file, we load that data to split into training and test, and we prepare that data for the models using a Tidymodels recipe. But how does targets deduce this flow? How does it know that the churn_recipe depend on churn_data? The order you write targets in the pipeline does not matter. targets knows that churn_recipe depends on churn_data because the symbol "churn_data" is mentioned in the command for "churn_recipe" in the pipeline. targets scans your commands and functions without actually running them it in order to look for changes and understand dependency relationships. This is called static code analysis. --- ## Build your first targets. ```r tar_make() ``` ``` #> • start target churn_file #> • built target churn_file #> • start target churn_data #> • built target churn_data #> • start target churn_recipe #> • built target churn_recipe #> • end pipeline ``` ??? To actually run the workflow, we use a function called tar_make(). tar_make() creates a clean new reproducible R process, runs _targets.R to populate the new session and define the pipeline, resolves the dependency graph, runs the correct targets in the correct order from the dependency graph, and writes the return values to storage. --- ## Check the targets for problems - `tar_load()` and `tar_read()` get targets from the `_targets/` data store. ```r ncol(training(tar_read(churn_data))) #> [1] 21 tar_load(churn_recipe) ncol(juice(churn_recipe)) #> [1] 36 ``` ??? Afterwards, all the targets are in storage. There's a special key-value store in a hidden _targets/ folder, and targets has functions tar_load() and tar_read() to retrieve data from the store. targets abstracts artifacts as ordinary objects. You don't need to worry about where these files are located, you just need to know the target names. This is the exploratory analysis phase. Always inspect your targets for issues between calls to tar_make(). --- ## Build up the pipeline *gradually*. .large[ 1. Add a couple targets. 2. Run the pipeline with `tar_make()`. 3. Inspect the new targets with `tar_load()` and `tar_read()`. 4. Repeat often. Not very time-consuming because `tar_make()` skips up-to-date targets. ] ??? This is part of an iterative process for building up a pipeline. We add a couple targets, run the pipeline, inspect the new targets, and repeat. We do this early and often. And as you will see, because targets can iterate frequently while still saving you time. --- ## Add some models. ```r # _targets.R library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( tar_target(churn_file, "data/customer_churn.csv", format = "file"), tar_target(churn_data, split_data(churn_file)), tar_target(churn_recipe, prepare_recipe(churn_data)), * tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), * tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)) ) ``` ??? With the data in place, we are now ready to add some models. We start with two models fit to the same data using different hyperparameters. --- ## Previous work is still up to date. ```r tar_outdated() ``` ``` #> [1] "run_relu" "run_sigmoid" ``` ??? When we inspect the pipeline, we see that some work needs to be done. But all our data processing steps are already up to date. --- ## Previous work is still up to date. ```r tar_visnetwork() ``` <center> <img src="images/graph2.png", height = "400px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? We can see this even more clearly in the dependency graph. --- ## Up-to-date targets are skipped. ```r tar_make() ``` ``` #> ✓ skip target churn_file #> ✓ skip target churn_data #> ✓ skip target churn_recipe #> • start target run_relu #> • built target run_relu #> • start target run_sigmoid #> • built target run_sigmoid #> • end pipeline ``` ??? So when we run the pipeline again, only the models run. The tool skips the data targets because they are already up to date. --- ## Inspect the newest targets. ```r tar_read(run_relu) #> # A tibble: 1 × 6 #> accuracy units1 units2 act1 act2 act3 #> <dbl> <dbl> <dbl> <chr> <chr> <chr> #> 1 0.800 16 16 relu relu sigmoid tar_read(run_sigmoid) #> # A tibble: 1 × 6 #> accuracy units1 units2 act1 act2 act3 #> <dbl> <dbl> <dbl> <chr> <chr> <chr> #> 1 0.803 16 16 sigmoid relu sigmoid ``` ??? And as usual, we read our newest targets and verify there are no obvious issues. --- ## Find the best model ```r # _targets.R library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( ..., tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)), * tar_target( * best_run, * bind_rows(run_relu, run_sigmoid) %>% * top_n(1, accuracy) %>% * head(1) * ), * tar_target( * best_model, * retrain_run(best_run, churn_recipe), * format = "keras" * ) ) ``` ??? And we repeat the process to gradually build up the pipeline. Here, we add new targets to find the model run with the highest accuracy so far and retrain that model to return a fitted model object. For fitted Keras models, we need to write 'format = "keras"' in tar_target() because Keras models cannot be saved with R's ordinary serialization functionality. --- ## Find the best model ```r tar_make() ``` ``` #> ✓ skip target churn_file #> ✓ skip target churn_data #> ✓ skip target churn_recipe #> ✓ skip target run_relu #> ✓ skip target run_sigmoid #> • start target best_run #> • built target best_run #> • start target best_model #> • built target best_model #> • end pipeline ``` ??? As before, only only the new targets run because we didn't change any other code or data, --- ## Find the best model ```r tar_read(best_model) #> Model #> Model: "sequential_2" #> ________________________________________________________________________________ #> Layer (type) Output Shape Param # #> ================================================================================ #> dense_8 (Dense) (None, 16) 576 #> ________________________________________________________________________________ #> dropout_5 (Dropout) (None, 16) 0 #> ________________________________________________________________________________ #> dense_7 (Dense) (None, 16) 272 #> ________________________________________________________________________________ #> dropout_4 (Dropout) (None, 16) 0 #> ________________________________________________________________________________ #> dense_6 (Dense) (None, 1) 17 #> ================================================================================ #> Total params: 865 #> Trainable params: 865 #> Non-trainable params: 0 #> ________________________________________________________________________________ ``` ??? and now one of our targets is actually a trained Keras model. --- ## Try another model. ```r # _targets.R library(targets) source("R/functions.R") tar_option_set(packages = c("keras", "tidyverse", "rsample", "recipes", "yardstick")) list( ..., tar_target(run_relu, test_model(act1 = "relu", churn_data, churn_recipe)), tar_target(run_sigmoid, test_model(act1 = "sigmoid", churn_data, churn_recipe)), * tar_target(run_softmax, test_model(act1 = "softmax", churn_data, churn_recipe)), tar_target( best_run, * bind_rows(run_relu, run_sigmoid, run_softmax) %>% top_n(1, accuracy) %>% head(1) ), ... ) ``` ??? If we try another model, we just need to add another target and reference that new target in any downstream targets that need it. --- ## What gets done stays done. ```r tar_outdated() ``` ``` #> [1] "run_softmax" "best_run" "best_model" ``` ??? Now, not only are the new model and the best_run target outdated, the tool automatically calls everything downstream into question, so best_model is suspect as well. --- ## What gets done stays done. ```r tar_visnetwork() ``` <center> <img src="images/graph3.png", height = "400px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? Again, we see this clearly in the dependency graph. run_softmax is new, best_run is outdated because it uses run_softmax, and best_model is suspect because it depends on best_run. --- ## New best model? - Only if the new run beats the old runs, which would invalidate target `best_run`. - Otherwise, `targets` does not bother to retrain the best model. ```r tar_make() #> ✓ skip target churn_file #> ✓ skip target churn_data #> ✓ skip target churn_recipe #> ✓ skip target run_relu #> ✓ skip target run_sigmoid #> ● run target run_softmax #> ● run target best_run #> ✓ skip target best_model ``` ??? When we run the pipeline again, of course run_softmax and best_run both execute. But if best_run doesn't change, we skip best_model. targets looks at the actual fingerprint of the data to make decisions, unlike GNU Make, which just uses timestamps. --- ## What if we need to change a function? ```r define_model <- function(churn_recipe, units1, units2, act1, act2, act3) { input_shape <- ncol( juice(churn_recipe, all_predictors(), composition = "matrix") ) keras_model_sequential() %>% layer_dense( units = units1, kernel_initializer = "uniform", activation = act1, input_shape = input_shape ) %>% * layer_dropout(rate = 0.2) %>% # previously 0.1 layer_dense( units = units2, kernel_initializer = "uniform", activation = act2 ) %>% layer_dropout(rate = 0.1) %>% ... ``` ??? If we change a function in a nontrivial way, targets notices. Let's increase the dropout rate in one of the layers in the neural net. --- ## Show invalidated functions and targets. ```r tar_visnetwork() ``` <center> <img src="images/graph4.png", height = "400px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? The define_model() function is no longer up to date. That means neither is train_model() because it calls define_model(), and likewise with downstream functions. That means the targets downstream, which include all our model runs, are also invalidated. --- ## Rerun invalidated targets. ```r tar_make() #> ✓ skip target churn_file #> ✓ skip target churn_data #> ✓ skip target churn_recipe #> ● run target run_relu #> ● run target run_sigmoid #> ● run target run_softmax #> ● run target best_run #> ● run target best_model ``` ??? But we can bring everything back up to date without rerunning the data processing. --- ## Similar story if the data file changes. ```r tar_make() #> ● run target churn_file #> ● run target churn_data #> ● run target churn_recipe #> ● run target run_relu #> ● run target run_sigmoid #> ● run target run_softmax #> ● run target best_run #> ● run target best_model ``` ??? But the data processing does rerun if we change our data file. This is because in our upstream target churn_file, we selected 'format = "file"' in the tar_target() function. That tells targets to treat the return value of the command as a vector of file and directory paths, and it watches that data for changes. --- ## Evidence of reproducibility ```r tar_make(plan) #> ✓ skip target churn_file #> ✓ skip target churn_data #> ✓ skip target churn_recipe #> ✓ skip target run_relu #> ✓ skip target run_sigmoid #> ✓ skip target run_softmax #> ✓ skip target best_run #> ✓ skip target best_model #> ✓ Already up to date. ``` ??? At the end of the day, targets can tell you if all your targets are up to date. This is tangible evidence that your output matches the code and data it's supposed to come from. It's evidence that someone else running the same code would get the same results. That's reproducibility. It's certainly not the only form of reproducibility, but it does increase the trust we can place in the conclusions of the project. --- ## Evidence of reproducibility ```r tar_outdated() #> character(0) tar_visnetwork() ``` <center> <img src="images/graph5.png", height = "350px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> ??? And again, utility functions can confirm the status of targets without needing to run tar_make(). --- ## Resources * Get [`targets`](https://github.com/ropensci/targets): ```r install.packages("remotes") remotes::install_github("ropensci/targets") ``` * Code: <https://github.com/wlandau/targets-keras> * RStudio Cloud workspace: <https://rstudio.cloud/project/1430828/> * These slides: <https://wlandau.github.io/targets-tutorial> * Tutorial materials: <https://github.com/wlandau/targets-tutorial> * Development repository: <https://github.com/ropensci/targets> * Full user manual: <https://books.ropensci.org/targets/> * Reference website: <https://docs.ropensci.org/targets/> ??? There are several resources to learn about targets. There's a reference website, an online user manual, and a repository with the example code from today. targets is nearing the end of its beta phase. I am about to submit it to rOpenSci for peer review, and I hope to have it on CRAN at the end of this year or early next year. Now is a great time for feedback because there is no formal release yet and the interface is not yet set in stone. --- ## The tutorial 1. Sign up for a free account at <https://rstudio.cloud>. 2. Log into <https://rstudio.cloud/project/1699460>. 3. Work through the R notebooks in order. 4. Optional: revisit the material again later at <https://github.com/wlandau/targets-tutorial>. Topic | Notebook ---|--- Functions | [`1-functions.Rmd`](https://github.com/wlandau/targets-tutorial/blob/main/1-functions.Rmd) Pipelines | [`2-pipelines.Rmd`](https://github.com/wlandau/targets-tutorial/blob/main/2-pipelines.Rmd) Changes | [`3-changes.Rmd`](https://github.com/wlandau/targets-tutorial/blob/main/3-changes.Rmd) Debugging | [`4-debugging.Rmd`](https://github.com/wlandau/targets-tutorial/blob/main/4-debugging.Rmd) Files | [`5-files.Rmd`](https://github.com/wlandau/targets-tutorial/blob/main/5-files.Rmd) Branching | [`6-branching.Rmd`](https://github.com/wlandau/targets-tutorial/blob/main/6-branching.Rmd) Challenge | [`7-challenge.Rmd`](https://github.com/wlandau/targets-tutorial/blob/main/7-challenge.Rmd) ??? The workshop is publicly available and deployed to RStudio Cloud. Just sign up for a cloud account and log into a free instance of RStudio Server. We will spend our remaining time working through interactive exercises in the R notebooks here, with some breaks for Q&A and live coding demos to go over the solutions. If you missed part of the exercises or just want to go back and study in your own time, the workshop will still be available. I included the link to both the source and the cloud project here. --- ## References * Dancho. "Deep Learning With Keras To Predict Customer Churn." RStudio AI Blog, 2018-01-11. <https://blogs.rstudio.com/tensorflow/posts/2018-01-11-keras-customer-churn/> * D'Angelo, Scott and Nayak, Sandhya and Martinelli, Steve. "Predict Customer Churn using Watson Machine Learning and Jupyter Notebooks on Cloud Pak for Data." Released to GitHub by IBM under the Apache Software License version 2.0. <https://github.com/IBM/telco-customer-churn-on-icp4d>. Last accessed 2020-08-29. * IBM Cognos Analytics. "Telco customer churn (11.1.3+)." IBM Community, 2019-07-11. <https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113>