class: center, middle, inverse, title-slide # Reproducible computation at scale in R ### Will Landau --- <style> .inverse { background-color: transparent; text-shadow: 0 0 0px transparent; } .title-slide { vertical-align: bottom !important; text-align: center !important; } .title-slide h1 { position: absolute; top: 0; left: 0; right: 0; width: 100%; line-height: 4em; color: #666666; } .title-slide h3 { line-height: 6em; color: #666666; } .title-slide { background-color: white; background-image: url('images/logo.png'); background-repeat: no-repeat; background-size: 25%; } .remark-slide-content:after { content: "Copyright Eli Lilly and Company"; position: absolute; bottom: -5px; left: 10px; height: 40px; width: 100%; font-family: Helvetica, Arial, sans-serif; font-size: 0.7em; color: gray; background-repeat: no-repeat; background-size: contain; } </style> ## Large statistical computation * [Bayesian data analysis](https://mc-stan.org/) * [Bayesian network meta-analysis](https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/bayesian-network-meta-analysis.html) * [Graph-based multiple comparison procedures](https://github.com/kornl/gMCP) * [Subgroup identification](https://cran.r-project.org/web/packages/TSDT/index.html) * [Predictive modeling](http://appliedpredictivemodeling.com/computing) * [Deep neural networks](https://keras.rstudio.com/) * [PK/PD modeling](https://github.com/nlmixrdevelopment/nlmixr) * Clinical trial simulation * Target identification --- ## Common features 1. Heavy use of the [R language](https://www.r-project.org/). 2. Long runtimes. 3. Multiple sub-tasks. 4. Frequent changes to code and data. <img src = "./images/sisyphus.svg" align="left" style="border: none; box-shadow: none; height: 375px; text-align: center;"> <br> * <https://openclipart.org/detail/275842/sisyphus-overcoming-silhouette> --- ## Interconnected tasks <center> <img src = "./images/workflow.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> --- ## Changes <center> <img src = "./images/change.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> --- ## Consequences <center> <img src = "./images/downstream.png" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> --- ## Pipeline tools and workflow managers <center> <img src = "./images/infographic.svg" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> - Tons exist already: [github.com/pditommaso/awesome-pipeline](https://github.com/pditommaso/awesome-pipeline). - Most are language-agnostic or designed for Python or the shell. --- ## What distinguishes `drake`? <center> <img src = "./images/R.png" align="middle" style="border: none; box-shadow: none; text-align: center; height: 200px"> </center> .large[ * Respects the way R works. * Better code, i.e. functions. * Time savings allow for an incremental development strategy: 1. Change a couple things. 2. Run the workflow. 3. Inspect results. 4. *Repeat often*. ] --- ## Example drake workflow - Find the model that best predicts which customers will cancel their telecom subscriptions. - [IBM Watson Telco Customer Churn dataset](https://www.ibm.com/communities/analytics/watson-analytics-blog/predictive-insights-in-the-telco-customer-churn-data-set/). - Workflow principles generalize to the life sciences. <img src = "./images/combine.png" style="border: none; box-shadow: none; height: 200px"> <div style="font-size: 0.5em;"><a href="https://openclipart.org/detail/90739/newplus">https://openclipart.org/detail/90739/newplus</a>, <a href="https://github.com/rstudio/keras">https://github.com/rstudio/keras</a></div> ??? To dive into drake, we're going to use a machine learning example. We've got a deep neural net, and we're going to use it to predict "customer churn", which is another way of saying attrition, or dropout. --- background-image: ./images/not.png ## <img src="./images/no.png" width="40" height="40"> Say goodbye to numbered imperative scripts! ```r run_everything.R R/ ├── 01-data.R ├── 02-munge.R ├── 03-model.R ├── 04-results.R └── 05-plot.R data/ └── customer_churn.csv ``` --- ## <img src="./images/yes.png" width="60" height="40"> drake wants you to write **functions**. > - Everything that exists is an object. > - Everything that happens is a function call. > > John Chambers ```r add_things <- function(argument1, argument2) { argument1 + argument2 } add_things(1, 2) #> [1] 3 add_things(c(3, 4), c(5, 6)) #> [1] 8 10 ``` --- ## Functions in a drake workflow ```r split_data <- function(churn_file) { read_csv(churn_file, col_types = cols()) %>% initial_split(prop = 0.3) } prepare_recipe <- function(churn_data) { churn_data %>% training() %>% recipe(Churn ~ .) %>% step_rm(customerID) %>% step_naomit(all_outcomes(), all_predictors()) %>% step_discretize(tenure, options = list(cuts = 6)) %>% step_log(TotalCharges) %>% step_mutate(Churn = ifelse(Churn == "Yes", 1, 0)) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_center(all_predictors(), -all_outcomes()) %>% step_scale(all_predictors(), -all_outcomes()) %>% prep() } ``` --- ## Functions in a drake workflow ```r define_model <- function(churn_recipe, units1, units2, act1, act2, act3) { # ... } train_model <- function(churn_recipe, units1, units2, act1, act2, act3) { # ... } test_accuracy <- function(churn_data, churn_recipe, model) { # ... } test_model <- function(churn_data, churn_recipe, units1, units2, act1, act2, act3) { # ... } train_best_model <- function(best_run, churn_recipe) { # ... } ``` --- ## Typical project structure * There are *many* variations on this theme. ```r make.R # Top-level script R/ ├── packages.R # Calls to library(drake), library(keras), etc. *├── functions.R └── plan.R data/ └── customer_churn.csv .drake/ # drake's cache └── # Output automatically appears here. ``` --- ## Build up your workflow in a **drake plan**. * Usually goes in the `R/plan.R` script. ```r # R/plan.R plan <- drake_plan( churn_recipe = prepare_recipe(churn_data), churn_data = split_data(file_in("data/customer_churn.csv")) ) ``` --- ## The plan is a data frame of skippable *targets*. ```r plan #> # A tibble: 2 x 2 #> target command #> <chr> <expr> #> 1 churn_recipe prepare_recipe(churn_data) #> 2 churn_data split_data(file_in("data/customer_churn.csv")) ``` --- ## drake understands code and data dependencies. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` ```r vis_drake_graph(plan) ``` <center> <img src="images/graph1.png", height = "450px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> --- ## Build your first targets. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` ```r make(plan) #> ▶ target churn_data #> ▶ target churn_recipe ``` --- ## Check the targets for problems - `loadd()` and `readd()` get targets from the cache. ```r ncol(training(readd(churn_data))) #> [1] 21 loadd(churn_recipe) ncol(juice(churn_recipe)) #> [1] 36 ``` --- ## Build up the plan *gradually*. .large[ 1. Add a couple targets. 2. Build targets with `make(plan)`. 3. Inspect targets with `loadd()` and `readd()`. 4. *Repeat often*. Not as time-consuming as you might think! ] --- ## Add some model runs. ```r # R/plan.R plan <- drake_plan( churn_recipe = prepare_recipe(churn_data), churn_data = split_data(file_in("data/customer_churn.csv")), * run_relu = test_model(act1 = "relu", churn_data, churn_recipe), * run_sigmoid = test_model(act1 = "sigmoid", churn_data, churn_recipe) ) ``` --- ## Previous work is still up to date. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` ```r outdated(plan) #> [1] "run_relu" "run_sigmoid" ``` --- ## Previous work is still up to date. ```r vis_drake_graph(plan) ``` <center> <img src="images/graph2.png", height = "550px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> --- ## drake skips up-to-date targets. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` ```r make(plan) #> ▶ target run_relu #> ▶ target run_sigmoid ``` --- ## Inspect the newest targets. ```r readd(run_relu) #> # A tibble: 1 x 6 #> accuracy units1 units2 act1 act2 act3 #> <dbl> <dbl> <dbl> <chr> <chr> <chr> #> 1 0.803 16 16 relu relu sigmoid readd(run_sigmoid) #> # A tibble: 1 x 6 #> accuracy units1 units2 act1 act2 act3 #> <dbl> <dbl> <dbl> <chr> <chr> <chr> #> 1 0.799 16 16 sigmoid relu sigmoid ``` --- ## Find the best model ```r plan <- drake_plan( churn_recipe = prepare_recipe(churn_data), churn_data = split_data(file_in("data/customer_churn.csv")), run_relu = test_model(act1 = "relu", churn_data, churn_recipe), run_sigmoid = test_model(act1 = "sigmoid", churn_data, churn_recipe), * best_run = bind_rows(run_relu, run_sigmoid) %>% * top_n(1, accuracy) %>% * head(1), * best_model = target( * train_best_model(best_run, churn_recipe), * format = "keras" * ) ) ``` --- ## Find the best model ```r make(plan) #> ▶ target best_run #> ▶ target best_model ``` --- ## Find the best model ```r readd(best_model) #> Model #> ________________________________________________________________________________ #> Layer (type) Output Shape Param # #> ================================================================================ #> dense_6 (Dense) (None, 16) 576 #> ________________________________________________________________________________ #> dropout_4 (Dropout) (None, 16) 0 #> ________________________________________________________________________________ #> dense_7 (Dense) (None, 16) 272 #> ________________________________________________________________________________ #> dropout_5 (Dropout) (None, 16) 0 #> ________________________________________________________________________________ #> dense_8 (Dense) (None, 1) 17 #> ================================================================================ #> Total params: 865 #> Trainable params: 865 #> Non-trainable params: 0 #> ________________________________________________________________________________ ``` --- ## Try another model. ```r # R/plan.R plan <- drake_plan( churn_recipe = prepare_recipe(churn_data), churn_data = split_data(file_in("data/customer_churn.csv")), run_relu = test_model(act1 = "relu", churn_data, churn_recipe), run_sigmoid = test_model(act1 = "sigmoid", churn_data, churn_recipe), * run_softmax = test_model(act1 = "softmax", churn_data, churn_recipe), * best_run = bind_rows(run_relu, run_sigmoid, run_softmax) %>% top_n(1, accuracy) %>% head(1), best_model = target( train_best_model(best_run, churn_recipe), format = "keras" ) ) ``` --- ## What gets done stays done. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` ```r outdated(plan) #> [1] "best_model" "best_run" "run_softmax" ``` --- ## What gets done stays done. ```r vis_drake_graph(plan) ``` <center> <img src="images/graph3.png", height = "550px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> --- ## New best model? - Only if the new run beats the old runs. - Otherwise, `drake` does not bother to retrain the best model. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` ```r make(plan) #> target run_softmax #> target best_run ``` --- ## What if we need to change a function? ```r define_model <- function(churn_recipe, units1, units2, act1, act2, act3) { input_shape <- ncol( juice(churn_recipe, all_predictors(), composition = "matrix") ) keras_model_sequential() %>% layer_dense( units = units1, kernel_initializer = "uniform", activation = act1, input_shape = input_shape ) %>% * layer_dropout(rate = 0.2) %>% # previously 0.1 layer_dense( units = units2, kernel_initializer = "uniform", activation = act2 ) %>% layer_dropout(rate = 0.1) %>% layer_dense( units = 1, kernel_initializer = "uniform", activation = act3 ) } ``` --- ## drake knows what to do. ```r source("R/functions.R") vis_drake_graph(plan) ``` <center> <img src="images/graph4.png", height = "550px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> --- ## drake knows what to do. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` ```r make(plan) #> ▶ target run_relu #> ▶ target run_softmax #> ▶ target run_sigmoid #> ▶ target best_run #> ▶ target best_model ``` --- ## Undo the change. ```r define_model <- function(churn_recipe, units1, units2, act1, act2, act3) { input_shape <- ncol( juice(churn_recipe, all_predictors(), composition = "matrix") ) keras_model_sequential() %>% layer_dense( units = units1, kernel_initializer = "uniform", activation = act1, input_shape = input_shape ) %>% * layer_dropout(rate = 0.1) %>% # Changed back to 0.1. layer_dense( units = units2, kernel_initializer = "uniform", activation = act2 ) %>% layer_dropout(rate = 0.1) %>% layer_dense( units = 1, kernel_initializer = "uniform", activation = act3 ) } ``` --- ## Undo the change. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` * Recover old targets instead of building new ones. ```r make(plan, recover = TRUE) #> ✓ recover run_relu #> ✓ recover run_softmax #> ✓ recover run_sigmoid #> ✓ recover best_run #> ✓ recover best_model ``` --- ## Similar story if the data file changes. ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") # Requires file_in() in the plan. ``` ```r make(plan) #> target churn_data #> target churn_recipe #> target run_relu #> target run_sigmoid #> target run_softmax #> target best_run #> target best_model ``` --- ## Evidence of reproducibility ```r source("R/packages.R") source("R/functions.R") source("R/plan.R") ``` ```r make(plan) #> ✓ All targets are already up to date. outdated(plan) #> character(0) ``` --- ## Evidence of reproducibility ```r vis_drake_graph(plan) ``` <center> <img src="images/graph5.png", height = "550px" align="middle" style="border: none; box-shadow: none; text-align: center;"> </center> --- ## History of past model runs ```r history <- drake_history() %>% select(target, act1, exists, hash) %>% filter(!is.na(act1)) %>% print() #> # A tibble: 6 x 4 #> target act1 exists hash #> <chr> <chr> <lgl> <chr> #> 1 run_relu relu TRUE 693f38d0cca6df48 #> 2 run_relu relu TRUE d18e61338781810b #> 3 run_sigmoid sigmoid TRUE bfcea2797b1845c8 #> 4 run_sigmoid sigmoid TRUE 2b28e18f2a2a88bc #> 5 run_softmax softmax TRUE 0fba209a1b82f8a1 #> 6 run_softmax softmax TRUE 907f0a76c2d5da5b ``` --- ## Recover an old model run. ```r drake_cache()$get_value(history$hash[1]) #> # A tibble: 1 x 6 #> accuracy units1 units2 act1 act2 act3 #> <dbl> <dbl> <dbl> <chr> <chr> <chr> #> 1 0.803 16 16 relu relu sigmoid ``` --- ## More highlights * [High-performance computing](https://books.ropensci.org/drake/hpc.html) * [Efficient data formats](https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets) * [Dynamic branching](https://books.ropensci.org/drake/dynamic.html) * [Dynamic files](https://books.ropensci.org/drake/plans.html#dynamic-files) * [Memory management](https://books.ropensci.org/drake/memory.html) ```r # Run targets in parallel on a cluster: options(clustermq.scheduler = "slurm") make(plan, parallelism = "clustermq", jobs = 64) ``` --- ## Resources * Get [`drake`](https://github.com/ropensci/drake): ```r install.packages("drake") ``` * Example code from these slides: ```r drake::drake_example("customer-churn") ``` * Workshop materials: ```r remotes::install_github("wlandau/learndrake") ``` --- ## Links - Development repository: <https://github.com/ropensci/drake> - Full user manual <https://books.ropensci.org/drake/> - Reference website: <https://docs.ropensci.org/drake> - Hands-on workshop: <https://github.com/wlandau/learndrake> - Code examples: <https://github.com/wlandau/drake-examples> - Discuss at rOpenSci.org: <https://discuss.ropensci.org> ## rOpenSci use cases - Use [`drake`](https://github.com/ropensci/drake)? Share your use case at <https://ropensci.org/usecases>. <center> <img src = "./images/ropensci.png" style="border: none; box-shadow: none; height: 150px"> </center> --- ## Thanks <br> <br> <table style = "border: none"> <tr> <td style = "padding-right: 125px"> <ul style> <img src = "./images/edgar.jpg" style="border: none; box-shadow: none; height: 150px"> <li><a href = "https://github.com/edgararuiz">Edgar Ruiz</a></li> <li><a href = "https://github.com/sol-eng/tensorflow-w-r/blob/master/workflow/tensorflow-drake.Rmd">example code</a></li> </ul> </td> <td> <ul> <img src = "./images/matt.jpg" style="border: none; box-shadow: none; height: 150px"> <li><a href = "https://github.com/mdancho84">Matt Dancho</a></li> <li><a href = "https://blogs.rstudio.com/tensorflow/posts/2018-01-11-keras-customer-churn/">blog post</a></li> </ul> </td> </tr> </table> --- ## Thanks <table style = "border: none"> <tr> <td> <br> <ul> <img src = "./images/ropensci.png" style="border: none; box-shadow: none; height: 150px"> <li><a href = "https://github.com/maelle">Maëlle Salmon</a></li> <li><a href = "https://github.com/benmarwick">Ben Marwick</a></li> <li><a href = "https://github.com/jules32">Julia Lowndes</a></li> <li><a href = "https://github.com/gothub">Peter Slaughter</a></li> <li><a href = "https://github.com/jennybc">Jenny Bryan</a></li> <li><a href = "https://github.com/richfitz">Rich FitzJohn</a></li> <li><a href = "https://github.com/stefaniebutland">Stefanie Butland</a></li> </ul> </td> <td> <ul> <li><a href = "https://github.com/jarad">Jarad Niemi</a></li> <li><a href = "https://github.com/krlmlr">Kirill Müller</a></li> <li><a href = "https://github.com/HenrikBengtsson">Henrik Bengtsson</a></li> <li><a href = "https://github.com/mschubert">Michael Schubert</a></li> <li><a href = "https://github.com/kendonB">Kendon Bell</a></li> <li><a href = "https://github.com/milesmcbain">Miles McBain</a></li> <li><a href = "https://github.com/pat-s">Patrick Schratz</a></li> <li><a href = "https://github.com/AlexAxthelm">Alex Axthelm</a></li> <li><a href = "https://github.com/dapperjapper">Jasper Clarkberg</a></li> <li><a href = "https://github.com/tiernanmartin">Tiernan Martin</a></li> <li><a href = "https://github.com/BListyg">Ben Listyg</a></li> <li><a href = "https://github.com/tjmahr">TJ Mahr</a></li> <li><a href = "https://github.com/bpbond">Ben Bond-Lamberty</a></li> <li><a href = "https://github.com/tmastny">Tim Mastny</a></li> <li><a href = "https://github.com/billdenney">Bill Denney</a></li> <li><a href = "https://github.com/aedobbyn">Amanda Dobbyn</a></li> <li><a href = "https://github.com/dfalster">Daniel Falster</a></li> <li><a href = "https://github.com/rkrug">Rainer Krug</a></li> <li><a href = "https://github.com/bmchorse">Brianna McHorse</a></li> <li><a href = "https://github.com/mrchypark">Chan-Yub Park</a></li> </ul> </td> </tr> </table> --- ## The workshop 1. Sign up for a free account at <https://rstudio.cloud>. 2. Log into <https://rstudio.cloud/project/627076>. 3. Work through the R notebooks in order. Topic | Notebook ---|--- Custom functions | [`1-functions/1-functions.Rmd`](https://github.com/wlandau/learndrake/blob/master/inst/notebooks/1-functions/1-functions.Rmd) `drake` plans | [`2-plans/2-plans.Rmd`](https://github.com/wlandau/learndrake/blob/master/inst/notebooks/2-plans/2-plans.Rmd) Changing workflows | [`3-changes/3-changes.Rmd`](https://github.com/wlandau/learndrake/blob/master/inst/notebooks/3-changes/3-changes.Rmd) Static branching | [`4-static/4-static.Rmd`](https://github.com/wlandau/learndrake/blob/master/inst/notebooks/4-static/4-static.Rmd) Dynamic branching | [`5-dynamic/5-dynamic.Rmd`](https://github.com/wlandau/learndrake/blob/master/inst/notebooks/5-dynamic/5-dynamic.Rmd) Files and R Markdown | [`6-files/6-files.Rmd`](https://github.com/wlandau/learndrake/blob/master/inst/notebooks/6-files/6-files.Rmd)