Copyright Eli Lilly and Company

Reproducible computation at scale in R

drake

Will Landau

Goals


Data analysis has interconnected steps.

Each update…

…can invalidate other work.

Do you hunt for all the changes yourself?

  • Messy and prone to human error.
  • Not reproducible.

https://openclipart.org/detail/216179/messy-desk

Do you rerun everything from scratch?

  • Takes too long.
  • Too frustrating.

https://openclipart.org/detail/275842/sisyphus-overcoming-silhouette

Pipeline toolkits for large computation

Small example analysis project: GDP

  • Which quantity is a better predictor of GDP per capita, life expectancy or population?
    • Gapminder data: observations on multiple countries from 1952 to 2007.
    • Bayesian regeression with rstanarm: straightforward inference and interpretation.
  • But Bayesian methods can be computationally expensive!

Let’s use the drake R package.

The drake plan: steps of the workflow.

The plan is just a data frame.

Support for the plan

Run your workflow.

Output file report.html

Get targets from the cache.

Find things to improve.

Go back and change a function.

Which targets need an update?

vis_drake_graph()

Run only the parts that need to change.

Life expectancy has a stronger association.

Reproducibilty is about confidence and trust.


  • Tangible evidence that your results match the underlying code and data:

Scale up to many targets.

  • Larger example: Produc dataset.
  • Predict gross state product using all possible 3-covariate models.

Scale up to many targets.

  • Experimental interface coming to drake >= 7.0.0.

Scale up to many targets.

## # A tibble: 170 x 2
##    target              command                                             
##    <chr>               <chr>                                               
##  1 model_state_year_p… fit_gsp_model(gsp ~ state + year + pcap, data = Ecd…
##  2 model_state_year_h… fit_gsp_model(gsp ~ state + year + hwy, data = Ecda…
##  3 model_state_year_w… fit_gsp_model(gsp ~ state + year + water, data = Ec…
##  4 model_state_year_u… fit_gsp_model(gsp ~ state + year + util, data = Ecd…
##  5 model_state_year_pc fit_gsp_model(gsp ~ state + year + pc, data = Ecdat…
##  6 model_state_year_e… fit_gsp_model(gsp ~ state + year + emp, data = Ecda…
##  7 model_state_year_u… fit_gsp_model(gsp ~ state + year + unemp, data = Ec…
##  8 model_state_pcap_h… fit_gsp_model(gsp ~ state + pcap + hwy, data = Ecda…
##  9 model_state_pcap_w… fit_gsp_model(gsp ~ state + pcap + water, data = Ec…
## 10 model_state_pcap_u… fit_gsp_model(gsp ~ state + pcap + util, data = Ecd…
## # … with 160 more rows

Scale up to many targets.

Persistent parallel workers

Persistent parallel workers

Transient parallel workers

Distributed computing: transient workers


Get drake and get help.

Thanks

References