A new R package ecosystem for democratized reproducible pipelines at scale
The targets
R package is a Make-like pipeline toolkit for reproducible data science. It tackles copious workflows and demanding runtimes to accelerate research papers, simulation studies, and other computationally intense projects in fields such as Bayesian statistics and machine learning. Relative to its predecessor, drake
, targets
is not only more efficient, but also more extensible. The modular interface and object-oriented design allow package developers to write reusable target factories.1 If you want to help other data scientists create a certain specialized kind of pipeline, you can write a function that creates a list of target objects.
# yourExamplePackage/R/example_target_factory.R
target_factory <- function(data) {
list(
tar_target_raw("file", data, format = "file"),
tar_target_raw("simple_model", quote(run_simple(file))),
tar_target_raw("flexible_model", quote(run_flexible(file))),
tar_target_raw("conclusions", quote(summarize(simple_model, flexible_model)))
)
}
Then, when users of your package write _targets.R
, the pipeline becomes much easier to express.
tar_visnetwork(targets_only = TRUE)
With pre-packaged target factories, end users do not need to write as much code, and they do not need to be familiar with the advanced features of targets
.
The R Targetopia is the Pandora’s Box of low-hanging fruit that dangles from target factories, and its goal is to democratize reproducible pipelines across more of the R community. It is a growing ecosystem of R packages that abstract away the most difficult parts of targets
and make workflows simple and quick to write.
At the time of writing, the newest R Targetopia package is stantargets
, a domain-specific workflow framework for Bayesian data analysis with Stan. With stantargets
, writing a complex simulation study is as simple as a one call to tar_stan_mcmc_rep_summary()
. This complicated pipeline condenses down to the simple one below. Not only is the code shorter, but advanced concepts like file tracking, dynamic branching, and batching are completely abstracted far away from the user. Bayesian statisticians can spend less time on software development and more time on model development.
# _targets.R
library(targets)
library(stantargets)
stan_targets <- tar_stan_mcmc_rep_summary(
model,
"model.stan",
generate_stan_data(), # custom function
batches = 40, # Batching reduces overhead.
reps = 25, # reps per batch
variables = c("beta", "true_beta_value"),
summaries = list(
~posterior::quantile2(.x, probs = c(0.025, 0.5, 0.975))
)
)
stan_targets
tarchetypes
is a more general R Targetopia package that simplifies general-purpose tasks such as static branching and parameterized R Markdown. As described here, it is straightforward to reproducibly render a parameterized R Markdown report repeatedly across a large grid of parameters.
# _targets.R
library(targets)
library(tarchetypes)
library(tibble)
list(
tar_target(x, "value_of_x"),
tar_render_rep(
report,
"report.Rmd",
params = generate_large_param_grid(), # custom function
batches = 50 # Batching reduces overhead.
)
)
If you like developing R packages, please consider contributing an R Targetopia package for your own field of data science. I do plan to post detailed guidance in early 2021. But for now, the main piece is a target factory that calls tar_target_raw()
. Functions substitute()
, tar_sub()
, and tar_eval()
can help create language objects for the command
argument of tar_target_raw()
. Functions tar_manifest()
, tar_network()
, tar_dir()
, and tar_test()
help write examples and tests. Feel free to borrow the source code of tarchetypes
or stantargets
, and do not hesitate to reach out.
In early 2020, my colleague Richard Payne wrote a package to support a specialized drake
plan factory, an idea that I previously underestimated. His package helped users create pipelines of their own, but it struggled against the constraints of drake_plan()
, which is a major reason I decided to design targets
with target factories in mind.↩︎
Figure from https://openclipart.org/image/2000px/188840.↩︎
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".