Make-like pipeline toolkits enhance the reproducibility, efficiency, and hygiene of large data science projects. Pipelines orchestrate tasks such as data munging, model fitting, summaries, and R Markdown reports, while skipping tasks that are already up to date with the underlying code and upstream data. With pipelines, we save time, and we have evidence to inform which results we can trust.
targets is such a pipeline tool, and its major benefits are as follows.
|Efficiency||Orchestrates the pipeline, running computational tasks in parallel when possible and skipping tasks that are already up to date.|
|Reproducibility||Detects when all targets are already up to date, which provides tangible evidence that results match the underlying code and data.|
|Maintainability||Nudges users toward a clean function-oriented programming style, and it abstracts output data as R objects.|
|Portability||Saves only one file per target in the data store and gives files informative names to allow users to more easily understand their data and collaborate on projects.|
|Domain specificity||Designed to work entirely within the R language.|
Myriad workflow tools already exist, but most of them have different goals and behaviors. Different tools suit different situations.
|Tool||Issues and differences|
|pditmmaso/awesome-pipeline||Most tools listed are not for R.|
|Metaflow||Resembles Airflow more than GNU Make. Features include versioned runs and seamless cloud integration, but not automated decision-making about which steps to skip.|
|GNU Make||Not designed for a function-oriented style in R.|
||No longer maintained.|
||Permanent systemic limitations on portability, reproducibility, parallel efficiency, dynamic branching, user-side guardrails, and metaprogramming.|
||Operates on individual functions, not on a pipeline as a whole.|
||Designed for literate programming and communication. Not designed to scale in the same ways as pipeline tools, but useful for small components of pipelines.|
|Git||Designed for version control, not pipeline management specifically, but recommended for use with pipelines.|
||Designed for package environment management and containerization. Not pipeline management specifically, but recommended for use with pipelines.|
drake package is an older and more established R-focused pipeline toolkit. It is has become a key piece of the R ecosystem, and development and support will continue. Future efforts will focus on the issues that bring the most concrete value to users: requested features, performance improvements for known inefficiencies, bug fixes, documentation, education, and prompt one-on-one help. Existing
drake-powered projects can safely continue to use
drake, and there is no urgency to migrate to
However, nearly four years of community feedback have exposed major user-side limitations regarding data management, collaboration, dynamic branching, and parallel efficiency. Unfortunately, these limitations are permanent. Solutions in
drake itself would make the package incompatible with existing projects that use it, and the internal architecture is too copious, elaborate, and mature for such extreme refactoring. That is why
targets was created. The
targets package borrows from past learnings, user suggestions, discussions, complaints, success stories, and feature requests, and it improves the user experience in ways that will never be possible in
drake leaves ample room for user-side mistakes, and some of these mistakes require extra awareness or advanced knowledge of R to consistently avoid. The example behaviors below are too systemic to solve and still preserve back-compatibility.
make()looks for functions and global objects in the parent environment of the calling R session. Because the global environment is often old and stale in practical situations, which causes targets to become incorrectly invalidated. Users need to remember to restart the session before calling
make(). The issue is discussed here, and the discussion led to functions like
r_make()which always create a fresh session to do the work. However,
r_make()is not a complete replacement for
make(), and beginner users still run into the original problems.
make()does not find the intended functions and global objects if it is called in a different environment. Edge cases like this one and this one continue to surprise users.
drakeis extremely flexible about the location of the
.drake/cache. When a user calls
make(), and similar functions,
drakesearches up through the parent directories until it finds a
.drake/folder. This flexibility seldom helps, and it creates uncertainty and inconsistency when it comes to initializing and accessing projects, especially if there are multiple projects with nested file systems.
targets package solves all these issues by design. Functions
tar_make_future() all create fresh new R sessions by default. They all require a
_targets.R configuration file in the project root (working directory of the
tar_make() call) so that the functions, global objects, and settings are all populated in the exact same way each session, leading to less frustration, greater consistency, and greater reproducibility. In addition, the
_targets/ data store always lives in the project root.
drake’s cache is an intricate file system in a hidden
.drake folder. It contains multiple files for each target, and those names are not informative. (See the files in the
data/ folder in the diagram below.) Users often have trouble understanding how
drake manages data, resolving problems when files are corrupted, placing the data under version control, collaborating with others on the same pipeline, and clearing out superfluous data when the cache grows large in storage.
.drake/ ├── config/ ├── data/ ├───── 17bfcef645301416.rds ├───── 21935c86f12692e2.rds ├───── 37caf5df2892cfc4.rds ├───── ... ├── drake/ ├───── history/ ├───── return/ ├───── tmp/ ├── keys/ # A surprisingly large number of tiny text files live here. ├───── memoize/ ├───── meta/ ├───── objects/ ├───── progress/ ├───── recover/ ├───── session/ └── scratch/ # This folder should be temporary, but it gets egregiously large.
target takes a friendlier, more transparent, less mysterious approach to data management. Its data store is a visible
_targets folder, and it contains far fewer files: a spreadsheet of metadata, a spreadsheet of target progress, and one informatively named data file for each target. It is much easier to understand the data management process, identify and diagnose problems, place projects under version control, and avoid consuming unnecessary storage resources. Sketch:
_targets/ ├── meta/ ├───── meta ├───── progress ├── objects/ ├───── target_name_1 ├───── target_name_2 ├───── target_name_3 ├───── ... └── scratch/ # Deleted when the pipeline finishes.
Thanks to the simplified data store and simplified internals,
targets can automatically upload data to the Amazon S3 bucket of your choice. Simply configure
aws.s3, create a bucket, and select one of the AWS-powered storage formats. Then,
targets will automatically upload the return values to the cloud.
# _targets.R tar_option_set(resources = list(bucket = "my-bucket-name")) tar_pipeline( tar_target(dataset, get_large_dataset(), format = "aws_fst_tbl"), tar_target(analysis, analyze_dataset(dataset), format = "aws_qs") )
Data retrieval is still super easy.
drake has several utilities that inform users which targets are up to date and which need to rerun. However, those utilities are limited by how
drake manages functions and other global objects. Whenever
drake inspects globals, it stores their values in its cache and loses track of their previous state from the last run of the pipeline. As a result, it has trouble informing users exactly why a given target is out of date. And because the system for tracking global objects is tightly coupled with the cache, this limitation is permanent.
targets, the metadata management system only updates information on global objects when the pipeline actually runs. This makes it possible to understand which specific changes to your code could have invalided your targets. In large projects with long runtimes, this feature contributes significantly to reproducibility and peace of mind.
Dynamic branching was an architecturally difficult fit in
drake, and it can only support one single (
vctrs-based) method of slicing and aggregation for processing sub-targets. This limitation has frustrated members of the community, as discussed here and here.
targets, on the other hand, is more flexible regarding slicing and aggregation. When it branches over an object, it can iterate over vectors, lists, and even data frames grouped with
dplyr::group_by(). To branch over chunks of a data frame, our data frame target needs to have a special
tar_group column. We can create this column in our target’s return value with the
## # A tibble: 6 x 3 ## # Groups: id  ## x id tar_group ## <int> <chr> <int> ## 1 1 a 1 ## 2 2 a 1 ## 3 3 b 2 ## 4 4 b 2 ## 5 5 c 3 ## 6 6 c 3
Our actual target has the command above and
iteration = "group".
tar_target( data, data.frame( x = seq_len(6), id = rep(letters[seq_len(3)], each = 2) ) %>% group_by(id) %>% tar_group(), iteration = "group" )
Now, any target that maps over
data is going to define one branch for each group in the data frame. The following target creates three branches when run in a pipeline: one returning 3, one returning 7, and one returning 11.
Dynamic branching in
drake is staged. In other words, all the sub-targets of a dynamic target must complete before the pipeline moves on to downstream targets. The diagram below illustrates this behavior in a pipeline with a dynamic target B that maps over another dynamic target A. For thousands of dynamic sub-targets with highly variable runtimes, this behavior consumes unnecessary runtime and computing resources. And because
drake’s architecture was designed at a fundamental level for static branching only, this limitation is permanent.
By contrast, the internal data structures in
targets are dynamic by design, which allows for a dynamic branching model with more flexibility and parallel efficiency. Branches can always start as soon as their upstream dependencies complete, even if some of those upstream dependencies are branches. This behavior reduces runtime and reduces consumption of computing resources.
drake, pipelines are defined with the
drake_plan() supports an elaborate domain specific language that diffuses user-supplied R expressions. This makes it convenient to assign commands to targets in the vast majority of cases, but it also obstructs custom metaprogramming by users (example here). Granted, it is possible to completely circumvent
drake_plan() and create the whole data frame from scratch, but this is hardly ideal and seldom done in practice.
targets package tries to make customization easier. Relative to
targets takes a decentralized approach to setting up pipelines, moving as much custom configuration as possible to the target level rather than the whole pipeline level. In addition, the
tar_target_raw() function avoids non-standard evaluation while mirroring
tar_target() in all other respects. All this makes it much easier to create custom metaprogrammed pipelines and target archetypes while avoiding an elaborate domain specific language for static branching, which was extremely difficult to understand and error prone in