R Targetopia: Contributing

The R Targetopia has the potential to cover multiple fields of Statistics and data science, and community contributions are extremely valuable. The following guide explains how to create your own R Targetopia package.

Before you begin

Prerequisites

Domain expertise in a subfield of data science.
Familiarity with targets. (Resources linked here.)
R package development, including documentation and testing. The rOpenSci development guide is super helpful.

Scope

R Targetopia packages are highly specialized, and each is tailored to an existing implementation of the underlying methodology. For example, stantargets builds on cmdstanr, and the former inherits interface patterns and documentation from the latter. brms compatibility is out of scope and would need to be implemented in its own R Targetopia package (discussion here).

Implementation

Target factories

R Targetopia packages leverage target factories to make pipeline construction easier. A target factory is a function that accepts simple inputs, calls tar_target_raw(), and produces a list of target objects. Sketch:

# R/factory.R
#' @title Example target factory.
#' @description Define 3 targets:
#' 1. Track the user-supplied data file.
#' 2. Read the data using `read_data()` (defined elsewhere).
#' 3. Fit a model to the data using `fit_model()` (defined elsewhere).
#' @return A list of target objects.
#' @export
#' @param file Character, data file path.
target_factory <- function(file) {
  list(
    tar_target_raw("file", file, format = "file", deployment = "main"),
    tar_target_raw("data", quote(read_data(file)), format = "fst_tbl", deployment = "main"),
    tar_target_raw("model", quote(run_model(data)), format = "qs")
  )
}

In _targets.R, the user writes one call to the factory instead of multiple calls to tar_target().¹ This shorthand makes user-side code simpler and more concise, and it abstracts away low-level configuration settings like format = "file" and deployment = "main".

# _targets.R
library(targets)
library(yourExamplePackage)
target_factory("data.csv") # End with a list of targets.

# R console
tar_manifest(fields = command)
#> # A tibble: 3 x 2
#>   name  command          
#>   <chr> <chr>            
#> 1 file  "\"data.csv\""   
#> 2 data  "read_data(file)"           
#> 3 model "run_model(data)"

Metaprogramming

Target factories invoke the tar_target_raw() function. Whereas tar_target() is for end users, tar_target_raw() is for developers. tar_target_raw() expects a character string for the name argument and expression objects for arguments command and pattern. Functions deparse(), substitute(), tar_sub(), and tar_eval() can help you create these arguments.²

The quote() function captures arbitrary expressions.

quote(f(x + y))

f(x + y)

str(quote(f(x + y)))

 language f(x + y)

The deparse() function turns expressions into characters.

deparse(quote(f(x + y)))

[1] "f(x + y)"

The substitute() function quotes code, creates expressions, and inserts arbitrary values into symbols.

substitute(f(arg = arg), env = list(arg = quote(x + y)))

f(arg = x + y)

If you call substitute() from inside a function (or other non-global environment) then env defaults to the calling environment.

f <- function(arg) substitute(f(arg = arg))
f(arg = f(x + y))

f(arg = f(x + y))

Together, quote(), deparse(), and substitute() help you create factories that accept friendly user inputs and supply safe arguments to tar_target_raw().

# R/factory.R
#' @title Example target factory.
#' @description Define 3 targets:
#' 1. Track the user-supplied data file.
#' 2. Read the data using `read_data()` (defined elsewhere).
#' 3. Fit a model to the data using `fit_model()` (defined elsewhere).
#' @return A list of target objects.
#' @export
#' @param name Symbol, name for the collection of targets.
#' @param file Character, data file path.
target_factory <- function(name, file) {
  name_model <- deparse(substitute(name))
  name_file <- paste0(name_model, "_file")
  name_data <- paste0(name_model, "_data")
  sym_file <- as.symbol(name_file)
  sym_data <- as.symbol(name_data)
  command_data <- substitute(read_data(file), env = list(file = sym_file))
  command_model <- substitute(run_model(data), env = list(data = sym_data))
  list(
    tar_target_raw(name_file, file, format = "file", deployment = "main"),
    tar_target_raw(name_data, command_data, format = "fst_tbl", deployment = "main"),
    tar_target_raw(name_model, command_model, format = "qs")
  )
}

# R console
tar_manifest(fields = command)
#> # A tibble: 3 x 2
#>   name        command                  
#>   <chr>       <chr>                    
#> 1 custom_file "\"data.csv\""           
#> 2 custom_data "read_data(custom_file)"
#> 3 custom      "run_model(custom_data)"

Settings

Situational knowledge helps us supply optimal arguments to tar_target_raw() that the user should not need to bother with. We have four such examples in target_factory() above.

deployment = "main": the data file lives on the user’s local machine or login node, so remote workers in high-performance computing scenarios may not be able to access it. Targets like these should not run on remote compute nodes.
format = "file": track the input data file and invalidate the appropriate targets when the contents of the file change.
format = "fst_tbl": the "fst_tbl" is a specialized format to efficiently store and retrieve data frames.
format = "qs": efficient general-purpose storage format for R objects.

Many of the remaining arguments to tar_target_raw() should be exposed as arguments to the factory (omitted from our example target_factory() for brevity) with default values from tar_option_get(). Examples may include priority and cue because users may have good reasons to set these. However, arguments like command, pattern, deps, and string are low level and should not be supported.

Branching

Dynamic branching and static branching are difficult for most end users, so the mechanics of branching should happen behind the scenes. Simplification and guardrails are critical.

Static branching

Static branching works best with a small number of potentially heterogeneous tasks. Functions tar_map(), tar_combine_raw(), tar_sub(), and tar_eval() can help with the implementation internally. User-side inputs should be as simple as possible. For example, the stantargets::tar_stan_mcmc() factory accepts a character vector of Stan model files and internally calls tar_map() to create a group of targets for each model.

Dynamic branching

Dynamic branching is best suited to larger collections of homogeneous tasks whose inputs are not necessarily known in advance. A factory with dynamic branching should create the pattern argument of tar_target_raw() with behind-the-scenes metaprogramming, and it should support batching to sensibly partition the work. Users should control the number of batches and reps per batch, but they should not be able to control the pattern argument. Examples of batching include tar_rep_raw(), tar_stan_mcmc_rep_summary() and the targets-stan workflow.

Documentation

Examples

The @examples field of the roxygen2 docstring should run quickly and avoid creating non-temporary files, which is why the examples in stantargets are mostly just sketches of pipelines. If you want to actually run a pipeline in an example, consider enclosing it inside tar_dir() to run the code in a temporary directory.

README.Rmd

Feel free to include a README badge to let others know your package is part of the R Targetopia.

[![R Targetopia](https://img.shields.io/badge/R_Targetopia-member-blue?style=flat&labelColor=gray)](https://wlandau.github.io/targetopia/)

Testing

What to test

Results: write a pipeline with tar_script(), run it with tar_make(), and inspect the output with tar_read().
Manifest: use tar_manifest() to check that the pipeline has the correct number of targets with the correct commands and configuration settings.
Dependencies: use the graph edges from tar_network() to check the dependency relationships among the targets. For example, in our target factory from earlier, there should be a directed edge from the input file target to the data target.

Speed

Unit tests should run quickly if possible. To increase testing speed, you may wish to set callr_function = NULL in functions like tar_make(), but be warned that the result will be sensitive to functions you define in the testing environment. CRAN has strict policies about total check time, and testthat::skip_on_cran() can help.

Environment

Tests should avoid creating non-temporary files, and they should avoid permanently changing target-specific options that could affect other tests. tar_test() is a drop-in replacement for test_that() which solves these problems. It runs the test in a temporary directory, and it automatically calls tar_option_reset() when the test is over. Tests using tar_test() can freely create local files and set target options.

rOpenSci

R Targetopia packages support workflow automation, making them excellent candidates for rOpenSci software review. The review process is a valuable source of feedback, and the rOpenSci community is welcoming and supportive. More details are available here.

Contact

If you have a package idea or are actively working on one, please feel free to reach out.

Users can still write their own downstream tar_target() calls in the pipeline for custom postprocessing.↩︎
For more information about metaprogramming in base R, see the “Computing on the Language” chapter of the Advanced R book.↩︎