The remakeGenerator package is a helper add-on for remake, a Makefile-like reproducible build system for R. If you haven’t done so already, go learn remake! Once you do that, you will be ready to use remakeGenerator. With remakeGenerator, your long and cumbersome workflows will be

  • Quick to set up. You can plan a large workflow with a small amount of code.
  • Reproducible. Reproduce computation with remake::make() or GNU Make.
  • Development-friendly. Thanks to remake, whenever you change your code, your next computation will only run the parts that are new or out of date.
  • Parallelizable. Distribute your workflow over multiple parallel processes with a single flag in GNU Make.

The remakeGenerator package accomplishes this by generating YAML files for remake that would be too big to type manually.

Rtools for Windows users

Windows users may need Rtools to take full advantage of remakeGenerator’s features, specifically to run Makefiles with system("make").

Help and troubleshooting

Use the help_remakeGenerator() function to obtain a collection of helpful links. For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.

Basic example

Write the files for the basic example using

library(remakeGenerator)
example_remakeGenerator("basic")
# list_examples_remakeGenerator() # Shows the names of available examples.

Run workflow.R to produce the remake file remake.yml, an overarching Makefile, and run the workflow using 2 parallel processes.

source("workflow.R")

To use remake directly in a single process, use

worflow(..., run = FALSE)
remake::make()

Do not call the Makefile directly in the Linux command line. As explained in the parallelRemake vignette, you must use workflow(..., command = "make", args = "--jobs=2") or parallelRemake::makefile(..., command = "make", args = "--jobs=4"), etc. parallelRemake uses a quick overhead step to configure hidden files for the Makefile before running it.

Notice how workflow.R and remake.yml rely on the functions defined in code.R. To see how remakeGenerator saves you time, change the body of one of these functions (something more significant than whitespace or comments) and then run remake::make() again. Only the targets that depend on that function and downstream output are recomputed. If you only change whitespace or comments in code.R, the next call to remake::make() will change nothing, so you can tidy and document your code without triggering unnecessary rebuilds.

A walk through workflow.R

workflow.R is the master plan of the analysis. It arranges the helper functions in code.R to

  1. Generate some datasets.

    library(remakeGenerator)
    datasets = commands(
      normal16 = normal_dataset(n = 16),
      poisson32 = poisson_dataset(n = 32),
      poisson64 = poisson_dataset(n = 64)
    )
  2. Analyze each dataset with each of two methods of analysis.

    analyses = analyses(
      commands = commands(
        linear = linear_analysis(..dataset..),
        quadratic = quadratic_analysis(..dataset..)), 
      datasets = datasets)
  3. Summarize each analysis of each dataset and gather the summaries into manageable objects.

    summaries = summaries(
      commands = commands(
        mse = mse_summary(..dataset.., ..analysis..),
        coef = coefficients_summary(..analysis..)), 
      analyses = analyses, datasets = datasets, gather = strings(c, rbind))
  4. Compute output on the summaries.

    output = commands(coef.csv = write.csv(coef, target_name))
  5. Generate plots.

    plots = commands(mse.pdf = hist(mse, col = I("black")))
    plots$plot = TRUE
  6. Compile knitr reports.

    reports = data.frame(target = strings(markdown.md, latex.tex),
      depends = c("poisson32, coef, coef.csv", ""))
    reports$knitr = TRUE

With these stages of the workflow planned, workflow.R collects all the remake targets into one YAML-like list.

targets = targets(datasets = datasets, analyses = analyses, 
  summaries = summaries, output = output, plots = plots, reports = reports)

Finally, it generates the remake.yml file and an overarching Makefile. Then, unless run = FALSE, it runs the Makefile to deploy your workflow. In this case, from the command argument, you can see that the work is distributed over at most 2 parallel jobs.

workflow(targets, sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
  prepend = c("# Prepend this", "# to the Makefile."), command = "make",
  args = "--jobs=2")

Running intermediate stages

You can run each intermediate stages by themselves with the make_these argument in workflow(...).

workflow(targets, make_these = "summaries", 
  sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
  prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")

Bypassing the Makefile using run = FALSE and running remake.yml directly does the same thing in a single R process.

workflow(targets, make_these = "summaries", run = FALSE
  sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
  prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
remake::make("summaries")

To remove the intermediate files and final results, run

remake::make("clean")

The framework

At each stage (datasets, analyses, summaries, mse, etc.), the user supplies named R commands. The commands are then arranged into a data frame, such as the datasets data frame from the basic example.

> datasets
     target                 command
1  normal16  normal_dataset(n = 16)
2 poisson32 poisson_dataset(n = 32)
3 poisson64 poisson_dataset(n = 64)

Above, each row stands for an individual remake target, and the target column contains the name of the target. Each command is the R function call that produces its respective target. With the exception of “target”, each column of each data frame represents a target-specific field in the remake.yml file. If additional fields are needed, just append the appropriate columns to the data frame. In workflow.R, the plot and knitr fields were added this way to the plots and reports data frames, respectively. Recall from remake that setting plot to TRUE automatically sends the output of the command to a file so you do not have to bother writing the code to save it.

> plots
   target                            command plot
1 mse.pdf hist(mse_vector, col = I("black")) TRUE

In addition, setting knitr to TRUE knits .md and .tex target files from .Rmd and .Rnw files, respectively.

> reports
       target                         depends knitr
1 markdown.md poisson32, coef, coef.csv  TRUE
2   latex.tex   

Above, and in the general case, each depends field is a character string of comma-separated remake dependencies. Dependencies that are arguments to commands are automatically resolved and should not be restated in depends. However, for knitr reports, every dependency must be explicitly given in the depends field.

In generating the analyses and summaries data frames, you may have noticed the ..dataset.. and ..analysis.. symbols. Those are wildcard placeholders indicating that the respective commands will iterate over each dataset and each analysis of each dataset. The analyses() function turns

> commands(linear = linear_analysis(..dataset..), quadratic = quadratic_analysis(..dataset..))
     target                         command
1    linear    linear_analysis(..dataset..)
2 quadratic quadratic_analysis(..dataset..)

into

               target                       command
1     linear_normal16     linear_analysis(normal16)
2    linear_poisson32    linear_analysis(poisson32)
3    linear_poisson64    linear_analysis(poisson64)
4  quadratic_normal16  quadratic_analysis(normal16)
5 quadratic_poisson32 quadratic_analysis(poisson32)
6 quadratic_poisson64 quadratic_analysis(poisson64)

and summaries(..., gather = NULL) turns

> commands(mse = mse_summary(..dataset.., ..analysis..), coef = coefficients_summary(..analysis..))
  target                                command
1    mse mse_summary(..dataset.., ..analysis..)
2   coef     coefficients_summary(..analysis..)

into

                     target                                     command
1       mse_linear_normal16      mse_summary(normal16, linear_normal16)
2      mse_linear_poisson32    mse_summary(poisson32, linear_poisson32)
3      mse_linear_poisson64    mse_summary(poisson64, linear_poisson64)
4    mse_quadratic_normal16   mse_summary(normal16, quadratic_normal16)
5   mse_quadratic_poisson32 mse_summary(poisson32, quadratic_poisson32)
6   mse_quadratic_poisson64 mse_summary(poisson64, quadratic_poisson64)
7      coef_linear_normal16       coefficients_summary(linear_normal16)
8     coef_linear_poisson32      coefficients_summary(linear_poisson32)
9     coef_linear_poisson64      coefficients_summary(linear_poisson64)
10  coef_quadratic_normal16    coefficients_summary(quadratic_normal16)
11 coef_quadratic_poisson32   coefficients_summary(quadratic_poisson32)
12 coef_quadratic_poisson64   coefficients_summary(quadratic_poisson64)

Setting the gather argument in summaries() to c("c", "rbind") prepends the following two rows to the above data frame.

    target
1   coef
2   mse

    command
1   rbind(coef_linear_normal16 = coef_linear_normal16, coef_linear_poisson32 = coef_linear_poisson32, coef_linear_poisson64 = coef_linear_poisson64, coef_quadratic_normal16 = coef_quadratic_normal16, coef_quadratic_poisson32 = coef_quadratic_poisson32, coef_quadratic_poisson64 = coef_quadratic_poisson64)
2   c(mse_linear_normal16 = mse_linear_normal16, mse_linear_poisson32 = mse_linear_poisson32, mse_linear_poisson64 = mse_linear_poisson64, mse_quadratic_normal16 = mse_quadratic_normal16, mse_quadratic_poisson32 = mse_quadratic_poisson32, mse_quadratic_poisson64 = mse_quadratic_poisson64)

These top two rows contain instructions to gather the summaries together into manageable objects. The default value of gather is a character vector with entries "list".

The “commands” functions

Functions commands(), commands_string(), and commands_batch() help create datasets as in previous section.

commands(x = f(1), y = g(2))
##   target command
## 1      x    f(1)
## 2      y    g(2)
a = "f(1)"
b = "g(2)"
commands_string(x = a, y = b)
##   target command
## 1      x    f(1)
## 2      y    g(2)
batch = c(x = a, y = b)
commands_batch(batch)
##   target command
## 1      x    f(1)
## 2      y    g(2)

Where is my output?

When your worflow runs, intermediate objects such as datasets, analyses, and summaries are maintained in remake’s hidden storr cache, located in the hidden .remake/objects/ folder. To inspect your workflow, you can list the generated objects using parallelRemake::recallable() and load objects using parallelRemake::recall(). After running the basic example, we see the following.

> library(parallelRemake)
> recallable()
 [1] "coef"                     "coef_linear_normal16"    
 [3] "coef_linear_poisson32"    "coef_linear_poisson64"   
 [5] "coef_quadratic_normal16"  "coef_quadratic_poisson32"
 [7] "coef_quadratic_poisson64" "linear_normal16"         
 [9] "linear_poisson32"         "linear_poisson64"        
[11] "mse"                      "mse_linear_normal16"     
[13] "mse_linear_poisson32"     "mse_linear_poisson64"    
[15] "mse_quadratic_normal16"   "mse_quadratic_poisson32" 
[17] "mse_quadratic_poisson64"  "normal16"                
[19] "poisson32"                "poisson64"               
[21] "quadratic_normal16"       "quadratic_poisson32"     
[23] "quadratic_poisson64"   
> recall("normal16")
            x        y
1   1.5500328 4.226192
2   1.4714371 4.374820
3   0.4906371 6.228053
4   1.0086720 4.945609
5   1.3360642 5.619259
6   1.4899272 4.920836
7   0.7046544 4.926668
8   1.4092923 4.030779
9   2.5636956 6.026149
10 -0.5202316 4.368160
11  0.5540340 4.760691
12  1.6256007 4.722436
13  1.3210316 3.838017
14  0.8247446 2.708511
15  2.7262725 5.878415
16  2.3565342 4.445811
> out = recall("normal16", "poisson32")
> str(out)
List of 2
 $ normal16 :'data.frame':  16 obs. of  2 variables:
  ..$ x: num [1:16] 0.9728 1.0688 1.4152 -0.4313 0.0912 ...
  ..$ y: num [1:16] 6.76 6.48 5.59 5.03 3.01 ...
 $ poisson32:'data.frame':  32 obs. of  2 variables:
  ..$ x: int [1:32] 0 2 1 0 0 2 1 0 0 1 ...
  ..$ y: int [1:32] 4 4 5 4 3 7 4 4 5 2 ...

The functions create_bindings() and make_environment() are alternatives from remake itself. Just be careful with create_bindings() if your project has a lot of data.

Do not use recall() or recallable() in serious production-level workflows because operations on the storr cache are not reproducibly tracked.

High-performance computing

If you want to run Make to distribute tasks over multiple nodes of a Slurm cluster, you should generate a Makefile like the one in this post. To do this, add the following to an R script (say, my_script.R)

workflow(..., command = "make", args = "--jobs=8",
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS= <ARGS> bash -c"))

where <ARGS> stands for additional arguments to srun. Then, deploy your parallelized workflow to the cluster using the following Linux command.

nohup nice -19 R CMD BATCH my_script.R &

For other task managers such as PBS, you may have to create a custom stand-in for a shell. For example, suppose we are using the Univa Grid Engine. From R, call

workflow(.., command = "make", args = "--jobs=8",
  begin = "SHELL = ./shell.sh")

where the file shell.sh contains

#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y

Now, in the Linux command line, enable execution with

chmod +x shell.sh

and then distribute the work over [N] simultaneous jobs with

nohup nice -19 R CMD BATCH my_script.R &

The same approach should work for LSF systems, where make replaced by lsmake and the Makefile is compatible.

Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake storr cache. For the Univa Grid Engine, the -cwd flag for qsub accomplishes this.

downsize

You can use the downsize package in conjunction with remakeGenerator. First, write an R script (say, downsize.R) to set test or production mode.

# downsize::test_mode()
downsize::production_mode()

Load downsize.R into workflow.R to make your analysis plan respond to downsize().

library(remakeGenerator)
source("downsize.R")
datasets = commands_string(
  target = "data1",
  command = paste0("long_job(number_of_samples = ", downsize(1000, 2), ")")
)

If your custom code.R functions call downsize() internally, remake needs to know.

workflow(sources = c("downsize.R", "code.R", ...), packages = c("downsize", ...))

Unfortunately, remake does not rebuild targets in response to changes to global options, so you should manually run remake::make("clean") to start from scratch whenever you change downsize.R.

Flexibility

Some workflows do not fit the rigid structure of the basic example but could still benefit from the automated generation of remake.yml files and Makefiles. If you supply the appropriate data frames to the targets() function, you can customize your own analyses. Here, the expand() and evaluate() functions are essential to flexibility. The expand() function replicates targets generated by the same commands, and the evaluate() function lets you create and evaluate your own wildcard placeholders. With the rules argument, the evaluate() funcion is also capable of evaluating multiple wildcards in a single function call. (In this case, rules takes precedence, and the wildcard and values arguments are ignored.) Here are some examples.

df = commands(data = simulate(center = MU, scale = SIGMA))
df
##   target                              command
## 1   data simulate(center = MU, scale = SIGMA)
df = expand(df, values = c("rep1", "rep2"))
df
##      target                              command
## 1 data_rep1 simulate(center = MU, scale = SIGMA)
## 2 data_rep2 simulate(center = MU, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2)
##        target                             command
## 1 data_rep1_1 simulate(center = 1, scale = SIGMA)
## 2 data_rep1_2 simulate(center = 2, scale = SIGMA)
## 3 data_rep2_1 simulate(center = 1, scale = SIGMA)
## 4 data_rep2_2 simulate(center = 2, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2, expand = FALSE)
##      target                             command
## 1 data_rep1 simulate(center = 1, scale = SIGMA)
## 2 data_rep2 simulate(center = 2, scale = SIGMA)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)
##      target                           command
## 1 data_rep1 simulate(center = 1, scale = 0.1)
## 2 data_rep2   simulate(center = 2, scale = 1)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1, 10)))
##             target                           command
## 1  data_rep1_1_0.1 simulate(center = 1, scale = 0.1)
## 2    data_rep1_1_1   simulate(center = 1, scale = 1)
## 3   data_rep1_1_10  simulate(center = 1, scale = 10)
## 4  data_rep1_2_0.1 simulate(center = 2, scale = 0.1)
## 5    data_rep1_2_1   simulate(center = 2, scale = 1)
## 6   data_rep1_2_10  simulate(center = 2, scale = 10)
## 7  data_rep2_1_0.1 simulate(center = 1, scale = 0.1)
## 8    data_rep2_1_1   simulate(center = 1, scale = 1)
## 9   data_rep2_1_10  simulate(center = 1, scale = 10)
## 10 data_rep2_2_0.1 simulate(center = 2, scale = 0.1)
## 11   data_rep2_2_1   simulate(center = 2, scale = 1)
## 12  data_rep2_2_10  simulate(center = 2, scale = 10)

For another demonstration, see the flexible example, which almost the same as the basic example except that it uses expand() and evaluate() explicitly.