The `remakeGenerator`

package is a helper add-on for `remake`

, a Makefile-like reproducible build system for R. If you haven’t done so already, go learn `remake`

! Once you do that, you will be ready to use `remakeGenerator`

. With `remakeGenerator`

, your long and cumbersome workflows will be

**Quick to set up**. You can plan a large workflow with a small amount of code.**Reproducible**. Reproduce computation with`remake::make()`

or GNU Make.**Development-friendly**. Thanks to`remake`

, whenever you change your code, your next computation will only run the parts that are new or out of date.**Parallelizable**. Distribute your workflow over multiple parallel processes with a single flag in GNU Make.

The `remakeGenerator`

package accomplishes this by generating YAML files for `remake`

that would be too big to type manually.

Windows users may need `Rtools`

to take full advantage of `remakeGenerator`

’s features, specifically to run Makefiles with `system("make")`

.

Use the `help_remakeGenerator()`

function to obtain a collection of helpful links. For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.

Write the files for the basic example using

```
library(remakeGenerator)
example_remakeGenerator("basic")
# list_examples_remakeGenerator() # Shows the names of available examples.
```

Run `workflow.R`

to produce the `remake`

file `remake.yml`

, an overarching Makefile, and run the workflow using 2 parallel processes.

`source("workflow.R")`

To use `remake`

directly in a single process, use

```
worflow(..., run = FALSE)
remake::make()
```

**Do not call the Makefile directly in the Linux command line.** As explained in the parallelRemake vignette, you must use `workflow(..., command = "make", args = "--jobs=2")`

or `parallelRemake::makefile(..., command = "make", args = "--jobs=4")`

, etc. parallelRemake uses a quick overhead step to configure hidden files for the Makefile before running it.

Notice how `workflow.R`

and `remake.yml`

rely on the functions defined in `code.R`

. To see how `remakeGenerator`

saves you time, change the body of one of these functions (something more significant than whitespace or comments) and then run `remake::make()`

again. Only the targets that depend on that function and downstream output are recomputed. If you only change whitespace or comments in `code.R`

, the next call to `remake::make()`

will change nothing, so you can tidy and document your code without triggering unnecessary rebuilds.

`workflow.R`

`workflow.R`

is the master plan of the analysis. It arranges the helper functions in `code.R`

to

Generate some datasets.

`library(remakeGenerator) datasets = commands( normal16 = normal_dataset(n = 16), poisson32 = poisson_dataset(n = 32), poisson64 = poisson_dataset(n = 64) )`

Analyze each dataset with each of two methods of analysis.

Summarize each analysis of each dataset and gather the summaries into manageable objects.

Compute output on the summaries.

`output = commands(coef.csv = write.csv(coef, target_name))`

Generate plots.

`plots = commands(mse.pdf = hist(mse, col = I("black"))) plots$plot = TRUE`

Compile

`knitr`

reports.`reports = data.frame(target = strings(markdown.md, latex.tex), depends = c("poisson32, coef, coef.csv", "")) reports$knitr = TRUE`

With these stages of the workflow planned, `workflow.R`

collects all the `remake`

targets into one YAML-like list.

```
targets = targets(datasets = datasets, analyses = analyses,
summaries = summaries, output = output, plots = plots, reports = reports)
```

Finally, it generates the `remake.yml`

file and an overarching Makefile. Then, unless `run = FALSE`

, it runs the Makefile to deploy your workflow. In this case, from the `command`

argument, you can see that the work is distributed over at most 2 parallel jobs.

```
workflow(targets, sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make",
args = "--jobs=2")
```

You can run each intermediate stages by themselves with the `make_these`

argument in `workflow(...)`

.

```
workflow(targets, make_these = "summaries",
sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
```

Bypassing the Makefile using `run = FALSE`

and running remake.yml directly does the same thing in a single R process.

```
workflow(targets, make_these = "summaries", run = FALSE
sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
remake::make("summaries")
```

To remove the intermediate files and final results, run

`remake::make("clean")`

At each stage (`datasets`

, `analyses`

, `summaries`

, `mse`

, etc.), the user supplies named R commands. The commands are then arranged into a data frame, such as the `datasets`

data frame from the basic example.

```
> datasets
target command
1 normal16 normal_dataset(n = 16)
2 poisson32 poisson_dataset(n = 32)
3 poisson64 poisson_dataset(n = 64)
```

Above, each row stands for an individual `remake`

target, and the `target`

column contains the name of the target. Each command is the R function call that produces its respective target. With the exception of “`target`

”, each column of each data frame represents a target-specific field in the `remake.yml`

file. If additional fields are needed, just append the appropriate columns to the data frame. In `workflow.R`

, the `plot`

and `knitr`

fields were added this way to the `plots`

and `reports`

data frames, respectively. Recall from `remake`

that setting `plot`

to `TRUE`

automatically sends the output of the command to a file so you do not have to bother writing the code to save it.

```
> plots
target command plot
1 mse.pdf hist(mse_vector, col = I("black")) TRUE
```

In addition, setting `knitr`

to `TRUE`

knits `.md`

and `.tex`

target files from `.Rmd`

and `.Rnw`

files, respectively.

```
> reports
target depends knitr
1 markdown.md poisson32, coef, coef.csv TRUE
2 latex.tex
```

Above, and in the general case, each `depends`

field is a character string of comma-separated `remake`

dependencies. Dependencies that are arguments to commands are automatically resolved and should not be restated in `depends`

. However, for `knitr`

reports, every dependency must be explicitly given in the `depends`

field.

In generating the `analyses`

and `summaries`

data frames, you may have noticed the `..dataset..`

and `..analysis..`

symbols. Those are wildcard placeholders indicating that the respective commands will iterate over each dataset and each analysis of each dataset. The `analyses()`

function turns

```
> commands(linear = linear_analysis(..dataset..), quadratic = quadratic_analysis(..dataset..))
target command
1 linear linear_analysis(..dataset..)
2 quadratic quadratic_analysis(..dataset..)
```

into

```
target command
1 linear_normal16 linear_analysis(normal16)
2 linear_poisson32 linear_analysis(poisson32)
3 linear_poisson64 linear_analysis(poisson64)
4 quadratic_normal16 quadratic_analysis(normal16)
5 quadratic_poisson32 quadratic_analysis(poisson32)
6 quadratic_poisson64 quadratic_analysis(poisson64)
```

and `summaries(..., gather = NULL)`

turns

```
> commands(mse = mse_summary(..dataset.., ..analysis..), coef = coefficients_summary(..analysis..))
target command
1 mse mse_summary(..dataset.., ..analysis..)
2 coef coefficients_summary(..analysis..)
```

into

```
target command
1 mse_linear_normal16 mse_summary(normal16, linear_normal16)
2 mse_linear_poisson32 mse_summary(poisson32, linear_poisson32)
3 mse_linear_poisson64 mse_summary(poisson64, linear_poisson64)
4 mse_quadratic_normal16 mse_summary(normal16, quadratic_normal16)
5 mse_quadratic_poisson32 mse_summary(poisson32, quadratic_poisson32)
6 mse_quadratic_poisson64 mse_summary(poisson64, quadratic_poisson64)
7 coef_linear_normal16 coefficients_summary(linear_normal16)
8 coef_linear_poisson32 coefficients_summary(linear_poisson32)
9 coef_linear_poisson64 coefficients_summary(linear_poisson64)
10 coef_quadratic_normal16 coefficients_summary(quadratic_normal16)
11 coef_quadratic_poisson32 coefficients_summary(quadratic_poisson32)
12 coef_quadratic_poisson64 coefficients_summary(quadratic_poisson64)
```

Setting the `gather`

argument in `summaries()`

to `c("c", "rbind")`

prepends the following two rows to the above data frame.

```
target
1 coef
2 mse
command
1 rbind(coef_linear_normal16 = coef_linear_normal16, coef_linear_poisson32 = coef_linear_poisson32, coef_linear_poisson64 = coef_linear_poisson64, coef_quadratic_normal16 = coef_quadratic_normal16, coef_quadratic_poisson32 = coef_quadratic_poisson32, coef_quadratic_poisson64 = coef_quadratic_poisson64)
2 c(mse_linear_normal16 = mse_linear_normal16, mse_linear_poisson32 = mse_linear_poisson32, mse_linear_poisson64 = mse_linear_poisson64, mse_quadratic_normal16 = mse_quadratic_normal16, mse_quadratic_poisson32 = mse_quadratic_poisson32, mse_quadratic_poisson64 = mse_quadratic_poisson64)
```

These top two rows contain instructions to gather the summaries together into manageable objects. The default value of `gather`

is a character vector with entries `"list"`

.

Functions `commands()`

, `commands_string()`

, and `commands_batch()`

help create datasets as in previous section.

`commands(x = f(1), y = g(2))`

```
## target command
## 1 x f(1)
## 2 y g(2)
```

```
a = "f(1)"
b = "g(2)"
commands_string(x = a, y = b)
```

```
## target command
## 1 x f(1)
## 2 y g(2)
```

```
batch = c(x = a, y = b)
commands_batch(batch)
```

```
## target command
## 1 x f(1)
## 2 y g(2)
```

When your worflow runs, intermediate objects such as datasets, analyses, and summaries are maintained in `remake`

’s hidden `storr`

cache, located in the hidden `.remake/objects/`

folder. To inspect your workflow, you can list the generated objects using `parallelRemake::recallable()`

and load objects using `parallelRemake::recall()`

. After running the basic example, we see the following.

```
> library(parallelRemake)
> recallable()
[1] "coef" "coef_linear_normal16"
[3] "coef_linear_poisson32" "coef_linear_poisson64"
[5] "coef_quadratic_normal16" "coef_quadratic_poisson32"
[7] "coef_quadratic_poisson64" "linear_normal16"
[9] "linear_poisson32" "linear_poisson64"
[11] "mse" "mse_linear_normal16"
[13] "mse_linear_poisson32" "mse_linear_poisson64"
[15] "mse_quadratic_normal16" "mse_quadratic_poisson32"
[17] "mse_quadratic_poisson64" "normal16"
[19] "poisson32" "poisson64"
[21] "quadratic_normal16" "quadratic_poisson32"
[23] "quadratic_poisson64"
> recall("normal16")
x y
1 1.5500328 4.226192
2 1.4714371 4.374820
3 0.4906371 6.228053
4 1.0086720 4.945609
5 1.3360642 5.619259
6 1.4899272 4.920836
7 0.7046544 4.926668
8 1.4092923 4.030779
9 2.5636956 6.026149
10 -0.5202316 4.368160
11 0.5540340 4.760691
12 1.6256007 4.722436
13 1.3210316 3.838017
14 0.8247446 2.708511
15 2.7262725 5.878415
16 2.3565342 4.445811
> out = recall("normal16", "poisson32")
> str(out)
List of 2
$ normal16 :'data.frame': 16 obs. of 2 variables:
..$ x: num [1:16] 0.9728 1.0688 1.4152 -0.4313 0.0912 ...
..$ y: num [1:16] 6.76 6.48 5.59 5.03 3.01 ...
$ poisson32:'data.frame': 32 obs. of 2 variables:
..$ x: int [1:32] 0 2 1 0 0 2 1 0 0 1 ...
..$ y: int [1:32] 4 4 5 4 3 7 4 4 5 2 ...
```

The functions `create_bindings()`

and `make_environment()`

are alternatives from `remake`

itself. Just be careful with `create_bindings()`

if your project has a lot of data.

**Do not use recall() or recallable() in serious production-level workflows because operations on the storr cache are not reproducibly tracked.**

If you want to run Make to distribute tasks over multiple nodes of a Slurm cluster, you should generate a Makefile like the one in this post. To do this, add the following to an R script (say, `my_script.R`

)

```
workflow(..., command = "make", args = "--jobs=8",
prepend = c(
"SHELL=srun",
".SHELLFLAGS= <ARGS> bash -c"))
```

where `<ARGS>`

stands for additional arguments to `srun`

. Then, deploy your parallelized workflow to the cluster using the following Linux command.

`nohup nice -19 R CMD BATCH my_script.R &`

For other task managers such as PBS, you may have to create a custom stand-in for a shell. For example, suppose we are using the Univa Grid Engine. From R, call

```
workflow(.., command = "make", args = "--jobs=8",
begin = "SHELL = ./shell.sh")
```

where the file `shell.sh`

contains

```
#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y
```

Now, in the Linux command line, enable execution with

`chmod +x shell.sh`

and then distribute the work over `[N]`

simultaneous jobs with

`nohup nice -19 R CMD BATCH my_script.R &`

The same approach should work for LSF systems, where `make`

replaced by lsmake and the Makefile is compatible.

Regardless of the system, be sure that all nodes point to the same working directory so that they share the same `.remake`

storr cache. For the Univa Grid Engine, the `-cwd`

flag for `qsub`

accomplishes this.

You can use the `downsize`

package in conjunction with `remakeGenerator`

. First, write an R script (say, `downsize.R`

) to set test or production mode.

```
# downsize::test_mode()
downsize::production_mode()
```

Load `downsize.R`

into `workflow.R`

to make your analysis plan respond to `downsize()`

.

```
library(remakeGenerator)
source("downsize.R")
datasets = commands_string(
target = "data1",
command = paste0("long_job(number_of_samples = ", downsize(1000, 2), ")")
)
```

If your custom `code.R`

functions call `downsize()`

internally, `remake`

needs to know.

`workflow(sources = c("downsize.R", "code.R", ...), packages = c("downsize", ...))`

Unfortunately, `remake`

does not rebuild targets in response to changes to global options, so you should manually run `remake::make("clean")`

to start from scratch whenever you change `downsize.R`

.

Some workflows do not fit the rigid structure of the basic example but could still benefit from the automated generation of `remake.yml`

files and Makefiles. If you supply the appropriate data frames to the `targets()`

function, you can customize your own analyses. Here, the `expand()`

and `evaluate()`

functions are essential to flexibility. The `expand()`

function replicates targets generated by the same commands, and the `evaluate()`

function lets you create and evaluate your own wildcard placeholders. With the `rules`

argument, the `evaluate()`

funcion is also capable of evaluating multiple wildcards in a single function call. (In this case, `rules`

takes precedence, and the `wildcard`

and `values`

arguments are ignored.) Here are some examples.

```
df = commands(data = simulate(center = MU, scale = SIGMA))
df
```

```
## target command
## 1 data simulate(center = MU, scale = SIGMA)
```

```
df = expand(df, values = c("rep1", "rep2"))
df
```

```
## target command
## 1 data_rep1 simulate(center = MU, scale = SIGMA)
## 2 data_rep2 simulate(center = MU, scale = SIGMA)
```

`evaluate(df, wildcard = "MU", values = 1:2)`

```
## target command
## 1 data_rep1_1 simulate(center = 1, scale = SIGMA)
## 2 data_rep1_2 simulate(center = 2, scale = SIGMA)
## 3 data_rep2_1 simulate(center = 1, scale = SIGMA)
## 4 data_rep2_2 simulate(center = 2, scale = SIGMA)
```

`evaluate(df, wildcard = "MU", values = 1:2, expand = FALSE)`

```
## target command
## 1 data_rep1 simulate(center = 1, scale = SIGMA)
## 2 data_rep2 simulate(center = 2, scale = SIGMA)
```

`evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)`

```
## target command
## 1 data_rep1 simulate(center = 1, scale = 0.1)
## 2 data_rep2 simulate(center = 2, scale = 1)
```

`evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1, 10)))`

```
## target command
## 1 data_rep1_1_0.1 simulate(center = 1, scale = 0.1)
## 2 data_rep1_1_1 simulate(center = 1, scale = 1)
## 3 data_rep1_1_10 simulate(center = 1, scale = 10)
## 4 data_rep1_2_0.1 simulate(center = 2, scale = 0.1)
## 5 data_rep1_2_1 simulate(center = 2, scale = 1)
## 6 data_rep1_2_10 simulate(center = 2, scale = 10)
## 7 data_rep2_1_0.1 simulate(center = 1, scale = 0.1)
## 8 data_rep2_1_1 simulate(center = 1, scale = 1)
## 9 data_rep2_1_10 simulate(center = 1, scale = 10)
## 10 data_rep2_2_0.1 simulate(center = 2, scale = 0.1)
## 11 data_rep2_2_1 simulate(center = 2, scale = 1)
## 12 data_rep2_2_10 simulate(center = 2, scale = 10)
```

For another demonstration, see the flexible example, which almost the same as the basic example except that it uses `expand()`

and `evaluate()`

explicitly.