The remakeGenerator
package is a helper add-on for remake
, a Makefile-like reproducible build system for R. If you haven’t done so already, go learn remake
! Once you do that, you will be ready to use remakeGenerator
. With remakeGenerator
, your long and cumbersome workflows will be
remake::make()
or GNU Make.remake
, whenever you change your code, your next computation will only run the parts that are new or out of date.The remakeGenerator
package accomplishes this by generating YAML files for remake
that would be too big to type manually.
Windows users may need Rtools
to take full advantage of remakeGenerator
’s features, specifically to run Makefiles with system("make")
.
Use the help_remakeGenerator()
function to obtain a collection of helpful links. For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.
Write the files for the basic example using
library(remakeGenerator)
example_remakeGenerator("basic")
# list_examples_remakeGenerator() # Shows the names of available examples.
Run workflow.R
to produce the remake
file remake.yml
, an overarching Makefile, and run the workflow using 2 parallel processes.
source("workflow.R")
To use remake
directly in a single process, use
worflow(..., run = FALSE)
remake::make()
Do not call the Makefile directly in the Linux command line. As explained in the parallelRemake vignette, you must use workflow(..., command = "make", args = "--jobs=2")
or parallelRemake::makefile(..., command = "make", args = "--jobs=4")
, etc. parallelRemake uses a quick overhead step to configure hidden files for the Makefile before running it.
Notice how workflow.R
and remake.yml
rely on the functions defined in code.R
. To see how remakeGenerator
saves you time, change the body of one of these functions (something more significant than whitespace or comments) and then run remake::make()
again. Only the targets that depend on that function and downstream output are recomputed. If you only change whitespace or comments in code.R
, the next call to remake::make()
will change nothing, so you can tidy and document your code without triggering unnecessary rebuilds.
workflow.R
workflow.R
is the master plan of the analysis. It arranges the helper functions in code.R
to
Generate some datasets.
library(remakeGenerator)
datasets = commands(
normal16 = normal_dataset(n = 16),
poisson32 = poisson_dataset(n = 32),
poisson64 = poisson_dataset(n = 64)
)
Analyze each dataset with each of two methods of analysis.
Summarize each analysis of each dataset and gather the summaries into manageable objects.
Compute output on the summaries.
output = commands(coef.csv = write.csv(coef, target_name))
Generate plots.
plots = commands(mse.pdf = hist(mse, col = I("black")))
plots$plot = TRUE
Compile knitr
reports.
reports = data.frame(target = strings(markdown.md, latex.tex),
depends = c("poisson32, coef, coef.csv", ""))
reports$knitr = TRUE
With these stages of the workflow planned, workflow.R
collects all the remake
targets into one YAML-like list.
targets = targets(datasets = datasets, analyses = analyses,
summaries = summaries, output = output, plots = plots, reports = reports)
Finally, it generates the remake.yml
file and an overarching Makefile. Then, unless run = FALSE
, it runs the Makefile to deploy your workflow. In this case, from the command
argument, you can see that the work is distributed over at most 2 parallel jobs.
workflow(targets, sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make",
args = "--jobs=2")
You can run each intermediate stages by themselves with the make_these
argument in workflow(...)
.
workflow(targets, make_these = "summaries",
sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
Bypassing the Makefile using run = FALSE
and running remake.yml directly does the same thing in a single R process.
workflow(targets, make_these = "summaries", run = FALSE
sources = "code.R", packages = "MASS", remake_args = list(verbose = F),
prepend = c("# Prepend this", "# to the Makefile."), command = "make", args = "--jobs=2")
remake::make("summaries")
To remove the intermediate files and final results, run
remake::make("clean")
At each stage (datasets
, analyses
, summaries
, mse
, etc.), the user supplies named R commands. The commands are then arranged into a data frame, such as the datasets
data frame from the basic example.
> datasets
target command
1 normal16 normal_dataset(n = 16)
2 poisson32 poisson_dataset(n = 32)
3 poisson64 poisson_dataset(n = 64)
Above, each row stands for an individual remake
target, and the target
column contains the name of the target. Each command is the R function call that produces its respective target. With the exception of “target
”, each column of each data frame represents a target-specific field in the remake.yml
file. If additional fields are needed, just append the appropriate columns to the data frame. In workflow.R
, the plot
and knitr
fields were added this way to the plots
and reports
data frames, respectively. Recall from remake
that setting plot
to TRUE
automatically sends the output of the command to a file so you do not have to bother writing the code to save it.
> plots
target command plot
1 mse.pdf hist(mse_vector, col = I("black")) TRUE
In addition, setting knitr
to TRUE
knits .md
and .tex
target files from .Rmd
and .Rnw
files, respectively.
> reports
target depends knitr
1 markdown.md poisson32, coef, coef.csv TRUE
2 latex.tex
Above, and in the general case, each depends
field is a character string of comma-separated remake
dependencies. Dependencies that are arguments to commands are automatically resolved and should not be restated in depends
. However, for knitr
reports, every dependency must be explicitly given in the depends
field.
In generating the analyses
and summaries
data frames, you may have noticed the ..dataset..
and ..analysis..
symbols. Those are wildcard placeholders indicating that the respective commands will iterate over each dataset and each analysis of each dataset. The analyses()
function turns
> commands(linear = linear_analysis(..dataset..), quadratic = quadratic_analysis(..dataset..))
target command
1 linear linear_analysis(..dataset..)
2 quadratic quadratic_analysis(..dataset..)
into
target command
1 linear_normal16 linear_analysis(normal16)
2 linear_poisson32 linear_analysis(poisson32)
3 linear_poisson64 linear_analysis(poisson64)
4 quadratic_normal16 quadratic_analysis(normal16)
5 quadratic_poisson32 quadratic_analysis(poisson32)
6 quadratic_poisson64 quadratic_analysis(poisson64)
and summaries(..., gather = NULL)
turns
> commands(mse = mse_summary(..dataset.., ..analysis..), coef = coefficients_summary(..analysis..))
target command
1 mse mse_summary(..dataset.., ..analysis..)
2 coef coefficients_summary(..analysis..)
into
target command
1 mse_linear_normal16 mse_summary(normal16, linear_normal16)
2 mse_linear_poisson32 mse_summary(poisson32, linear_poisson32)
3 mse_linear_poisson64 mse_summary(poisson64, linear_poisson64)
4 mse_quadratic_normal16 mse_summary(normal16, quadratic_normal16)
5 mse_quadratic_poisson32 mse_summary(poisson32, quadratic_poisson32)
6 mse_quadratic_poisson64 mse_summary(poisson64, quadratic_poisson64)
7 coef_linear_normal16 coefficients_summary(linear_normal16)
8 coef_linear_poisson32 coefficients_summary(linear_poisson32)
9 coef_linear_poisson64 coefficients_summary(linear_poisson64)
10 coef_quadratic_normal16 coefficients_summary(quadratic_normal16)
11 coef_quadratic_poisson32 coefficients_summary(quadratic_poisson32)
12 coef_quadratic_poisson64 coefficients_summary(quadratic_poisson64)
Setting the gather
argument in summaries()
to c("c", "rbind")
prepends the following two rows to the above data frame.
target
1 coef
2 mse
command
1 rbind(coef_linear_normal16 = coef_linear_normal16, coef_linear_poisson32 = coef_linear_poisson32, coef_linear_poisson64 = coef_linear_poisson64, coef_quadratic_normal16 = coef_quadratic_normal16, coef_quadratic_poisson32 = coef_quadratic_poisson32, coef_quadratic_poisson64 = coef_quadratic_poisson64)
2 c(mse_linear_normal16 = mse_linear_normal16, mse_linear_poisson32 = mse_linear_poisson32, mse_linear_poisson64 = mse_linear_poisson64, mse_quadratic_normal16 = mse_quadratic_normal16, mse_quadratic_poisson32 = mse_quadratic_poisson32, mse_quadratic_poisson64 = mse_quadratic_poisson64)
These top two rows contain instructions to gather the summaries together into manageable objects. The default value of gather
is a character vector with entries "list"
.
Functions commands()
, commands_string()
, and commands_batch()
help create datasets as in previous section.
commands(x = f(1), y = g(2))
## target command
## 1 x f(1)
## 2 y g(2)
a = "f(1)"
b = "g(2)"
commands_string(x = a, y = b)
## target command
## 1 x f(1)
## 2 y g(2)
batch = c(x = a, y = b)
commands_batch(batch)
## target command
## 1 x f(1)
## 2 y g(2)
When your worflow runs, intermediate objects such as datasets, analyses, and summaries are maintained in remake
’s hidden storr
cache, located in the hidden .remake/objects/
folder. To inspect your workflow, you can list the generated objects using parallelRemake::recallable()
and load objects using parallelRemake::recall()
. After running the basic example, we see the following.
> library(parallelRemake)
> recallable()
[1] "coef" "coef_linear_normal16"
[3] "coef_linear_poisson32" "coef_linear_poisson64"
[5] "coef_quadratic_normal16" "coef_quadratic_poisson32"
[7] "coef_quadratic_poisson64" "linear_normal16"
[9] "linear_poisson32" "linear_poisson64"
[11] "mse" "mse_linear_normal16"
[13] "mse_linear_poisson32" "mse_linear_poisson64"
[15] "mse_quadratic_normal16" "mse_quadratic_poisson32"
[17] "mse_quadratic_poisson64" "normal16"
[19] "poisson32" "poisson64"
[21] "quadratic_normal16" "quadratic_poisson32"
[23] "quadratic_poisson64"
> recall("normal16")
x y
1 1.5500328 4.226192
2 1.4714371 4.374820
3 0.4906371 6.228053
4 1.0086720 4.945609
5 1.3360642 5.619259
6 1.4899272 4.920836
7 0.7046544 4.926668
8 1.4092923 4.030779
9 2.5636956 6.026149
10 -0.5202316 4.368160
11 0.5540340 4.760691
12 1.6256007 4.722436
13 1.3210316 3.838017
14 0.8247446 2.708511
15 2.7262725 5.878415
16 2.3565342 4.445811
> out = recall("normal16", "poisson32")
> str(out)
List of 2
$ normal16 :'data.frame': 16 obs. of 2 variables:
..$ x: num [1:16] 0.9728 1.0688 1.4152 -0.4313 0.0912 ...
..$ y: num [1:16] 6.76 6.48 5.59 5.03 3.01 ...
$ poisson32:'data.frame': 32 obs. of 2 variables:
..$ x: int [1:32] 0 2 1 0 0 2 1 0 0 1 ...
..$ y: int [1:32] 4 4 5 4 3 7 4 4 5 2 ...
The functions create_bindings()
and make_environment()
are alternatives from remake
itself. Just be careful with create_bindings()
if your project has a lot of data.
Do not use recall()
or recallable()
in serious production-level workflows because operations on the storr
cache are not reproducibly tracked.
If you want to run Make to distribute tasks over multiple nodes of a Slurm cluster, you should generate a Makefile like the one in this post. To do this, add the following to an R script (say, my_script.R
)
workflow(..., command = "make", args = "--jobs=8",
prepend = c(
"SHELL=srun",
".SHELLFLAGS= <ARGS> bash -c"))
where <ARGS>
stands for additional arguments to srun
. Then, deploy your parallelized workflow to the cluster using the following Linux command.
nohup nice -19 R CMD BATCH my_script.R &
For other task managers such as PBS, you may have to create a custom stand-in for a shell. For example, suppose we are using the Univa Grid Engine. From R, call
workflow(.., command = "make", args = "--jobs=8",
begin = "SHELL = ./shell.sh")
where the file shell.sh
contains
#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y
Now, in the Linux command line, enable execution with
chmod +x shell.sh
and then distribute the work over [N]
simultaneous jobs with
nohup nice -19 R CMD BATCH my_script.R &
The same approach should work for LSF systems, where make
replaced by lsmake and the Makefile is compatible.
Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake
storr cache. For the Univa Grid Engine, the -cwd
flag for qsub
accomplishes this.
You can use the downsize
package in conjunction with remakeGenerator
. First, write an R script (say, downsize.R
) to set test or production mode.
# downsize::test_mode()
downsize::production_mode()
Load downsize.R
into workflow.R
to make your analysis plan respond to downsize()
.
library(remakeGenerator)
source("downsize.R")
datasets = commands_string(
target = "data1",
command = paste0("long_job(number_of_samples = ", downsize(1000, 2), ")")
)
If your custom code.R
functions call downsize()
internally, remake
needs to know.
workflow(sources = c("downsize.R", "code.R", ...), packages = c("downsize", ...))
Unfortunately, remake
does not rebuild targets in response to changes to global options, so you should manually run remake::make("clean")
to start from scratch whenever you change downsize.R
.
Some workflows do not fit the rigid structure of the basic example but could still benefit from the automated generation of remake.yml
files and Makefiles. If you supply the appropriate data frames to the targets()
function, you can customize your own analyses. Here, the expand()
and evaluate()
functions are essential to flexibility. The expand()
function replicates targets generated by the same commands, and the evaluate()
function lets you create and evaluate your own wildcard placeholders. With the rules
argument, the evaluate()
funcion is also capable of evaluating multiple wildcards in a single function call. (In this case, rules
takes precedence, and the wildcard
and values
arguments are ignored.) Here are some examples.
df = commands(data = simulate(center = MU, scale = SIGMA))
df
## target command
## 1 data simulate(center = MU, scale = SIGMA)
df = expand(df, values = c("rep1", "rep2"))
df
## target command
## 1 data_rep1 simulate(center = MU, scale = SIGMA)
## 2 data_rep2 simulate(center = MU, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2)
## target command
## 1 data_rep1_1 simulate(center = 1, scale = SIGMA)
## 2 data_rep1_2 simulate(center = 2, scale = SIGMA)
## 3 data_rep2_1 simulate(center = 1, scale = SIGMA)
## 4 data_rep2_2 simulate(center = 2, scale = SIGMA)
evaluate(df, wildcard = "MU", values = 1:2, expand = FALSE)
## target command
## 1 data_rep1 simulate(center = 1, scale = SIGMA)
## 2 data_rep2 simulate(center = 2, scale = SIGMA)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1)), expand = FALSE)
## target command
## 1 data_rep1 simulate(center = 1, scale = 0.1)
## 2 data_rep2 simulate(center = 2, scale = 1)
evaluate(df, rules = list(MU = 1:2, SIGMA = c(0.1, 1, 10)))
## target command
## 1 data_rep1_1_0.1 simulate(center = 1, scale = 0.1)
## 2 data_rep1_1_1 simulate(center = 1, scale = 1)
## 3 data_rep1_1_10 simulate(center = 1, scale = 10)
## 4 data_rep1_2_0.1 simulate(center = 2, scale = 0.1)
## 5 data_rep1_2_1 simulate(center = 2, scale = 1)
## 6 data_rep1_2_10 simulate(center = 2, scale = 10)
## 7 data_rep2_1_0.1 simulate(center = 1, scale = 0.1)
## 8 data_rep2_1_1 simulate(center = 1, scale = 1)
## 9 data_rep2_1_10 simulate(center = 1, scale = 10)
## 10 data_rep2_2_0.1 simulate(center = 2, scale = 0.1)
## 11 data_rep2_2_1 simulate(center = 2, scale = 1)
## 12 data_rep2_2_10 simulate(center = 2, scale = 10)
For another demonstration, see the flexible example, which almost the same as the basic example except that it uses expand()
and evaluate()
explicitly.