The parallelRemake
package is a helper add-on for remake
, a Makefile-like reproducible build system for R. If you haven’t done so already, go learn remake
! You should also familiarize yourself with GNU make enough to use the -j
flag. The whole key to this package is that make -j <N>
runs a Makefile and distributes the work over <N>
simultaneous tasks. The makefile()
function takes in a remake.yml
file (or collection of collated files), generates a Makefile, and runs it using parallel processes. As explained later in this tutorial, you can launch these processes as formal jobs on a cluster or supercomputer.
If you’re using the Windows operating system, you will need to install the Rtools
package. makefile(run = TRUE)
requires it.
Use the help_parallelRemake()
function to obtain a collection of helpful links. For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.
Run list_examples_parallelRemake()
to see the available examples, and run example_parallelRemake()
to output an example to your current working directory. To run the basic example,
example_parallelRemake("basic")
in an R session.setwd("basic")
to enter the "basic"
folder.makefile()
to generate the Makefile
and run it. Use makefile(..., command = "make", args = "--jobs=4")
to distribute the work over 4 parallel processes.makefile()
Here, the Makefile
is more than just a job scheduler. At initialization, it generates dummy files to store timestamps, so it knows which jobs to skip and which jobs to run. The downside is that you cannot run the Makefile
on its own. You have to run it through makefile(..., run = TRUE)
.
The makefile()
function has additional arguments. The remake_args
argument passes additional arguments to remake::make()
. For example, makefile(remake_args = list(verbose = FALSE))
is equivalent to remake::make(..., verbose = F)
for each target. You cannot set target_names
or remake_file
this way. Also, prepend
argument writes arbitrary lines of code to the top of the Makefile
. As explained in the next section, this is the key to submitting processes as jobs on a cluster.
If you want to run make -j
to distribute tasks over multiple nodes of a Slurm cluster, you should generate a Makefile like the one in this post. To do this, run the following in an R session
makefile(..., command = "make", args = "--jobs=4",
prepend = c(
"SHELL=srun",
".SHELLFLAGS= <ARGS> bash -c"))
where <ARGS>
stands for additional arguments to srun
. To ensure that jobs continue to get submitted after you log out, put the call to makefile()
in an R script (say, my_script.R
) and run the following in the Linux command line.
nohup nice -19 R CMD BATCH my_script.R &
For other task managers such as PBS, you may have to create a custom stand-in for a shell. For example, suppose we are using the Univa Grid Engine. From R, call
makefile(.., command = "make", args = "--jobs=8",
prepend = "SHELL = ./shell.sh")
where the file shell.sh
contains
#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y
Now, in the Linux command line, enable execution with
chmod +x shell.sh
As before, write your call to makefile()
in an R script and run
nohup nice -19 R CMD BATCH my_script.R &
The same approach should work for LSF systems, where make
replaced by lsmake and the Makefile is compatible.
Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake
storr cache. For the Univa Grid Engine, the -cwd
flag for qsub
accomplishes this.
The downsize package is compatible with parallelRemake and remakeGenerator workflows, and the remakeGenerator vignette explains how to use these packages together.
remake
cache for debugging and testingIntermediate remake
objects are maintained in remake
’s hidden storr
cache. At any point in the workflow, you can reload them using recall(name)
, where name
is a character string denoting the name of a cached object, and you can see the names of all the available objects with the recallable()
function. Enter multiple names to return a named list of multiple objects (i.e., recall(name1, name2)
). Important: this is only recommended for debugging and testing. Changes to the cache are not tracked and thus not reproducible.
The functions create_bindings()
and make_environment()
are alternatives from remake
itself. Just be careful with create_bindings()
if your project has a lot of data.
remake
/YAML
filesremake
has the option to split the workflow over multiple YAML
files and collate them with the “include:” field. If that’s the case, just specify all the root nodes in the remakefiles
argument to makefile()
. (You could also specify every single YAML
file, but that’s tedious.) If needed, makefile()
will recursively combine the targets, sources, etc. in the constituent remakefiles
and output a new collated YAML
file that the master Makefile
will then use.