The parallelRemake package is a helper add-on for remake, a Makefile-like reproducible build system for R. If you haven’t done so already, go learn remake! You should also familiarize yourself with GNU make enough to use the -j flag. The whole key to this package is that make -j <N> runs a Makefile and distributes the work over <N> simultaneous tasks. The makefile() function takes in a remake.yml file (or collection of collated files), generates a Makefile, and runs it using parallel processes. As explained later in this tutorial, you can launch these processes as formal jobs on a cluster or supercomputer.

Rtools for Windows users

If you’re using the Windows operating system, you will need to install the Rtools package. makefile(run = TRUE) requires it.

Help and troubleshooting

Use the help_parallelRemake() function to obtain a collection of helpful links. For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.

Examples

Run list_examples_parallelRemake() to see the available examples, and run example_parallelRemake() to output an example to your current working directory. To run the basic example,

  1. Call example_parallelRemake("basic") in an R session.
  2. Call setwd("basic") to enter the "basic" folder.
  3. Call makefile() to generate the Makefile and run it. Use makefile(..., command = "make", args = "--jobs=4") to distribute the work over 4 parallel processes.

More on makefile()

Here, the Makefile is more than just a job scheduler. At initialization, it generates dummy files to store timestamps, so it knows which jobs to skip and which jobs to run. The downside is that you cannot run the Makefile on its own. You have to run it through makefile(..., run = TRUE).

The makefile() function has additional arguments. The remake_args argument passes additional arguments to remake::make(). For example, makefile(remake_args = list(verbose = FALSE)) is equivalent to remake::make(..., verbose = F) for each target. You cannot set target_names or remake_file this way. Also, prepend argument writes arbitrary lines of code to the top of the Makefile. As explained in the next section, this is the key to submitting processes as jobs on a cluster.

High-performance computing

If you want to run make -j to distribute tasks over multiple nodes of a Slurm cluster, you should generate a Makefile like the one in this post. To do this, run the following in an R session

makefile(..., command = "make", args = "--jobs=4",
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS= <ARGS> bash -c"))

where <ARGS> stands for additional arguments to srun. To ensure that jobs continue to get submitted after you log out, put the call to makefile() in an R script (say, my_script.R) and run the following in the Linux command line.

nohup nice -19 R CMD BATCH my_script.R &

For other task managers such as PBS, you may have to create a custom stand-in for a shell. For example, suppose we are using the Univa Grid Engine. From R, call

makefile(.., command = "make", args = "--jobs=8",
  prepend = "SHELL = ./shell.sh")

where the file shell.sh contains

#!/bin/bash
shift
echo "module load R; $*" | qsub -sync y -cwd -j y

Now, in the Linux command line, enable execution with

chmod +x shell.sh

As before, write your call to makefile() in an R script and run

nohup nice -19 R CMD BATCH my_script.R &

The same approach should work for LSF systems, where make replaced by lsmake and the Makefile is compatible.

Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake storr cache. For the Univa Grid Engine, the -cwd flag for qsub accomplishes this.

Use with the downsize package

The downsize package is compatible with parallelRemake and remakeGenerator workflows, and the remakeGenerator vignette explains how to use these packages together.

Accessing the remake cache for debugging and testing

Intermediate remake objects are maintained in remake’s hidden storr cache. At any point in the workflow, you can reload them using recall(name), where name is a character string denoting the name of a cached object, and you can see the names of all the available objects with the recallable() function. Enter multiple names to return a named list of multiple objects (i.e., recall(name1, name2)). Important: this is only recommended for debugging and testing. Changes to the cache are not tracked and thus not reproducible.

The functions create_bindings() and make_environment() are alternatives from remake itself. Just be careful with create_bindings() if your project has a lot of data.

Multiple remake/YAML files

remake has the option to split the workflow over multiple YAML files and collate them with the “include:” field. If that’s the case, just specify all the root nodes in the remakefiles argument to makefile(). (You could also specify every single YAML file, but that’s tedious.) If needed, makefile() will recursively combine the targets, sources, etc. in the constituent remakefiles and output a new collated YAML file that the master Makefile will then use.