The targets
R package enhances the reproducibility, scale, and maintainability of data science projects in computationally intense fields such as machine learning, Bayesian Statistics, and statistical genomics (Landau 2021b). Recent breakthroughs in the targets ecosystem make it easy to create ambitious, domain-specific, reproducible data analysis pipelines. Two highlights include Target Markdown, an R Markdown interface to transparently communicate the entire process of pipeline construction and prototyping, and stantargets
, a new rOpenSci package that generates specialized workflows for Stan models while reducing the required volume of user-side R code (Landau 2021a).
This Target Markdown report demonstrates Target Markdown and stantargets
in a workflow to validate a Bayesian longitudinal linear model common to clinical trial data analysis. Using multiple simulated datasets from the prior predictive distribution of the model, we evaluate how well the model recovers the true parameters drawn from the data-generating process.
The model assumes a normally-distributed response variable y
, which could be a clinical endpoint, biomarker, or other continuous measure of efficacy or safety. The mean response is expressed in terms of a model matrix and independent model coefficients. Patients are conditionally independent, and different study visits from the same patient are correlated. We apply half-Cauchy priors to the standard deviations (Gelman 2006), and we apply an LKJ prior to the correlations among study visits in order to avoid the known pathology of inverse-Wishart (Alvarez, Niemi, and Simpson 2016).
\[ \begin{aligned} & y \sim \text{MVN}(X_{(n \cdot t) \times p} \beta, \ I_{n \times n} \otimes \Sigma_{t \times t} ) \\ & \qquad \beta \sim \text{MVN} (0, 10^2 I_{p \times p})\\ & \qquad \Sigma_{t \times t} = \left (I_{t \times t} \sigma \right ) \Lambda_{t \times t} \Lambda_{t \times t}' \left (I_{t \times t} \sigma \right ) \\ & \qquad \qquad \sigma_1, \ldots, \sigma_t \stackrel{\text{ind}}{\sim} \text{Cauchy}^+(0, 5) \\ & \qquad \qquad \Lambda_{t \times t}\Lambda_{t \times t}' \sim \text{LKJ}(\text{shape} = 1, \text{order} = t) \end{aligned} \]
writeLines(readLines("model.stan"))
#> data {
#> int<lower=0> n_arms; // number of clinical trial study arms, e.g. treatment groups
#> int<lower=0> n_beta; // number of model coefficients beta
#> int<lower=0> n_observations; // number of data points (n_patients * n_visits) including missing values
#> int<lower=0> n_missing; // number of missing data points
#> int<lower=0> n_patients; // number of patients/patients in the clinical trial
#> int<lower=0> n_visits; // number of scheduled site visits / repeated measures per patient
#> int<lower=0> s_beta; // prior standard deviation of the model coefficients.
#> int<lower=0> s_sigma; // prior half-Cauchy scale for the visit-specific variances of the response
#> int<lower=0> missing[n_observations]; // 0-1 vector to indicate which elements of y are missing values
#> int<lower=0> count_missing[n_observations]; // index vector to map missing to an index in y_missing
#> vector[n_observations] y; // clinical endpoint, including missing data points
#> matrix[n_observations, n_beta] x; // model matrix
#> }
#> parameters {
#> vector[n_beta] beta; // vector of model coefficients
#> vector<lower=0>[n_visits] sigma; // standard deviations of the response at each visit
#> cholesky_factor_corr[n_visits] lambda; // Cholesky factor of the correlation matrix among visits within patients
#> vector[n_missing] y_missing; // missing values to impute
#> }
#> model {
#> int first_visit; // first visit/observation of the current patient
#> int last_visit; // last visit/observation of the current patient
#> vector[n_observations] mu = x * beta; // mean response
#> vector[n_observations] y_imputed; // data vector with imputed missing values
#> matrix[n_visits, n_visits] sigma_lambda = diag_pre_multiply(sigma, lambda);
#> for (observation in 1:n_observations){
#> // Impute the missing data:
#> y_imputed[observation] = missing[observation] == 1 ? y_missing[count_missing[observation]] : y[observation];
#> }
#> for (patient in 1:n_patients) { // model each patient separately
#> last_visit = patient * n_visits; // Find the last visit of the current patient.
#> first_visit = last_visit - n_visits + 1; // Find the first visit of the current patient.
#> // Each patient is multivariate normal with the same unstructured covariance:
#> y_imputed[first_visit:last_visit] ~ multi_normal_cholesky(mu[first_visit:last_visit], sigma_lambda);
#> }
#> beta ~ normal(0, s_beta); // independent fixed effects
#> sigma ~ cauchy(0, s_sigma); // diffuse prior on the visit-specific standard deviations
#> lambda ~ lkj_corr_cholesky(1); // LKJ prior on the correlations among visits
#> }
The goal of this validation exercise is to verify that the fitted model recovers the parameter values used to generate the original data. This serves as evidence that the Stan code was implemented correctly. Steps:
This technique is based on the concept of calibration explained at https://statmodeling.stat.columbia.edu/2017/04/12/bayesian-posteriors-calibrated/ (Carpenter 2017), and it is similar to simulation-based calibration (SBC) (Cook, Gelman, and Rubin 2006; Talts et al. 2020). SBC may be more robust than the interval-based validation technique above, but interval-based validation is currently easier to perform with stantargets
and easier to communicate for high-dimensional models.1
This section constructs the pipeline and uses Target Markdown to explain the purpose of each target along the way. In interactive mode in the RStudio IDE, you can run the inexpensive {targets}
code chunks in order to emulate a section of the targets
pipeline. If you run the report in a non-interactive R session, the {targets}
code chunks write scripts to set up the pipeline for a tar_make_clustermq()
call later on. When it comes time to actually run the pipeline, call tar_make_clustermq()
, either inside an ordinary {r}
code chunk or outside the R Markdown report altogether.
First, we load the targets
package to register the Target Markdown knitr
language engine.
library(targets)
Next, we remove any transient scripts previously generated by Target Markdown non-interactively (optional).
tar_unscript()
In our first Target Markdown code chunk, we load the packages required to define the targets in the pipeline: in this case, targets
and stantargets
.
library(targets)
library(stantargets)
#> Establish _targets.R and _targets_r/globals/packages.R.
Next, we register the packages required to run the pipeline, as well as and global settings to control the storage and retrieval of data.
tar_option_set(
packages = c("extraDistr", "tidyverse", "trialr"),
deployment = "main", # Set to "worker" in tar_stan_mcmc_rep_summary().
storage = "worker",
retrieval = "worker",
memory = "transient",
garbage_collection = TRUE
)
#> Establish _targets.R and _targets_r/globals/settings.R.
We also register a Sun Grid Engine (SGE) cluster with the clustermq
package to enable distributed computing via tar_make_clustermq()
. You can choose a different scheduler or local multicore computing by following the instructions at https://mschubert.github.io/clustermq/articles/userguide.html (Schubert 2019).
options(clustermq.scheduler = "sge", clustermq.template = "sge.tmpl")
#> Establish _targets.R and _targets_r/globals/clustermq.R.
The following function simulates a single clinical trial dataset from the prior predictive distribution of the model. The function returns a Stan data list and accepts arguments n_patients
(number of patients in the trial), n_visits
(number of scheduled study visits, e.g. repeated measures per patient), and n_arms
(number of study arms, e.g. treatment groups). A random 5 values are missing to test the imputation. To mitigate numerical stability issues, the covariance matrix is restricted (via rejection sampling) such that each response standard deviation is less than 8. The .join_data
list at the end tells stantargets
to append the prior predictive draws of model parameters to the MCMC output.
simulate_data <- function(n_patients = 50, n_visits = 3, n_arms = 2) {
structure <- tibble(patient = as.character(seq_len(n_patients))) %>%
mutate(arm = as.character(rep(seq_len(n_arms), each = n_patients / n_arms))) %>%
mutate(covariate1 = rnorm(n()), covariate2 = rnorm(n())) %>%
expand_grid(visit = as.character(seq_len(n_visits))) %>%
mutate(group = paste0(arm, visit))
x <- model.matrix(~ covariate1 + covariate2 + group, data = structure) %>%
as.matrix()
beta <- rnorm(n = ncol(x), mean = 0, sd = 2)
sigma <- NULL
while (is.null(sigma) || max(sigma) > 8) {
rho <- rlkjcorr(n = 1, K = n_visits, eta = 1)
lambda <- t(chol(rho))
sigma <- rhcauchy(n = n_visits, sigma = 1)
covariance <- diag(sigma) %*% rho %*% diag(sigma)
}
data <- structure %>%
mutate(x_beta = as.numeric(x %*% beta)) %>%
group_by(patient) %>%
mutate(response = MASS::mvrnorm(n = 1L, mu = x_beta, Sigma = covariance)) %>%
ungroup() %>%
select(response, arm, patient, visit, covariate1, covariate2, group)
which_missing <- sample.int(n = length(data$response), size = 5)
missing <- rep(0L, length(data$response))
missing[which_missing] <- 1L
data$response[which_missing] <- -99999
list(
data = data,
y = data$response,
x = x,
n_arms = n_arms,
n_beta = length(beta),
n_observations = n_patients * n_visits,
n_patients = n_patients,
n_visits = n_visits,
s_beta = 2,
s_sigma = 1,
missing = missing,
count_missing = cumsum(missing),
n_missing = sum(missing),
.join_data = list(
beta = beta,
sigma = sigma,
lambda = lambda
)
)
}
#> Establish _targets.R and _targets_r/globals/data.R.
Now, we define the targets for the steps of the workflow. We use the stantargets
R package to manage most of the work: simulate 1000 datasets, analyze each dataset with the model, and calculate the coverage status of each simulation. Following the advice at https://books.ropensci.org/targets/dynamic.html#batching, we divide these 1000 replications among 100 batches for computational efficiency. As explained at https://docs.ropensci.org/stantargets/articles/simulation.html, all this can be done with the tar_stan_mcmc_rep_summary()
function. The targets created by tar_stan_mcmc_rep_summary()
can take a long time to run, so it is recommended to suppress interactive mode in this particular code chunk unless you reduce iter_warmup
and iter_sampling
to very small values and batches
, reps
, chains
, and parallel_chains
each down to 1.
tar_stan_mcmc_rep_summary(
name = mcmc,
stan_files = "model.stan",
data = simulate_data(), # Runs once per rep.
batches = 100, # Number of branch targets.
reps = 10, # Number of model reps per branch target.
chains = 4, # Number of MCMC chains.
parallel_chains = 4, # How many MCMC chains to run in parallel.
iter_warmup = 4e4, # Number of MCMC warmup iterations to run.
iter_sampling = 4e4, # Number of MCMC post-warmup iterations to run.
summaries = list(
# Compute posterior intervals at levels 50% and 95%.
# The 50% intervals should cover prior predictive parameter draws
# 50% of the time. The 95% intervals are similar.
# We also calculate posterior medians so we can compare them
# directly to the prior predictive draws.
~posterior::quantile2(.x, probs = c(0.025, 0.25, 0.5, 0.75, 0.975)),
# We use Gelman-Rubin potential scale reduction factors to
# assess convergence:
rhat = ~posterior::rhat(.x)
),
deployment = "worker"
)
#> Establish _targets.R and _targets_r/targets/mcmc.R.
After we fit the model, we assess convergence using univariate Gelman-Rubin potential scale reduction factors (Gelman et al. 2014) on all scalar parameters across all simulations. Below, we report the maximum Gelman factor of each visibly divergent simulation rep (i.e. with a Gelman factor above 1.01). Because this is a single target and easy to define, we can set the tar_simple
chunk option to TRUE
. That way, the chunk label becomes the target name, and the chunk code becomes the target command.
tar_target(convergence, {
mcmc %>%
filter(!is.na(rhat) & rhat > 1.01) %>%
group_by(.rep) %>%
summarize(max_rhat = max(rhat), .groups = "drop") %>%
arrange(desc(max_rhat))
})
#> Define target convergence from chunk code.
#> Establish _targets.R and _targets_r/targets/convergence.R.
Our last target calculates coverage as averaged over the prior predictive distribution.
tar_target(coverage, {
mcmc %>%
filter(!is.na(rhat), !grepl("lp__|y_missing", variable)) %>%
group_by(variable) %>%
summarize(
coverage_50 = mean(q25 < .join_data & .join_data < q75),
coverage_95 = mean(q2.5 < .join_data & .join_data < q97.5),
.groups = "drop"
)
})
#> Define target coverage from chunk code.
#> Establish _targets.R and _targets_r/targets/coverage.R.
Up to this point, we have only constructed the pipeline. The {targets}
chunks write R scripts to _targets.R
and _targets_r/
(in non-interactive mode) but they do not actually run the pipeline. To run the pipeline, we use tar_make_clustermq()
, either outside the report or in a subsequent ordinary {r}
code chunk.
But first, it is good practice to visualize the dependency graph to verify that the workflow is correctly specified. All targets should be connected by graph edges, and they should appear in the correct order in relation to one another.
tar_visnetwork()
Because there are so many simulations to run, the pipeline could take a long time. Below, we use the tar_make_clustermq()
to run different targets on different parallel workers on a Sun Grid Engine (SGE) cluster. Because of the long computation time, if there are any outdated targets according to the graph above, it is recommended to run this report in a non-interactive background process (e.g. via the included shell script run.sh
) rather than clicking the “Knit” button in the RStudio IDE. While the pipeline is running, you may wish to monitor the SGE workers with qstat
and monitor the progress of the targets using tar_watch()
and/or tar_poll()
.
unlink("logs", recursive = TRUE)
tar_make_clustermq(workers = 50, reporter = "silent")
#> Master: [21203.0s 0.1% CPU]; Worker: [avg 2.1% CPU, max 901.9 Mb]
#> Warning message:
#> 100 targets produced warnings. Run tar_meta(fields = warnings) for the messages.
The following output shows how many targets were skipped, how many ran, and how many had issues.
tar_progress_summary()
#> # A tibble: 1 × 6
#> skipped started built errored canceled since
#> <int> <int> <int> <int> <int> <chr>
#> 1 1 0 204 0 0 0.091 seconds
First, we examine convergence diagnostics. There are few visibly divergent simulation reps, so convergence issues seem unlikely to impact coverage statistics.
library(gt)
gt(tar_read(convergence))
.rep | max_rhat |
---|---|
01cf306e | 1.010922 |
Finally, we report coverage statistics. Coverage does not appear to systematically deviate from nominal: coverage_50
is around 0.5, and coverage_95
is around 0.95.
gt(tar_read(coverage))
variable | coverage_50 | coverage_95 |
---|---|---|
beta[1] | 0.501 | 0.953 |
beta[2] | 0.498 | 0.953 |
beta[3] | 0.495 | 0.944 |
beta[4] | 0.499 | 0.949 |
beta[5] | 0.500 | 0.957 |
beta[6] | 0.512 | 0.952 |
beta[7] | 0.533 | 0.948 |
beta[8] | 0.506 | 0.955 |
lambda[2,1] | 0.500 | 0.949 |
lambda[2,2] | 0.491 | 0.949 |
lambda[3,1] | 0.488 | 0.953 |
lambda[3,2] | 0.506 | 0.961 |
lambda[3,3] | 0.504 | 0.956 |
sigma[1] | 0.504 | 0.949 |
sigma[2] | 0.478 | 0.955 |
sigma[3] | 0.516 | 0.952 |
The results above provide evidence that the model was implemented correctly. With that established, we can move on to the real task at hand: compare this model with a legacy inverse-Wishart-based model to explore how the choice of prior distribution affects inference on real clinical datasets.
stantargets
:
targets
: https://docs.ropensci.org/targets/stantargets
: https://docs.ropensci.org/stantargets/cmdstanr
: https://mc-stan.org/cmdstanr/posterior
: https://mc-stan.org/posterior/Alvarez, Ignacio, Jarad Niemi, and Matt Simpson. 2016. “Bayesian Inference for a Covariance Matrix.”
Carpenter, Bob. 2017. “Bayesian Posteriors are Calibrated by Definition.” https://statmodeling.stat.columbia.edu/2017/04/12/bayesian-posteriors-calibrated/.
Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3). [American Statistical Association, Taylor & Francis, Ltd., Institute of Mathematical Statistics, Interface Foundation of America]:675–92. http://www.jstor.org/stable/27594203.
Gelman, Andrew. 2006. “Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper).” Bayesian Analysis 1 (3). International Society for Bayesian Analysis:515–34. https://doi.org/10.1214/06-BA117A.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2014. Bayesian Data Analysis. Edited by Francesca Dominici, Julian J. Faraway, Martin Tanner, and Jim idek. Third. CRC Press.
Landau, William Michael. 2021a. “The Stantargets R Package: A Workflow Framework for Efficient Reproducible Stan-Powered Bayesian Data Analysis Pipelines.” Journal of Open Source Software 6 (60). The Open Journal:3193. https://doi.org/10.21105/joss.03193.
———. 2021b. “The Targets R Package: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for Reproducibility and High-Performance Computing.” Journal of Open Source Software 6 (57). The Open Journal:2959. https://doi.org/10.21105/joss.02959.
Schubert, Michael. 2019. “clustermq enables efficient parallelization of genomic analyses.” Bioinformatics 35 (21):4493–5. https://doi.org/10.1093/bioinformatics/btz284.
Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.”
Parameter-specific rank statistic histograms can be difficult to inspect when there are many parameters.↩