Run all MCMCs on a Sun Grid Engine (SGE) cluster. Different models run in different jobs, and different chains run on different cores.
hbl_mcmc_sge(
data,
response = "response",
study = "study",
study_reference = max(data[[study]]),
group = "group",
group_reference = min(data[[group]]),
patient = "patient",
rep = "rep",
rep_reference = min(data[[rep]]),
covariates = grep("^covariate", colnames(data), value = TRUE),
constraint = FALSE,
s_alpha = 30,
s_delta = 30,
s_beta = 30,
s_sigma = 30,
s_lambda = 1,
s_mu = 30,
s_tau = 30,
covariance_current = "unstructured",
covariance_historical = "unstructured",
control = list(max_treedepth = 17, adapt_delta = 0.99),
log = "/dev/null",
scheduler = "sge",
chains = 1,
cores = chains,
...
)
Tidy data frame with one row per patient per rep, indicator columns for the response variable, study, group, patient, rep, and covariates. All columns must be atomic vectors (e.g. not lists).
Character of length 1,
name of the column in data
with the response/outcome variable.
data[[response]]
must be a continuous variable,
and it should be the change from baseline of a
clinical endpoint of interest, as opposed to just
the raw response. Treatment differences
are computed directly from this scale, please supply
change from baseline unless you are absolutely certain
that treatment differences computed directly from
this quantity are clinically meaningful.
Character of length 1,
name of the column in data
with the study ID.
Atomic of length 1,
element of the study
column that indicates
the current study.
(The other studies are historical studies.)
Character of length 1,
name of the column in data
with the group ID.
Atomic of length 1,
element of the group
column that indicates
the control group.
(The other groups may be treatment groups.)
Character of length 1,
name of the column in data
with the patient ID.
Character of length 1,
name of the column in data
with the rep ID.
Atomic of length 1,
element of the rep
column that indicates
baseline, i.e. the first rep chronologically.
(The other reps may be post-baseline study visits or time points.)
Character vector of column names
in data
with the columns with baseline covariates.
These can be continuous, categorical, or binary.
Regardless, historicalborrowlong
derives the appropriate
model matrix.
Each baseline covariate column must truly be a baseline covariate: elements must be equal for all time points within each patient (after the steps in the "Data processing" section). In other words, covariates must not be time-varying.
A large number of covariates, or a large number of levels in a categorical covariate, can severely slow down the computation. Please consider carefully if you really need to include such complicated baseline covariates.
Logical of length 1, whether to pool all study arms at baseline (first rep). Appropriate when the response is the raw response (as opposed to change from baseline) and the first rep (i.e. time point) is prior to treatment.
Numeric of length 1, prior standard deviation
of the study-specific control group mean parameters alpha
.
Numeric of length 1, prior standard deviation
of the study-by-group effect parameters delta
.
Numeric of length 1, prior standard deviation
of the fixed effects beta
.
Numeric of length 1, prior upper bound of the residual standard deviations.
shape parameter of the LKJ priors on the unstructured correlation matrices.
Numeric of length 1,
prior standard deviation of mu
.
Non-negative numeric of length 1.
If prior_tau
is "half_t"
, then s_tau
is the scale parameter of
the Student t prior of tau
and analogous to the sigma
parameter of
the Student-t parameterization given at
https://mc-stan.org/docs/functions-reference/unbounded_continuous_distributions.html. # nolint
If prior_tau
is "uniform"
, then s_tau
is the upper bound of tau
.
Upper bound on tau
if prior_tau
is "uniform"
.
Character of length 1,
covariance structure of the current study.
Possible values are "unstructured"
for fully parameterized
covariance matrices, "ar1"
for AR(1) covariance matrices,
and "diagonal"
for residuals independent across time within
each patient. In MCMC (e.g. hbl_mcmc_hierarchical()
),
the covariance structure affects computational speed.
Unstructured covariance is slower than AR(1), and AR(1) is slower
than diagonal. This is particularly true for covariance_historical
if there are many historical studies in the data.
Same as covariance_current
,
but for the covariance structure of each separate historical study.
Each historical study has its own separate covariance matrix.
A named list
of parameters to control the sampler's
behavior. It defaults to NULL
so all the default values are used.
First, the following are adaptation parameters for sampling algorithms.
These are parameters used in Stan with similar names here.
adapt_engaged
(logical
)
adapt_gamma
(double
, positive, defaults to 0.05)
adapt_delta
(double
, between 0 and 1, defaults to 0.8)
adapt_kappa
(double
, positive, defaults to 0.75)
adapt_t0
(double
, positive, defaults to 10)
adapt_init_buffer
(integer
, positive, defaults to 75)
adapt_term_buffer
(integer
, positive, defaults to 50)
adapt_window
(integer
, positive, defaults to 25)
In addition, algorithm HMC (called 'static HMC' in Stan) and NUTS share the following parameters:
stepsize
(double
, positive, defaults to 1)
Note: this controls the initial stepsize only, unless adapt_engaged=FALSE
.
stepsize_jitter
(double
, [0,1], defaults to 0)
metric
(string
, one of "unit_e", "diag_e", "dense_e",
defaults to "diag_e")
For algorithm NUTS, we can also set:
max_treedepth
(integer
, positive, defaults to 10)
For algorithm HMC, we can also set:
int_time
(double
, positive)
For test_grad
mode, the following parameters can be set:
epsilon
(double
, defaults to 1e-6)
error
(double
, defaults to 1e-6)
Character of length 1, path to a directory (with a trailing /
)
or a single file path. The SGE log files go here. Only works if
scheduler
is "sge"
.
Either "sge"
or "local"
, high-performance computing
scheduler / resource manager to use. Choose "sge"
for serious use cases
with a Sun Grid Engine (SGE) cluster. Otherwise, to run models
sequentially on the current node, choose "local"
.
A positive integer specifying the number of Markov chains. The default is 4.
The number of cores to use when executing the Markov chains in parallel.
The default is to use the value of the "mc.cores"
option if it
has been set and otherwise to default to 1 core. However, we recommend
setting it to be as many processors as the hardware and RAM allow
(up to the number of chains). See detectCores
if you don't know this number for your system.
Other optional parameters:
chain_id
(integer
)
init_r
(double
, positive)
test_grad
(logical
)
append_samples
(logical
)
refresh
(integer
)
save_warmup
(logical
)
deprecated: enable_random_init
(logical
)
chain_id
can be a vector to specify the chain_id for all
chains or an integer. For the former case, they should be unique.
For the latter, the sequence of integers starting from the given
chain_id
are used for all chains.
init_r
is used only for generating random initial values,
specifically when init="random"
or not all parameters
are initialized in the user-supplied list or function. If specified,
the initial values are simulated uniformly from interval
[-init_r
, init_r
] rather than using the default interval
(see the manual of (cmd)Stan).
test_grad
(logical
).
If test_grad=TRUE
, Stan will not do any sampling. Instead,
the gradient calculation is tested and printed out and the fitted
stanfit
object is in test gradient mode. By default, it is
FALSE
.
append_samples
(logical
).
Only relevant if sample_file
is specified and is an
existing file. In that case, setting append_samples=TRUE
will append
the samples to the existing file rather than overwriting the contents of
the file.
refresh
(integer
) can be used to
control how often the progress of the sampling is reported (i.e.
show the progress every refresh
iterations).
By default, refresh = max(iter/10, 1)
.
The progress indicator is turned off if refresh <= 0
.
Deprecated: enable_random_init
(logical
) being TRUE
enables specifying initial values randomly when the initial
values are not fully specified from the user.
save_warmup
(logical
) indicates whether to
save draws during the warmup phase and defaults to TRUE
. Some
memory related problems can be avoided by setting it to FALSE
,
but some diagnostics are more limited if the warmup draws are not
stored.
A list of tidy data frames of parameter samples from the
posterior distribution.
Columns .chain
, .iteration
,
and .draw
have the meanings documented in the
posterior
package.
Before running the MCMC, dataset is pre-processed. This includes expanding the rows of the data so every rep of every patient gets an explicit row. So if your original data has irregular rep IDs, e.g. unscheduled visits in a clinical trial that few patients attend, please remove them before the analysis. Only the most common rep IDs should be added.
After expanding the rows, the function fills in missing values for every column except the response. That includes covariates. Missing covariate values are filled in, first with last observation carried forward, then with last observation carried backward. If there are still missing values after this process, the program throws an informative error.
Other mcmc:
hbl_convergence()
,
hbl_mcmc_hierarchical()
,
hbl_mcmc_independent()
,
hbl_mcmc_pool()
if (identical(Sys.getenv("HBL_SGE"), "true")) {
if (!identical(Sys.getenv("HBL_TEST", unset = ""), "")) {
set.seed(0)
data <- hbl_sim_hierarchical(
n_study = 2,
n_group = 2,
n_patient = 5,
n_rep = 3
)$data
tmp <- utils::capture.output(
suppressWarnings(
mcmc <- hbl_mcmc_sge(
data,
chains = 2,
warmup = 10,
iter = 20,
seed = 0,
scheduler = "local" # change to "sge" for serious runs
)
)
)
mcmc
}
}