MCMC
One final common task in statistical computing that is relatively easy to
parallelize is the estimation of posterior distributions using Markov chain
Monte Carlo (MCMC) simulation. Although an individual chain is difficult to
parallelize, one can carry out the estimation by running multiple chains, and
this is easily accomplished in parallel.
There are several programs (BUGS, JAGS, Stan) that accept a specified model and
carry out sampling from the posterior. We will focus here on Stan for several reasons, one of which being that it is easy to install: just run install.packages('rstan')
in R.
We will use data from a two-factor study of pilot response to loss of aircraft
control carried out in a flight simulator (raw data
here).
The two factors are type of training that the pilot received and the scenario
that was simulated. A reasonable model for this data is the following:
flights.stan
data {
int n; // Sample size
int J; // Number of scenarios
int K; // Number of groups
int sid[n]; // Scenario ID
int gid[n]; // Group ID
int y[n]; // Outcome
}
parameters {
real mu; // Intercept
real a[J]; // Scenario effect
real b[K]; // Group effect
real<lower=0,upper=5> sigma[2]; // SD for random effects
}
model {
// Likelihood
for (i in 1:n) {
y[i] ~ bernoulli_logit(mu + a[sid[i]] + b[gid[i]]);
}
// Priors
mu ~ normal(0, 100);
for (j in 1:J) {
a[j] ~ normal(0, sigma[1]);
}
for (k in 1:K) {
b[k] ~ normal(0, sigma[2]);
}
for (i in 1:2) {
sigma[i] ~ uniform(0, 5);
}
}
The following code fits a posterior to the above model model from R (either interactively or not), here specifying that we want 10 chains, each with 10,000
draws (and 2,000 warm-up iterations, so we’re drawing 12,000 samples from each chain, but only keeping the last 10,000) for a total of 100,000 draws across all 10 chains. This is overkill for such a simple model, but useful for demonstrating the HPC cluster.
flights.R
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
# Data
Data <- read.delim("https://iowabiostat.github.io/hpc/misc/flights.txt")
Data$Scenario <- factor(Data$Scenario)
Data$Group <- factor(Data$Group)
stanData <- list(sid = as.numeric(Data$Scenario),
gid = as.numeric(Data$Group),
J = length(levels(Data$Scenario)),
K = length(levels(Data$Group)),
y = Data$Recovered,
n = nrow(Data))
# fit
fit <- stan('flights.stan',
data = stanData,
iter = 12000,
warmup = 2000,
chains = 10)
# Posterior summaries
print(fit)
stan_plot(fit, 'a') # Scenario effects
stan_plot(fit, 'b') # Training group effects
The above code can be run either interactively through qlogin
or as a batch
job with qsub
. Either way, you will want to specify the -pe smp 10
flag to
ensure that you have 10 processors available (if you only have one processor
available, you’ll have to run the 10 chains sequentially, which will of course
be about 10 times slower).
For the purposes of this tutorial, I would perhaps suggest running this through qlogin
so that you can witness firsthand the chains running in parallel:
$
qlogin -q BIOSTAT -pe smp 10
For models that will take longer to fit, however, you would want to submit a batch job so that you don’t have to sit there and wait for it to finish. In this case, the qsub
command would look like:
$
qsub -pe smp 10 -q BIOSTAT "R CMD BATCH --no-save --no-restore flights.R .flights.Rout"
As the summaries inform us, this study indicates that there is little reason to think that type of training had any impact on pilot’s ability to recover aircraft control, although scenario certainly had an effect (some simulated scenarios were much more difficult than others).