MCMC

One final common task in statistical computing that is relatively easy to parallelize is the estimation of posterior distributions using Markov chain Monte Carlo (MCMC) simulation. Although an individual chain is difficult to parallelize, one can carry out the estimation by running multiple chains, and this is easily accomplished in parallel.

There are several programs (BUGS, JAGS, Stan) that accept a specified model and carry out sampling from the posterior. We will focus here on Stan for several reasons, one of which being that it is easy to install: just run install.packages('rstan') in R.

We will use data from a two-factor study of pilot response to loss of aircraft control carried out in a flight simulator (raw data here). The two factors are type of training that the pilot received and the scenario that was simulated. A reasonable model for this data is the following:

flights.stan
data {
  int n;      // Sample size
  int J;      // Number of scenarios
  int K;      // Number of groups
  int sid[n]; // Scenario ID
  int gid[n]; // Group ID
  int y[n];   // Outcome
}

parameters {
  real mu;                        // Intercept
  real a[J];                      // Scenario effect
  real b[K];                      // Group effect
  real<lower=0,upper=5> sigma[2]; // SD for random effects
}

model {
  // Likelihood
  for (i in 1:n) {
    y[i] ~ bernoulli_logit(mu + a[sid[i]] + b[gid[i]]);
  }

  // Priors
  mu ~ normal(0, 100);
  for (j in 1:J) {
    a[j] ~ normal(0, sigma[1]);
  }
  for (k in 1:K) {
    b[k] ~ normal(0, sigma[2]);
  }
  for (i in 1:2) {
    sigma[i] ~ uniform(0, 5);
  }
}

The following code fits a posterior to the above model model from R (either interactively or not), here specifying that we want 10 chains, each with 10,000 draws (and 2,000 warm-up iterations, so we’re drawing 12,000 samples from each chain, but only keeping the last 10,000) for a total of 100,000 draws across all 10 chains. This is overkill for such a simple model, but useful for demonstrating the HPC cluster.

flights.R
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# Data
Data <- read.delim("https://iowabiostat.github.io/hpc/misc/flights.txt")
Data$Scenario <- factor(Data$Scenario)
Data$Group <- factor(Data$Group)
stanData <- list(sid = as.numeric(Data$Scenario),
                 gid = as.numeric(Data$Group),
                 J = length(levels(Data$Scenario)),
                 K = length(levels(Data$Group)),
                 y = Data$Recovered,
                 n = nrow(Data))

# fit
fit <- stan('flights.stan',
            data = stanData,
            iter = 12000,
            warmup = 2000,
            chains = 10)

# Posterior summaries
print(fit)
stan_plot(fit, 'a')  # Scenario effects
stan_plot(fit, 'b')  # Training group effects

The above code can be run either interactively through qlogin or as a batch job with qsub. Either way, you will want to specify the -pe smp 10 flag to ensure that you have 10 processors available (if you only have one processor available, you’ll have to run the 10 chains sequentially, which will of course be about 10 times slower).

For the purposes of this tutorial, I would perhaps suggest running this through qlogin so that you can witness firsthand the chains running in parallel:

$
qlogin -q BIOSTAT -pe smp 10

For models that will take longer to fit, however, you would want to submit a batch job so that you don’t have to sit there and wait for it to finish. In this case, the qsub command would look like:

$
qsub -pe smp 10 -q BIOSTAT "R CMD BATCH --no-save --no-restore flights.R .flights.Rout"

As the summaries inform us, this study indicates that there is little reason to think that type of training had any impact on pilot’s ability to recover aircraft control, although scenario certainly had an effect (some simulated scenarios were much more difficult than others).