Customization

One great thing about the Linux command line is how easy it is to customize. For example, submitting an array of R jobs like the one on the previous page is something I (PB) do all the time, so I simply added the following commands to ~/.local/bin/:

r_job
  
#!/bin/bash
R CMD BATCH --no-save --no-restore "--args $SGE_TASK_ID $SGE_TASK_LAST" $1 .$1.log

r_batch
  

#!/bin/bash

if [[ $# == 1 ]]; then
    qsub -cwd -e ~/err -o ~/out -q BIOSTAT -b y ~/.local/bin/r_job $1
elif [[ $# == 2 ]]; then
    qsub -cwd -e ~/err -o ~/out -q BIOSTAT -b y -t 1-$2 ~/.local/bin/r_job $1
else
    echo "r_batch takes either 1 or 2 arguments"
    exit 1
fi


  

Now, we can submit arrays of R jobs from the command line without writing any extra scripts like sim and batch-sim (which we will do shortly).

Furthermore, it’s extremely useful to have modular, versatile code. In particular, I don’t like writing R code, then rewriting R code to run on the cluster, then re-re-writing it if I want to run it again on my machine. All of these rewrites are (a) annoying and (b) an opportunity to make a mistake. For example, comparing the versions of sim.R here and here, you’ll note that one of them works in an interactive session but not a batch session, while the other works in a batch session, but won’t run in an interactive session. To avoid switching between the two, I use a pair of simple R functions:

bsave(): This is a function called within R to save results with a structured naming convention that is easy to combine later. To make this available, you can add source("bsave.r") to your ~/.Rprofile file (specifying a path as necessary, depending on where you save the file).
gather: This is a command-line script that combines multiple files (one from each job) into one file that contains all the results (to do the gathering, it uses the abind package, which you will have to install). To make this available, save it to the directory where you keep your executable files (~/.local/bin, if you’re following the naming convention used in this tutorial) and make sure it is executable (chmod u+x).

To see why these functions are useful, let’s watch them in action. Let’s rewrite sim.R one last time:

sim.R
  

N <- 10000
p <- numeric(N)
n <- 10
for (i in 1:N) {
  x <- rnorm(n)
  y <- rnorm(n, sd=3)
  p[i] <- t.test(x, y, var.equal=TRUE)$p.value
}
bsave(p)

  

If we run this in an interactive session of R, p will be saved in a file named with today’s date. When we run it non-interactively on the cluster with

$

r_batch sim.R 15

The results are saved in tmp1.rds, tmp2.rds, and so on. To combine them, submit:

$

gather

Now all the tmp and log files are gone and we are left with a file 2021-06-03.rds (or whatever the date is). If you load it into R, p <- readRDS('2021-06-03.rds'), you’ll see that it contains all 150,000 results. This is about as low a barrier as you can hope for, short of running jobs on multiple processors from within R itself: no need to modify any code or to write any scripts, just run r_batch and then gather when you’re done.

In this particular example, the result was a scalar, but bsave()/gather work for an array of any dimensions (under the assumption that the first dimension is the one we’re merging on) as well as on lists. You can also add a suffix (e.g., bsave(p, 'a') if, for example, you want to save different results in different files; see the comments in bsave.r for documentation and examples.

This has just been an illustration of some personal things I’ve done to customize Argon and make transitioning code to and from Argon go smoothly. You’re of course welcome to use these tools yourself, further customize them to your own need, or take a completely different approach. The main point I want to get across is that writing helper scripts to make your work easier is a very powerful tool.