Customization
One great thing about the Linux command line is how easy it is to customize. For example, submitting an array of R jobs like the one on the previous page is something I (PB) do all the time, so I simply added the following commands to ~/.local/bin/
:
r_job
#!/bin/bash
R CMD BATCH --no-save --no-restore "--args $SGE_TASK_ID $SGE_TASK_LAST" $1 .$1.log
r_batch
#!/bin/bash
if [[ $# == 1 ]]; then
qsub -cwd -e ~/err -o ~/out -q BIOSTAT -b y ~/.local/bin/r_job $1
elif [[ $# == 2 ]]; then
qsub -cwd -e ~/err -o ~/out -q BIOSTAT -b y -t 1-$2 ~/.local/bin/r_job $1
else
echo "r_batch takes either 1 or 2 arguments"
exit 1
fi
Now, we can submit arrays of R jobs from the command line without writing any
extra scripts like sim
and batch-sim
(which we will do shortly).
Furthermore, it’s extremely useful to have modular, versatile code. In
particular, I don’t like writing R code, then rewriting R code to run on the
cluster, then re-re-writing it if I want to run it again on my machine. All of
these rewrites are (a) annoying and (b) an opportunity to make a mistake. For
example, comparing the versions of sim.R
here and here,
you’ll note that one of them works in an interactive session but not a batch
session, while the other works in a batch session, but won’t run in an
interactive session. To avoid switching between the two, I use a pair of simple R functions:
bsave()
: This is a function called within R to save results with a structured naming convention that is easy to combine later. To make this available, you can add source("bsave.r")
to your ~/.Rprofile
file (specifying a path as necessary, depending on where you save the file).
- gather: This is a command-line script that combines multiple files (one from each job) into one file that contains all the results (to do the gathering, it uses the
abind
package, which you will have to install). To make this available, save it to the directory where you keep your executable files (~/.local/bin
, if you’re following the naming convention used in this tutorial) and make sure it is executable (chmod u+x
).
To see why these functions are useful, let’s watch them in action. Let’s
rewrite sim.R
one last time:
sim.R
N <- 10000
p <- numeric(N)
n <- 10
for (i in 1:N) {
x <- rnorm(n)
y <- rnorm(n, sd=3)
p[i] <- t.test(x, y, var.equal=TRUE)$p.value
}
bsave(p)
If we run this in an interactive session of R, p
will be saved in a file named with today’s date. When we run it non-interactively on the cluster with
The results are saved in tmp1.rds
, tmp2.rds
, and so on. To combine them, submit:
Now all the tmp and log files are gone and we are left with a file 2021-06-03.rds
(or whatever the date is). If you load it into R, p <- readRDS('2021-06-03.rds')
, you’ll see that it contains all 150,000 results. This is about as low a barrier as you can hope for, short of running jobs on multiple processors from within R itself: no need to modify any code or to write any scripts, just run r_batch
and then gather
when you’re done.
In this particular example, the result was a scalar, but bsave()
/gather
work for an array of any dimensions (under the assumption that the first dimension is the one we’re merging on) as well as on lists. You can also add a suffix (e.g., bsave(p, 'a')
if, for example, you want to save different results in different files; see the comments in bsave.r
for documentation and examples.
This has just been an illustration of some personal things I’ve done to customize Argon and make transitioning code to and from Argon go smoothly. You’re of course welcome to use these tools yourself, further customize them to your own need, or take a completely different approach. The main point I want to get across is that writing helper scripts to make your work easier is a very powerful tool.