Monitoring jobs

You’ll feel a bit helpless submitting a job and then just waiting for the output to appear, so it’s important to be able to track and monitor the jobs you’ve submitted to the scheduler. The most useful command for tracking the status of submitted jobs is qstat. Try running q_sub sim 5 and then submitting the following:

$
qstat
job-ID  prior   name       user         state submit/start at     queue  slots
------------------------------------------------------------------------------
<...more...>
1007632 0.50118 Ets1.sh    gaolong      qw    02/19/2014 17:17:38        4
1007633 0.50118 Mxi1.sh    gaolong      qw    02/19/2014 17:17:38        4
1022907 0.00000 sim        pbreheny     qw    02/20/2014 07:42:40        1

Assuming you submitted qstat immediately after submitting your job, you’ll be at the end of the list, and your ‘state’ will be ‘qw’, which stands for ‘queued and waiting’. This means that the batch scheduler is still in the process of deciding where to run the job and allocating resources to do so. After a few seconds, your job will transition to a state of ‘r’, meaning ‘running’. For the most part, these are the only two states you will see, although you may happen to catch your job in a ‘t’ state, meaning that it is in the process of transitioning to a compute node. There are also various error states that your job could enter if something goes wrong, such as ‘Eqw’ – if you see anything other than ‘qw’, ‘r’, or ‘t’, it’s an indication that something has gone wrong.

Running qstat with no options will tell you about all the jobs currently submitted to the entire Argon cluster. This is sometimes useful to see, but typically a bit of information overload. To see only your jobs, you can submit:

$
qstat -u pbreheny

(obviously, replacing pbreheny with your own hawkID). Perhaps more helpful, assuming you’re planning to use the BIOSTAT queue, is to see all the jobs submitted to that queue:

$
qstat -q BIOSTAT

One final monitoring command that may be useful is qhost, which provides more information about the processor and memory usage of specific hosts (qstat tells you which host your job is running on):

$
qhost -h argon-compute-7-11

In particular, this command is useful as a way to see if the node you are working on is running out of memory.

Deleting jobs: qdel

For various reasons, you may wish to delete a job from the queue (usually, because you realize there is a mistake in your code or your qsub command). By running qstat, you can learn the ID for the job you submitted (1022907 in the example above). To kill it, simply use the qdel command:

$
qdel 1022907

You will then get a confirmation message confirming that you have deleted the job.