Monitoring jobs
You’ll feel a bit helpless submitting a job and then just waiting for the output to appear, so it’s important to be able to track and monitor the jobs you’ve submitted to the scheduler. The most useful command for tracking the status of submitted jobs is qstat
. Try running q_sub sim 5
and then submitting the following:
$
qstat
job-ID prior name user state submit/start at queue slots
------------------------------------------------------------------------------
<...more...>
1007632 0.50118 Ets1.sh gaolong qw 02/19/2014 17:17:38 4
1007633 0.50118 Mxi1.sh gaolong qw 02/19/2014 17:17:38 4
1022907 0.00000 sim pbreheny qw 02/20/2014 07:42:40 1
Assuming you submitted qstat
immediately after submitting your job, you’ll be at the end of the list, and your ‘state’ will be ‘qw’, which stands for ‘queued and waiting’. This means that the batch scheduler is still in the process of deciding where to run the job and allocating resources to do so. After a few seconds, your job will transition to a state of ‘r’, meaning ‘running’. For the most part, these are the only two states you will see, although you may happen to catch your job in a ‘t’ state, meaning that it is in the process of transitioning to a compute node. There are also various error states that your job could enter if something goes wrong, such as ‘Eqw’ – if you see anything other than ‘qw’, ‘r’, or ‘t’, it’s an indication that something has gone wrong.
Running qstat
with no options will tell you about all the jobs currently submitted to the entire Argon cluster. This is sometimes useful to see, but typically a bit of information overload. To see only your jobs, you can submit:
(obviously, replacing pbreheny
with your own hawkID). Perhaps more helpful, assuming you’re planning to use the BIOSTAT queue, is to see all the jobs submitted to that queue:
One final monitoring command that may be useful is qhost
, which provides more
information about the processor and memory usage of specific hosts (qstat
tells you which host your job is running on):
$
qhost -h argon-compute-7-11
In particular, this command is useful as a way to see if the node you are working on is running out of memory.
Deleting jobs: qdel
For various reasons, you may wish to delete a job from the queue (usually,
because you realize there is a mistake in your code or your qsub
command). By
running qstat
, you can learn the ID for the job you submitted (1022907 in the
example above). To kill it, simply use the qdel
command:
You will then get a confirmation message confirming that you have deleted the
job.