Queues

When we say that a compute node “belongs” to the Department of Biostatistics, what does that mean? The basic idea is that the node is there for us to use whenever we want, although if we’re not using it, other people are allowed to. This works both ways: if there are compute nodes not being used, we can use them in addition to our own node(s).

All this is managed through a batch queuing system – you place your computing request with the scheduler, and it decides whose jobs run on which node. As part of the request, you specify at least one queue (essentially, you declare which line or lines you want your jobs to wait in). There are three important queues to know about:

BIOSTAT: This is the name of the general-purpose Department of Biostatistics queue. It currently has 192 slots (cores), spread out over several nodes. Submission to this queue requires approval from one of the queue managers.
UI: This is the name of the general-purpose University of Iowa queue. As a member of the University of Iowa, these compute nodes belong to you just as much as the BIOSTAT queue does, although there is a limit of 5 jobs you can run at the same time on the UI queue (as with all queues, more than this limit can be submitted, though only 5 will run simultaneously).
all.q: Submitting jobs to this queue means that they will run on any compute node that is available. In particular, it means that your job could run on a node belonging to someone else, and if they want it back, your job would be immediately terminated.

There are also some other queues that could be worth knowing about if you have specific resource requests, such as high memory or GPU cards. See the HPC’s Queues and Policies page for additional details on campus queues and their limits, including guidelines for selecting a queue.

More on the BIOSTAT queue

The BIOSTAT queue consists of three machines (“nodes”), two of which are older and one of which is newer:

56 cores, 128G memory
56 cores, 128G memory
80 cores, 384G memory

Typically, it doesn’t make a great deal of difference which machine your program runs on, but note that one node has more memory. The next page discusses how to force your program to run on one type of machine or the other.

SGE, Ettiquette, and the BIOSTAT Queue Usage Policy

Background

There are a variety of batch schedulers (software tools for allocating tasks to computational hardware); the one used by the Argon cluster is the Sun Grid Engine (SGE), and it provides the primary model for sharing access to the nodes in Argon between the users. The next several pages discuss several powerful SGE commands for submitting, controlling, and monitoring jobs submitted to the compute nodes.

Ultimately, we want to provide the best compromise between maximizing the productive use of our HPC resources and equitably sharing the finite resources available within our BIOSTAT queue and on the Argon cluster generally. This is a complicated environment, so the best way to make sure you’re being a good computational citizen is to have a good grasp of the different moving parts of the HPC system. As you read through the next few sections, keep this goal in mind. In general,

When using qlogin, a lot of care is needed both in terms of the resources requested (TL;DR; qlogin -pe smp 2 is more polite than qlogin), and in terms of the runtime (don’t leave interactive sessions logged in).
The scheduler’s job is to keep things fair, but its ability to do so depends on the jobs you’re submitting. If you submit 10,000 jobs that all take 2 minutes to complete, the longest another user will have to wait before their job is considered for submission is 2 minutes. If you submit 500 jobs that all take 13 hours to complete, it may be a full 13 hours before anyone else has a chance to use the resources involved (for example, this could lock up the entire BIOSTAT queue)

Usage Policy

“With great power comes great responsibility”

Very Short Jobs (<15 min): Submit as many as you like
Short Jobs (15-60min): Try not to exceed 2/3 of available slots in the queue
Medium Jobs (1-4 hours): Try not to exceed 1/3 of available slots in the queue
Long Jobs (>4 hours): Try not to exceed 10% of the available slots in the queue

During periods of low usage, it is ok to exceed the limits described in this policy, but users who do so should be prepared to throttle back and remove some of their jobs if requested. The more resources you use (e.g., slots, minutes/job), the greater your responsibility will be to make sure your work is not impacting others.

When to use other queues?

There’s no universal answer as to which queue is best for a given circumstance. Here are a few general thoughts:

If you have lots of short jobs, then all.q may be a good fit, because it allows unlimited submissions. If any jobs are evicted, they should be able to be quickly re-run.
If you have only a few very long running jobs, it may be a good idea to use the UI queue. This queue is much larger than BIOSTAT, and if you have fewer than 5 jobs to run you should be able to get them going very quickly.
Give some thought as to how your work is distributed between jobs. You can often obtain the same result by running many small jobs, or fewer, larger jobs. In general, running more, small, jobs is a better fit for the Argon cluster.