Difference between revisions of "TORQUE/Maui"
(Created page with "== Overview == The UT NE Cluster now uses TORQUE for job submission and Maui for job scheduling. For the user who is used to going to Ganglia, selecting an underutilized node, ...") |
|||
Line 4: | Line 4: | ||
===Queue Structure=== | ===Queue Structure=== | ||
+ | |||
+ | Currently we have 6 queues available on the cluster. Some are available to all users, and some are restricted to certain users. Here is a brief overview of the queues. Note that there are no default queues, you will have to specify a queue for every job you want to run. Some queues will line up closely with information on the [[Nodes]] page. | ||
+ | |||
+ | * gen1 | ||
+ | |||
+ | These nodes line up with the AMD nodes on the cluster. There are currently 7 nodes in this queue and each node has 2 cores. These nodes are accessible for all users, and users may run interactive jobs on them. | ||
+ | |||
+ | * gen2 | ||
+ | |||
+ | These nodes line up with the Core2Quad nodes on the cluster plus Ondrej's server node [http://necluster.engr.utk.edu/ganglia/?c=NE%20Cluster&h=node18 node18]. There are currently 8 nodes in this queue and each node has 4 cores, EXCEPT node18 which has 8. These nodes are accessible for all users, and users may run interactive jobs on them. | ||
+ | |||
+ | * gen3 | ||
+ | |||
+ | These nodes line up with the first generation Core i7 nodes on the cluster. There are currently 11 nodes in this queue and each node has 8 cores. Access to these nodes are restricted to users who have contacted the cluster admins with a reason to use this computational power, and users may NOT run interactive jobs on them. | ||
+ | |||
+ | * gen4 | ||
+ | |||
+ | These nodes line up with the second generation (Sandy Bridge Core i7) nodes on the cluster. There are currently 3 nodes in this queue and each node has 8 cores. Access to these nodes are restricted to students with higher priority, large, computational tasks, and users may NOT run interactive jobs on them. | ||
+ | |||
+ | * corei7 | ||
+ | |||
+ | This queue consists of both the gen3 and gen4 queues and have those associated restrictions. | ||
+ | |||
+ | * all | ||
+ | |||
+ | This queue consists of gen1-gen4 queues and have those associated restrictions. | ||
+ | |||
+ | ==Job Submission== | ||
+ | |||
+ | Job submission is done by the <code>qsub</code> command. The easiest way to create a job is to create a job script and submit it to the queuing system. A job script is merely a text file with some #PBS directives and the command you want to run. Important variables are shown in the table below. | ||
+ | |||
+ | {| border="1" | ||
+ | ! PBS Flag !! Description !! Example | ||
+ | |- | ||
+ | | -I (upper case i) || Runs the job interactively. This is somewhat similar to logging into a node the old way. || N/A | ||
+ | |- | ||
+ | | -l (lower case l) || Defines the resources that you want for the job. This is probably one of the more important flags as it allows you to specific how many nodes you want, and how many processes you want on that node. || -l nodes=4:ppn=4 | ||
+ | |- | ||
+ | | -N || Give the job a name. Not required, but it will name the screen output file and error file after the job name if it is given. || -N MyMCNPCase | ||
+ | |- | ||
+ | | -q || What queue you want to submit the job to. || -q gen3 | ||
+ | |- | ||
+ | | -V || Export your environment variables to the job. Needed most of the time for OpenMPI to work (PATH, etc.) || N/A | ||
+ | |} | ||
+ | |||
+ | Many other flags can be found in the [http://http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/commands/qsub.htm Admin Guide for TORQUE]. | ||
+ | |||
+ | ===Example Interactive Job=== | ||
+ | |||
+ | The following command requests all processors on an Intel Core2Quad core node (queue gen2). You can then use this node for various quick little tests, compiling, anything really. You don't have to request all the processors on the node, but if you plan on compiling or doing anything in parallel it's probably beneficial so that other people also don't come on your node at the same time. | ||
+ | |||
+ | <pre>shart6@necluster ~ $ qsub -I -V -q gen2 -l nodes=1:ppn=4 | ||
+ | qsub: waiting for job 0.necluster.engr.utk.edu to start | ||
+ | qsub: job 0.necluster.engr.utk.edu ready | ||
+ | |||
+ | shart6@node2 ~ $ mpirun hostname | ||
+ | node2 | ||
+ | node2 | ||
+ | node2 | ||
+ | node2 | ||
+ | shart6@node2 ~ $ logout | ||
+ | |||
+ | qsub: job 0.necluster.engr.utk.edu completed</pre> | ||
+ | |||
+ | Two imporant things to note are: | ||
+ | |||
+ | * I used -V to pass through my environment variables to the interactive job (PATH, LD_LIBRARY_PATH, etc.) | ||
+ | |||
+ | * When I used mpirun in my job, I DID NOT need to specify -np or -machinefile. The job will inherently know what cores you have access to. | ||
+ | |||
+ | ===Example Script=== | ||
+ | |||
+ | The following script does the exact same things I did in the interactive job but non-interactively. | ||
+ | |||
+ | <source lang="bash">#!/bin/bash | ||
+ | |||
+ | #PBS -V | ||
+ | #PBS -l nodes=1:ppn=4 | ||
+ | #PBS -q gen2 | ||
+ | |||
+ | mpirun hostname</source> | ||
+ | |||
+ | The job is then submitted with qsub directly. Since I didn't give a name, output (and error messages) will be in the form <scriptname>.o<job#> and <scriptname>.e<job#> respectively. | ||
+ | |||
+ | <pre>shart6@necluster ~/pbstest $ qsub myrun.sh | ||
+ | 1.necluster.engr.utk.edu | ||
+ | shart6@necluster ~/pbstest $ cat myrun.sh.o1 | ||
+ | node2 | ||
+ | node2 | ||
+ | node2 | ||
+ | node2</pre> | ||
+ | |||
+ | ==Job Control== |
Revision as of 23:48, 19 January 2013
Contents
Overview
The UT NE Cluster now uses TORQUE for job submission and Maui for job scheduling. For the user who is used to going to Ganglia, selecting an underutilized node, logging into that node, and manually managing their job this will be a large change. This Wiki article will give a brief overview of how TORQUE/Maui are implemented on our cluster, how to use the commands to manage your job, and some examples with codes that are in common use on the cluster.
Queue Structure
Currently we have 6 queues available on the cluster. Some are available to all users, and some are restricted to certain users. Here is a brief overview of the queues. Note that there are no default queues, you will have to specify a queue for every job you want to run. Some queues will line up closely with information on the Nodes page.
- gen1
These nodes line up with the AMD nodes on the cluster. There are currently 7 nodes in this queue and each node has 2 cores. These nodes are accessible for all users, and users may run interactive jobs on them.
- gen2
These nodes line up with the Core2Quad nodes on the cluster plus Ondrej's server node node18. There are currently 8 nodes in this queue and each node has 4 cores, EXCEPT node18 which has 8. These nodes are accessible for all users, and users may run interactive jobs on them.
- gen3
These nodes line up with the first generation Core i7 nodes on the cluster. There are currently 11 nodes in this queue and each node has 8 cores. Access to these nodes are restricted to users who have contacted the cluster admins with a reason to use this computational power, and users may NOT run interactive jobs on them.
- gen4
These nodes line up with the second generation (Sandy Bridge Core i7) nodes on the cluster. There are currently 3 nodes in this queue and each node has 8 cores. Access to these nodes are restricted to students with higher priority, large, computational tasks, and users may NOT run interactive jobs on them.
- corei7
This queue consists of both the gen3 and gen4 queues and have those associated restrictions.
- all
This queue consists of gen1-gen4 queues and have those associated restrictions.
Job Submission
Job submission is done by the qsub
command. The easiest way to create a job is to create a job script and submit it to the queuing system. A job script is merely a text file with some #PBS directives and the command you want to run. Important variables are shown in the table below.
PBS Flag | Description | Example |
---|---|---|
-I (upper case i) | Runs the job interactively. This is somewhat similar to logging into a node the old way. | N/A |
-l (lower case l) | Defines the resources that you want for the job. This is probably one of the more important flags as it allows you to specific how many nodes you want, and how many processes you want on that node. | -l nodes=4:ppn=4 |
-N | Give the job a name. Not required, but it will name the screen output file and error file after the job name if it is given. | -N MyMCNPCase |
-q | What queue you want to submit the job to. | -q gen3 |
-V | Export your environment variables to the job. Needed most of the time for OpenMPI to work (PATH, etc.) | N/A |
Many other flags can be found in the Admin Guide for TORQUE.
Example Interactive Job
The following command requests all processors on an Intel Core2Quad core node (queue gen2). You can then use this node for various quick little tests, compiling, anything really. You don't have to request all the processors on the node, but if you plan on compiling or doing anything in parallel it's probably beneficial so that other people also don't come on your node at the same time.
shart6@necluster ~ $ qsub -I -V -q gen2 -l nodes=1:ppn=4 qsub: waiting for job 0.necluster.engr.utk.edu to start qsub: job 0.necluster.engr.utk.edu ready shart6@node2 ~ $ mpirun hostname node2 node2 node2 node2 shart6@node2 ~ $ logout qsub: job 0.necluster.engr.utk.edu completed
Two imporant things to note are:
- I used -V to pass through my environment variables to the interactive job (PATH, LD_LIBRARY_PATH, etc.)
- When I used mpirun in my job, I DID NOT need to specify -np or -machinefile. The job will inherently know what cores you have access to.
Example Script
The following script does the exact same things I did in the interactive job but non-interactively.
#!/bin/bash
#PBS -V
#PBS -l nodes=1:ppn=4
#PBS -q gen2
mpirun hostname
The job is then submitted with qsub directly. Since I didn't give a name, output (and error messages) will be in the form <scriptname>.o<job#> and <scriptname>.e<job#> respectively.
shart6@necluster ~/pbstest $ qsub myrun.sh 1.necluster.engr.utk.edu shart6@necluster ~/pbstest $ cat myrun.sh.o1 node2 node2 node2 node2