TORQUE/Maui

From NEClusterWiki
Revision as of 17:30, 29 January 2013 by WikiSysop (talk | contribs)
Jump to navigation Jump to search

Overview

The UT NE Cluster now uses TORQUE for job submission and Maui for job scheduling. For the user who is used to going to Ganglia, selecting an underutilized node, logging into that node, and manually managing their job this will be a large change. This Wiki article will give a brief overview of how TORQUE/Maui are implemented on our cluster, how to use the commands to manage your job, and some examples with codes that are in common use on the cluster.

Queue Structure

Currently we have 8 queues available on the cluster. Some are available to all users, and some are restricted to certain users. Here is a brief overview of the queues. Note that there are no default queues, you will have to specify a queue for every job you want to run. Some queues will line up closely with information on the Nodes page.

  • gen1

These nodes line up with the AMD nodes on the cluster. There are currently 7 nodes in this queue and each node has 2 cores. These nodes are accessible for all users, and users may run interactive jobs on them.

  • gen2

These nodes line up with the Core2Quad nodes on the cluster plus Ondrej's server node node18. There are currently 8 nodes in this queue and each node has 4 cores, EXCEPT node18 which has 8. These nodes are accessible for all users, and users may run interactive jobs on them.

  • gen3

These nodes line up with the first generation Core i7 nodes on the cluster. There are currently 11 nodes in this queue and each node has 8 cores. Access to these nodes are restricted to users who have contacted the cluster admins with a reason to use this computational power, and users may NOT run interactive jobs on them.

  • gen4

These nodes line up with the second generation (Sandy Bridge Core i7) nodes on the cluster. There are currently 3 nodes in this queue and each node has 8 cores. Access to these nodes are restricted to students with higher priority, large, computational tasks, and users may NOT run interactive jobs on them.

  • vgen

These are the virtual nodes on the cluster. There are currently 5 nodes in this queue and each node has 3 cores. There are no access restrictions to this queue. Users may NOT currently run interactive jobs on these nodes, but they CAN log in to them through SSH.

  • corei7

This queue consists of both the gen3 and gen4 queues and have those associated restrictions.

  • super

This is the queue for node1. Please ask for permission to use it.

  • all

This queue consists of gen1-gen4 queues and have those associated restrictions.

  • students

This queue consists of gen1-gen2 and vgen nodes and have those associated restrictions.

It is somewhat preferable to stay within a queue since the nodes within a queue are usually homogeneous (except for Ondrej's node18). For example, if you ran a job in the corei7 queue, you could get some faster nodes and some slower nodes which will impact load balancing in MCNP. While the end product will still be faster, there will be some inefficiencies that the user should be aware of.

Job Submission

Job submission is done by the qsub command. The easiest way to create a job is to create a job script and submit it to the queuing system. A job script is merely a text file with some #PBS directives and the command you want to run. Important variables are shown in the table below.

PBS Flag Description Example
-I (upper case i) Runs the job interactively. This is somewhat similar to logging into a node the old way. N/A
-l (lower case l) Defines the resources that you want for the job. This is probably one of the more important flags as it allows you to specific how many nodes you want, and how many processes you want on that node. -l nodes=4:ppn=4
-N Give the job a name. Not required, but it will name the screen output file and error file after the job name if it is given. -N MyMCNPCase
-q What queue you want to submit the job to. -q gen3
-V Export your environment variables to the job. Needed most of the time for OpenMPI to work (PATH, etc.) N/A

Many other flags can be found in the Admin Guide for TORQUE.

Example Interactive Job

The following command requests all processors on an Intel Core2Quad core node (queue gen2). You can then use this node for various quick little tests, compiling, anything really. You don't have to request all the processors on the node, but if you plan on compiling or doing anything in parallel it's probably beneficial so that other people also don't come on your node at the same time.

shart6@necluster ~ $ qsub -I -V -q gen2 -l nodes=1:ppn=4
qsub: waiting for job 0.necluster.engr.utk.edu to start
qsub: job 0.necluster.engr.utk.edu ready

shart6@node2 ~ $ mpirun hostname
node2
node2
node2
node2
shart6@node2 ~ $ logout

qsub: job 0.necluster.engr.utk.edu completed

Two imporant things to note are:

  • I used -V to pass through my environment variables to the interactive job (PATH, LD_LIBRARY_PATH, etc.)
  • When I used mpirun in my job, I DID NOT need to specify -np or -machinefile. The job will inherently know what cores you have access to.

Example Script

The following script does the exact same things I did in the interactive job but non-interactively.

#!/bin/bash

#PBS -V
#PBS -l nodes=1:ppn=4
#PBS -q gen2

mpirun hostname

The job is then submitted with qsub directly. Since I didn't give a name, output (and error messages) will be in the form <scriptname>.o<job#> and <scriptname>.e<job#> respectively.

shart6@necluster ~/pbstest $ qsub myrun.sh 
1.necluster.engr.utk.edu
shart6@necluster ~/pbstest $ cat myrun.sh.o1 
node2
node2
node2
node2

Job Control

There are numerous commands to control and view your job. Two of the big ones are:

  • qdel <#> - Remove the job from the queue.
  • qstat - View the queued jobs.

More information on these commands (and others!) can be found in the TORQUE Administrator's Guide.

Also, when you are running a job on a node, you CAN ssh into that node like normal. This will like you do things like top, and ps. However, when your job finishes, the node will forcibly evict you (so don't have many things open when your job is about to finish)!

NOTE: If you are running two jobs on a node and one ends, you won't be kicked off the node. You will only be kicked off the node when all your jobs running on that node end.

Examples

MCNPX

Running MCNPX with the queuing system is relatively straightforward since MCNPX only uses MPI.

Here is an example script file where I run MCNPX on two nodes with 8 cores each.

#!/bin/bash

#PBS -q gen3
#PBS -V
#PBS -l nodes=2:ppn=8

cd $PBS_O_WORKDIR
mpirun mcnpx name=LWR_test_pin_old

Again, note how I DID NOT give mpirun -np or -machinefile, it determines this all automatically. Also, one other important thing is that I change to the directory $PBS_O_WORKDIR before I execute any commands. This is because when your jobs starts, it doesn't start in any directory. You need to tell it to go to where your input files are for MCNPX. The shortcut for the directory where you invoked qsub is $PBS_O_WORKDIR.

One drawback is that TORQUE makes it difficult to monitor your job's output as it runs. The <jobname>.o<job#> file doesn't appear until the job has finished (or failed!). One way around this is to redirect your output to a file. This will have the output show up there as the job runs. As an example, we can change the run line above to redirect output to myjoboutput.txt:

mpirun mcnpx name=LWR_test_pin_old > myjoboutput.txt

MCNP5

As in the MCNP section of this wiki, we have 3 different ways to run MCNP5:

1) MPI only. 2) OpenMP only. 3) OpenMP and MPI.

Only the 3rd situation is complicated, but I will cover all three for completeness.

MPI Only

This is very much the same as MCNPX. We request nodes, and processors per node in a PBS script and run our case in the exact same way. An example would be:

#!/bin/bash

#PBS -q gen4
#PBS -V
#PBS -l nodes=3:ppn=4

cd $PBS_O_WORKDIR
mpirun mcnp.mpi name=super_complicated_case

This will run our case on three gen4 nodes, with 4 MPI runs on each. Note that this is somewhat inefficient, because it would be better to use ALL 8 CPUS on the gen4 nodes, but I only used 4 to show how the command could be used.

Example specific NOTE: Maui is set up to only use each node once. The logical reader could imagine a case where, since we have 8 CPU cores per gen4 node, we could schedule the case with 4 on node9, 4 on node21, and then 4 on node9 again. However, Maui is set up to spread the cases out, so this would not happen. In addition, because of this, you can not specify nodes=4:ppn=4 and have Maui schedule 4 on node9, 4 on node21, 4 on node31, and 4 on node9. Your job will either not work, or stay in queue forever, even though it is technically possible.

OpenMP Only

This is even easier because with only OpenMP we are limited to just one node. However, MCNP5 doesn't detect how many threads it has available to it, so we will have to set that up in our script.

#!/bin/bash

#PBS -q gen3
#PBS -V
#PBS -l nodes=1:ppn=8

cd $PBS_O_WORKDIR
mcnp5.mpi TASKS 8 name=less_complicated_case

Here we are running on one node with 8 threads. If we had specificed nodes=2:ppn=8, then the case would only run on the first node (called Mother Superior) and the other node would be marked utilized, but not calculate anything. This would waste resources!

MPI and OpenMP

This is a little bit complicated since we want to allocate ourselves sufficient nodes to run with OpenMP, but, if you remember, mpirun automatically runs on every thread we have allocated. The trick here is running mpirun with the -npernode flag as follows:

#!/bin/bash

#PBS -q gen3
#PBS -V
#PBS -l nodes=7:ppn=8

cd $PBS_O_WORKDIR
mpirun -npernode 1 mcnp5.mpi TASKS 8 name=very_complicated_case

This will allocate 7 nodes, with all 8 cores on each, to us. Then, mpirun will run ONE case per node, but thread that one case to all 8 CPUs using OpenMP. Note that, due to the way MPI works, we will have 6 nodes running 8 threads, and 1 node (the master node) running 1 thread. The astute mathematician will notice that this leaves 7 threads on the master node not utilized. This is something we'll have to put up with for now. However the next example shows a way to lower this impact.

Another example would be to run on 7 nodes, but run 2 MPI tasks per node, giving each MPI task 4 threads.

#!/bin/bash

#PBS -q gen3
#PBS -V
#PBS -l nodes=7:ppn=8

cd $PBS_O_WORKDIR
mpirun -npernode 2 mcnp5.mpi TASKS 4 name=very_complicated_case

Again, the astute mathematician will notice that this still leaves us not utilizing 3 threads. Oh well.

A final example would be to 1/2 use a few nodes:

#!/bin/bash

#PBS -q gen3
#PBS -V
#PBS -l nodes=4:ppn=4

cd $PBS_O_WORKDIR
mpirun -npernode 1 mcnp5.mpi TASKS 4 name=very_complicated_case

With this example you would have 3 runner nodes using 4 threads, and one master node using 1 thread (3 threads wasted). Note that the Maui scheduling explanation discussed above for MCNPX would also apply in this case.

Scale

90% of Scale runs are serial, so allocating one node and one CPU to the task will be sufficient.

#!/bin/bash

#PBS -q gen2
#PBS -V
#PBS -l nodes=1:ppn=1

cd $PBS_O_WORKDIR
batch6.1 -m my_case.inp

So far with parallel KENO I have only had access with running it on one node. Since scale doesn't use mpirun, you'll have to some of that work yourself.

#!/bin/bash

#PBS -q gen3
#PBS -V
#PBS -l nodes=1:ppn=8

cd $PBS_O_WORKDIR
batch6.1 -m -N 8 -M ${PBS_NODEFILE} my_keno.inp

The important thing to note here is that Torque will create your machine file for you, and you can access it with the environment variable PBS_NODEFILE.

FAQ

I'm not getting error/output files!

This problem also manifests with you getting mail sent to your account that looks like the following:

PBS Job Id: 421.necluster.engr.utk.edu
Job Name:   depl_scaling
Exec host:  node8/7
An error has occurred processing your job, see below.
Post job file processing error; job 421.necluster.engr.utk.edu on host node8/7

Unable to copy file /var/spool/torque/spool/421.necluster.engr.utk.edu.OU to bmervin@necluster.engr.utk.edu:/home/bmervin/work/shift/full_depletion/test/test1.out
*** error from copy
Host key verification failed.
lost connection
*** end error output
Output retained on that host in: /var/spool/torque/undelivered/421.necluster.engr.utk.edu.OU

Unable to copy file /var/spool/torque/spool/421.necluster.engr.utk.edu.ER to bmervin@necluster.engr.utk.edu:/home/bmervin/work/shift/full_depletion/test/test1.err
*** error from copy
Host key verification failed.
lost connection
*** end error output
Output retained on that host in: /var/spool/torque/undelivered/421.necluster.engr.utk.edu.ER

To prevent this error you need to ensure that the FQDN of the cluster (necluster.engr.utk.edu) is in your ${HOME}/.ssh/known_hosts file. To do this, carry out the following steps:

1) Create an interactive session.

2) SSH into the head node by using it's FQDN.

3) Log off the head node.

4) End your interactive session.

As an example, that I did for user cgentry7:

[cgentry7@necluster ~]$ qsub -I -V -q gen2
qsub: waiting for job 429.necluster.engr.utk.edu to start
qsub: job 429.necluster.engr.utk.edu ready

[cgentry7@node30 ~]$ ssh necluster.engr.utk.edu
The authenticity of host 'necluster.engr.utk.edu (160.36.8.194)' can't be established.
RSA key fingerprint is 2e:4a:57:21:2e:23:72:73:39:07:65:80:db:8c:aa:d7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'necluster.engr.utk.edu,160.36.8.194' (RSA) to the list of known hosts.
Last login: Tue Jan 29 12:22:19 2013 from cgentry7.nomads.utk.edu
_  _ ____    ____ _    _  _ ____ ___ ____ ____    _  _ ____ _ _ _ ____ 
|\ | |___    |    |    |  | [__   |  |___ |__/    |\ | |___ | | | [__  
| \| |___    |___ |___ |__| ___]  |  |___ |  \    | \| |___ |_|_| ___]

01/25/13 All compute nodes are now only accessible from TORQUE/Maui.
         READ http://necluster.engr.utk.edu/wiki/index.php/TORQUE/Maui !
         E-Mail questions to shart6@utk.edu.
01/18/13 Cluster has been updated to Fedora 18.  Let cluster admins know of any
         problems.
10/24/12 Intel Compilers 2013 is now the default Intel compiler suite on the
         cluster.  Intel Compilers 2011 is still available as a module and 
         will be until April 2013.  Ensure your codes work with the new
         version before then.
[cgentry7@necluster ~]$ logout
Connection to necluster.engr.utk.edu closed.
[cgentry7@node30 ~]$ logout

qsub: job 429.necluster.engr.utk.edu completed

Such hard-hitting questions as:

My job takes oodles of memory! What should I do?

and

How do I alter my job once it's in the queue?

COMING SOON!