Difference between revisions of "TORQUE/Maui"

From NEClusterWiki
Jump to navigation Jump to search
 
(102 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
== Overview ==
 
== Overview ==
  
The UT NE Cluster now uses TORQUE for job submission and Maui for job scheduling.  For the user who is used to going to Ganglia, selecting an underutilized node, logging into that node, and manually managing their job this will be a large change.  This Wiki article will give a brief overview of how TORQUE/Maui are implemented on our cluster, how to use the commands to manage your job, and some examples with codes that are in common use on the cluster.
+
The UT NE Cluster now uses TORQUE for job submission and Maui for job scheduling.  For the user who is used to going to Ganglia, selecting an underutilized node, logging into that node, and manually managing their job this will be a large change.  This Wiki article will give a brief overview of how TORQUE/Maui are implemented on our cluster, how to use the commands to manage your job, and some examples with codes that are in common use on the cluster. If you have significant needs for compute time for your problem, please first make sure that you have a script that runs well for a tiny problem size, then talk to admins to work out a strategy that makes your problem run as fast as possible given the resources available.
 +
 
 +
[https://wikispaces.psu.edu/display/CyberLAMP/Scheduling+jobs+with+qsub+and+PBS+command+files Good summary of Torque job options]
  
 
===Queue Structure===
 
===Queue Structure===
  
Currently we have 8 queues available on the cluster.  Some are available to all users, and some are restricted to certain users. Here is a brief overview of the queues.  Note that there are no default queues, you will have to specify a queue for every job you want to run.  Some queues will line up closely with information on the [[Nodes]] page.
+
Currently we have 9 queues available on the cluster.  Some are available to all users, and some are restricted to certain users. Here is a brief overview of the queues.  Note that there are no default queues, you will have to specify a queue for every job you want to run.  Some queues will line up closely with information on the [[Nodes]] page.
 +
 
 +
NOTE:  As of May 12, 2014, we are testing having no restrictions on queues.  Please select a queue that you feel is appropriate for your job.
 +
 
 +
* interactive
 +
 
 +
This queue consists of the gen1 and gen2 nodes and is a mechanism for users to submit interactive jobs.  This is the ONLY queue that users may run interactive jobs on.
  
 
* gen1
 
* gen1
  
These nodes line up with the AMD nodes on the cluster.  There are currently 7 nodes in this queue and each node has 2 cores.  These nodes are accessible for all users, and users may run interactive jobs on them.
+
These nodes line up with the AMD nodes on the cluster.  There are currently 7 nodes in this queue and each node has 2 cores.  These nodes are accessible for all users.
  
 
* gen2
 
* gen2
  
These nodes line up with the Core2Quad nodes on the cluster plus Ondrej's server node [http://necluster.engr.utk.edu/ganglia/?c=NE%20Cluster&h=node18 node18].  There are currently 8 nodes in this queue and each node has 4 cores, EXCEPT node18 which has 8.  These nodes are accessible for all users, and users may run interactive jobs on them.
+
These nodes line up with the Core2Quad nodes on the cluster plus Ondrej's server node [http://necluster.ne.utk.edu/ganglia/?c=NECluster&h=node18&m=cpu_report node18].  There are currently 8 nodes in this queue and each node has 4 cores, EXCEPT node18 which has 8.  These nodes are accessible for all users.  
  
 
* gen3
 
* gen3
  
These nodes line up with the first generation Core i7 nodes on the cluster.  There are currently 12 nodes in this queue and each node has 8 cores.  Access to these nodes are restricted to users who have contacted the cluster admins with a reason to use this computational power, and users may NOT run interactive jobs on them.
+
These nodes line up with the first generation (Nehalem) Core i7 nodes on the cluster.  There are currently 11 nodes in this queue and each node has 8 cores.  Access to these nodes are restricted to users who have contacted the cluster admins with a reason to use this computational power.
  
 
* gen4
 
* gen4
  
These nodes line up with the second generation (Sandy Bridge Core i7) nodes on the cluster.  There are currently 3 nodes in this queue and each node has 8 cores.  Access to these nodes are restricted to students with higher priority, large, computational tasks, and users may NOT run interactive jobs on them.
+
These nodes line up with the second generation (Sandy Bridge) Core i7 nodes on the cluster.  There are currently 3 nodes in this queue and each node has 8 cores and 32GB of RAM.  Access to these nodes are restricted to students with higher priority, large, computational tasks.
  
* vgen
+
* gen5
  
These are the virtual nodes on the cluster.  There are currently 5 nodes in this queue and each node has 3 cores.  There are no access restrictions to this queue. Users may NOT currently run interactive jobs on these nodes, but they CAN log in to them through SSH.
+
These nodes line up with the fourth generation (Haswell) Core i7 nodes on the cluster.  There are currently 12 nodes in this queue and each node has 8 cores.  Access to these nodes are restricted to those with gen4 access. Almost all gen5 nodes have 32GB of RAM, but this is not guaranteed. If you need memory, please specify it in your request.
  
 
* corei7
 
* corei7
Line 33: Line 41:
 
* super
 
* super
  
This is the queue for node1. Please ask for permission to use it.
+
This is the queue for [http://necluster.ne.utk.edu/ganglia/?c=NECluster&h=node0&m=cpu_report node0] and [http://necluster.ne.utk.edu/ganglia/?c=NECluster&h=node1&m=cpu_report node1]. Please ask for permission to use it.
  
* all
+
* students
  
This queue consists of gen1-gen4 queues and have those associated restrictions.
+
This queue consists of gen1-gen2 and vgen nodes and have those associated restrictions.
  
* students
+
* fill
  
This queue consists of gen1-gen2 and vgen nodes and have those associated restrictions.
+
This queue is for large, low-priority jobs.  They are queued with a lower priority than that of the other queues.  This means that jobs in this queue will run when there are no other jobs in queue.  All nodes are in this queue.
  
 
It is somewhat preferable to stay within a queue since the nodes within a queue are usually homogeneous (except for Ondrej's node18).  For example, if you ran a job in the corei7 queue, you could get some faster nodes and some slower nodes which will impact load balancing in MCNP.  While the end product will still be faster, there will be some inefficiencies that the user should be aware of.
 
It is somewhat preferable to stay within a queue since the nodes within a queue are usually homogeneous (except for Ondrej's node18).  For example, if you ran a job in the corei7 queue, you could get some faster nodes and some slower nodes which will impact load balancing in MCNP.  While the end product will still be faster, there will be some inefficiencies that the user should be aware of.
 +
 +
===Node-Queue Matrix===
 +
The [[Node-Queue Matrix]] is discouraged, use <code>nodes.py</code> script instead.
  
 
==Job Submission==
 
==Job Submission==
Line 54: Line 65:
 
| -I (upper case i) || Runs the job interactively.  This is somewhat similar to logging into a node the old way. || N/A
 
| -I (upper case i) || Runs the job interactively.  This is somewhat similar to logging into a node the old way. || N/A
 
|-
 
|-
| -l (lower case l) || Defines the resources that you want for the job.  This is probably one of the more important flags as it allows you to specific how many nodes you want, and how many processes you want on that node.  || -l nodes=4:ppn=4
+
| -l (lower case L) || Defines the resources that you want for the job.  This is probably one of the more important flags as it allows you to specific how many nodes you want, and how many processes you want on that node.  || -l nodes=4:ppn=4
 
|-
 
|-
 
| -N || Give the job a name.  Not required, but it will name the screen output file and error file after the job name if it is given.  || -N MyMCNPCase
 
| -N || Give the job a name.  Not required, but it will name the screen output file and error file after the job name if it is given.  || -N MyMCNPCase
Line 63: Line 74:
 
|}
 
|}
  
Many other flags can be found in the [http://http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/commands/qsub.htm Admin Guide for TORQUE].
+
Many other flags can be found in the [http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qsub.htm qsub documentation]. Specifically, for requesting resources using -l flag, [http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/2-jobs/requestingRes.htm see here].
 +
A short [https://support.adaptivecomputing.com/wp-content/media/pdf/Handout_TorqueTutorial_qsub.pdf handout summarizing the qsub command is here].
  
 
===Example Interactive Job===
 
===Example Interactive Job===
  
The following command requests all processors on an Intel Core2Quad core node (queue gen2). You can then use this node for various quick little tests, compiling, anything really. You don't have to request all the processors on the node, but if you plan on compiling or doing anything in parallel it's probably beneficial so that other people also don't come on your node at the same time.
+
The following command requests one processor in the interactive queue. Use this for short program runs.
 +
<pre>qsub -I -V -X -q interactive</pre>
 +
 
 +
For convenience there is an alias for the above command:
 +
<pre>qsubi</pre>
 +
 
 +
If you do not need X windows, you can use:
 +
<pre>qsubinox</pre>
  
<pre>shart6@necluster ~ $ qsub -I -V -q gen2 -l nodes=1:ppn=4
+
The following command requests all processors on an Intel Core2Quad core node (queue gen2).  You can then use this node for various quick little tests, compiling, anything really.  You generally shouldn't request all of the processors on a node, I did here to illustrate the use of mpirun.  If you know you are going to use parallel codes, or, you want to run multiple things at once, make sure you request the correct number of CPUs!
qsub: waiting for job 0.necluster.engr.utk.edu to start
+
 
qsub: job 0.necluster.engr.utk.edu ready
+
<pre>shart6@necluster ~ $ qsub -I -V -q interactive -l nodes=1:ppn=4
 +
qsub: waiting for job 0.necluster.ne.utk.edu to start
 +
qsub: job 0.necluster.ne.utk.edu ready
  
 
shart6@node2 ~ $ mpirun hostname
 
shart6@node2 ~ $ mpirun hostname
Line 80: Line 101:
 
shart6@node2 ~ $ logout
 
shart6@node2 ~ $ logout
  
qsub: job 0.necluster.engr.utk.edu completed</pre>
+
qsub: job 0.necluster.ne.utk.edu completed</pre>
  
 
Two imporant things to note are:
 
Two imporant things to note are:
Line 103: Line 124:
  
 
<pre>shart6@necluster ~/pbstest $ qsub myrun.sh  
 
<pre>shart6@necluster ~/pbstest $ qsub myrun.sh  
1.necluster.engr.utk.edu
+
1.necluster.ne.utk.edu
 
shart6@necluster ~/pbstest $ cat myrun.sh.o1  
 
shart6@necluster ~/pbstest $ cat myrun.sh.o1  
 
node2
 
node2
Line 109: Line 130:
 
node2
 
node2
 
node2</pre>
 
node2</pre>
 +
 +
===Requesting memory for your job===
 +
 +
There are two ways of requesting memory resource for your job. Either you request total memory for the job, or memory per process.
 +
The PBS flags <pre>-l mem=XXXX</pre> requests the memory available for the job.
 +
 +
An example script to request one node, 32 cores, and 64GB of RAM:
 +
<pre>#!/bin/bash
 +
#PBS -V
 +
#PBS -l nodes=1:ppn=32
 +
#PBS -l mem=64gb
 +
#PBS -q fill
 +
hostname</pre>
 +
 +
The PBS flags <pre>-l pmem=XX,pvmem=XX</pre> requests the number of available memory to XX '''per process'''.
 +
An example script to request 4 cores and (4x16=)64GB of RAM:
 +
<pre>#!/bin/bash
 +
#PBS -V
 +
#PBS -l nodes=1:ppn=4
 +
#PBS -l pmem=16gb,pvmem=16gb
 +
#PBS -q fill
 +
hostname</pre>
 +
 +
Note that if your calculation tries to allocate more memory than was requested using the ''-l mem='' or ''-l pmem='' flags, your job will be terminated.
  
 
==Job Control==
 
==Job Control==
Line 124: Line 169:
  
 
==Examples==
 
==Examples==
 +
 +
===Serpent2===
 +
 +
Running SERPENT 2 in the queuing system requires the following modules be loaded:
 +
<code><br>module load mpi<br>module load serpent</code>
 +
 +
An example script file for running SERPENT 2 on 1 nodes with 8 cores:
 +
<source lang="bash">#!/bin/bash
 +
#PBS -V
 +
#PBS -q corei7
 +
#PBS -l nodes=1:ppn=8
 +
 +
#### cd working directory (where you submitted your job)
 +
cd ${PBS_O_WORKDIR}
 +
 +
#### Executable Line
 +
sss2 -omp 8  serpentinput.inp > nohup.out</source>
 +
 +
For large cases, and if you know why you need MPI and how to use it, an example script file for running SERPENT 2 on 5 nodes with 8 cores each is provided below:
 +
<source lang="bash">#!/bin/bash
 +
#PBS -V
 +
#PBS -q gen3
 +
#PBS -l nodes=5:ppn=8
 +
#PBS -l pmem=1500mb
 +
 +
#### cd working directory (where you submitted your job)
 +
cd ${PBS_O_WORKDIR}
 +
 +
#### Executable Line
 +
mpirun  -npernode 1 sss2 -omp 8  serpentinput.inp > nohup.out</source>
 +
 +
The cd ${PBS_O_WORKDIR} statement is meant to change the working directory to the directory that the user was in at the time they launched the job.  Remember, by default, no working directory is specified when a job initiates in the queue, therefore one must have the script change directories to the working directory.  The shortcut for the directory where you invoked qsub is $PBS_O_WORKDIR.
 +
 +
Originally mpirun would use the -npernode flag to indicate how many MPI tasks to run on a given node, however this is being replaced by the --map-by flag.  ppr:1:node indicates 1 MPI task per node.  -bind-to none ensures that the MPI task is not bound to 1 processor, but allowed to use all processors on a given node.  The SERPENT 2 flag -omp specifies the number of OpenMP threads per MPI task (in this case 8 threads per task).  Leaving the -omp specification out will result in a single thread per MPI task.
 +
 +
TORQUE does not allow for the viewing of job screen prints directly and therefore requires one specify a file to write the screen prints to.  This is accomplished with the "> nohup.out" specification which writes all screen output to the file named "nohup.out".  One can of course this file anything, but nohup.out was chosen as an example.  Currently, this screen printing though has been having problems with the most recent update and is not printing to the output file until job completion.  Hopefully, this will be resolved soon.
 +
 +
In regards to how best to run SERPENT 2 on multiple nodes, it should be kept in mind that dividing the SERPENT 2 job among multiple MPI tasks results in the model being copied for each MPI task and the number of neutrons per generation (npg) divided among the MPI tasks.  This can introduce problems if the model is large or if the problem is spread over too many tasks.  In the case of the model being large, each copy of the model will require memory to fit the entire model into, so if I were to use 8 MPI tasks on one node as opposed to 1 MPI task with 8 OpenMP threads on one node, I would expect the 8 MPI task run to require 8 times the memory as the 1 MPI task with 8 OpenMP thread run.  Also, if I request 6000 npg in the 8 MPI run, each task would get 6000/8 npg, whereas the single MPI task with 8 threads would get all 6000 npg.  Typically, 6000 npg is sufficient to ensure that enough fissions occur to provide new fission neutron source sites in single assembly (maybe not accurate, but it will run).  However less than 1000 npg runs the risk of having too few new neutron sites and gradually finding fewer and fewer new fission neutron until the job dies.  Therefore, to avoid both these problems, it is best to specify 1 MPI task per node, and use OpenMP threads to use all processors on said node.
 +
 +
===Nuclear data===
 +
The latest nuclear data from [https://www.nndc.bnl.gov/endf/b8.0/index.html ENDF/B-VIII.0] are available in ''/opt/MCNP_DATA''. The most visible change from the previous versions is the absence of data for natural carbon, thus it is necessary to use data for C-12 and C-13. There is also change in thermal data for graphite. You can use data for crystalline graphite (gr00) with density approx. 2.3 g/cm3, graphite with 10 % porosity (gr10) and approx. density 2.0 g/cm3, and graphite with 30 % porosity (gr30) with density approx. 1.6 g/cm3. The following lines will tell Serpent to use these data in your calcualtions:
 +
 +
set acelib "/opt/MCNP_DATA/sss_endfb80.xsdir"
 +
set declib "/opt/MCNP_DATA/sss_endfb80.dec"
 +
set nfylib "/opt/MCNP_DATA/sss_endfb80.nfy"
  
 
===MCNPX===
 
===MCNPX===
  
 
Running MCNPX with the queuing system is relatively straightforward since MCNPX only uses MPI.
 
Running MCNPX with the queuing system is relatively straightforward since MCNPX only uses MPI.
 +
Load the required modules using these commands: <code><br>module load mpi <br>module load intel <br> module load MCNPX</code>
 +
  
 
Here is an example script file where I run MCNPX on two nodes with 8 cores each.
 
Here is an example script file where I run MCNPX on two nodes with 8 cores each.
Line 138: Line 230:
  
 
cd $PBS_O_WORKDIR
 
cd $PBS_O_WORKDIR
mpirun mcnpx name=LWR_test_pin_old</source>
+
mpirun --bind-to none mcnpx name=LWR_test_pin_old</source>
  
 
Again, note how I DID NOT give mpirun -np or -machinefile, it determines this all automatically.  Also, one other important thing is that I change to the directory <code>$PBS_O_WORKDIR</code> before I execute any commands.  This is because when your jobs starts, it doesn't start in any directory.  You need to tell it to go to where your input files are for MCNPX.  The shortcut for the directory where you invoked <code>qsub</code> is <code>$PBS_O_WORKDIR</code>.
 
Again, note how I DID NOT give mpirun -np or -machinefile, it determines this all automatically.  Also, one other important thing is that I change to the directory <code>$PBS_O_WORKDIR</code> before I execute any commands.  This is because when your jobs starts, it doesn't start in any directory.  You need to tell it to go to where your input files are for MCNPX.  The shortcut for the directory where you invoked <code>qsub</code> is <code>$PBS_O_WORKDIR</code>.
Line 144: Line 236:
 
One drawback is that TORQUE makes it difficult to monitor your job's output as it runs.  The <jobname>.o<job#> file doesn't appear until the job has finished (or failed!).  One way around this is to redirect your output to a file.  This will have the output show up there as the job runs.  As an example, we can change the run line above to redirect output to myjoboutput.txt:
 
One drawback is that TORQUE makes it difficult to monitor your job's output as it runs.  The <jobname>.o<job#> file doesn't appear until the job has finished (or failed!).  One way around this is to redirect your output to a file.  This will have the output show up there as the job runs.  As an example, we can change the run line above to redirect output to myjoboutput.txt:
  
<source lang="bash">mpirun mcnpx name=LWR_test_pin_old > myjoboutput.txt</source>
+
<source lang="bash">mpirun --bind-to none mcnpx name=LWR_test_pin_old > myjoboutput.txt</source>
  
 
===MCNP5===
 
===MCNP5===
 +
 +
First see a [[Beginners Guide to MCNP with Torque]] here.
 +
Load the required modules using these commands: <code><br>module load openmpi <br> module load MCNP5</code>
  
 
As in the [[MCNP]] section of this wiki, we have 3 different ways to run MCNP5:
 
As in the [[MCNP]] section of this wiki, we have 3 different ways to run MCNP5:
  
 
1) MPI only.
 
1) MPI only.
2) OpenMP only.
+
2) OpenMP only. (PREFERRED for KCODE runs}
 
3) OpenMP and MPI.
 
3) OpenMP and MPI.
  
Line 179: Line 274:
 
<source lang="bash">#!/bin/bash
 
<source lang="bash">#!/bin/bash
  
#PBS -q gen3
+
#PBS -q corei7
 
#PBS -V
 
#PBS -V
 
#PBS -l nodes=1:ppn=8
 
#PBS -l nodes=1:ppn=8
  
 
cd $PBS_O_WORKDIR
 
cd $PBS_O_WORKDIR
mcnp5.mpi TASKS 8 name=less_complicated_case</source>
+
mcnp5 TASKS 8 name=less_complicated_case</source>
  
Here we are running on one node with 8 threads.  If we had specificed nodes=2:ppn=8, then the case would only run on the first node (called Mother Superior) and the other node would be marked utilized, but not calculate anything.  This would waste resources!
+
Here we are running on one node with 8 threads.  If we had specified nodes=2:ppn=8, then the case would only run on the first node (called Mother Superior) and the other node would be marked utilized, but not calculate anything.  This would waste resources!
 +
 
 +
'''Note: This is usually the most efficient way to run MCNP cases.'''
  
 
====MPI and OpenMP====
 
====MPI and OpenMP====
Line 194: Line 291:
 
<source lang="bash">#!/bin/bash
 
<source lang="bash">#!/bin/bash
  
#PBS -q gen3
+
#PBS -q gen1
 
#PBS -V
 
#PBS -V
#PBS -l nodes=7:ppn=8
+
#PBS -l nodes=1:ppn=1+7:gen3:ppn=8
  
 
cd $PBS_O_WORKDIR
 
cd $PBS_O_WORKDIR
 
mpirun -npernode 1 mcnp5.mpi TASKS 8 name=very_complicated_case</source>
 
mpirun -npernode 1 mcnp5.mpi TASKS 8 name=very_complicated_case</source>
  
This will allocate 7 nodes, with all 8 cores on each, to usThen, mpirun will run '''ONE''' case per node, but thread that one case to all 8 CPUs using OpenMP.  Note that, due to the way MPI works, we will have 6 nodes running 8 threads, and 1 node (the master node) running 1 threadThe astute mathematician will notice that this leaves 7 threads on the master node not utilized.  This is something we'll have to put up with for nowHowever the next example shows a way to lower this impact.
+
This will allocate 1 CPU on the gen1 queue to act as a master, and then 7 nodes (8 CPUs per node) on the gen3 queue to act as the compute nodesIt is okay to use a "slow" computer as the master because it doesn't do any calculations, it just sends out cross-sections and consolidates results from the compute nodes.  Note that it is very important that you list the master CPU first and that you also use the -npernode 1 flag.  If you don't, you will start many, many MPI jobs, each using 8 CPUs and will quickly oversubscribe the node.  Also note that the queue specified by -q is only for the first "block", and that you can specify a different queue for the compute nodesSee [[#How can I request different CPU counts on different nodes/How can I use multiple queues?]] for more multi-job requests.
  
Another example would be to run on 7 nodes, but run 2 MPI tasks per node, giving each MPI task 4 threads.
+
Another example (again using a slow gen1 as the master) with gen2 nodes:
  
 
<source lang="bash">#!/bin/bash
 
<source lang="bash">#!/bin/bash
  
#PBS -q gen3
+
#PBS -q gen1
 +
#PBS -V
 +
#PBS -l nodes=1:ppn=1+4:gen2:ppn=4
 +
 
 +
cd $PBS_O_WORKDIR
 +
mpirun -npernode 1 mcnp5.mpi TASKS 4 name=another_complicated_case</source>
 +
 
 +
In general this is for extreme cases only, and you ought to use this only if you are sure you know what you are doing.
 +
 
 +
===MCNP6.1===
 +
Much like MCNP5, but in the first step load these modules:
 +
<code><br>module load intel/12.1.6
 +
<br>module load openmpi/1.6.5-intel-12.1
 +
<br>module load MCNP6/1.0
 +
</code>
 +
 
 +
The binary is then <code>mcnp6</code> for OpenMP version and <code>mcnp6.mpi</code> for MPI version. Do not use MPI unless you know what you are doing!
 +
 
 +
Example script, with module load commands:
 +
 
 +
<source lang="bash">#!/bin/bash
 +
#PBS -q fill
 
#PBS -V
 
#PBS -V
#PBS -l nodes=7:ppn=8
+
#PBS -l nodes=1:ppn=8
 +
 
 +
module load intel/12.1.6
 +
module load openmpi/1.6.5-intel-12.1
 +
module load MCNP6/1.0
  
 
cd $PBS_O_WORKDIR
 
cd $PBS_O_WORKDIR
mpirun -npernode 2 mcnp5.mpi TASKS 4 name=very_complicated_case</source>
+
mcnp6 TASKS 8 name=less_complicated_case</source>
 +
 
 +
===MCNP6.2===
 +
For the moment, you can use <code>mcnp6</code> with OpenMP only:
 +
<code>
 +
<br>module load MCNP6/2.0
 +
</code>
 +
 
 +
 
 +
The MPI version was compiled with gcc-5.4.0, which is sub-optimal. ''Contact system administrator before you do this.'' Use it like MCNP6.1, but load these modules:
 +
<code><br>module load gcc/5.4.0
 +
<br>module load mpi
 +
<br>module load MCNP6/2.0
 +
</code>
  
Again, the astute mathematician will notice that this still leaves us not utilizing 3 threads.  Oh well.
+
===MCNP: Delete unneeded runtapes===
  
A final example would be to 1/2 use a few nodes:
+
Please delete runtape files which you don't need, they take a huge amount of disk space.
 +
Below is an example modification of the above script to do it automatically. It also writes the runtape locally to /tmp, saving interconnect bandwith.
  
 
<source lang="bash">#!/bin/bash
 
<source lang="bash">#!/bin/bash
  
#PBS -q gen3
+
#PBS -q fill
 
#PBS -V
 
#PBS -V
#PBS -l nodes=4:ppn=4
+
#PBS -N MyMCNPjob
 +
#PBS -l nodes=1:ppn=8
  
 +
RTP=runtape_$(date "+%s%N")
 
cd $PBS_O_WORKDIR
 
cd $PBS_O_WORKDIR
mpirun -npernode 1 mcnp5.mpi TASKS 4 name=very_complicated_case</source>
+
mcnp6 TASKS 8 name=my_complicated_case runtpe=/tmp/$RTP
 +
rm /tmp/$RTP
 +
</source>
 +
 
 +
===MCNP6.2 with MPI===
 +
MPI with MCNP6.2 is tricky, so please test your jobs on small runs, and make sure all MPI threads initialize correctly.
 +
 
 +
<source lang="bash">#!/bin/bash
 +
#PBS -V
 +
#PBS -q xeon
 +
#PBS -l nodes=1:ppn=32
 +
#PBS -N MCNP-MPI
  
With this example you would have 3 runner nodes using 4 threads, and one master node using 1 thread (3 threads wasted). Note that the Maui scheduling explanation discussed above for MCNPX would also apply in this case.
+
hostname
 +
module load mpi
 +
module load MCNP6/2.0-mpi
 +
 
 +
RTP="runtp--".`date "+%R%N"`
 +
cd $PBS_O_WORKDIR
 +
/opt/intel/oneapi/mpi/latest/bin/mpirun -np $PBS_NUM_PPN mcnp6.mpi name=my_complicated_case.inp runtpe=$RTP
 +
rm $RTP
 +
</source>
  
 
===Scale===
 
===Scale===
  
90% of Scale runs are serial, so allocating one node and one CPU to the task will be sufficient.
+
90% of Scale runs are serial, so allocating one node and one CPU to the task will be sufficient. Use local /tmp for temporary files (TMPDIR in the example below) to speed computation up.  
  
 
<source lang="bash">#!/bin/bash
 
<source lang="bash">#!/bin/bash
Line 238: Line 395:
 
#PBS -V
 
#PBS -V
 
#PBS -l nodes=1:ppn=1
 
#PBS -l nodes=1:ppn=1
 +
 +
TMPDIR=$(mktemp -d -t myproject.XXXXXX) || exit 1
  
 
cd $PBS_O_WORKDIR
 
cd $PBS_O_WORKDIR
batch6.1 -m my_case.inp</source>
+
scalerte -m -T $TMPDIR my_case.inp
  
So far with parallel KENO I have only had access with running it on one nodeSince scale doesn't use <code>mpirun</code>, you'll have to some of that work yourself.
+
rm -rf $TMPDIR
 +
</source>
 +
 
 +
You can use parallel KENO through the scalerte driver (batch6.1 in SCALE6.1) as well, but since it doesn't use <code>mpirun</code> you'll have to do some of the grunt work yourself. The easiest way is to grep for the number of nodes you're running on, store that in a variable, and pass that through to scalerte/batch6.1::
  
 
<source lang="bash">#!/bin/bash
 
<source lang="bash">#!/bin/bash
 +
#PBS -q gen3
 +
#PBS -V
 +
#PBS -l nodes=3:ppn=8
  
 +
NP=$(grep -c node ${PBS_NODEFILE})
 +
TMPDIR=/home/tmp_scale/$USER/scale.$$
 +
 +
cd $PBS_O_WORKDIR
 +
scalerte -m -N ${NP} -M ${PBS_NODEFILE} -T $TMPDIR my_keno.inp
 +
 +
rm -rf $TMPDIR
 +
</source>
 +
 +
Two important things to note here are that Torque will create your machine file for you (which you access through the environment variable PBS_NODEFILE), and that parallel Scale across multiple nodes requires a shared temporary area.  As such, the directory /home/tmp_scale is set aside for users who need it for parallel Scale across multiple nodes.  Copying my example above will ensure that each parallel Scale case runs in its own directory. NOTE: If you are running on a single node, use node-local /tmp, which is much faster.
 +
 +
For parallel TRITON it is very similar, but you have to remember to allocate one EXTRA CPU for the master thread (unlike MCNP, it doesn't do this itself).  For example, if you had a case that has 18 branches, you would want to request 19 CPUs, but only pass 18 through to scalerte/batch6.1:
 +
 +
<source lang="bash">#!/bin/bash
 
#PBS -q gen3
 
#PBS -q gen3
 
#PBS -V
 
#PBS -V
#PBS -l nodes=1:ppn=8
+
#PBS -l nodes=2:ppn=8+1:gen3:ppn=3
 +
 
 +
NP=$(grep -c node ${PBS_NODEFILE})
 +
TMPDIR=/home/tmp_scale/$USER/scale.$$
  
 
cd $PBS_O_WORKDIR
 
cd $PBS_O_WORKDIR
batch6.1 -m -N 8 -M ${PBS_NODEFILE} my_keno.inp</source>
+
scalerte -m -N $(($NP-1)) -M ${PBS_NODEFILE} -T $TMPDIR my_triton.inp
 +
 
 +
rm -rf $TMPDIR
 +
</source>
 +
 
 +
Again, you subtract one CPU from the total allocated, pass through the machinefile, and give it a shared temporary directory.
 +
 
 +
====Scale 6.3.1====
 +
 
 +
When you load it using <code>module load scale/6.3.1</code>, it overloads system libraries and qstat/qsub etc. break. These are binaries directly from ORNL, and are preferred unless you need MPI.
 +
 
 +
Please put the <code>module load scale/6.3.1</code> in your submission script, or use Scale interactively via interactive jobs, <code>qsubi</code>. Note that <code>module unload scale</code> will get your system libraries back.
 +
 
 +
====Scale 6.3.1 MPI====
 +
 
 +
If you need MPI Scale 6.3.1, use <code>module unload mpi && module load openmpi/2.1.6 && module load scale/6.3.1-mpi</code>. This is compiled with NEcluster environment, and does not suffer from the above issue; however, it unloads system MPI library for the Scale-compatible ones, so you can run into similar issues. Also, the MPI binaries likely have a larger overhead for non-MPI jobs, and they do not come from ORNL.
 +
 
 +
===Advantg===
  
The important thing to note here is that Torque will create your machine file for you, and you can access it with the environment variable PBS_NODEFILE.
+
Advantg depends on MCNP5, so you need to load that first:
 +
<code><br>module load openmpi
 +
<br>module load MCNP5
 +
<br>module load advantg
 +
</code>
  
 
==FAQ==
 
==FAQ==
 +
 +
===How can I setup unique temporary directory for my job? ===
 +
 +
Use: <pre>mktemp -d -t myproject.XXXXXX</pre> Make sure you remove it before the job exits.
 +
[https://crashcourse.housegordon.org/temp-directories.html See here for details and an example.]
  
 
===I'm not getting error/output files!===
 
===I'm not getting error/output files!===
Line 261: Line 469:
 
This problem also manifests with you getting mail sent to your account that looks like the following:
 
This problem also manifests with you getting mail sent to your account that looks like the following:
  
<pre>PBS Job Id: 421.necluster.engr.utk.edu
+
<pre>PBS Job Id: 421.necluster.ne.utk.edu
 
Job Name:  depl_scaling
 
Job Name:  depl_scaling
 
Exec host:  node8/7
 
Exec host:  node8/7
 
An error has occurred processing your job, see below.
 
An error has occurred processing your job, see below.
Post job file processing error; job 421.necluster.engr.utk.edu on host node8/7
+
Post job file processing error; job 421.necluster.ne.utk.edu on host node8/7
  
Unable to copy file /var/spool/torque/spool/421.necluster.engr.utk.edu.OU to bmervin@necluster.engr.utk.edu:/home/bmervin/work/shift/full_depletion/test/test1.out
+
Unable to copy file /var/spool/torque/spool/421.necluster.ne.utk.edu.OU to bmervin@necluster.ne.utk.edu:/home/bmervin/work/shift/full_depletion/test/test1.out
 
*** error from copy
 
*** error from copy
 
Host key verification failed.
 
Host key verification failed.
 
lost connection
 
lost connection
 
*** end error output
 
*** end error output
Output retained on that host in: /var/spool/torque/undelivered/421.necluster.engr.utk.edu.OU
+
Output retained on that host in: /var/spool/torque/undelivered/421.necluster.ne.utk.edu.OU
  
Unable to copy file /var/spool/torque/spool/421.necluster.engr.utk.edu.ER to bmervin@necluster.engr.utk.edu:/home/bmervin/work/shift/full_depletion/test/test1.err
+
Unable to copy file /var/spool/torque/spool/421.necluster.ne.utk.edu.ER to bmervin@necluster.ne.utk.edu:/home/bmervin/work/shift/full_depletion/test/test1.err
 
*** error from copy
 
*** error from copy
 
Host key verification failed.
 
Host key verification failed.
 
lost connection
 
lost connection
 
*** end error output
 
*** end error output
Output retained on that host in: /var/spool/torque/undelivered/421.necluster.engr.utk.edu.ER</pre>
+
Output retained on that host in: /var/spool/torque/undelivered/421.necluster.ne.utk.edu.ER</pre>
 +
 
 +
To prevent this error you need to ensure that the FQDN of the cluster (necluster.ne.utk.edu) is in your ${HOME}/.ssh/known_hosts file.  To do this, ssh to the hostname and FQDN of the head node:
 +
 
 +
ssh necluster
 +
ssh necluster.ne.utk.edu
  
To prevent this error you need to ensure that the FQDN of the cluster (necluster.engr.utk.edu) is in your ${HOME}/.ssh/known_hosts file.  To do this, ssh to the FQDN of the head node:
 
  
ssh necluster.engr.utk.edu
 
  
 
and answer yes when it prompts you to add the key to known_hosts.
 
and answer yes when it prompts you to add the key to known_hosts.
Line 300: Line 511:
 
Another example is if you want 4 nodes of 4 CPUs on the all queue, and 1 node of 8 CPUs on the gen3 queue:
 
Another example is if you want 4 nodes of 4 CPUs on the all queue, and 1 node of 8 CPUs on the gen3 queue:
  
  #PBS -q all
+
  #PBS -q fill
 
  #PBS -l nodes=4:ppn=4+1:gen3:ppn=8
 
  #PBS -l nodes=4:ppn=4+1:gen3:ppn=8
 +
 +
===How can I submit a job to a specific node?===
 +
 +
Node name can be specified by "-l nodes=" option, assuming the node is part of the queue. The example below will request 2 cores on node32:
 +
 +
#PBS -q fill
 +
#PBS -l nodes=node32:2
 +
 +
  
 
===How can I ensure that I have enough *local* disk space (in /tmp)?===
 
===How can I ensure that I have enough *local* disk space (in /tmp)?===
  
Torque reports the amount of disk space free in /tmp for every node.  If you know how much disk space you'll need in this directory you can request it using disk=<size> like:
+
Torque reports the amount of disk space free in /tmp for every node.  If you know how much disk space you'll need in this directory you can request it using file=<size> like:
  
 
  #PBS -q corei7
 
  #PBS -q corei7
  #PBS -l nodes=2:ppn=8,disk=50gb
+
  #PBS -l nodes=2:ppn=8,file=50gb
  
 
This is useful for making parallel TRITON runs avoid nodes that only have 10 GB of space in /tmp.
 
This is useful for making parallel TRITON runs avoid nodes that only have 10 GB of space in /tmp.
  
 
Remember that this is only for <b>local</b> disk access.  Network file access to your $HOME or /home/tmp_scale (for Scale runs) isn't affected by this parameter.
 
Remember that this is only for <b>local</b> disk access.  Network file access to your $HOME or /home/tmp_scale (for Scale runs) isn't affected by this parameter.
 +
 +
===I messed up my node allocation request!  How do I fix it?===
 +
 +
If you request more nodes or CPUs per node than is physically possible, Torque won't tell you.  Rather, your jobs will just sit in queue until the end of time.  This occurs when, for example, you request 8 CPUs per node on a gen2 node (which only have 4 CPUs per node).
 +
 +
To fix problems like this there are three solutions.  Continuing with the above example, say you had submitted a job to gen2 requesting 2 nodes at 8 CPUs each.  Your PBS script would have looked like this:
 +
 +
<source lang=bash>
 +
#PBS -q gen2
 +
#PBS -V
 +
#PBS -l nodes=2:ppn=8
 +
 +
cd ${PBS_O_WORKDIR}
 +
./run_my_code input
 +
</source>
 +
 +
As mentioned, this will never, ever, ever, run.  gen2 nodes only have 4 CPUs per node, not 8!  Let's say we submitted this job, and it was assigned the job ID 1234.  The three solutions to fix this problem are:
 +
 +
1) Delete the job using qdel 1234, rewrite your submission script, and resubmit the job.  This is quick, but not really elegant.
 +
 +
gdel 1234
 +
 +
2) Move the job to a queue where it will actually run.  In our example, we know that the gen3 queue has 8 CPUs per node, we can move it there.  This is done with qmove:
 +
 +
qmove gen3 1234
 +
 +
3) Change the request to use 4 CPUs per node using qalter.  We could also bump up the number of nodes wanted to 4 to keep the total number of CPU requests the same:
 +
 +
qalter -l nodes=4:ppn=4 1234
 +
 +
Of course, for all of these examples replace 1234 with your job's ID number.
 +
 +
===Where are my jobs?===
 +
 +
You can list your submitted job IDs and the directories they were submitted from by:
 +
 +
  listjobs
 +
 +
===Admin stuff===
 +
 +
  qstat -a -n -- torque stats
 +
  showstats -n -- maui
 +
  momctl
 +
  /etc/init.d/pbs_mom restart -- if node seems down on showstats but should not
 +
  pbsnodes -o nodename / pbsnodes -c nodename -- add/clear OFFLINE flag on a node

Latest revision as of 00:03, 3 April 2024

Overview

The UT NE Cluster now uses TORQUE for job submission and Maui for job scheduling. For the user who is used to going to Ganglia, selecting an underutilized node, logging into that node, and manually managing their job this will be a large change. This Wiki article will give a brief overview of how TORQUE/Maui are implemented on our cluster, how to use the commands to manage your job, and some examples with codes that are in common use on the cluster. If you have significant needs for compute time for your problem, please first make sure that you have a script that runs well for a tiny problem size, then talk to admins to work out a strategy that makes your problem run as fast as possible given the resources available.

Good summary of Torque job options

Queue Structure

Currently we have 9 queues available on the cluster. Some are available to all users, and some are restricted to certain users. Here is a brief overview of the queues. Note that there are no default queues, you will have to specify a queue for every job you want to run. Some queues will line up closely with information on the Nodes page.

NOTE: As of May 12, 2014, we are testing having no restrictions on queues. Please select a queue that you feel is appropriate for your job.

  • interactive

This queue consists of the gen1 and gen2 nodes and is a mechanism for users to submit interactive jobs. This is the ONLY queue that users may run interactive jobs on.

  • gen1

These nodes line up with the AMD nodes on the cluster. There are currently 7 nodes in this queue and each node has 2 cores. These nodes are accessible for all users.

  • gen2

These nodes line up with the Core2Quad nodes on the cluster plus Ondrej's server node node18. There are currently 8 nodes in this queue and each node has 4 cores, EXCEPT node18 which has 8. These nodes are accessible for all users.

  • gen3

These nodes line up with the first generation (Nehalem) Core i7 nodes on the cluster. There are currently 11 nodes in this queue and each node has 8 cores. Access to these nodes are restricted to users who have contacted the cluster admins with a reason to use this computational power.

  • gen4

These nodes line up with the second generation (Sandy Bridge) Core i7 nodes on the cluster. There are currently 3 nodes in this queue and each node has 8 cores and 32GB of RAM. Access to these nodes are restricted to students with higher priority, large, computational tasks.

  • gen5

These nodes line up with the fourth generation (Haswell) Core i7 nodes on the cluster. There are currently 12 nodes in this queue and each node has 8 cores. Access to these nodes are restricted to those with gen4 access. Almost all gen5 nodes have 32GB of RAM, but this is not guaranteed. If you need memory, please specify it in your request.

  • corei7

This queue consists of both the gen3 and gen4 queues and have those associated restrictions.

  • super

This is the queue for node0 and node1. Please ask for permission to use it.

  • students

This queue consists of gen1-gen2 and vgen nodes and have those associated restrictions.

  • fill

This queue is for large, low-priority jobs. They are queued with a lower priority than that of the other queues. This means that jobs in this queue will run when there are no other jobs in queue. All nodes are in this queue.

It is somewhat preferable to stay within a queue since the nodes within a queue are usually homogeneous (except for Ondrej's node18). For example, if you ran a job in the corei7 queue, you could get some faster nodes and some slower nodes which will impact load balancing in MCNP. While the end product will still be faster, there will be some inefficiencies that the user should be aware of.

Node-Queue Matrix

The Node-Queue Matrix is discouraged, use nodes.py script instead.

Job Submission

Job submission is done by the qsub command. The easiest way to create a job is to create a job script and submit it to the queuing system. A job script is merely a text file with some #PBS directives and the command you want to run. Important variables are shown in the table below.

PBS Flag Description Example
-I (upper case i) Runs the job interactively. This is somewhat similar to logging into a node the old way. N/A
-l (lower case L) Defines the resources that you want for the job. This is probably one of the more important flags as it allows you to specific how many nodes you want, and how many processes you want on that node. -l nodes=4:ppn=4
-N Give the job a name. Not required, but it will name the screen output file and error file after the job name if it is given. -N MyMCNPCase
-q What queue you want to submit the job to. -q gen3
-V Export your environment variables to the job. Needed most of the time for OpenMPI to work (PATH, etc.) N/A

Many other flags can be found in the qsub documentation. Specifically, for requesting resources using -l flag, see here. A short handout summarizing the qsub command is here.

Example Interactive Job

The following command requests one processor in the interactive queue. Use this for short program runs.

qsub -I -V -X -q interactive

For convenience there is an alias for the above command:

qsubi

If you do not need X windows, you can use:

qsubinox

The following command requests all processors on an Intel Core2Quad core node (queue gen2). You can then use this node for various quick little tests, compiling, anything really. You generally shouldn't request all of the processors on a node, I did here to illustrate the use of mpirun. If you know you are going to use parallel codes, or, you want to run multiple things at once, make sure you request the correct number of CPUs!

shart6@necluster ~ $ qsub -I -V -q interactive -l nodes=1:ppn=4
qsub: waiting for job 0.necluster.ne.utk.edu to start
qsub: job 0.necluster.ne.utk.edu ready

shart6@node2 ~ $ mpirun hostname
node2
node2
node2
node2
shart6@node2 ~ $ logout

qsub: job 0.necluster.ne.utk.edu completed

Two imporant things to note are:

  • I used -V to pass through my environment variables to the interactive job (PATH, LD_LIBRARY_PATH, etc.)
  • When I used mpirun in my job, I DID NOT need to specify -np or -machinefile. The job will inherently know what cores you have access to.

Example Script

The following script does the exact same things I did in the interactive job but non-interactively.

#!/bin/bash

#PBS -V
#PBS -l nodes=1:ppn=4
#PBS -q gen2

mpirun hostname

The job is then submitted with qsub directly. Since I didn't give a name, output (and error messages) will be in the form <scriptname>.o<job#> and <scriptname>.e<job#> respectively.

shart6@necluster ~/pbstest $ qsub myrun.sh 
1.necluster.ne.utk.edu
shart6@necluster ~/pbstest $ cat myrun.sh.o1 
node2
node2
node2
node2

Requesting memory for your job

There are two ways of requesting memory resource for your job. Either you request total memory for the job, or memory per process.

The PBS flags

-l mem=XXXX

requests the memory available for the job.

An example script to request one node, 32 cores, and 64GB of RAM:

#!/bin/bash
#PBS -V
#PBS -l nodes=1:ppn=32
#PBS -l mem=64gb
#PBS -q fill
hostname

The PBS flags

-l pmem=XX,pvmem=XX

requests the number of available memory to XX per process.

An example script to request 4 cores and (4x16=)64GB of RAM:

#!/bin/bash
#PBS -V
#PBS -l nodes=1:ppn=4
#PBS -l pmem=16gb,pvmem=16gb
#PBS -q fill
hostname

Note that if your calculation tries to allocate more memory than was requested using the -l mem= or -l pmem= flags, your job will be terminated.

Job Control

There are numerous commands to control and view your job. Two of the big ones are:

  • qdel <#> - Remove the job from the queue.
  • qstat - View the queued jobs.

More information on these commands (and others!) can be found in the TORQUE Administrator's Guide.

Also, when you are running a job on a node, you CAN ssh into that node like normal. This will like you do things like top, and ps. However, when your job finishes, the node will forcibly evict you (so don't have many things open when your job is about to finish)!

NOTE: If you are running two jobs on a node and one ends, you won't be kicked off the node. You will only be kicked off the node when all your jobs running on that node end.

Examples

Serpent2

Running SERPENT 2 in the queuing system requires the following modules be loaded:
module load mpi
module load serpent

An example script file for running SERPENT 2 on 1 nodes with 8 cores:

#!/bin/bash
#PBS -V
#PBS -q corei7
#PBS -l nodes=1:ppn=8

#### cd working directory (where you submitted your job)
cd ${PBS_O_WORKDIR}

#### Executable Line
sss2 -omp 8  serpentinput.inp > nohup.out

For large cases, and if you know why you need MPI and how to use it, an example script file for running SERPENT 2 on 5 nodes with 8 cores each is provided below:

#!/bin/bash
#PBS -V
#PBS -q gen3
#PBS -l nodes=5:ppn=8
#PBS -l pmem=1500mb

#### cd working directory (where you submitted your job)
cd ${PBS_O_WORKDIR}

#### Executable Line
mpirun  -npernode 1 sss2 -omp 8  serpentinput.inp > nohup.out

The cd ${PBS_O_WORKDIR} statement is meant to change the working directory to the directory that the user was in at the time they launched the job. Remember, by default, no working directory is specified when a job initiates in the queue, therefore one must have the script change directories to the working directory. The shortcut for the directory where you invoked qsub is $PBS_O_WORKDIR.

Originally mpirun would use the -npernode flag to indicate how many MPI tasks to run on a given node, however this is being replaced by the --map-by flag. ppr:1:node indicates 1 MPI task per node. -bind-to none ensures that the MPI task is not bound to 1 processor, but allowed to use all processors on a given node. The SERPENT 2 flag -omp specifies the number of OpenMP threads per MPI task (in this case 8 threads per task). Leaving the -omp specification out will result in a single thread per MPI task.

TORQUE does not allow for the viewing of job screen prints directly and therefore requires one specify a file to write the screen prints to. This is accomplished with the "> nohup.out" specification which writes all screen output to the file named "nohup.out". One can of course this file anything, but nohup.out was chosen as an example. Currently, this screen printing though has been having problems with the most recent update and is not printing to the output file until job completion. Hopefully, this will be resolved soon.

In regards to how best to run SERPENT 2 on multiple nodes, it should be kept in mind that dividing the SERPENT 2 job among multiple MPI tasks results in the model being copied for each MPI task and the number of neutrons per generation (npg) divided among the MPI tasks. This can introduce problems if the model is large or if the problem is spread over too many tasks. In the case of the model being large, each copy of the model will require memory to fit the entire model into, so if I were to use 8 MPI tasks on one node as opposed to 1 MPI task with 8 OpenMP threads on one node, I would expect the 8 MPI task run to require 8 times the memory as the 1 MPI task with 8 OpenMP thread run. Also, if I request 6000 npg in the 8 MPI run, each task would get 6000/8 npg, whereas the single MPI task with 8 threads would get all 6000 npg. Typically, 6000 npg is sufficient to ensure that enough fissions occur to provide new fission neutron source sites in single assembly (maybe not accurate, but it will run). However less than 1000 npg runs the risk of having too few new neutron sites and gradually finding fewer and fewer new fission neutron until the job dies. Therefore, to avoid both these problems, it is best to specify 1 MPI task per node, and use OpenMP threads to use all processors on said node.

Nuclear data

The latest nuclear data from ENDF/B-VIII.0 are available in /opt/MCNP_DATA. The most visible change from the previous versions is the absence of data for natural carbon, thus it is necessary to use data for C-12 and C-13. There is also change in thermal data for graphite. You can use data for crystalline graphite (gr00) with density approx. 2.3 g/cm3, graphite with 10 % porosity (gr10) and approx. density 2.0 g/cm3, and graphite with 30 % porosity (gr30) with density approx. 1.6 g/cm3. The following lines will tell Serpent to use these data in your calcualtions:

set acelib "/opt/MCNP_DATA/sss_endfb80.xsdir"
set declib "/opt/MCNP_DATA/sss_endfb80.dec"
set nfylib "/opt/MCNP_DATA/sss_endfb80.nfy"

MCNPX

Running MCNPX with the queuing system is relatively straightforward since MCNPX only uses MPI. Load the required modules using these commands:
module load mpi
module load intel
module load MCNPX


Here is an example script file where I run MCNPX on two nodes with 8 cores each.

#!/bin/bash

#PBS -q gen3
#PBS -V
#PBS -l nodes=2:ppn=8

cd $PBS_O_WORKDIR
mpirun --bind-to none mcnpx name=LWR_test_pin_old

Again, note how I DID NOT give mpirun -np or -machinefile, it determines this all automatically. Also, one other important thing is that I change to the directory $PBS_O_WORKDIR before I execute any commands. This is because when your jobs starts, it doesn't start in any directory. You need to tell it to go to where your input files are for MCNPX. The shortcut for the directory where you invoked qsub is $PBS_O_WORKDIR.

One drawback is that TORQUE makes it difficult to monitor your job's output as it runs. The <jobname>.o<job#> file doesn't appear until the job has finished (or failed!). One way around this is to redirect your output to a file. This will have the output show up there as the job runs. As an example, we can change the run line above to redirect output to myjoboutput.txt:

mpirun --bind-to none mcnpx name=LWR_test_pin_old > myjoboutput.txt

MCNP5

First see a Beginners Guide to MCNP with Torque here. Load the required modules using these commands:
module load openmpi
module load MCNP5

As in the MCNP section of this wiki, we have 3 different ways to run MCNP5:

1) MPI only. 2) OpenMP only. (PREFERRED for KCODE runs} 3) OpenMP and MPI.

Only the 3rd situation is complicated, but I will cover all three for completeness.

MPI Only

This is very much the same as MCNPX. We request nodes, and processors per node in a PBS script and run our case in the exact same way. An example would be:

#!/bin/bash

#PBS -q gen4
#PBS -V
#PBS -l nodes=3:ppn=4

cd $PBS_O_WORKDIR
mpirun mcnp.mpi name=super_complicated_case

This will run our case on three gen4 nodes, with 4 MPI runs on each. Note that this is somewhat inefficient, because it would be better to use ALL 8 CPUS on the gen4 nodes, but I only used 4 to show how the command could be used.

Example specific NOTE: Maui is set up to only use each node once. The logical reader could imagine a case where, since we have 8 CPU cores per gen4 node, we could schedule the case with 4 on node9, 4 on node21, and then 4 on node9 again. However, Maui is set up to spread the cases out, so this would not happen. In addition, because of this, you can not specify nodes=4:ppn=4 and have Maui schedule 4 on node9, 4 on node21, 4 on node31, and 4 on node9. Your job will either not work, or stay in queue forever, even though it is technically possible.

OpenMP Only

This is even easier because with only OpenMP we are limited to just one node. However, MCNP5 doesn't detect how many threads it has available to it, so we will have to set that up in our script.

#!/bin/bash

#PBS -q corei7
#PBS -V
#PBS -l nodes=1:ppn=8

cd $PBS_O_WORKDIR
mcnp5 TASKS 8 name=less_complicated_case

Here we are running on one node with 8 threads. If we had specified nodes=2:ppn=8, then the case would only run on the first node (called Mother Superior) and the other node would be marked utilized, but not calculate anything. This would waste resources!

Note: This is usually the most efficient way to run MCNP cases.

MPI and OpenMP

This is a little bit complicated since we want to allocate ourselves sufficient nodes to run with OpenMP, but, if you remember, mpirun automatically runs on every thread we have allocated. The trick here is running mpirun with the -npernode flag as follows:

#!/bin/bash

#PBS -q gen1
#PBS -V
#PBS -l nodes=1:ppn=1+7:gen3:ppn=8

cd $PBS_O_WORKDIR
mpirun -npernode 1 mcnp5.mpi TASKS 8 name=very_complicated_case

This will allocate 1 CPU on the gen1 queue to act as a master, and then 7 nodes (8 CPUs per node) on the gen3 queue to act as the compute nodes. It is okay to use a "slow" computer as the master because it doesn't do any calculations, it just sends out cross-sections and consolidates results from the compute nodes. Note that it is very important that you list the master CPU first and that you also use the -npernode 1 flag. If you don't, you will start many, many MPI jobs, each using 8 CPUs and will quickly oversubscribe the node. Also note that the queue specified by -q is only for the first "block", and that you can specify a different queue for the compute nodes. See #How can I request different CPU counts on different nodes/How can I use multiple queues? for more multi-job requests.

Another example (again using a slow gen1 as the master) with gen2 nodes:

#!/bin/bash

#PBS -q gen1
#PBS -V
#PBS -l nodes=1:ppn=1+4:gen2:ppn=4

cd $PBS_O_WORKDIR
mpirun -npernode 1 mcnp5.mpi TASKS 4 name=another_complicated_case

In general this is for extreme cases only, and you ought to use this only if you are sure you know what you are doing.

MCNP6.1

Much like MCNP5, but in the first step load these modules:
module load intel/12.1.6
module load openmpi/1.6.5-intel-12.1
module load MCNP6/1.0

The binary is then mcnp6 for OpenMP version and mcnp6.mpi for MPI version. Do not use MPI unless you know what you are doing!

Example script, with module load commands:

#!/bin/bash
#PBS -q fill
#PBS -V
#PBS -l nodes=1:ppn=8

module load intel/12.1.6
module load openmpi/1.6.5-intel-12.1
module load MCNP6/1.0

cd $PBS_O_WORKDIR
mcnp6 TASKS 8 name=less_complicated_case

MCNP6.2

For the moment, you can use mcnp6 with OpenMP only:
module load MCNP6/2.0


The MPI version was compiled with gcc-5.4.0, which is sub-optimal. Contact system administrator before you do this. Use it like MCNP6.1, but load these modules:
module load gcc/5.4.0
module load mpi
module load MCNP6/2.0

MCNP: Delete unneeded runtapes

Please delete runtape files which you don't need, they take a huge amount of disk space. Below is an example modification of the above script to do it automatically. It also writes the runtape locally to /tmp, saving interconnect bandwith.

#!/bin/bash

#PBS -q fill
#PBS -V
#PBS -N MyMCNPjob
#PBS -l nodes=1:ppn=8

RTP=runtape_$(date "+%s%N")
cd $PBS_O_WORKDIR
mcnp6 TASKS 8 name=my_complicated_case runtpe=/tmp/$RTP
rm /tmp/$RTP

MCNP6.2 with MPI

MPI with MCNP6.2 is tricky, so please test your jobs on small runs, and make sure all MPI threads initialize correctly.

#!/bin/bash
#PBS -V
#PBS -q xeon
#PBS -l nodes=1:ppn=32
#PBS -N MCNP-MPI

hostname
module load mpi
module load MCNP6/2.0-mpi

RTP="runtp--".`date "+%R%N"`
cd $PBS_O_WORKDIR
/opt/intel/oneapi/mpi/latest/bin/mpirun -np $PBS_NUM_PPN mcnp6.mpi name=my_complicated_case.inp runtpe=$RTP
rm $RTP

Scale

90% of Scale runs are serial, so allocating one node and one CPU to the task will be sufficient. Use local /tmp for temporary files (TMPDIR in the example below) to speed computation up.

#!/bin/bash

#PBS -q gen2
#PBS -V
#PBS -l nodes=1:ppn=1

TMPDIR=$(mktemp -d -t myproject.XXXXXX) || exit 1

cd $PBS_O_WORKDIR
scalerte -m -T $TMPDIR my_case.inp

rm -rf $TMPDIR

You can use parallel KENO through the scalerte driver (batch6.1 in SCALE6.1) as well, but since it doesn't use mpirun you'll have to do some of the grunt work yourself. The easiest way is to grep for the number of nodes you're running on, store that in a variable, and pass that through to scalerte/batch6.1::

#!/bin/bash
#PBS -q gen3
#PBS -V
#PBS -l nodes=3:ppn=8

NP=$(grep -c node ${PBS_NODEFILE})
TMPDIR=/home/tmp_scale/$USER/scale.$$ 

cd $PBS_O_WORKDIR
scalerte -m -N ${NP} -M ${PBS_NODEFILE} -T $TMPDIR my_keno.inp

rm -rf $TMPDIR

Two important things to note here are that Torque will create your machine file for you (which you access through the environment variable PBS_NODEFILE), and that parallel Scale across multiple nodes requires a shared temporary area. As such, the directory /home/tmp_scale is set aside for users who need it for parallel Scale across multiple nodes. Copying my example above will ensure that each parallel Scale case runs in its own directory. NOTE: If you are running on a single node, use node-local /tmp, which is much faster.

For parallel TRITON it is very similar, but you have to remember to allocate one EXTRA CPU for the master thread (unlike MCNP, it doesn't do this itself). For example, if you had a case that has 18 branches, you would want to request 19 CPUs, but only pass 18 through to scalerte/batch6.1:

#!/bin/bash
#PBS -q gen3
#PBS -V
#PBS -l nodes=2:ppn=8+1:gen3:ppn=3

NP=$(grep -c node ${PBS_NODEFILE})
TMPDIR=/home/tmp_scale/$USER/scale.$$ 

cd $PBS_O_WORKDIR
scalerte -m -N $(($NP-1)) -M ${PBS_NODEFILE} -T $TMPDIR my_triton.inp

rm -rf $TMPDIR

Again, you subtract one CPU from the total allocated, pass through the machinefile, and give it a shared temporary directory.

Scale 6.3.1

When you load it using module load scale/6.3.1, it overloads system libraries and qstat/qsub etc. break. These are binaries directly from ORNL, and are preferred unless you need MPI.

Please put the module load scale/6.3.1 in your submission script, or use Scale interactively via interactive jobs, qsubi. Note that module unload scale will get your system libraries back.

Scale 6.3.1 MPI

If you need MPI Scale 6.3.1, use module unload mpi && module load openmpi/2.1.6 && module load scale/6.3.1-mpi. This is compiled with NEcluster environment, and does not suffer from the above issue; however, it unloads system MPI library for the Scale-compatible ones, so you can run into similar issues. Also, the MPI binaries likely have a larger overhead for non-MPI jobs, and they do not come from ORNL.

Advantg

Advantg depends on MCNP5, so you need to load that first:
module load openmpi
module load MCNP5
module load advantg

FAQ

How can I setup unique temporary directory for my job?

Use:

mktemp -d -t myproject.XXXXXX

Make sure you remove it before the job exits.

See here for details and an example.

I'm not getting error/output files!

This problem also manifests with you getting mail sent to your account that looks like the following:

PBS Job Id: 421.necluster.ne.utk.edu
Job Name:   depl_scaling
Exec host:  node8/7
An error has occurred processing your job, see below.
Post job file processing error; job 421.necluster.ne.utk.edu on host node8/7

Unable to copy file /var/spool/torque/spool/421.necluster.ne.utk.edu.OU to bmervin@necluster.ne.utk.edu:/home/bmervin/work/shift/full_depletion/test/test1.out
*** error from copy
Host key verification failed.
lost connection
*** end error output
Output retained on that host in: /var/spool/torque/undelivered/421.necluster.ne.utk.edu.OU

Unable to copy file /var/spool/torque/spool/421.necluster.ne.utk.edu.ER to bmervin@necluster.ne.utk.edu:/home/bmervin/work/shift/full_depletion/test/test1.err
*** error from copy
Host key verification failed.
lost connection
*** end error output
Output retained on that host in: /var/spool/torque/undelivered/421.necluster.ne.utk.edu.ER

To prevent this error you need to ensure that the FQDN of the cluster (necluster.ne.utk.edu) is in your ${HOME}/.ssh/known_hosts file. To do this, ssh to the hostname and FQDN of the head node:

ssh necluster
ssh necluster.ne.utk.edu


and answer yes when it prompts you to add the key to known_hosts.

How can I request different CPU counts on different nodes/How can I use multiple queues?

By advanced usage of the -l flag to Torque, you can make advanced allocation requests. This is useful when you have an MPI case that requires 9 CPUS and you want to use 8 CPUs on one node, and 1 CPU on another (instead of 1 CPU on 9 nodes!). Note that when you do this, your queue -q request only applies to the first chunk!

The following example requests 8 CPUs on 1 gen4 nodes, and 1 CPU on a gen4 node:

#PBS -q gen4
#PBS -l nodes=1:ppn=8+1:gen4:ppn=1

Notice how I had to "reapply" the queue for the 2nd chunk! You have to specify the queue for any node requests past the first in this way.

Another example is if you want 4 nodes of 4 CPUs on the all queue, and 1 node of 8 CPUs on the gen3 queue:

#PBS -q fill
#PBS -l nodes=4:ppn=4+1:gen3:ppn=8

How can I submit a job to a specific node?

Node name can be specified by "-l nodes=" option, assuming the node is part of the queue. The example below will request 2 cores on node32:

#PBS -q fill
#PBS -l nodes=node32:2


How can I ensure that I have enough *local* disk space (in /tmp)?

Torque reports the amount of disk space free in /tmp for every node. If you know how much disk space you'll need in this directory you can request it using file=<size> like:

#PBS -q corei7
#PBS -l nodes=2:ppn=8,file=50gb

This is useful for making parallel TRITON runs avoid nodes that only have 10 GB of space in /tmp.

Remember that this is only for local disk access. Network file access to your $HOME or /home/tmp_scale (for Scale runs) isn't affected by this parameter.

I messed up my node allocation request! How do I fix it?

If you request more nodes or CPUs per node than is physically possible, Torque won't tell you. Rather, your jobs will just sit in queue until the end of time. This occurs when, for example, you request 8 CPUs per node on a gen2 node (which only have 4 CPUs per node).

To fix problems like this there are three solutions. Continuing with the above example, say you had submitted a job to gen2 requesting 2 nodes at 8 CPUs each. Your PBS script would have looked like this:

 #PBS -q gen2
 #PBS -V
 #PBS -l nodes=2:ppn=8

 cd ${PBS_O_WORKDIR}
 ./run_my_code input

As mentioned, this will never, ever, ever, run. gen2 nodes only have 4 CPUs per node, not 8! Let's say we submitted this job, and it was assigned the job ID 1234. The three solutions to fix this problem are:

1) Delete the job using qdel 1234, rewrite your submission script, and resubmit the job. This is quick, but not really elegant.

gdel 1234

2) Move the job to a queue where it will actually run. In our example, we know that the gen3 queue has 8 CPUs per node, we can move it there. This is done with qmove:

qmove gen3 1234

3) Change the request to use 4 CPUs per node using qalter. We could also bump up the number of nodes wanted to 4 to keep the total number of CPU requests the same:

qalter -l nodes=4:ppn=4 1234

Of course, for all of these examples replace 1234 with your job's ID number.

Where are my jobs?

You can list your submitted job IDs and the directories they were submitted from by:

 listjobs

Admin stuff

 qstat -a -n -- torque stats
 showstats -n -- maui
 momctl
 /etc/init.d/pbs_mom restart -- if node seems down on showstats but should not
 pbsnodes -o nodename / pbsnodes -c nodename -- add/clear OFFLINE flag on a node