The key to using the clusters is to keep in mind that all jobs or work need to be given to a program called a batch system and your tasks are then scheduled and run as resources become available. Except for rare cases the idea is not to have real-time interaction and, even in these cases, we still pass via the batch system.
The clusters all use SLURM which is widely used and open source http://slurm.schedmd.com
Running jobs with SLURM
The normal way of working is to create a short script that describes what you need to do and submit this to the batch system using the "sbatch" command.
For example here's a script to run a code called moovit:
Any line beginning with
#SBATCH is a directive to the batch system (see
man sbatch for the full list)
The six options shown are more or less mandatory and do the following:
This is the directory in which the job will be run and the standard output files written. This should ideally point to your scratch space.
The ntasks is the number of tasks (in an MPI sense) to run per job
This is the number of cores per aforementioned task
This is the number of nodes to use - on Castor this is limited to 1 but it's good practice to request it anyway!
The memory required in MB per node
--time 12:00:00 # 12 hours
--time 2-6 # two days and six hours
The time required - there are a number of formats so see "man sbatch" for the details:
If the time and memory are not specified then default values will be imposed - these may well be lower than required!
This script is saved as
moojob1.run and in order to submit it we run the following command from one of the login nodes:
The output will look something like
The number returned is the Job ID and is the key to finding out further information or modifying the task.
To cancel a specific job:
scancel JOB ID
To cancel all your jobs (use with care!):
scancel -u username
To cancel all your jobs that are not yet running
scancel -u username -t PENDING
Getting job information
There are a number of different tools that can be used to query jobs depending on exactly what information is needed. If the name of a tool begins with a capital S then it is a SCITAS specific tool. Any tool whose name starts with a small s is part of the base SLURM distribution.
Squeue shows information about all your jobs be they running or pending.
By default squeue will show you all the jobs from all users. This information can be modified by passing options to squeue.
To see all the running jobs from the scitas group we run:
man squeue for all the options.
For example, the
Squeue command described above is actually a script that calls:
scontrol will show you everything that the system knows about a running or pending job.
Sjob is particularly useful to find out information about jobs that have finished.
Modules and Provided software
Modules (LMod) is utility that allows multiple, often incompatible, tools and libraries to exist on a cluster. Scientific tools and libraries are provided as modules and you can see what is available by running "module avail":
Initially you will only see the base modules - these are either compilers or stand alone packages such as MATLAB. In order to see more modules including libraries and MPI distributions you need to load a compiler:
The full guide to how to use module can be found here.
In your submission script we strongly recommend that you begin with a "module purge" and then load the module you need to as to ensure that you always have the correct environment.
Examples of submission scripts
There are a number of examples available on our GIT repository. To download these run the following command from the clusters:
git clone https://<gaspar-username>@git.epfl.ch/repo/scitas-examples.git
Enter the directory scitas-examples and choose the example to run by navigating the folders. We have three categories of examples: Basic (examples to get you started), Advanced (including hybrid jobs and job arrays) and Modules (specific examples of installed software).
To run an example (here: hybrid HPL), do
sbatch --partition=debug hpl-hybrid.run
or, if you do not wish to run on the debug partition,
Running MPI jobs
MPI is the acronym for Message Passing Interface and is now the de facto standard for distributed memory parallelisation.
It's an open standard with multiple implementations and we are now at version 3.
There are multiple MPI flavours that comply with the specification and each claims to have some advantage over the other. Some are vendor specific and others are open source.
On the SCITAS clusters we only support the following compiler/MPI combinations (July 2016 until Jully 2017):
Intel Composer 2016 with Intel MPI 2016
GCC 5.3 with MVAPICH2 version 2.2
GCC 5.3 with OpenMPI version 1.10
This is a SCITAS restriction to prevent chaos - nothing technically stops one from mixing! Both work well and have good performance.
If we have a MPI code we need some way of correctly launching it across multiple nodes. To do this we use srun which is SLURM’s built in job launcher
To specify how many ranks and the number of nodes we add the relevant #SBATCH directives to the job script. For example to launch our code on 4 nodes with 16 ranks per node we specify:
There is no need to specify the number of ranks when you call srun!
Running OpenMP jobs
When running an OpenMP or hybrid OpenMP/MPI job the important thing to set is the number of OpenMP threads per process via the variable OMP_NUM_THREADS. If this is not specified it often defaults to the number of processors in the system.
We can integrate this with SLURM as seen for the following hybrid (4 ranks, 4 threads per rank) task:
This takes the environmental variable set by SLURM and assigns the value to OMP_NUM_THREADS.
If you run such hybrid jobs we advise you to read the page on CPU affinity.
The Debug Partition
All the clusters have a few nodes that only allow short jobs and are intended to give you quick access to allow you to debug jobs or quickly test input files.
To use these nodes you can either add the #SBATCH -p debug directive to your job script or specify it on the command line:
sbatch -p debug myjob.run
Please note that the debug nodes must not be used for production runs of short jobs. Any such use will result in access to the clusters being revoked.
There are two main methods of getting interactive (rather than batch) access to the machines. Thay have different use cases and advantages.
The Sinteract command allows one to log onto a compute node and run applications directly on it. This can be especially useful for graphical applications such as Matlab and Comsol.
Please note that to use a graphical application you must have connected to the login node with "ssh -Y". Sinteract can also be used with the debug partition if appropriate.
salloc creates a reservation on the system that you can then access via srun. It allows one to run multi-node MPI jobs in an interactive manner and is very useful for debugging problems with such tasks.