Structure of SCITAS filesystems

  • The structure and purpose of each filesystem is described here File systems
  • $HOME and $WORK are shared across the site, while $SCRATCH is local to each cluster
  • On $SCRATCH automatic deletion of files older than 2 weeks may happen without notice
  • Production jobs should use $SCRATCH

What to do when CPU time is significantly less than WALL time?

  • $SCRATCH is a GPFS parallel filesystem which is designed to perform well with parallel I/O
  • In certain cases a big number of files is produced at runtime. Such I/O patterns put stress on the $SCRATCH filesystem metadata service, and are generally much slower than using a local disk.
  • A long term solution would require changing the code to use external libraries like HDF5 or ADIOS. Those libraries give more flexibility in the way data is saved/handled.
  • A workaround is to use the local filesystem $TMPDIR
  • Beware $TMPDIR is visible only once resources are allocated.
    If you query the value of $TMPDIR in a login node:

    [user@fidis ~]$ echo $TMPDIR
    [user@fidis ~]$

    However, within a job allocation:

    [user@fidis ~]$ Sinteract
    Cores:            1
    Tasks:            1
    Time:             00:30:00
    Memory:           4G
    Partition:        parallel
    Account:          scitas-ge
    Jobname:          interact
    QOS:              normal
    Reservation:        salloc: Pending job allocation 159671
    salloc: job 159671 queued and waiting for resources
    salloc: job 159671 has been allocated resources
    salloc: Granted job allocation 159671
    srun: Job step created
    [nvarini@f061 ~]$ echo $TMPDIR

     The variables $TMPDIR, $WORK and $SCRATCH are set by the SLURM scheduler while preparing the environment for each job

How to use the $TMPDIR in your simulations

Quantum Espresso

  • The following example show how to use the $TMPDIR with Quantum-ESPRESSO (QE).
  • QE relies on fortran namelists to read certain parameters used during the simulation.
  • The only change that has to be done to a standard pw input is related to the outdir in the &CONTROL namelist. For example, in the input below the outdir is set equal to a placeholder 'fakeoutdir':

    calculation = 'scf',
    restart_mode = 'from_scratch',
    prefix = 'lgps_diel'
    tstress = .false.
    tprnfor = .false.  
    outdir = 'fakeoutdir'
    pseudo_dir = '/scratch/nvarini/pseudo'
  • The submission script would look like:

    #SBATCH --nodes 2
    #SBATCH --time=1:00:00
    #SBATCH -p debug
    module purge
    module load intel/16.0.3
    module load intelmpi/5.1.3
    module load fftw/3.3.4-mpi
    module load mkl/11.3.3
    sed "s|fakeoutdir|${TMPDIR}|g" temp_pw > ${TMPDIR}/${SLURM_JOB_ID}_pw
    srun pw.x < ${TMPDIR}/${SLURM_JOB_ID}_pw>${TMPDIR}/${SLURM_JOB_ID}.tmp.out
    tar cvf ${SLURM_JOB_ID}.archive.tar ${TMPDIR}/* .
  • After the sed command the CONTROL namelist looks like:

    calculation = 'scf',
    restart_mode = 'from_scratch',
    prefix = 'lgps_diel'
    tstress = .false.
    tprnfor = .false.
    outdir = '/tmp/1325324'
    pseudo_dir = '/scratch/marcolon/test_LGPS/pseudo'
  • For a single 100GB file, all results in MB/s, <write into TMPDIR> : <copy from TMPDIR to /scratch>:

    ClusterArchitecturewrite into $TMPDIRcopy from $TMPDIR to /scratch