1.  Swarm2 Configuration

Swarm2 is a cluster of computers (also referred to as "nodes") in the College of Information and Computer Sciences (CICS). Swarm2 was purchased with funds generously provided by the Massachusetts Technology Collaborative.

The swarm2 hardware cluster consists of:

  • 1 head node with 24 cores (2 processors, 12 cores each - 48 cores with hyperthreading) Xeon E5-2680 v4 @ 2.40GHz, 128GB RAM and 200GB local SSD disk.
  • 50 compute nodes with 28 cores (2 processors, 14 cores each - 56 cores with hyper threading) Xeon E5-2680 v4 @ 2.40GHz, 128GB RAM and 200GB local SSD disk.
  • 2 data server nodes
  • 2 10GB network switches Dell S4048-ON

The software consists of:

  • Bright Cluster Manager
  • CentOS Linux 7 (OS)
  • ZFS filesystem shared over NFS
  • Slurm (job scheduling and resource management)

Swarm2 Policy Document

2.  Accounts

Initially all users with an account on the old swarm cluster now have an account on swarm2. Graduate students needing a new account should send email to system@cs.umass.edu and CC the faculty member who will need to approve the request. Faculty members can just sent mail to system and request an account.

3.  Quick Start

Important - Do not run jobs directly on the head node, The head should be used for submitting jobs.

To log into swarm2 ssh into swarm2.cs.umass.edu. The compute nodes are only accessible from within the swarm2 local network, and should only be used through slurm jobs. They are named swarm001 - swarm050

Currently Slurm assigns job priority using a fairshare basis.

To begin submitting jobs to Swarm2's cluster you must use slurm and have disk space on one of the work directories.

Slurm will automatically start your job on the cluster nodes with available resources.

  • Submitting a job: Place your commands in a shell script. Then use:
   sbatch  mybinary args

3.1  Policy changes on swarm

  • Effective Apr 19, 2018, SWARM will begin automatically enforcing thread limits. If your job swaps more threads then requested, it will be automatically killed. You can adapt to this by either a) requesting more cores or b) ensuring that your program doesn't use extra threads. Examples of how to do both are below.

3.2  [Warning 1] Thread abuse

Most programs, including a simple python script, will figure out how many cores are on a node and try to use all of them, even if only 1 cpu is allocated. Justin came up with a solution to this which should be included in the doc: The reason was that I was running code that used OPENBLAS and/or MKL libraries that are internally multithreaded. I finally figured out that if in my job script I request 7 cores as

#SBATCH -n 7

but then add the following lines the script before I call my working code

export MKL_NUM_THREADS=7
export OPENBLAS_NUM_THREADS=7
export OMP_NUM_THREADS=7

then that code will limit itself to 7 threads.

3.3  [Warning 2] Wasteful resource allocation

Users should allocate their resources carefully, since over-allocation means that swarm is only operating at a very low capacity compared to what it is capable of.

I don't have good advice on how to properly profile a program, but I know you can check your actual usage (threads and memory) by logging into a node. E.g.:

[ksung@swarm2 ~]$ ssh swarm003
Last login: Thu Nov  2 10:28:45 2017 from swarm2.cm.cluster
[ksung@swarm003 ~]$ htop

4.  Disk Space and FileSystem

Configuration

Home directories, work1 and scratch1 are provided by a Dell MD3060e enclosure connected to two PowerEdge servers running NexentaStor, a ZFS-based storage management system. The disk enclosure includes four SSDs for read/write caching and the servers connect with 10 Gbe interfaces to the rest of the cluster for extremely fast I/O. The file systems are built on a raidz2 storage pool, similar to a RAID 60 configuration.

Backups for home directories and work1 are provided by a similar storage solution.

User disk space

The swarm2 cluster uses a large raid storage device with zfs file systems, with 3 partitions :

  • /home
  • /mnt/nfs/work1
  • /mnt/nfs/scratch1

Each user is allocated a disk quota of 10 gb in their /home directory. Users can see this quota using the 'quota -s' command.

Users are also given a directory on the /mnt/nfs/work1 disk, under a specific group (PI's username). This directory is restricted by a group quota using the PI username as the group. Each group has a disk quota on this disk of 2 TB. There is currently no command to view this quota. Each user has a directory in /mnt/nfs/work1/{primary-group}/{username}, and a link called work1 from their home directory to this directory.

Each user is given a /mnt/nfs/scratch1/{username} directory. There are no quotas. Files will be removed after a period of time (2 weeks?)

ZFS Filesystems

In the ZFS file systems compression is turned on. This means:

  • Copying data from an outside source to the home directory or work1

will often use less space.

  • The quotas reflect the amount of disk space used after compression,

so a user can likely store 14-15 GB of data with a 10 GB quota.

  • Tools like 'du' will show the amount of space used on disk after

compression. To see how much "actual" space the file(s) consume use 'du --apparent-size'.

  • It may be easier to work with uncompressed files (*.txt rather

than *.txt.gz) since the file system is compressing the files anyway.

Backups

The intent is to backup home directories nightly to dedicated backup servers. Initial retention will be 30 days for home directories.

 work1 will be backed up less often  -- we will see once it starts filling how often and what the retention rate will be.

scratch1 will NOT be backed up.

Files from old Swarm cluster

The filesystems from the old swarm cluster are no longer mounted on the swarm2 head node. The old swarm will be going away soon so it would be best to copy any of the old files that you need to the new swarm2 filesystems.

5.  Slurm - Job Scheduler

What is Slurm

SLURM (Simple Linux Utility for Resource Management) is a workload manager that provides a framework for job queues, allocation of compute nodes, and the start and execution of jobs. This replaces SGE on the old swarm. More information can be found at : http://slurm.schedmd.com/

NOTE: The scheduler is currently being configured - all settings such as memory and time limits, are liable to change.

Using SLURM

The cluster compute nodes are available in SLURM partitions (called queues in SGE). Users submit jobs to request node resources in a partition. The SLURM partitions on swarm are defq (the default partition) and longq.

defq jobs are restricted to 12 hours, with a default setting of 2 hours (02:00:00). When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. More compute resources are reserved for defq than longq jobs - so it is advantageous to tailor your jobs to be able to run in defq.

longq jobs are restricted to 21 days, with a default setting of 2 days (2-00:00:00) if you submit a job without specifying the runtime. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL.

Current limits / Defaults:

  • DefMemPerCPU=2048MB
  • MaxMemPerCPU=10GB (was 62000MB)
  • MaxJobCount=20000 # max number of jobs at one time. After this submitting jobs will fail.

Per User limits:

  • GrpMemory=1024000 # 1GB Maximum amount of memory allowed to be requested for all running jobs for a user (= memory from 8 compute nodes)
  • GrpCpus = 2240 # Maximum number of CPUs allowed to be used at one time for a user. After this jobs will remain pending with the reason AssocGrpCpuLimit (= cpus from 40 nodes)

Note: If your process needs more memory then the MaxMemPerCPU you will need to request more CPUs using the cpus-per-task flag.

Overview of SLURM Commands

  • sbatch - submit a job script
  • srun - run a command on allocated compute nodes
  • squeue - show status of jobs in queue
  • scancel - delete a job
  • sinfo - show status of compute nodes
  • salloc - allocate compute nodes for interactive use

Submitting a SLURM Job

A job consists in two parts:resource requests and job steps. Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM, etc. Job steps describe tasks that must be done, software which must be run. A sample submission script to request one CPU for 10 minutes, along with 100 MB of RAM, in the longq partition would look like:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res_%j.txt  # output file
#SBATCH -e res_%j.err        # File to which STDERR will be written
#SBATCH --partition=longq    # Partition to submit to 
#
#SBATCH --ntasks=1
#SBATCH --time=10:00         # Maximum runtime in D-HH:MM
#SBATCH --mem-per-cpu=100    # Memory in MB per cpu allocated

hostname
sleep 1
exit

Slurm Job flags

The job flags are used with SBATCH command. The syntax for the SLURM directive in a script is : #SBATCH <flag>
Some of the possible flags used with the srun and salloc commands

ResourceFlag SyntaxDescriptionNotes
partition--partition=defqPartition is a queue for jobsDefault is defq
time--time=02-01:00:00Time limit for the job2 days and 1 hour; default is MaxTime for partition
nodes--nodes=2Number of compute nodes for the jobDefault is 1
cores--cpus-per-task=2Number of cpus for a multi-threaded taskDefault is 1
cpus/cores--ntasks-per-node=8Number of cores on the compute node.Default is 1
memory--mem=2400Memory limit per compute node for the job. Do not use with mem-per-cpu flag.memory in MB; default limit is 2000MB per core
memory--mem-per-cpu=4000Per core memory limit. Do not use the mem flagmemory in MB; default limit is 2000MB per core
output file--output=test.outName of file for stdout.default is the JobID

Multi core jobs

In the Slurm context, a task is to be understood as a process. So a multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs.

Tasks are requested/created with the --ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option. Tasks cannot be split across several compute nodes, so requesting several CPUs with the --cpus-per-task option will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the --ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.

Multithreaded jobs

Slurm is set up currently as a cooperative system, not an enforcement system. If you have a job that spawns more than one process per CPU or more than one thread per CPU, then swarm will let your jobs go over the resources allocated by your job parameters (Note this is not true of memory resources).

If each of your threads uses a full CPU, then your job should be submitted as requesting as many CPUs as threads. Otherwise, you are using more than your allocated resources on swarm, causing everyone else?s jobs to run slower.

 Some common applications are programs that use multi-threaded linear 

algebra libraries (MKL, BLAS, OpenBLAS, ATLAS, etc.), programs that use OpenMP, programs that use pthreads. When using multi-threaded linear algebra libraries, you may need to additionally restrict the number of threads using environment variables such as OMP_NUM_THREADS. Please, spend some time reading documentation of the specific library you are using to understand what environment variables need to be changed.

For example with OPENBLAS and/or MKL libraries that are internally multithreaded - if you request 7 cores as

#SBATCH -n 7

 then limit the code to 7 threads by adding the following lines in the 

script before calling the working code:

export MKL_NUM_THREADS=7
export OPENBLAS_NUM_THREADS=7
export OMP_NUM_THREADS=7

Interactive logins

Though batch submission is best, foreground, interactive jobs can also be run. Jobs should be initiated with the srun command instead of sbatch.

srun --pty --mem 500 -t 0-01:00 /bin/bash

will start a command line shell (/bin/bash) on the defq queue with 500 MB of RAM for 1 hour. The --pty option allows the session to act like a standard terminal. For interactive logins which last longer then 12 hours remember to use the longq.

After you enter the srun command you will be put into the normal queue waiting for nodes to become available. When they do you will get an interactive session on a compute node and you are put into the directory from which you ran the launched the session. You can then run commands.

If you need X11 forwarding for your interactive jobs, use the fisbatch command as a replacement for srun (after loading the stubl module).

module add stubl
fisbatch  [sbatch directives]

Job Priority

The slurm scheduler will assign a priority to waiting jobs based upon configuration parameters (age, size, fair-share allocation, etc.) We have configured multifactor based on fair-share(10000), job size (4000), and age (1000). The fair-share value is configured under an account hierarchy based on the user, group, department, university.

# show priority:
sprio -w  # show configured weights
sprio  # show jobs weighting priority

# Show the current fairshare situation
sshare

# Getting the priority given to a job can be done either with squeue or with sprio
squeue -o %Q -j jobid

6.  Software

Modules

The swarm cluster uses Environment Modules which makes it easy to maintain multiple versions of compilers, libraries and applications for different users on the cluster. Each module file contains the information needed to configure the shell for an application. When a user "loads" an environment module for an application, all environment variables are set correctly for that particular application.

Use the following commands to adjust your environment:

module avail            - show available modules
module add <module>     - adds a module to your environment for this session
module initadd <module> - configure module to be loaded at every login

If a script is unable to find the module command, add a call to initialize it - for example in bash:

. /etc/profile.d/modules.sh

Matlab

  Currently users of swarm2 are allowed to use a total of 2 of the 

department Matlab licenses, and only on one node - swarm050. To use one of the 2 matlab license you must specify ' -L matlab@slurmdb' when submitting the job. To show how many licenses are free use:

scontrol show lic

Once connected load the matlab/r2016a module.

You need to use ""srun"" or ""fisbatch" before running an interactive Matlab session. This is restricted to the queue defq which has a time limit of 12 hours.

For example:

module add stubl
fisbatch --immediate --partition=defq \
    -L matlab@slurmdb --nodelist=swarm050 --mem=4096

The best way to use Matlab in the cluster is to compile the matlab code, and submit jobs to run the compiled matlab program. The compilation itself uses a matlab license but running the compiled code will not use a license.

Compiled matlab jobs

Note: Compiling using mcc in the unix terminal keeps the matlab license for 30 mins, whereas compiling using mcc in matlab keeps the matlab license until you quit. (So compile and then quit). The department only has 2 compiler licenses for all users.

A Matlab runtime environment for Matlab-R2016a is available in /cm/shared/apps/MATLAB/MATLAB_Runtime/v901 - and can be used by loading the shared module mcr/v901
module load shared mcr/v901
There is also a Matlab runtime environment for Matlab-R2016b - available by loading the module mcr/v91
When running multiple matlab compiled jobs, the MATLAB Compiler Runtime cache can become corrupted. To prevent this specify a local, unique MCR cache directory (using the MCR_CACHE_ROOT variable) in the job and delete it after running the jobs.

A sample compiled matlab submit job:

#! /bin/bash
#SBATCH —job-name=test 
#SBATCH --output=test.out 
#SBATCH --error=test.err 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --mem=2048

module load shared mcr/v901
export MCR_CACHE_ROOT=/tmp/mcr_cache_root_${USER}_$RANDOM # local to host

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

mkdir -p $MCR_CACHE_ROOT 
./run_test.sh /cm/shared/apps/MATLAB/MATLAB_Runtime/v901
/bin/rm -r $MCR_CACHE_ROOT 

7.  Frequently asked questions (FAQ)

7.1  Q: My compiled job takes as input two integers, and prints out these two numbers. But my output is not as I expected. For example, I get 49 and 50 for the input 1 and 2. Do you know how I can solve this problem?

Slurm takes all input values as “string”. Thus, you should first convert your input to the proper data type that you want to use. For example, you can use str2num

7.2  Q: When running my Matlab compiled job, it gets an error message An unknown error occurred while executing MATLAB code. MCL:Runtime:InternalFeval. What should I do?

You should wrap addpath functions as follows

if (~isdeployed)
    addpath(‘path’);
end

You may also want to check the following page: https://www.mathworks.com/matlabcentral/answers/102228-why-does-addpath-cause-an-application-that-is-generated-with-matlab-compiler-4-0-r14-to-fail

7.3  Q: Where should I put my .sbatch file?

You need to put your sbatch file on the same location where your main entry code exists. For example, let’s say you compiled main.m with sub1.m, sub2.m, and sub3.m while your files need access to /datasets/.

        
/datasets/
/path1/sub1.m
/path2/sub2.m.
/code/insidePath/sub3.m
/code/main.m

Then, your sbatch file should be located at /code/ as follows while /datasets/ is still accessible.

/datasets/
/code/my.sbatch

7.4  Q: I want to run my compiled job with multiple input values. Are there any easier way to run the jobs at once, instead of making multiple sbatch files?

You can use job array as follows.

$ sbatch --array=1-10

Also, you can use the number as one of your input arguments as follows so that you can submit multiple jobs at once with different input values.

.run_matlab.sh /cm/shared/apps/MATLAB/MATLAB_Runtime/v901 $SLURM_ARRAY_TASK_ID

Please find more details in the following page: http://slurm.schedmd.com/job_array.html

7.5  Q: Can you give me an example .sbatch so that I can start from there?

#! /bin/bash

#SBATCH --job-name=your_job_name
#SBATCH --output=result-%A_%a.out
#SBATCH --error=result-%A_%a.err
#SBATCH --nodes=1
#SBATCH --ntasks=1 
#SBATCH --mem=5000
#SBATCH --partition=longq
#SBATCH --array=1-10

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

export MCR_CACHE_ROOT=/tmp/mcr_cache_root_${USER}_$RANDOM  
mkdir -p $MCR_CACHE_ROOT
module load shared mcr/v901

./run_matlab.sh /cm/shared/apps/MATLAB/MATLAB_Runtime/v901 $SLURM_ARRAY_TASK_ID 1 2 

/bin/rm -r $MCR_CACHE_ROOT

7.6  Q: Is it possible to install Tensorflow 0.11? The most update version available in the cluster is 0.10 and I specifically need features added on 0.11.

TensorFlow 0.11 is "hiding" in Python 2.7.12:

$ module load python/2.7.12
$ pip list | grep tensorflow
tensorflow (0.11.0rc0)

7.7  Q: How can I run a postgresql database server on the cluster.

  You will need to run the server as your own process, as part of a slurm

job - either an interactive job or a scripted job. You will need to determine how much memory to reserve for this process when you submit the job - if you go over, the job is killed. You should use a non-default port when running the postgresql server so as not to conflict with any others.

  An example of setting up the server (where mygroup=your group, and username=your username):
# get an interactive login to a compute node
srun --pty --mem 5000 -t 0-01:00 /bin/bash
# create database directory and initialize db
initdb -D  /mnt/nfs/work1/mygroup/username/mydb

# edit  mydb/postgresql.conf and  set
  listen_addresses = '*'
  port = xxxxx (some large port number),
  unix_socket_directories = '/mnt/nfs/work1/mygroup/username/, /tmp'

# edit pg_hba.conf to allow access from 10.128.255.0/16
host    all             username             10.128.255.0/16 md5


# start database server
pg_ctl -D /mnt/nfs/work1/mygroup/username/mydb -l ~/psql.logfile start

# create a database
createdb  -p 25432 -h localhost mytestdb


# connect
psql -p 25432 -h localhost mytestdb
# set a pssword on user account
mytestdb=# alter user username with encrypted password 'Top!secret';

# edit pg_hba.conf to allow access from 10.128.255.0/16 to user username
host    all             username             10.128.255.0/16 md5

# restart database server
pg_ctl -D /mnt/nfs/work1/mygroup/username/mydb -l ~/psql.logfile restart

7.8  Q: How do I use OpenCV 3.2.0?

 opencv 3.2.0 is installed as a module on swarm2. To use it you will need to do 
module add shared 
module add opencv/3.2.0
  To use with python you will need to create a python virtualenv, activate is and 

install numpy in it (using pip).

  Then you need to add a link to 

/cm/shared/apps/opencv/3.2.0/lib/python2.7/site-packages/cv2.so

 under the virtualenv in lib/python2.7/site-packages

  If compiling with it you need to include a 
 -I/cm/shared/apps/opencv/3.2.0/include 

7.9  Q: How can I install python (or python3) modules?

 The easiest things is to configure a virtualenv for your python 

environment and install the packages you need under that. You can set up virtualenv by doing:

virtualenv ~/mypython 
# each time you want to use this python you will need to do: 
source ~/mypython/activate

Then you can use your pip to install modules under your mypython configuration: pip install numpy

To use python3 you need to enable the module:

module add python/3.5.2 
which python 
  /cm/shared/apps/python/3.5.2/bin/python
which virtualenv 
  /cm/shared/apps/python/3.5.2/bin/virtualenv
# Then create a new virtualenv directory for this version: 
virtualenv ~/mypython3
source ~/mypython3/activate
 To use this in a job you need to make sure to add the module, and 

source the activate script at the start of your slurm script.

7.10  Q: How can I add something to the FAQ?

Please send the question and answer to SouYoung Jin.