Table of contents - old Swarm cluster


Quick Start

Important - Do not run jobs directly on the head node, The head should be used for submitting jobs. Idle user processes on the head node will be killed after 14 days.

To log into swarm ssh into swarm.cs.umass.edu. The compute nodes are only accessible from within the swarm local network, and should only be used through grid engine jobs. They are named swarm-compute-1-0 - swarm-compute-1-21, swarm-compute-2-0 - swarm-compute-2-36

To begin submitting jobs to Swarm's cluster you must use Grid Engine and have disk space on one of the work directories.

Grid engine will automatically start your job on the cluster nodes with the least load.

Grid engine commands should be run from Swarm's head node.

  • Submitting a job: Place your commands in a shell script. Then do:
   qsub -cwd -o stdout.txt -e stderr.txt myscript.sh args

-cwd says that the files will be put in your working directory rather than your home directory. This advisable because you don't want to fill up your home directory. -o is your output file -e is standard error file

Alternatively, to run a binary rather than a script, you must use the flag -b y

   qsub -b y -cwd -o stdout.txt -e stderr.txt mybinary args
  • Deleting a job: You can get the job number using qstat
   qdel <job-id>
  • Viewing jobs: In general, use qstat. Perhaps the most useful thing to do is get detailed information about your jobs only:
  qstat -r -u USERNAME

If you see 'Eqw' next to a job, that means it's not running because of an error. Probably you misspelled the name of the script or it can't find a file or directory that you reference. You See #If your job has an Error state, below.

  • Viewing hosts: You can view all the hosts in the Grid Engine and what their current load is by doing

qstat -f

qstat -g c will give a brief summary of the status of all queues.

Working interactively.

Sometimes, you really just want a shell prompt.

  qlogin 

will give you an interactive shell on some available Grid Engine machine. This way you take up a slot on the Grid Engine queue, so if someone else submits a large array job, your interactive job will still be assigned a processor. This is great for running Matlab. Variations: To log in to a specific host, use qlogin -q all.q@compute-0-1. If all the slots are full, but you want your shell to wait in the queue for the next available slot, do qlogin -now n.

Important qsub arguments

Here are some neat tricks that the Grid Engine can do.

Requiring Some Amount of Physical RAM

Sometimes your job uses a lot of memory, and you want to restrict it to run on nodes where there is enough free memory, so you don't have swapping. You can do this using job resources, like this

 qsub -l mem_free=2G my-job.sh

If you want to use all 16GB of ram on a swarm node please submit a #threaded job.

But what if you want to put constraints on things other than free memory? "Can I do that?" you ask. Boy can you. You can require all kinds of things about the grid node handling your job (e.g., that it has enough physical RAM.) For a full list, try qstat -F. For example, you can constrain to a particular host using "-l hostname=compute-0-0" (although you probably should never want to do this).

Large memory jobs

As each execution node runs multiple jobs simultaneously, the job scheduler needs to be given a fairly accurate estimate of the job's size in order not to oversubscribe physical memory on a given node. When memory is oversubscribed, all jobs on that node will start to use virtual memory, which is much slower. To prevent this type of disruption, jobs exceeding 2 gigabytes in size must be submitted with a valid memory resource request. To find the memory usage of a running job you can use the command : qstat -j NNNNN | grep vmem

For jobs greater than 2 gigabytes, calculate the amount of memory your job requires as closely as possible (in 'M'egabytes or 'G'igabytes),then for memory size x submit the job with:

qsub -l mem_free=xG -l mem_token=xG myjob.com

When submitting a number of large memory jobs at once, it is possible for jobs to oversubscribe the memory on a node due to the delay between the job starting and it finally claiming its full amount of memory - which is what mem_free reports. In the few seconds a job takes to claim its full amount of memory, a second job may launch on that node, unaware that the free memory it sees reported is soon to be claimed by the first job. To avoid such cases, it is necessary to add a request for memory 'tokens' - these are subtracted immediately from a node's available memory token count, avoiding the delayed oversubscription problem. An example command for an 4 gigabyte job:

qsub -l mem_free=4G -l mem_token=4G

Note: It is still necessary to use mem_free even when using mem_token.

Array Jobs

You can start an array job, a bunch of jobs run in parallel that differ only in that each one is associated with a different numeric index. Each job can find out its index via the SGE_TASK_ID environment variable.

e.g., Suppose you want to do 10-fold cross validation in parallel. Create a script that runs one fold, using that environment variable to know which script to run. Then doing

  qsub -t 1-10 myjob-cv.sh

would give you 10-fold cross-validation in parallel! This is way cool. I must try this.

Passing Through Environment Variables

If you have custom shell variables in your environment and want to use them in grid engine, you need to export them to your grid engine runtime environment using -v option. For example say you have the Varible $COLLECTION=/work1/mycollection

 qsub -v COLLECTION=$COLLECTION my-job.sh

and the environment variable C will be set in the job context. You may have problems if the value of C contains a space. There's probably a workaround, but I haven't explored it.

Embedding Command-line Arguments

qsub has many command-line options. Rather than specifying them on the comand line, you can keep them in the script file by putting them in with special #$ comments. This means the qsub command line is one less thing you have to log / keep track of.

For example, say you wanted a script to always run as an array job. You could use a command-line argument, as in the last section. Or you can do

 # my-job.sh2

 #$ -t 1-10

 echo `uname` $SGE_TASK_ID

And then you just do qsub my-job.sh to submit it!

Here's another example of some parameters used at the top of a perl script:

 # ---------------------------
 # -- Grid Engine parameters 
 # -----------------
 # --- interpreter 
 #$ -S /usr/bin/perl
 # -----------------
 # --- run from current dir
 #$ -cwd 
 # -----------------
 # --- job name
 #$ -N my.job.name
 # -----------------
 # --- stdio redirection
 #$ -e /dev/null
 #$ -o /dev/null
 # -----------------
 # --- force jobs to linux 
 #$ -l arch=glinux
 # -----------------
 # --- restartable
 #$ -r y
 # --- run 20 instances of this script
 #$ -t 1-20

In the actual script, the task id is retrieved using:

 $taskid = $ENV{"SGE_TASK_ID"};

Note that options specified to qsub on the command-line override these, so if you want to, e.g., resubmit just one task in an array job, you can do something like

 qsub -t 3-3 my-large-array-job.sh

Waiting for Jobs to Finish

Sometimes you don't want qsub to return to the command line until the job actually finishes. (This happened to me when I was using Makefiles for data processing, but wanted to offload some processing steps to the GE.) To do this, use the -sync y option.

Suspending Jobs

If you want to temporarily suspend one of your jobs (e.g., to make way for someone else), you can do

  qmod -sj <job-id>

To start your job up again, you can do

  qmod -usj <job-id>

Debugging

Sometimes, you'll make a mistake somewhere, and you'll have a job in an error state. When you do qstat -f, you'll see errors like this: <verbatim>

 ############################################################################ 
  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
 ############################################################################
       5 0.00000 galago gauthier     Eqw   11/07/2005 11:50:42     1        

See the big E? That means for some reason, the Grid Engine couldn't even start my job. To find out more, I do qstat -j JOB_ID. In this case:

$ qstat -j 5

 job_number:                 5
  ...LOTS OF INFORMATION ...
  error reason    1:          11/07/2005 11:50:56 [1567:17869]: error: 

can't chdir to /work1/gauthier/experiments/galgo

  scheduling info:queue instance "all.q@swarm.cs.umass.edu" dropped because it is disabled
                                     job is in error state

In this case, the problem was that SGE couldn't find the directory in which I wanted to run the script.

View Swarm status on the web, this is not available outside the department.

http://swarm.cs.umass.edu/ganglia/


Python 2.7 and 3.2

Python 2.7 is now installed as /opt/python/bin/python2.7 on all of the cluster nodes. Python 3.2 is installed at /opt/python/bin/python3.2.

virtualenv can be used to create a personal Python environment that you can easily add your own packages to.

Example:

$ /share/apps/virtualenv-2.7 ~/mypython
New python executable in mypython/bin/python2.7
Also creating executable in mypython/bin/python
Installing setuptools...........................done.
Installing pip.....................done.

$ ls ~/mypython/bin
activate  activate.csh  activate.fish  activate_this.py  easy_install
easy_install-2.7  pip  pip-2.7  python  python2  python2.7

As you can see, there is now a python executable in "~/mypython/bin". You can easily add your own modules to this environment, like so:

$ ~/mypython/bin/pip install CoreBio

When you run "~/mypython/bin/python", your newly installed packages will be available for import.

More info on virtualenv can be found here:

http://virtualenv.openplans.org/

NOTE: If this fails you can try the following:

 wget https://pypi.python.org/packages/source/v/virtualenv/
 virtualenv-1.9.tar.gz
 tar xvfz virtualenv-1.9.tar.gz
 cd virtualenv-1.9
 /opt/python/bin/python2.7 virtualenv.py ~/mypython

Matlab

  The users of swarm are allowed to use up to 2 of the department Matlab

licenses. Matlab should not be run on the head node. Instead you need to use ""qlogin"" before running an interactive Matlab session. The best way to use Matlab in the cluster is to compile the matlab code, and submit jobs to run the compiled matlab program. The compilation itself uses a matlab license but running the compiled code will not use a license.

 Matlab on swarm is installed in /share/apps/matlab/

 For information on compiling matlab code for use on swarm read this document: 

Compiling and submitting Matlab jobs on swarm

3. Special queues and Advance use of Grid Engine.

Some users will find that using qsub to submit a single job or multiple discrete jobs may not fit their purposes. For those cases we have created custom queues and parallel environments (PE) such as MPI queue and threaded] PE. Custom queues and PE's were configured according to the [[PolicyDOC|Policy document.

The limitations for the queues are:

  • all.q - 5 hour wallclock limit, 16G virtual memory limit, 12G max resident memory limit, 10K stack size limit
          480 cores
  • long.q - 168 hour wallclock limit, 12G max resident memory limit, max long slots per user: 60
          120 cores
  • mpi.q - 36 hour wallclock limit, 12G max resident memory limit, max mpi slots per user: 120
          120 cores

NOTE: The long job queue is meant for jobs which are likely to take more than 5 hours. Do not submit jobs to the long job queue to avoid waiting in the short job queue. Please only submit long jobs to this queue. Repeated infractions of this rule will incur a penalty - for example the priority of the offending user may be lowered.

3.1 MPI Jobs and long jobs queue.

The MPI queues should only be used for parallel computing jobs using one of the installed parallel environments (mpich2, mpi). They should not be used to gain access to more cores for other types of jobs. If you plan to use these queues you will need to ask to have them enabled, and to have your username added to the 'protein' userset.

To submit MPI job you must request the mpi resource with the -l option. qsub -l mpi=TRUE myjob.sh

To submit a job to the long.q you must request the long resourse with the -l option. qsub -l long=TRUE myjob.sh

From an email by T.J Brunette:

Now that mpich2 has been installed your program needs to be compiled with the mpicc or mpicxx that was installed above. You can accomplish this by putting /home/apps/compilers/mpich2_v1.05_64_smpd/bin in your PATH and /home/apps/compilers/mpich2_v1.05_64_smpd/lib in your LD_LIBRARY_PATH. You can verify this with the command "which mpicc".

Now you can run your code with the following scripts: example_start.csh <--start example_start.csh below-->

  1. !/bin/csh
  2. $ -cwd
  3. $ -j y
  4. $ -S /bin/csh

qsub -l mpi=TRUE -p 1.0 -R yes -pe mpich <number_of_processors> example.sh <--stop example_start.csh above -->

example.sh <--start example.sh below-->

  1. !/bin/bash
  2. $ -cwd
  3. $ -j y
  4. $ -S /bin/bash

export MPIEXEC_RSH=ssh

code="<your code>" export PATH=/home/apps/compilers/mpich2_v1.05_64_smpd/bin:$PATH

mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines $code <--stop example.sh below-->

warnings: Don't start the scripts with a number. Grid Engine will reject the scripts.

3.2 Threaded

To submit a threaded job which takes up an entire dedicated node, use the command: qsub -pe threaded 8 myjob.sh


3.3 Hadoop Cluster

A small hadoop cluster has been created as part of swarm. To use it (research only) please contact swarm-request@cs.umass.edu

  • Hadoop MapReduce version is 0.20.2 (RPMs from Cloudera)
  • The configuration that all the nodes use is at: /share/apps/hadoop/hadoop.conf/
  • The machines have aliases: hadoop, hadoop{1,2,3,4,5,6,7,8}.
  • SGE queues have been turned off on these nodes.
  • The HDFS reports a size of 1529.9GB
  • Unlike what we have previously been using on swarm, the HDFS is persistant.
  • To compile and run a job, the user must ssh to the node "hadoop".
    • The hadoop binary is at /usr/bin/hadoop
    • The libraries, jars are at /usr/lib/hadoop/
  • Before people can use it, we need to create home directories for them in the HDFS.

4 Disk space and File system

The file System is Lustre , which is made up of 2 dedicated MetaData servers and 8 I/O nodes with attached storage. This is configured into 3 lustre filesystems. /lustre/work1 is 60TB of shared storage with enough possible throughput for a fully loaded queue. /lustre/work2 (22TB) is used as temporary extra space as needed. Users must request space on this partition. /lustre/work3 is reserved for the Million Books project.

 Each user is allocated 1GB of disk quota for their home directory /home2/{username}
 Users can show this quota using the 'quota' command.

 Users are also given a directory on the 60T /lustre/work1 lustre disk, under a 

specific group (PI's username). This directory is restricted by a group quota using the PI name as the group. You can view this quota using the command:

    lfs quota -g `groups | awk ' { print $1 } '` /lustre/work1 

By default in the swarm lustre filesystems, the files are not striped across multiple OSTs. If you plan to use very large files, it can help to stripe them over the 8 Lustre OSTs to balance the load between the systems. You cannot convert a file to be striped - it needs to be set at file creation time. You can set a directory to be striped, and any files created after that in the directory will be striped:

     lfs setstripe -c 8 /lustre/work1/.../mydirectory 

To convert a file:

   mv myfile myfile.nostripe 
   lfs setstripe -c 8 myfile 
   cp myfile.nostripe myfile 
   lfs getstripe myfile 
   rm -f myfile.nostripe 

Some Lustre best practices:

   Avoid Using ls -l or du -s 
   Avoid Having a Large Number of Files in a Single Directory 
   Avoid Accessing Small Files on Lustre Filesystems 
   Use a Stripe Count of 1 for Directories with Many Small Files 
   Avoid Having Multiple Processes Open the Same File(s) at the Same Time 

Each compute node has a local /scratch directory (link to /state/partition1/scratch), which can be used for staging data. The size ranges between 8GB, 168GB, and 600GB.


Credits

The P.I's that made it possible are: Manmatha, James Allan, Emery Berger, David Kulp.

The quick start guide originally written by Charles Sutton for Iesl's original cluster. The modifications and the rest of the document composed by Andre Gauthier from CIIR

The cluster design and purchasing by Andre Gauthier

Cluster configuration and Maintenance By Valarie Caro and Andre Gauthier.

Special thanks to T.J Brunette for assistance in setting up MPI, and to Yariv Levy for help in documenting the Matlab compiler.