NSF MRI cluster Policy Document - old swarm cluster
February 8, 2008
We expect that the NSFMRI cluster “swarm” will have many competing priorities for its use. This policy document is written in the hope of minimizing conflicts. Many of the policies will be automatic and enforced by software. We hope this will minimize the load on people. The NSFMRI proposal call speci.ed that they were not going to fund general departmental resources and it should not be considered as one. It is primarily meant to enable research.
CSCF (the computer science computing facility) will manage regular requests (email should be sent to email@example.com ). CSCF will be the cluster managers. Special requests should be sent to firstname.lastname@example.org and will be sent to them and forwarded to the Swarm Policy Committee.
The Swarm Policy Committee R. Manmatha (chair), James Allan, Emery Berger -will resolve conflicts and also decide on special requests.
All special requests should go to email@example.com.
2 Cluster Specifications
• 764 cores • 2 GB/core RAM.
• 65TB centralized RAID 5 storage.
Each node has about 250 GB of local storage. The nodes run linux. The cluster file system is Lustre. Grid Engine is used to submit jobs.
• Each researcher laboratory will be a group and will be assigned a Group ID (GID). Resources will be allocated according to groups.
• New users/groups will be added per user request -(contact firstname.lastname@example.org). The request should come from the faculty member, or be OKed by a faculty member.
• The procedure for non-CS individuals (who are part of the research laboratories of senior researchers on the grant) may be more complicated and we will try to figure this out as we proceed.
3.1 Home Directories Each user will get 1GB enforced by quotas for home directories and will be backed up. Users are encouraged to store code/non-data stu. in home directories. Given that CSCF has to backup the entire amount and there is an upper limit of 250GB, inactive users will be removed every 6 months (or more frequently if necessary).
Jobs will be run in batch mode using queues. There will be 3 queues:
• Short Jobs: Each job (process on a core) is allowed to run for 5 hours. Jobs that run more than 5 hours will be automatically killed. Note that a user can submit many jobs -each of up to 5 hours duration. Thus by subdividing their work users can run many jobs.
• Long Jobs: A long job queue will be a smaller portion of the cluster. An individual user will be limited to 60 cores at one time. Long jobs will be allowed to run at most 1 week and will be automatically killed at the end of this time. There is a limit on the number of cores allocated to long jobs of 230 cores.
• MPI jobs: A MPI queue will be created for MPI jobs. An individual user will be limited to 117 cores at one time. MPI jobs will be allowed to run for at most 36 hours and will be automatically killed at the end of this time (this kill is destructive -nothing will be saved). There is a limit on the total number of cores allocated for the MPI queue of 117 cores.
• Special Requests: Special requests for running long jobs on many cores will need to be made in advance. We recommend at least a month’s notice although if the Swarm is not heavily used shorter notice may be su.cient. An example of a special request: A user may want to run a job on 80 cores for a week. Note that it is very unlikely that special requests for use of the entire cluster for an extended period of time will be granted since this will preempt all other users.
Special requests should be made to email@example.com Tutorial on Using the Cluster: We are hoping to organize a tutorial on how to use the cluster.
Groups will have functional priorities that will be based on the faculty cpu share. Each faculty member on the grant and his/her group will be allocated a priority -ie a portion of cluster time. Software algorithms will automatically allocate cluster time based on shares and usage. However, if a queue is idle it is possible for an individual to be allocated all of the queue’s capacity. Slight priority is given to the Swarm Policy Committee and to senior researchers associated with the grant proposal.
6 Disk Space
Each research group will be allocated a quota (usually 500GB). Some preference will be given to senior researchers associated with the grant proposal and to the Swarm Policy Committee. While this space will not be backed up it will not be overwritten. In exceptional cases, the committee may give a researcher/group additional disk space. Temporary scratch space of the order of 20 TB will also be available. This space will be overwritten periodically.
Backup: We recommend users backup their work both in the temporary scratch space as well as any other data they need. Since transfers between LGRC and the CS building are limited to 1Gbps for all clusters, we require that large .le transfers are done by taking a USB disk (size = 500 GB) to the cluster site in LGRC (please arrange with firstname.lastname@example.org in advance).
Additional Dedicated Disk Space: Researchers who would like dedicated disk space can purchase additional disk space (in multiples of 8 TB) alone or as part of a pool. To prevent degradation of the system, this requires buying an I/O node with 8 TB of disk space (In Fall 2007, the cost was about $11K for an I/O node with 8 TB of disk space). However, the initial startup costs are high since an additional rack (cost $900 in Fall 2007 ) and network switch (cost $5000 in Fall 2007) are required. We are investigating how to amortize the cost of the rack and network switch.