SLURM
To find the queued jobs
squeue -u username
To show job details
scontrol show jobid 12345
To show partitions
sinfo
Submit job
sbatch --partition=sb --time=10-00:00:00 --chdir=/home/centos/stuff --output=output.txt --gres=gpu:1 --error=error.txt -w sb010 basic.sh
The basic.sh must also specify the path of the job otherwise it won't find the conent due to the new directory.
Update job time
scontrol update jobid 3 TimeLimit=10-00:00:00
sinfo and nodes - Node States
idle- all cores are available on the compute node
no jobs are running on the compute node
mix - at least one core is available on the compute node
compute node has one or more jobs running on it
alloc - all cores on the compute node are assigned to jobs
squeue - Job States
R - Job is running on compute nodes
PD - Job is waiting on compute nodes
CG - Job is completing
scontrol show node
This will show the nodes details such as CPU / RAM.
scancel 10
This will cancel the job ID itself. It will also cancel any array within the job.
scontrol update nodename=worker005 state=drain reason=disabling it
This will let the node finish the current job, afterwards it won't take on any more jobs.
Environmental variables
$SLURM_JOB_ID ID of job allocation
$SLURM_SUBMIT_DIR Directory job where was submitted
$SLURM_JOB_NODELIST File containing allocated hostnames
$SLURM_NTASKS Total number of cores for job