Slurm은 cluster server에서 job을 manage해주는 프로그램이다.
https://slurm.schedmd.com/sbatch.html
Package를 통해 설치하거나, 파일을 다운받아 설치하는 두 가지의 방법이 있다. Package 설치가 편리하다. 하지만 최신버전은 package가 없기 때문에, 홈페이지에서 설치파일을 다운받아 설치한다.
Slurm 은 node간에 통신을 통해 job management가 이루어지 때문에, 각 compute node에서 방화벽을 해제하여야 한다. 또한 보안 통신을 위해 munge가 필요하고, master node에는 DB를 위해 mysql (mariadb) 설정이 필요하다.
Slurm 20.02 부터는 compute node에 slurm.conf를 작성하지 않아도, slurmd를 activation하면 master node의 설정들을 가지고 온다.
기본 사용법
- 통신의 암호화를 위해서 Munge 를 실행시켜 주어야 하고, compute node에서는 방화벽을 해제한다.
$ sudo systemctl start munge
$ sudo systemctl stop firewall && sudo systemctl disable firewall
- Master node에서는 slurmctld, 즉 slurm system and service management 를 control 한다.
$ sudo systemctl start slurmctld
- Compute node에서는 slurmd 가 사용된다.
$ sudo systemctl start slurmd
상태 확인
$ sinfo # 자주 사용하는 옵션은 -l(--long)과 -N(--Node)이다. ex) $ sinfo -lN
- Compute node의 상태가 idle 이 아닌, down이나 drain 상태일 때 마스터 노트에서 다음을 입력한다.
$ sudo scontrol update NodeName=name State=RESUME
- 작업이 끝났음에도 불구하고 complete 상태가 지속되고 있을 경우:
$ sudo scontrol update NodeName=name State=DOWN Reason=hung_completing
"Could not resolve hostname SERVER: Name or service not known" 이라는 문구가 나오면 hostfile에 추가해준다.
# Sample /etc/hosts file
127.0.0.1 localhost
127.0.1.1 computerhostnamehere
10.0.2.15 server
$ scontrol show nodes
문제가 생기면 log 파일을 확인한다.
Compute node bugs: tail /var/log/slurmd.log
Server node bugs: tail /var/log/slurmctld.log
$ sinfo --help
Usage: sinfo [OPTIONS]
-a, --all show all partitions (including hidden and those
not accessible)
-b, --bg show bgblocks (on Blue Gene systems)
-d, --dead show only non-responding nodes
-e, --exact group nodes only on exact match of configuration
--federation Report federated information if a member of one
-h, --noheader no headers on output
--hide do not show hidden or non-accessible partitions
-i, --iterate=seconds specify an iteration period
--local show only local cluster in a federation.
Overrides --federation.
-l, --long long output - displays more information
-M, --clusters=names clusters to issue commands to. Implies --local.
NOTE: SlurmDBD must be up.
-n, --nodes=NODES report on specific node(s)
--noconvert don't convert units from their original type
(e.g. 2048M won't be converted to 2G).
-N, --Node Node-centric format
-o, --format=format format specification
-O, --Format=format long format specification
-p, --partition=PARTITION report on specific partition
-r, --responding report only responding nodes
-R, --list-reasons list reason nodes are down or drained
-s, --summarize report state summary only
-S, --sort=fields comma separated list of fields to sort on
-t, --states=node_state specify the what states of nodes to view
-T, --reservation show only reservation information
-v, --verbose verbosity level
-V, --version output version information and exit
Help options:
--help show this help message
--usage display brief usage message
- 작업 실행
$ srun python test.py
$ srun bash -c "python test.py"
$ sbatch --wrap="python test.py" # stdout 은 slurm-1234.out 과 같은 "slurm-"+"Job ID"의 이름을 가지고 생성된다.
srun으로 실행하는 것과 sbatch로 실행하는 것에는 차이가 있다.
- srun은 master에서 실행시킨 job에 대해서 종료신호를 기다리고 있기 때문에 리소스를 계속 가지고 있으며, job의 수가 상당히 많을때는 master node가 hang에 걸리거나 실행이 안되는 문제가 발생한다.
- 그에 반해 sbatch는 job을 각 compute node에 던져주고 종료신호를 기다리지 않기 때문에 srun으로 실행시켰을 때와 같은 문제는 발생하지 않는다.
정확히 이야기해서 실행되고 있는 job에 대해서이다. (Process 자체는 srun이나 sbatch나 queue에 먼저 넣어주고 node에 job을 할당한다.)
편의를 위한 option
srun | -J jobname | job을 queue로 보낼 때, 표시되는 이름. |
--mem=20000 | task 하나에 할당하는 메모리 (단위 kb). 한 node의 최대 메모리를 기준으로 할당된 메모리에 따라 실행되는 job의 수가 결정됨. 예를 들어 node의 메모리가 128GB이고 --mem=20000 (~20GB)을 할당했다면 6개의 job이 동시에 실행됨. |
|
-Q | job allocation과 같은 메시지를 출력하지 않음. error 내용은 표시됨. | |
-N 2 | 사용할 node의 수. 2개 node에 동시에 같은 명령을 내림. | |
-n 2 | 동일한 명령을 2개 실행함. | |
-c 2 | task 하나에 사용될 cpu 수. | |
-w nd-2 | nd-2라는 이름의 node에서 명령을 실행. | |
-p part1 | slurm에서 파티션을 나누었다면, 특정 파티션에만 명령을 실행시킴. | |
-o display.out | srun에서는 화면에 stdout이 출력되는데, 출력하지 않고 파일에 저장함. | |
-e display.err | job이 실행에서 error가 발생하면 표시되는 내용을 파일에 저장함. |
$ srun -J test python test.py
$ squeue -S LIST -l
Mon Aug 17 14:44:36 2020
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
180193 part1 test user RUNNING 0:10 UNLIMITED 1 nd-1
$ for i in {1..6};do srun -J job${i} --mem=20000 bash -c "python test.py ${i}" & done
$ squeue -S LIST -l
Mon Aug 17 15:00:00 2020
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
180193 part1 job1 user RUNNING 0:01 UNLIMITED 1 nd-1
180194 part1 job3 user RUNNING 0:01 UNLIMITED 1 nd-1
180195 part1 job4 user RUNNING 0:01 UNLIMITED 1 nd-2
180196 part1 job2 user RUNNING 0:01 UNLIMITED 1 nd-2
180197 part1 job5 user PENDING 0:01 UNLIMITED 1 nd-1
180198 part1 job6 user PENDING 0:01 UNLIMITED 1 nd-2
$ srun -h
Usage: srun [OPTIONS...] executable [args...]
Parallel run options:
-A, --account=name charge job to specified account
--acctg-freq=<datatype>=<interval> accounting and profiling sampling
intervals. Supported datatypes:
task=<interval> energy=<interval>
network=<interval> filesystem=<interval>
--bb=<spec> burst buffer specifications
--bbf=<file_name> burst buffer specification file
--bcast=<dest_path> Copy executable file to compute nodes
--begin=time defer job until HH:MM MM/DD/YY
-c, --cpus-per-task=ncpus number of cpus required per task
--checkpoint=time job step checkpoint interval
--checkpoint-dir=dir directory to store job step checkpoint image
files
--comment=name arbitrary comment
--compress[=library] data compression library used with --bcast
--cpu-freq=min[-max[:gov]] requested cpu frequency (and governor)
-d, --dependency=type:jobid defer job until condition on jobid is satisfied
--deadline=time remove the job if no ending possible before
this deadline (start > (deadline - time[-min]))
--delay-boot=mins delay boot for desired node features
-D, --chdir=path change remote current working directory
--export=env_vars|NONE environment variables passed to launcher with
optional values or NONE (pass no variables)
-e, --error=err location of stderr redirection
--epilog=program run "program" after launching job step
-E, --preserve-env env vars for node and task counts override
command-line flags
--get-user-env used by Moab. See srun man page.
--gres=list required generic resources
--gres-flags=opts flags related to GRES management
-H, --hold submit job in held state
-i, --input=in location of stdin redirection
-I, --immediate[=secs] exit if resources not available in "secs"
--jobid=id run under already allocated job
-J, --job-name=jobname name of job
-k, --no-kill do not kill job on node failure
-K, --kill-on-bad-exit kill the job if any task terminates with a
non-zero exit code
-l, --label prepend task number to lines of stdout/err
--launch-cmd print external launcher command line if not SLURM
--launcher-opts= options for the external launcher command if not
SLURM
-L, --licenses=names required license, comma separated
-M, --clusters=names Comma separated list of clusters to issue
commands to. Default is current cluster.
Name of 'all' will submit to run on all clusters.
NOTE: SlurmDBD must up.
-m, --distribution=type distribution method for processes to nodes
(type = block|cyclic|arbitrary)
--mail-type=type notify on state change: BEGIN, END, FAIL or ALL
--mail-user=user who to send email notification for job state
changes
--mcs-label=mcs mcs label if mcs plugin mcs/group is used
--mpi=type type of MPI being used
--multi-prog if set the program name specified is the
configuration specification for multiple programs
-n, --ntasks=ntasks number of tasks to run
--nice[=value] decrease scheduling priority by value
-N, --nodes=N number of nodes on which to run (N = min[-max])
-o, --output=out location of stdout redirection
-O, --overcommit overcommit resources
--pack-group=value pack job allocation(s) in which to launch
application
-p, --partition=partition partition requested
--power=flags power management options
--priority=value set the priority of the job to value
--prolog=program run "program" before launching job step
--profile=value enable acct_gather_profile for detailed data
value is all or none or any combination of
energy, lustre, network or task
--propagate[=rlimits] propagate all [or specific list of] rlimits
--pty run task zero in pseudo terminal
--quit-on-interrupt quit on single Ctrl-C
-q, --qos=qos quality of service
-Q, --quiet quiet mode (suppress informational messages)
--reboot reboot block before starting job
-r, --relative=n run job step relative to node n of allocation
--restart-dir=dir directory of checkpoint image files to restart
from
-s, --oversubscribe over-subscribe resources with other jobs
-S, --core-spec=cores count of reserved cores
--signal=[B:]num[@time] send signal when time limit within time seconds
--slurmd-debug=level slurmd debug level
--spread-job spread job across as many nodes as possible
--switches=max-switches{@max-time-to-wait}
Optimum switches and max time to wait for optimum
--task-epilog=program run "program" after launching task
--task-prolog=program run "program" before launching task
--thread-spec=threads count of reserved threads
-T, --threads=threads set srun launch fanout
-t, --time=minutes time limit
--time-min=minutes minimum time limit (if distinct)
-u, --unbuffered do not line-buffer stdout/err
--use-min-nodes if a range of node counts is given, prefer the
smaller count
-v, --verbose verbose mode (multiple -v's increase verbosity)
-W, --wait=sec seconds to wait after first task exits
before killing job
--wckey=wckey wckey to run job under
-X, --disable-status Disable Ctrl-C status feature
Constraint options:
--cluster-constraint=list specify a list of cluster-constraints
--contiguous demand a contiguous range of nodes
-C, --constraint=list specify a list of constraints
--mem=MB minimum amount of real memory
--mincpus=n minimum number of logical processors (threads)
per node
--reservation=name allocate resources from named reservation
--tmp=MB minimum amount of temporary disk
-w, --nodelist=hosts... request a specific list of hosts
-x, --exclude=hosts... exclude a specific list of hosts
-Z, --no-allocate don't allocate nodes (must supply -w)
Consumable resources related options:
--exclusive[=user] allocate nodes in exclusive mode when
cpu consumable resource is enabled
or don't share CPUs for job steps
--exclusive[=mcs] allocate nodes in exclusive mode when
cpu consumable resource is enabled
and mcs plugin is enabled
or don't share CPUs for job steps
--mem-per-cpu=MB maximum amount of real memory per allocated
cpu required by the job.
--mem >= --mem-per-cpu if --mem is specified.
--resv-ports reserve communication ports
Affinity/Multi-core options: (when the task/affinity plugin is enabled)
-B, --extra-node-info=S[:C[:T]] Expands to:
--sockets-per-node=S number of sockets per node to allocate
--cores-per-socket=C number of cores per socket to allocate
--threads-per-core=T number of threads per core to allocate
each field can be 'min' or wildcard '*'
total cpus requested = (N x S x C x T)
--ntasks-per-core=n number of tasks to invoke on each core
--ntasks-per-socket=n number of tasks to invoke on each socket
Help options:
-h, --help show this help message
--usage display brief usage message
Other options:
-V, --version output version information and exit
$ sbatch -h
Usage: sbatch [OPTIONS...] executable [args...]
Parallel run options:
-a, --array=indexes job array index values
-A, --account=name charge job to specified account
--bb=<spec> burst buffer specifications
--bbf=<file_name> burst buffer specification file
--begin=time defer job until HH:MM MM/DD/YY
--comment=name arbitrary comment
--cpu-freq=min[-max[:gov]] requested cpu frequency (and governor)
-c, --cpus-per-task=ncpus number of cpus required per task
-d, --dependency=type:jobid defer job until condition on jobid is satisfied
--deadline=time remove the job if no ending possible before
this deadline (start > (deadline - time[-min]))
--delay-boot=mins delay boot for desired node features
-D, --chdir=directory set working directory for batch script
-e, --error=err file for batch script's standard error
--export[=names] specify environment variables to export
--export-file=file|fd specify environment variables file or file
descriptor to export
--get-user-env load environment from local cluster
--gid=group_id group ID to run job as (user root only)
--gres=list required generic resources
--gres-flags=opts flags related to GRES management
-H, --hold submit job in held state
--ignore-pbs Ignore #PBS options in the batch script
-i, --input=in file for batch script's standard input
-I, --immediate exit if resources are not immediately available
--jobid=id run under already allocated job
-J, --job-name=jobname name of job
-k, --no-kill do not kill job on node failure
-L, --licenses=names required license, comma separated
-M, --clusters=names Comma separated list of clusters to issue
commands to. Default is current cluster.
Name of 'all' will submit to run on all clusters.
NOTE: SlurmDBD must up.
-m, --distribution=type distribution method for processes to nodes
(type = block|cyclic|arbitrary)
--mail-type=type notify on state change: BEGIN, END, FAIL or ALL
--mail-user=user who to send email notification for job state
changes
--mcs-label=mcs mcs label if mcs plugin mcs/group is used
-n, --ntasks=ntasks number of tasks to run
--nice[=value] decrease scheduling priority by value
--no-requeue if set, do not permit the job to be requeued
--ntasks-per-node=n number of tasks to invoke on each node
-N, --nodes=N number of nodes on which to run (N = min[-max])
-o, --output=out file for batch script's standard output
-O, --overcommit overcommit resources
--profile=value enable acct_gather_profile for detailed data
value is all or none or any combination of
energy, lustre, network or task
--propagate[=rlimits] propagate all [or specific list of] rlimits
-q, --qos=qos quality of service
-Q, --quiet quiet mode (suppress informational messages)
--reboot reboot compute nodes before starting job
--requeue if set, permit the job to be requeued
-s, --oversubscribe over subscribe resources with other jobs
-S, --core-spec=cores count of reserved cores
--signal=[B:]num[@time] send signal when time limit within time seconds
--spread-job spread job across as many nodes as possible
--switches=max-switches{@max-time-to-wait}
Optimum switches and max time to wait for optimum
--thread-spec=threads count of reserved threads
-t, --time=minutes time limit
--time-min=minutes minimum time limit (if distinct)
--uid=user_id user ID to run job as (user root only)
--use-min-nodes if a range of node counts is given, prefer the
smaller count
-v, --verbose verbose mode (multiple -v's increase verbosity)
-W, --wait wait for completion of submitted job
--wckey=wckey wckey to run job under
--wrap[=command string] wrap command string in a sh script and submit
Constraint options:
--cluster-constraint=[!]list specify a list of cluster constraints
--contiguous demand a contiguous range of nodes
-C, --constraint=list specify a list of constraints
-F, --nodefile=filename request a specific list of hosts
--mem=MB minimum amount of real memory
--mincpus=n minimum number of logical processors (threads)
per node
--reservation=name allocate resources from named reservation
--tmp=MB minimum amount of temporary disk
-w, --nodelist=hosts... request a specific list of hosts
-x, --exclude=hosts... exclude a specific list of hosts
Consumable resources related options:
--exclusive[=user] allocate nodes in exclusive mode when
cpu consumable resource is enabled
--exclusive[=mcs] allocate nodes in exclusive mode when
cpu consumable resource is enabled
and mcs plugin is enabled
--mem-per-cpu=MB maximum amount of real memory per allocated
cpu required by the job.
--mem >= --mem-per-cpu if --mem is specified.
Affinity/Multi-core options: (when the task/affinity plugin is enabled)
-B --extra-node-info=S[:C[:T]] Expands to:
--sockets-per-node=S number of sockets per node to allocate
--cores-per-socket=C number of cores per socket to allocate
--threads-per-core=T number of threads per core to allocate
each field can be 'min' or wildcard '*'
total cpus requested = (N x S x C x T)
--ntasks-per-core=n number of tasks to invoke on each core
--ntasks-per-socket=n number of tasks to invoke on each socket
Help options:
-h, --help show this help message
-u, --usage display brief usage message
Other options:
-V, --version output version information and exit
$ squeue -h
Usage: squeue [OPTIONS]
-A, --account=account(s) comma separated list of accounts
to view, default is all accounts
-a, --all display jobs in hidden partitions
--array-unique display one unique pending job array
element per line
--federation Report federated information if a member
of one
-h, --noheader no headers on output
--hide do not display jobs in hidden partitions
-i, --iterate=seconds specify an interation period
-j, --job=job(s) comma separated list of jobs IDs
to view, default is all
--local Report information only about jobs on the
local cluster. Overrides --federation.
-l, --long long report
-L, --licenses=(license names) comma separated list of license names to view
-M, --clusters=cluster_name cluster to issue commands to. Default is
current cluster. cluster with no name will
reset to default. Implies --local.
-n, --name=job_name(s) comma separated list of job names to view
--noconvert don't convert units from their original type
(e.g. 2048M won't be converted to 2G).
-o, --format=format format specification
-O, --Format=format format specification
-p, --partition=partition(s) comma separated list of partitions
to view, default is all partitions
-q, --qos=qos(s) comma separated list of qos's
to view, default is all qos's
-R, --reservation=name reservation to view, default is all
-r, --array display one job array element per line
--sibling Report information about all sibling jobs
on a federated cluster. Implies --federation.
-s, --step=step(s) comma separated list of job steps
to view, default is all
-S, --sort=fields comma separated list of fields to sort on
--start print expected start times of pending jobs
-t, --states=states comma separated list of states to view,
default is pending and running,
'--states=all' reports all states
-u, --user=user_name(s) comma separated list of users to view
--name=job_name(s) comma separated list of job names to view
-v, --verbose verbosity level
-V, --version output version information and exit
-w, --nodelist=hostlist list of nodes to view, default is
all nodes
Help options:
--help show this help message
--usage display a brief summary of squeue options
/etc/slurm/slurm.conf
SlurmctldHost=g-master
SlurmctldParameters=enable_configless
#SlurmctldHost=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/usr/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPlugin=task/affinity
#TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
# SCHEDULING
#FastSchedule=1
SchedulerType=sched/backfill
#SelectType=select/cons_tres
#SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
SchedulerParameters=max_rpc_cnt=0
MessageTimeout=30
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=limits
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStoragePort=7031
AccountingStoreJobComment=YES
AccountingStorageUser=slurm
ClusterName=NGScluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
JobCompType=jobcomp/mysql
JobCompLoc=slurm_comp_db
JobCompUser=slurm
JobCompPass=SLMbio0912$
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
MaxArraySize=1000000
MaxJobCount=1000000
MaxStepCount=1000000
MaxTasksPerNode=65500
#OverSubscribe=FORCE:40
#MaxCPUsPerTask=unlimit
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
GresTypes=gpu
# COMPUTE NODES
#NodeName=g-master CPUs=4 RealMemory=28000 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
NodeName=adm-022 CPUs=40 RealMemory=200000 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
NodeName=g-[11-12] CPUs=6 RealMemory=120000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:1
# Partition Configurations
PartitionName=all Nodes=adm-022,g-[11-12] Default=YES MaxTime=INFINITE State=UP OverSubscribe=Yes
#PartitionName=master Nodes=g-master Default=NO MaxTime=INFINITE State=UP OverSubscribe=Yes
/etc/slurm/gres.conf
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Define GPU devices with MPS support
##################################################################
#AutoDetect=nvml
NodeName=g-[11-12] Name=gpu File=/dev/nvidia0
Reference
- 설치에 관련해서는 잘 정리된 곳, https://wonwooddo.tistory.com/35
- KISTI SLURM / 관리자 이용자 가이드, https://repository.kisti.re.kr/bitstream/10580/6542/1/2014-147%20Slurm%20%EA%B4%80%EB%A6%AC%EC%9E%90%20%EC%9D%B4%EC%9A%A9%EC%9E%90%20%EA%B0%80%EC%9D%B4%EB%93%9C.pdf
- https://dandyrilla.github.io/2017-04-11/jobsched-slurm/
- https://curc.readthedocs.io/en/latest/running-jobs/slurm-commands.html
댓글