Hi, I don't think I have ever seen a sig 9 that wasn't a user. Is it possible you have folks in slurm coordinator/administrator that may be killing jobs or run running a cleanup script? Only other thing I can think of is the user is closing their remote session before the srun completes. I can't recall right now but oom might be working. dmesg -T | grep oom to see if the OS is wiping out jobs to recover memory.
Doug On Mon, Apr 3, 2023, 8:56 AM Robert Barton <r...@realintent.com> wrote: > Hello, > > I'm looking for help in understanding a problem we're having such that > Slurm indicates that a job was killed, but not why. It's not clear what's > actually killing the jobs; we've seen jobs killed for time limits and > out-of-memory issues, and those reasons are obvious in the logs when they > happen, and that's not happening here. > > In Googling for the error messages, it seems like the jobs are killed > outside of Slurm, but the engineer insists that this is not the case. > > This happens sporadically, maybe every one or two million jobs, and is not > reliably reproducible. I'm looking for any ways to gather more information > about the cause of these issues. > > Slurm version: 20.11.9 > > The relevant messages: > > slurmctld.log: > > [2023-03-27T20:53:55.336] sched: _slurm_rpc_allocate_resources > JobId=31360187 NodeList=(null) usec=5871 > [2023-03-27T20:54:16.753] sched: Allocate JobId=31360187 NodeList=cl4 > #CPUs=1 Partition=build > [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9 > [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done > > slurmd.log: > > [2023-03-27T20:54:23.978] launch task StepId=31360187.0 request from > UID:255 GID:100 HOST:10.52.49.107 PORT:59370 > [2023-03-27T20:54:23.979] task/affinity: lllp_distribution: JobId=31360187 > implicit auto binding: cores,one_thread, dist 1 > [2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind: > _lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread, 0x000008 > [2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize: > /slurm/uid_255/job_31360187: alloc=4096MB mem.limit=4096MB > memsw.limit=4096MB > [2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize: > /slurm/uid_255/job_31360187/step_0: alloc=4096MB mem.limit=4096MB > memsw.limit=4096MB > [2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0 ON cl4 > CANCELLED AT 2023-03-27T20:54:27 *** > [2023-03-27T20:54:27.099] [31360187.0] done with job > > srun output: > > srun: job 31360187 queued and waiting for resources > srun: job 31360187 has been allocated resources > srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1) > srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0 > srun: launch/slurm: launch_p_step_launch: StepId=31360187.0 aborted before > step completely launched. > srun: Complete StepId=31360187.0+0 received > slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT > 2023-03-27T20:54:27 *** > srun: launch/slurm: _task_finish: Received task exit notification for 1 > task of StepId=31360187.0 (status=0x0009). > > accounting: > > # sacct -o jobid,elapsed,reason,state,exit -j 31360187 > JobID Elapsed Reason State ExitCode > ------------ ---------- ---------------------- ---------- -------- > 31360187 00:00:11 None FAILED 0:9 > > > These are compile jobs run via srun. The srun command is of this form > (I've omitted the -I and -D parts as irrelevant and containing private > information): > > ( echo -n 'max=3126 ; printf "[%2d%% %${#max}d/3126] %s\n" `expr 2090 \* > 100 / 3126` 2090 "["c+11.2"] $(printf "[slurm %4s %s]" $(uname -n) > $SLURM_JOB_ID) objectfile.o" ; fs_sync.sh sourcefile.cpp Makefile.flags ; ' > ; printf '%q ' g++ -MT objectfile.o -MMD -MP -MF optionfile.Td -m64 -Werror > -W -Wall -Wno-parentheses -Wno-unused-parameter -Wno-uninitialized > -Wno-maybe-uninitialized -Wno-misleading-indentation > -Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun -J rgrmake -p build > -N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose bash && fs_sync.sh > objectfile.o > > > Slurm config: > > Configuration data as of 2023-03-31T16:01:44 > AccountingStorageBackupHost = (null) > AccountingStorageEnforce = none > AccountingStorageHost = podarkes > AccountingStorageExternalHost = (null) > AccountingStorageParameters = (null) > AccountingStoragePort = 6819 > AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages > AccountingStorageType = accounting_storage/slurmdbd > AccountingStorageUser = N/A > AccountingStoreJobComment = Yes > AcctGatherEnergyType = acct_gather_energy/none > AcctGatherFilesystemType = acct_gather_filesystem/none > AcctGatherInterconnectType = acct_gather_interconnect/none > AcctGatherNodeFreq = 0 sec > AcctGatherProfileType = acct_gather_profile/none > AllowSpecResourcesUsage = No > AuthAltTypes = (null) > AuthAltParameters = (null) > AuthInfo = (null) > AuthType = auth/munge > BatchStartTimeout = 10 sec > BOOT_TIME = 2023-02-21T10:02:56 > BurstBufferType = (null) > CliFilterPlugins = (null) > ClusterName = ri_cluster_v20 > CommunicationParameters = (null) > CompleteWait = 0 sec > CoreSpecPlugin = core_spec/none > CpuFreqDef = Unknown > CpuFreqGovernors = Performance,OnDemand,UserSpace > CredType = cred/munge > DebugFlags = NO_CONF_HASH > DefMemPerNode = UNLIMITED > DependencyParameters = (null) > DisableRootJobs = No > EioTimeout = 60 > EnforcePartLimits = NO > Epilog = (null) > EpilogMsgTime = 2000 usec > EpilogSlurmctld = (null) > ExtSensorsType = ext_sensors/none > ExtSensorsFreq = 0 sec > FederationParameters = (null) > FirstJobId = 1 > GetEnvTimeout = 2 sec > GresTypes = (null) > GpuFreqDef = high,memory=high > GroupUpdateForce = 1 > GroupUpdateTime = 600 sec > HASH_VAL = Different Ours=0xf7a11381 Slurmctld=0x98e3b483 > HealthCheckInterval = 0 sec > HealthCheckNodeState = ANY > HealthCheckProgram = (null) > InactiveLimit = 0 sec > InteractiveStepOptions = --interactive --preserve-env --pty $SHELL > JobAcctGatherFrequency = 30 > JobAcctGatherType = jobacct_gather/linux > JobAcctGatherParams = (null) > JobCompHost = localhost > JobCompLoc = /var/log/slurm_jobcomp.log > JobCompPort = 0 > JobCompType = jobcomp/none > JobCompUser = root > JobContainerType = job_container/none > JobCredentialPrivateKey = (null) > JobCredentialPublicCertificate = (null) > JobDefaults = (null) > JobFileAppend = 0 > JobRequeue = 1 > JobSubmitPlugins = (null) > KeepAliveTime = SYSTEM_DEFAULT > KillOnBadExit = 0 > KillWait = 30 sec > LaunchParameters = (null) > LaunchType = launch/slurm > Licenses = (null) > LogTimeFormat = iso8601_ms > MailDomain = (null) > MailProg = /bin/mail > MaxArraySize = 1001 > MaxDBDMsgs = 20112 > MaxJobCount = 10000 > MaxJobId = 67043328 > MaxMemPerNode = UNLIMITED > MaxStepCount = 40000 > MaxTasksPerNode = 512 > MCSPlugin = mcs/none > MCSParameters = (null) > MessageTimeout = 60 sec > MinJobAge = 300 sec > MpiDefault = none > MpiParams = (null) > NEXT_JOB_ID = 31937596 > NodeFeaturesPlugins = (null) > OverTimeLimit = 0 min > PluginDir = /usr/lib64/slurm > PlugStackConfig = (null) > PowerParameters = (null) > PowerPlugin = > PreemptMode = GANG,SUSPEND > PreemptType = preempt/partition_prio > PreemptExemptTime = 00:02:00 > PrEpParameters = (null) > PrEpPlugins = prep/script > PriorityParameters = (null) > PrioritySiteFactorParameters = (null) > PrioritySiteFactorPlugin = (null) > PriorityType = priority/basic > PrivateData = none > ProctrackType = proctrack/cgroup > Prolog = (null) > PrologEpilogTimeout = 65534 > PrologSlurmctld = (null) > PrologFlags = (null) > PropagatePrioProcess = 0 > PropagateResourceLimits = ALL > PropagateResourceLimitsExcept = (null) > RebootProgram = (null) > ReconfigFlags = (null) > RequeueExit = (null) > RequeueExitHold = (null) > ResumeFailProgram = (null) > ResumeProgram = (null) > ResumeRate = 300 nodes/min > ResumeTimeout = 60 sec > ResvEpilog = (null) > ResvOverRun = 0 min > ResvProlog = (null) > ReturnToService = 2 > RoutePlugin = route/default > SbcastParameters = (null) > SchedulerParameters = > batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000 > SchedulerTimeSlice = 30 sec > SchedulerType = sched/backfill > ScronParameters = (null) > SelectType = select/cons_res > SelectTypeParameters = CR_CORE_MEMORY > SlurmUser = slurm(471) > SlurmctldAddr = (null) > SlurmctldDebug = info > SlurmctldHost[0] = clctl1 > SlurmctldLogFile = /var/log/slurm/slurmctld.log > SlurmctldPort = 6816-6817 > SlurmctldSyslogDebug = unknown > SlurmctldPrimaryOffProg = (null) > SlurmctldPrimaryOnProg = (null) > SlurmctldTimeout = 120 sec > SlurmctldParameters = (null) > SlurmdDebug = info > SlurmdLogFile = /var/log/slurm/slurmd.log > SlurmdParameters = (null) > SlurmdPidFile = /var/run/slurmd.pid > SlurmdPort = 6818 > SlurmdSpoolDir = /var/spool/slurmd > SlurmdSyslogDebug = unknown > SlurmdTimeout = 300 sec > SlurmdUser = root(0) > SlurmSchedLogFile = (null) > SlurmSchedLogLevel = 0 > SlurmctldPidFile = /var/run/slurmctld.pid > SlurmctldPlugstack = (null) > SLURM_CONF = /etc/slurm/slurm.conf > SLURM_VERSION = 20.11.9 > SrunEpilog = (null) > SrunPortRange = 0-0 > SrunProlog = (null) > StateSaveLocation = /data/slurm/spool > SuspendExcNodes = (null) > SuspendExcParts = (null) > SuspendProgram = (null) > SuspendRate = 60 nodes/min > SuspendTime = NONE > SuspendTimeout = 30 sec > SwitchType = switch/none > TaskEpilog = (null) > TaskPlugin = task/affinity,task/cgroup > TaskPluginParam = (null type) > TaskProlog = (null) > TCPTimeout = 2 sec > TmpFS = /tmp > TopologyParam = (null) > TopologyPlugin = topology/none > TrackWCKey = No > TreeWidth = 255 > UsePam = No > UnkillableStepProgram = (null) > UnkillableStepTimeout = 60 sec > VSizeFactor = 0 percent > WaitTime = 0 sec > X11Parameters = (null) > > Cgroup Support Configuration: > AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf > AllowedKmemSpace = (null) > AllowedRAMSpace = 100.0% > AllowedSwapSpace = 0.0% > CgroupAutomount = yes > CgroupMountpoint = /cgroup > ConstrainCores = yes > ConstrainDevices = no > ConstrainKmemSpace = no > ConstrainRAMSpace = yes > ConstrainSwapSpace = yes > MaxKmemPercent = 100.0% > MaxRAMPercent = 100.0% > MaxSwapPercent = 100.0% > MemorySwappiness = (null) > MinKmemSpace = 30 MB > MinRAMSpace = 30 MB > TaskAffinity = no > > Slurmctld(primary) at clctl1 is UP > > > Please let me know if any other information is needed to understand this. > Any help is appreciated. > > Thanks, > -rob > >