Hi Doug,

Thanks much for the reply. I'm certain it's not an OOM, as we've looked for those in all the relevant /var/log/messages, and we do have our share of OOMs in this environment - we've configured Slurm to kill jobs that go over their defined memory limits, so we're familiar with what that looks like.

The engineer asserts not only that the process wasn't killed by him or by the calling process, he also claims that Slurm didn't run the job at all. I believe he thinks that because he didn't see output he was looking for, but, as we see in the logs, the job started and ran for a few seconds. My guess is that the job didn't last long enough to flush the stdout buffer.

The job in question is run from a process which is run from cron, so I believe that would rule out the possibility of the remote session closing.

My best information tells me that the job was started, ran for a few seconds until it realized it didn't have something it needed, and died, but I don't have enough insight to be sure. This srun message is the most perplexing to me, as I don't believe I've seen it before, and Googling turns up very little useful information:

srun: launch/slurm: launch_p_step_launch: StepId=31360187.0 aborted before step 
completely launched.


Have you ever seen this before?

Thanks,
-rob


On 4/4/23 21:17, slurm-users-requ...@lists.schedmd.com wrote:
Send slurm-users mailing list submissions to
        slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
or, via email, send a message with subject or body 'help' to
        slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
        slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

    1. Re: Job killed for unknown reason (Doug Meyer)


----------------------------------------------------------------------

Message: 1
Date: Tue, 4 Apr 2023 19:19:39 -0700
From: Doug Meyer <dameye...@gmail.com>
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Job killed for unknown reason
Message-ID:
        <cajvtnxlk54jcxdlga9ijt0ss6wkhrcnv1hv3btgcskazjqg...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

I don't think I have ever seen a sig 9 that wasn't a user.  Is it possible
you have folks in slurm coordinator/administrator that may be killing jobs
or run running a cleanup script?  Only other thing I can think of is the
user is closing their remote session before the srun completes. I can't
recall right now but oom might be working.  dmesg -T | grep oom to see if
the OS is wiping out jobs to recover memory.

Doug


On Mon, Apr 3, 2023, 8:56 AM Robert Barton <r...@realintent.com> wrote:

Hello,

I'm looking for help in understanding a problem we're having such that
Slurm indicates that a job was killed, but not why. It's not clear what's
actually killing the jobs; we've seen jobs killed for time limits and
out-of-memory issues, and those reasons are obvious in the logs when they
happen, and that's not happening here.

In Googling for the error messages, it seems like the jobs are killed
outside of Slurm, but the engineer insists that this is not the case.

This happens sporadically, maybe every one or two million jobs, and is not
reliably reproducible. I'm looking for any ways to gather more information
about the cause of these issues.

Slurm version: 20.11.9

The relevant messages:

slurmctld.log:

[2023-03-27T20:53:55.336] sched: _slurm_rpc_allocate_resources
JobId=31360187 NodeList=(null) usec=5871
[2023-03-27T20:54:16.753] sched: Allocate JobId=31360187 NodeList=cl4
#CPUs=1 Partition=build
[2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9
[2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done

slurmd.log:

[2023-03-27T20:54:23.978] launch task StepId=31360187.0 request from
UID:255 GID:100 HOST:10.52.49.107 PORT:59370
[2023-03-27T20:54:23.979] task/affinity: lllp_distribution: JobId=31360187
implicit auto binding: cores,one_thread, dist 1
[2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread, 0x000008
[2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize:
/slurm/uid_255/job_31360187: alloc=4096MB mem.limit=4096MB
memsw.limit=4096MB
[2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize:
/slurm/uid_255/job_31360187/step_0: alloc=4096MB mem.limit=4096MB
memsw.limit=4096MB
[2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0 ON cl4
CANCELLED AT 2023-03-27T20:54:27 ***
[2023-03-27T20:54:27.099] [31360187.0] done with job

srun output:

srun: job 31360187 queued and waiting for resources
srun: job 31360187 has been allocated resources
srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1)
srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0
srun: launch/slurm: launch_p_step_launch: StepId=31360187.0 aborted before
step completely launched.
srun: Complete StepId=31360187.0+0 received
slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT
2023-03-27T20:54:27 ***
srun: launch/slurm: _task_finish: Received task exit notification for 1
task of StepId=31360187.0 (status=0x0009).

accounting:

# sacct -o jobid,elapsed,reason,state,exit -j 31360187
        JobID    Elapsed                 Reason      State ExitCode
------------ ---------- ---------------------- ---------- --------
31360187       00:00:11                   None     FAILED      0:9


These are compile jobs run via srun. The srun command is of this form
(I've omitted the -I and -D parts as irrelevant and containing private
information):

( echo -n 'max=3126 ; printf "[%2d%% %${#max}d/3126] %s\n" `expr 2090 \*
100 / 3126` 2090 "["c+11.2"] $(printf "[slurm %4s %s]" $(uname -n)
$SLURM_JOB_ID) objectfile.o" ; fs_sync.sh sourcefile.cpp Makefile.flags ; '
; printf '%q ' g++ -MT objectfile.o -MMD -MP -MF optionfile.Td -m64 -Werror
-W -Wall -Wno-parentheses -Wno-unused-parameter -Wno-uninitialized
-Wno-maybe-uninitialized  -Wno-misleading-indentation
-Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun  -J rgrmake -p build
-N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose bash  && fs_sync.sh
objectfile.o


Slurm config:

Configuration data as of 2023-03-31T16:01:44
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = podarkes
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2023-02-21T10:02:56
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = ri_cluster_v20
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = NO_CONF_HASH
DefMemPerNode           = UNLIMITED
DependencyParameters    = (null)
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Different Ours=0xf7a11381 Slurmctld=0x98e3b483
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxDBDMsgs              = 20112
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 31937596
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = GANG,SUSPEND
PreemptType             = preempt/partition_prio
PreemptExemptTime       = 00:02:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SbcastParameters        = (null)
SchedulerParameters     =
batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
ScronParameters         = (null)
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(471)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = clctl1
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6816-6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 20.11.9
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /data/slurm/spool
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 255
UsePam                  = No
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = yes
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at clctl1 is UP


Please let me know if any other information is needed to understand this.
Any help is appreciated.

Thanks,
-rob


-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.schedmd.com/pipermail/slurm-users/attachments/20230404/a93da655/attachment.htm>

End of slurm-users Digest, Vol 66, Issue 6
******************************************


Reply via email to