https://twitter.com/systemdsucks Is that my coat? Why thankyou.
On Tue, 27 Nov 2018 at 11:21, Kevin Buckley <kevin.buck...@pawsey.org.au> wrote: > Here are our usual Slurm log-related config settings > > (where we've increased the default "Syslog" level as we're trying > to get the Crays to funnel the slurmd log messages all the way up > to the machine that syslog messages go to on Crays) > > > SlurmctldDebug=info > SlurmctldLogFile=/var/spool/slurm/slurmctld.log > SlurmctldSyslogDebug=info > SlurmdDebug=info > SlurmdLogFile=/var/spool/slurmd/%h.log > SlurmdSyslogDebug=info > > > and here are the "logged" outputs to both, the systemd journal, > and slurmd's dedicated log file, channels, for a particular job: > > > SYSTEMD JOURNAL > > > Nov 27 16:57:32 nid00024 slurmd[18994]: task_p_slurmd_batch_request: 796 > Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU input > mask for node: 0x00FF > Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU final > HW mask for node: 0x0F0F > Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: run job script took > usec=10517 > Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: prolog with lock for > job 796 ran for 0 seconds > Nov 27 16:57:33 nid00024 slurmd[18994]: Launching batch job 796 for UID > 20480 > Nov 27 16:57:33 nid00024 slurmd[18994]: launch task 796.0 request from > 20480.20480@10.128.0.25 (port 25773) > Nov 27 16:57:33 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto > binding off: mask_cpu,one_thread > Nov 27 16:57:34 nid00024 slurmd[18994]: launch task 796.1 request from > 20480.20480@10.128.0.25 (port 30381) > Nov 27 16:57:34 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto > binding off: mask_cpu,one_thread > > > SLURMD LOG FILE > > > [2018-11-27T16:57:32.850] task_p_slurmd_batch_request: 796 > [2018-11-27T16:57:32.850] task/affinity: job 796 CPU input mask for node: > 0x00FF > [2018-11-27T16:57:32.850] task/affinity: job 796 CPU final HW mask for > node: 0x0F0F > [2018-11-27T16:57:32.862] _run_prolog: run job script took usec=10517 > [2018-11-27T16:57:32.862] _run_prolog: prolog with lock for job 796 ran > for 0 seconds > [2018-11-27T16:57:32.874] [796.extern] core_spec/cray: init > [2018-11-27T16:57:32.898] [796.extern] task/cgroup: > /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > [2018-11-27T16:57:32.898] [796.extern] task/cgroup: > /slurm/uid_20480/job_796/step_extern: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > [2018-11-27T16:57:33.666] Launching batch job 796 for UID 20480 > [2018-11-27T16:57:33.680] [796.batch] core_spec/cray: init > [2018-11-27T16:57:33.690] [796.batch] task/cgroup: > /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > [2018-11-27T16:57:33.690] [796.batch] task/cgroup: > /slurm/uid_20480/job_796/step_batch: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > [2018-11-27T16:57:33.696] [796.batch] task_p_pre_launch: Using > sched_affinity for tasks > [2018-11-27T16:57:33.871] launch task 796.0 request from > 20480.20480@10.128.0.25 (port 25773) > [2018-11-27T16:57:33.871] lllp_distribution jobid [796] auto binding off: > mask_cpu,one_thread > [2018-11-27T16:57:33.882] [796.0] core_spec/cray: init > [2018-11-27T16:57:33.918] [796.0] (switch_cray.c: 654: switch_p_job_init) > gres_cnt: 0 1 > [2018-11-27T16:57:33.934] [796.0] task/cgroup: /slurm/uid_20480/job_796: > alloc=28000MB mem.limit=26600MB memsw.limit=26600MB > [2018-11-27T16:57:33.934] [796.0] task/cgroup: > /slurm/uid_20480/job_796/step_0: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity > for tasks > [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity > for tasks > [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity > for tasks > [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity > for tasks > [2018-11-27T16:57:34.580] [796.0] done with job > [2018-11-27T16:57:34.940] launch task 796.1 request from > 20480.20480@10.128.0.25 (port 30381) > [2018-11-27T16:57:34.941] lllp_distribution jobid [796] auto binding off: > mask_cpu,one_thread > [2018-11-27T16:57:34.951] [796.1] core_spec/cray: init > [2018-11-27T16:57:34.994] [796.1] (switch_cray.c: 654: switch_p_job_init) > gres_cnt: 0 1 > [2018-11-27T16:57:35.006] [796.1] task/cgroup: /slurm/uid_20480/job_796: > alloc=28000MB mem.limit=26600MB memsw.limit=26600MB > [2018-11-27T16:57:35.006] [796.1] task/cgroup: > /slurm/uid_20480/job_796/step_1: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity > for tasks > [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity > for tasks > [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity > for tasks > [2018-11-27T16:57:35.013] [796.1] task_p_pre_launch: Using sched_affinity > for tasks > [2018-11-27T16:57:46.460] [796.1] done with job > [2018-11-27T16:57:46.602] [796.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 > [2018-11-27T16:57:46.608] [796.batch] done with job > [2018-11-27T16:57:46.760] [796.extern] done with job > > > where we see that SOME, but not ALL, of the log messages appear on both of > the channels. > > > > Now let's comment out the use of a dedicated slurmd log file and run > exactly > the same job, > > FWIW: > > SlurmctldDebug=info > SlurmctldLogFile=/var/spool/slurm/slurmctld.log > SlurmctldSyslogDebug=info > SlurmdDebug=info > #lurmdLogFile=/var/spool/slurmd/%h.log > SlurmdSyslogDebug=info > > > SYSTEMD JOURNAL > > > Nov 27 16:51:51 nid00024 slurmd[16415]: task_p_slurmd_batch_request: 794 > Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU input > mask for node: 0x00FF > Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU final > HW mask for node: 0x0F0F > Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: run job script took > usec=10472 > Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: prolog with lock for > job 794 ran for 0 seconds > Nov 27 16:51:51 nid00024 slurmstepd[18143]: core_spec/cray: init > Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup: > /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup: > /slurm/uid_20480/job_794/step_extern: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > Nov 27 16:51:52 nid00024 slurmd[16415]: Launching batch job 794 for UID > 20480 > Nov 27 16:51:52 nid00024 slurmstepd[18411]: core_spec/cray: init > Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup: > /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup: > /slurm/uid_20480/job_794/step_batch: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > Nov 27 16:51:52 nid00024 slurmstepd[18419]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:51:52 nid00024 slurmd[16415]: launch task 794.0 request from > 20480.20480@10.128.0.25 (port 13485) > Nov 27 16:51:52 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto > binding off: mask_cpu,one_thread > Nov 27 16:51:52 nid00024 slurmstepd[18520]: core_spec/cray: init > Nov 27 16:51:52 nid00024 slurmstepd[18520]: (switch_cray.c: 654: > switch_p_job_init) gres_cnt: 0 1 > Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup: > /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup: > /slurm/uid_20480/job_794/step_0: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > Nov 27 16:51:52 nid00024 slurmstepd[18527]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:51:52 nid00024 slurmstepd[18529]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:51:52 nid00024 slurmstepd[18528]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:51:52 nid00024 slurmstepd[18530]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:51:53 nid00024 slurmstepd[18520]: done with job > Nov 27 16:51:53 nid00024 slurmd[16415]: launch task 794.1 request from > 20480.20480@10.128.0.25 (port 18093) > Nov 27 16:51:53 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto > binding off: mask_cpu,one_thread > Nov 27 16:51:53 nid00024 slurmstepd[18551]: core_spec/cray: init > Nov 27 16:51:53 nid00024 slurmstepd[18551]: (switch_cray.c: 654: > switch_p_job_init) gres_cnt: 0 1 > Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup: > /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup: > /slurm/uid_20480/job_794/step_1: alloc=28000MB mem.limit=26600MB > memsw.limit=26600MB > Nov 27 16:51:53 nid00024 slurmstepd[18558]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:51:53 nid00024 slurmstepd[18559]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:51:53 nid00024 slurmstepd[18560]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:51:53 nid00024 slurmstepd[18561]: task_p_pre_launch: Using > sched_affinity for tasks > Nov 27 16:52:04 nid00024 slurmstepd[18551]: done with job > Nov 27 16:52:05 nid00024 slurmstepd[18411]: sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 > Nov 27 16:52:05 nid00024 slurmstepd[18411]: done with job > Nov 27 16:52:05 nid00024 slurmstepd[18143]: done with job > > > SLURMD LOG FILE > > > Well, nothing: obviously. > > > The thing to note here is that log messages that ONLY appeared in the > dedicated log file, when there was one, now show up in the systemd > journal (and of course, have that nice "slurmstepd" tag that makes > identifying them much easier) along with the log messages that ALWAYS > appear on BOTH channels. > > > > So, whilst it seems to be the case that, > > you will get SOME of the messages on ALL of the channels, ALL of the > time > > and even though it seems that, > > you can get ALL of the messages on SOME of the channels, SOME of the > time > > sadly it would appear that, > > you can't get ALL of the messages on ALL of the channels, ALL of the > time > > > > Am I a fool for thinking that you should be able to ? > > > Kevin Buckley > -- > Supercomputing Systems Administrator > Pawsey Supercomputing Centre > >