Here are our usual Slurm log-related config settings (where we've increased the default "Syslog" level as we're trying to get the Crays to funnel the slurmd log messages all the way up to the machine that syslog messages go to on Crays)
SlurmctldDebug=info SlurmctldLogFile=/var/spool/slurm/slurmctld.log SlurmctldSyslogDebug=info SlurmdDebug=info SlurmdLogFile=/var/spool/slurmd/%h.log SlurmdSyslogDebug=info
and here are the "logged" outputs to both, the systemd journal, and slurmd's dedicated log file, channels, for a particular job: SYSTEMD JOURNAL Nov 27 16:57:32 nid00024 slurmd[18994]: task_p_slurmd_batch_request: 796 Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU input mask for node: 0x00FF Nov 27 16:57:32 nid00024 slurmd[18994]: task/affinity: job 796 CPU final HW mask for node: 0x0F0F Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: run job script took usec=10517 Nov 27 16:57:32 nid00024 slurmd[18994]: _run_prolog: prolog with lock for job 796 ran for 0 seconds Nov 27 16:57:33 nid00024 slurmd[18994]: Launching batch job 796 for UID 20480 Nov 27 16:57:33 nid00024 slurmd[18994]: launch task 796.0 request from 20480.20480@10.128.0.25 (port 25773) Nov 27 16:57:33 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread Nov 27 16:57:34 nid00024 slurmd[18994]: launch task 796.1 request from 20480.20480@10.128.0.25 (port 30381) Nov 27 16:57:34 nid00024 slurmd[18994]: lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread SLURMD LOG FILE [2018-11-27T16:57:32.850] task_p_slurmd_batch_request: 796 [2018-11-27T16:57:32.850] task/affinity: job 796 CPU input mask for node: 0x00FF [2018-11-27T16:57:32.850] task/affinity: job 796 CPU final HW mask for node: 0x0F0F [2018-11-27T16:57:32.862] _run_prolog: run job script took usec=10517 [2018-11-27T16:57:32.862] _run_prolog: prolog with lock for job 796 ran for 0 seconds [2018-11-27T16:57:32.874] [796.extern] core_spec/cray: init [2018-11-27T16:57:32.898] [796.extern] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2018-11-27T16:57:32.898] [796.extern] task/cgroup: /slurm/uid_20480/job_796/step_extern: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2018-11-27T16:57:33.666] Launching batch job 796 for UID 20480 [2018-11-27T16:57:33.680] [796.batch] core_spec/cray: init [2018-11-27T16:57:33.690] [796.batch] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2018-11-27T16:57:33.690] [796.batch] task/cgroup: /slurm/uid_20480/job_796/step_batch: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2018-11-27T16:57:33.696] [796.batch] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:33.871] launch task 796.0 request from 20480.20480@10.128.0.25 (port 25773) [2018-11-27T16:57:33.871] lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread [2018-11-27T16:57:33.882] [796.0] core_spec/cray: init [2018-11-27T16:57:33.918] [796.0] (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1 [2018-11-27T16:57:33.934] [796.0] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2018-11-27T16:57:33.934] [796.0] task/cgroup: /slurm/uid_20480/job_796/step_0: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:33.940] [796.0] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:34.580] [796.0] done with job [2018-11-27T16:57:34.940] launch task 796.1 request from 20480.20480@10.128.0.25 (port 30381) [2018-11-27T16:57:34.941] lllp_distribution jobid [796] auto binding off: mask_cpu,one_thread [2018-11-27T16:57:34.951] [796.1] core_spec/cray: init [2018-11-27T16:57:34.994] [796.1] (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1 [2018-11-27T16:57:35.006] [796.1] task/cgroup: /slurm/uid_20480/job_796: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2018-11-27T16:57:35.006] [796.1] task/cgroup: /slurm/uid_20480/job_796/step_1: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:35.012] [796.1] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:35.013] [796.1] task_p_pre_launch: Using sched_affinity for tasks [2018-11-27T16:57:46.460] [796.1] done with job [2018-11-27T16:57:46.602] [796.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 [2018-11-27T16:57:46.608] [796.batch] done with job [2018-11-27T16:57:46.760] [796.extern] done with job where we see that SOME, but not ALL, of the log messages appear on both of the channels. Now let's comment out the use of a dedicated slurmd log file and run exactly the same job, FWIW: SlurmctldDebug=info SlurmctldLogFile=/var/spool/slurm/slurmctld.log SlurmctldSyslogDebug=info SlurmdDebug=info #lurmdLogFile=/var/spool/slurmd/%h.log SlurmdSyslogDebug=info SYSTEMD JOURNAL Nov 27 16:51:51 nid00024 slurmd[16415]: task_p_slurmd_batch_request: 794 Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU input mask for node: 0x00FF Nov 27 16:51:51 nid00024 slurmd[16415]: task/affinity: job 794 CPU final HW mask for node: 0x0F0F Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: run job script took usec=10472 Nov 27 16:51:51 nid00024 slurmd[16415]: _run_prolog: prolog with lock for job 794 ran for 0 seconds Nov 27 16:51:51 nid00024 slurmstepd[18143]: core_spec/cray: init Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB Nov 27 16:51:51 nid00024 slurmstepd[18143]: task/cgroup: /slurm/uid_20480/job_794/step_extern: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB Nov 27 16:51:52 nid00024 slurmd[16415]: Launching batch job 794 for UID 20480 Nov 27 16:51:52 nid00024 slurmstepd[18411]: core_spec/cray: init Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB Nov 27 16:51:52 nid00024 slurmstepd[18411]: task/cgroup: /slurm/uid_20480/job_794/step_batch: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB Nov 27 16:51:52 nid00024 slurmstepd[18419]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:51:52 nid00024 slurmd[16415]: launch task 794.0 request from 20480.20480@10.128.0.25 (port 13485) Nov 27 16:51:52 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto binding off: mask_cpu,one_thread Nov 27 16:51:52 nid00024 slurmstepd[18520]: core_spec/cray: init Nov 27 16:51:52 nid00024 slurmstepd[18520]: (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1 Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB Nov 27 16:51:52 nid00024 slurmstepd[18520]: task/cgroup: /slurm/uid_20480/job_794/step_0: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB Nov 27 16:51:52 nid00024 slurmstepd[18527]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:51:52 nid00024 slurmstepd[18529]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:51:52 nid00024 slurmstepd[18528]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:51:52 nid00024 slurmstepd[18530]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:51:53 nid00024 slurmstepd[18520]: done with job Nov 27 16:51:53 nid00024 slurmd[16415]: launch task 794.1 request from 20480.20480@10.128.0.25 (port 18093) Nov 27 16:51:53 nid00024 slurmd[16415]: lllp_distribution jobid [794] auto binding off: mask_cpu,one_thread Nov 27 16:51:53 nid00024 slurmstepd[18551]: core_spec/cray: init Nov 27 16:51:53 nid00024 slurmstepd[18551]: (switch_cray.c: 654: switch_p_job_init) gres_cnt: 0 1 Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup: /slurm/uid_20480/job_794: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB Nov 27 16:51:53 nid00024 slurmstepd[18551]: task/cgroup: /slurm/uid_20480/job_794/step_1: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB Nov 27 16:51:53 nid00024 slurmstepd[18558]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:51:53 nid00024 slurmstepd[18559]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:51:53 nid00024 slurmstepd[18560]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:51:53 nid00024 slurmstepd[18561]: task_p_pre_launch: Using sched_affinity for tasks Nov 27 16:52:04 nid00024 slurmstepd[18551]: done with job Nov 27 16:52:05 nid00024 slurmstepd[18411]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 Nov 27 16:52:05 nid00024 slurmstepd[18411]: done with job Nov 27 16:52:05 nid00024 slurmstepd[18143]: done with job SLURMD LOG FILE Well, nothing: obviously. The thing to note here is that log messages that ONLY appeared in the dedicated log file, when there was one, now show up in the systemd journal (and of course, have that nice "slurmstepd" tag that makes identifying them much easier) along with the log messages that ALWAYS appear on BOTH channels. So, whilst it seems to be the case that, you will get SOME of the messages on ALL of the channels, ALL of the time and even though it seems that, you can get ALL of the messages on SOME of the channels, SOME of the time sadly it would appear that, you can't get ALL of the messages on ALL of the channels, ALL of the time Am I a fool for thinking that you should be able to ? Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre