To follow up: this did not fix the problem. On Fri, Oct 21, 2016 at 2:45 PM, Daniel Povey <dpo...@gmail.com> wrote:
> I just want to follow up on this that I believe I have found the reason > for this bug and I have a solution which I am testing out. > > In the output of `systemctl status systemd-journald` it says: > Active: failed (Result: start-limit) > and this is related to the systemd mechanism to stop cycles where services > keep restarting. It's related to the configs > StartLimitIntervalSec= [defaults to DefaultStartLimitIntervalSec=10 sec > by default] > StartLimitBurst= [defaults to DefaultStartLimitBurst=5 by default] > > so that if a service is restarted 5 times within 10 seconds, it's not > restarted any more to avoid wasting resources. > What seems to happen is that if systemd-journal-flush.service sends the > signal SIGUSR1 to systemd-journald and systemd-journald wasn't ready to > handle it for some reason (e.g. had just been started, or there was some > hangup), then systemd-journald is restarted, and > systemd-journal-flush.service immediately restarts also. Again > systemd-journald won't be ready to handle the signal as it hasn't had time > to run the signal handler, and the cycle repeats 5 times until the > systemd-journald service hits the StartLimitBurst and dies. I verified > using `lastcomm` (we have process accounting enabled) that this did in fact > happen. > > The way I am trying to fix it is to introduce a restart delay of 10 > seconds into the systemd-journal-flush service. See the config file below; > the only new line that differs from the defaults in > /lib/systemd/system/systemd-journal-flush.service is the line > RestartSec=10. > > Dan > > > > ====== > cat /etc/systemd/system/systemd-journal-flush.service > # This file is part of systemd. > # > # systemd is free software; you can redistribute it and/or modify it > # under the terms of the GNU Lesser General Public License as published by > # the Free Software Foundation; either version 2.1 of the License, or > # (at your option) any later version. > > [Unit] > Description=Trigger Flushing of Journal to Persistent Storage > Documentation=man:systemd-journald.service(8) man:journald.conf(5) > DefaultDependencies=no > Requires=systemd-journald.service > After=systemd-journald.service local-fs.target remote-fs.target > Before=systemd-user-sessions.service > # Don't re-start this service too often, or we can get into a cycle where > # if this service was run just after systemd-journald was started, it > # fails to handle the signal, causing that service to be restarted > # and this servie to be immediately restarted also, until we hit > # the limit of StartLimitBurst=5 restarts and systemd-journald > # dies. > RestartSec=10 > > [Service] > ExecStart=/bin/systemctl kill --kill-who=main --signal=SIGUSR1 > systemd-journald.service > Type=oneshot > > > On Sat, Oct 15, 2016 at 9:48 PM, Daniel Povey <dpo...@gmail.com> wrote: > >> Another observation about this bug, which might be helpful. >> >> If the signal is sent to systemd-journald via >> /bin/systemctl kill --kill-who=main --signal=SIGUSR1 >> systemd-journald.service >> then messages like the following show up in the kernel messages from >> `dmesg -T`, like: >> >> [Sat Oct 15 21:02:35 2016] systemd-journald[26517]: Received request to >> flush runtime journal from PID 1 >> >> but they don't show up in the output of 'journalctl -r'. >> In /etc/systemd/journald.conf, it says: >> >> #MaxLevelStore=debug >> #MaxLevelSyslog=debug >> >> so I would have thought the same messages would go to both places. >> I don't know if I'm misunderstanding something here.. >> >> Dan >> >> >> >> >> >> On Sat, Oct 15, 2016 at 9:07 PM, Daniel Povey <dpo...@gmail.com> wrote: >> >>> BTW, I attach the output from `systemd-analyze dump`, as dump.txt. >>> It would be great if the debian people could help us look into this. >>> Lennart has a policy that his team will only look into bug reports in >>> the latest two versions of systemd, and obviously we are well behind that. >>> >>> Dan >>> >>> >>> On Sat, Oct 15, 2016 at 8:53 PM, Daniel Povey <dpo...@gmail.com> wrote: >>> >>>> >>>> I just want to report that we are suffering from this bug, and it is >>>> quite frequent. >>>> This is with version 215-17+deb8u5 . >>>> >>>> root@a12:~# systemctl status systemd-journald >>>> >>>> *** systemd-journald.service - Journal Service >>>> >>>> Loaded: loaded (/lib/systemd/system/systemd-journald.service; >>>> static) >>>> >>>> Active: *failed* (Result: start-limit) since Sat 2016-10-15 >>>> 12:01:57 EDT; 8h ago >>>> >>>> Docs: man:systemd-journald.service(8) >>>> >>>> man:journald.conf(5) >>>> >>>> Process: 51561 ExecStart=/lib/systemd/systemd-journald *(code=killed, >>>> signal=USR1)* >>>> >>>> Main PID: 51561 (code=killed, signal=USR1) >>>> >>>> >>>> >>> >> >