Hi,
On Thu, May 16, 2013 at 06:23:15PM +0200, [email protected] wrote:
> I recently stumbled over the bug with the watchdog mechanism that has already
> been reported to free desktop bugzilla (56109).
>
> I analyzed the bug and came to a simple solution for solving it.
>
> First, what I think is going on:
> - watchdog timeout is detected in service_handle_watchdog(),
> service_enter_dead(…) is called
> - service_enter_dead() sets the service state to auto_restart
> - triggered by a timer, service_enter_restart is called
> - service_enter_restart schedules a restart job
> - systemd splits up the jobs into a stop and a start job and schedules
> both
> - the stop job lasts to a call of service_stop()
> - here it begins to get interesting:
> - based on the AUTO_RESTART state, this function decides to go
> directly into dead state, nothing of the normal stopping procedure is done.
> This is probably because in most cases that cause a restart to be scheduled
> the stop proceeding is done automatically (for instance in case of a killed
> or normally exiting service.). But this is not true for a watchdog timeout.
> Nothing of the stop proceeding is executed in case of such a timeout. So the
> process that missed to send the watchdog event is going on to life (in which
> state ever). No one is cleaning up. A second instance of the service is
> started.
I'm quite sure this worked correctly at some point, when I first
implemented it. But I can reproduce the Problem here.
> My suggestion to solve this:
>
> Changes are needed in service.c in service_stop(…).
>
> change:
> /* A restart will be scheduled or is in progress. */
> if (s->state == SERVICE_AUTO_RESTART) {
> service_set_state(s, SERVICE_DEAD);
> return 0;
> }
>
> to:
> /* A restart will be scheduled or is in progress.
> In all cases but the watchdog timeout, stop is already progressed
> by systemd automatically*/
> if (s->state == SERVICE_AUTO_RESTART && s->result !=
> SERVICE_FAILURE_WATCHDOG) {
> service_set_state(s, SERVICE_DEAD);
> return 0;
> }
>
> and change:
>
> assert(s->state == SERVICE_RUNNING ||
> s->state == SERVICE_EXITED);
>
>
> to:
> assert(s->state == SERVICE_RUNNING ||
> s->state == SERVICE_AUTO_RESTART ||
> s->state == SERVICE_EXITED);
>
> I tested the following:
> - the watchdog mechanism is now actually stopping / killing the
> service in case it is not sending the watchdog event right in time
> - a restart triggered by a killed service works like before
>
> Hopefully, I didn’t miss some side effects caused by my changes.
>
>
> Any opinions on my proposed changes?
No, this is wrong. The change should happen in service_handle_watchdog().
Calling service_enter_dead() is wrong. Calling service_enter_stop() there
would probably create more or less what your change does.
But I think this is still wrong. When this code is executed then the
service is in an undefined state. Calling the normal shutdown mechanism can
imho actually be harmful. I think just killing the process here is correct.
I've sent a patch, that does this.
Btw, I've tested both versions with a worst-case failure (a process that
cannot be killed): With your code it waits 4x TimeoutStopSec before the
service is started again. With my patch it's 1x TimeoutStopSec.
Regards,
Michael
--
Pengutronix e.K. | |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
_______________________________________________
systemd-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/systemd-devel