Hi,

On Thu, May 16, 2013 at 06:23:15PM +0200, [email protected] wrote:
> I recently stumbled over the bug with the watchdog mechanism that has already 
> been reported to free desktop bugzilla (56109).
>  
> I analyzed the bug and came to a simple solution for solving it.
>  
> First, what I think is going on:
> -        watchdog timeout is detected in service_handle_watchdog(), 
> service_enter_dead(…) is called
> -        service_enter_dead() sets the service state to auto_restart
> -        triggered by a timer, service_enter_restart is called
> -        service_enter_restart  schedules a restart job
> -        systemd splits up the jobs into a stop and a start job and schedules 
> both
> -        the stop job lasts to a call of service_stop()
> -        here it begins to get interesting:
> -        based on the AUTO_RESTART state, this function decides to go 
> directly into dead state, nothing of the normal stopping procedure is done. 
> This is probably because in most cases that cause a restart to be scheduled 
> the stop proceeding is done automatically (for instance in case of a killed 
> or normally exiting service.). But this is not true for a watchdog timeout. 
> Nothing of the stop proceeding is executed in case of such a timeout. So the 
> process that missed to send the watchdog event is going on to life (in which 
> state ever). No one is cleaning up. A second instance of the service is 
> started.

I'm quite sure this worked correctly at some point, when I first
implemented it. But I can reproduce the Problem here.

> My suggestion to solve this:
>  
> Changes are needed in service.c in service_stop(…).
>  
> change:
> /* A restart will be scheduled or is in progress. */
>         if (s->state == SERVICE_AUTO_RESTART) {
>                 service_set_state(s, SERVICE_DEAD);
>                 return 0;
>         }
>  
> to:
> /* A restart will be scheduled or is in progress. 
>            In all cases but the watchdog timeout, stop is already progressed 
> by systemd automatically*/
>         if (s->state == SERVICE_AUTO_RESTART && s->result != 
> SERVICE_FAILURE_WATCHDOG) {
>                 service_set_state(s, SERVICE_DEAD);
>                 return 0;
>         }
>  
> and change:
>  
> assert(s->state == SERVICE_RUNNING ||
>              s->state == SERVICE_EXITED);
>  
>  
> to:
> assert(s->state == SERVICE_RUNNING ||
>                s->state == SERVICE_AUTO_RESTART ||
>                s->state == SERVICE_EXITED);
>  
> I tested the following:
> -        the watchdog mechanism is now actually stopping / killing the 
> service in case it is not sending the watchdog event right in time
> -        a restart triggered by a killed service works like before
>  
> Hopefully, I didn’t miss some side effects caused by my changes.
>  
>  
> Any opinions on my proposed changes?

No, this is wrong. The change should happen in service_handle_watchdog().
Calling service_enter_dead() is wrong. Calling service_enter_stop() there
would probably create more or less what your change does.
But I think this is still wrong. When this code is executed then the
service is in an undefined state. Calling the normal shutdown mechanism can
imho actually be harmful. I think just killing the process here is correct.
I've sent a patch, that does this.

Btw, I've tested both versions with a worst-case failure (a process that
cannot be killed): With your code it waits 4x TimeoutStopSec before the
service is started again. With my patch it's 1x TimeoutStopSec.


Regards,
Michael


-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
_______________________________________________
systemd-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

Reply via email to