Attempting to summarize what was said on this topic in the thread so far, and at the last technical committee meeting:
It's perhaps important to note that we are not discussing ideal situations here: any time this conversation becomes relevant, something is already wrong. We're aiming to recommend the lesser evil, rather than something actually desirable. One of the points of view here is Ian and Wouter's assertion that whenever a service fails to restart in a maintainer script, the most important thing is to make sure the sysadmin pays attention and fixes it before proceeding. Julien Cristau made another point in support of "failure to restart implies failure to configure" on IRC, namely that the only straightforward thing for an automated upgrade to do is to look at the successful or failed exit status of the package manager (whether that means dpkg, apt, unattended-upgrades or whatever), and assume that exiting 0 means everything is fine and exiting nonzero means attention is required. At the opposite extreme, Marga's team manages thousands of desktops, and having to do *anything* manual to any significant number of them doesn't scale. We can think of inexperienced users' desktops as a bit like this scenario too, except that instead of having a professional sysadmin, they have to ask volunteers for help through channels like debian-user and #debian (and those volunteers' help doesn't really scale well either). It's also undesirable if the mechanism we use to escalate the failure to the user is one that itself makes it harder to diagnose or fix the problem, and in particular there's a concern that when packages fail to configure, that can make it harder to use apt to install the necessary tools to diagnose what has gone wrong; Stuart points out that in his experience of helping people in #debian, this is a practical problem. Ian considers it to be design flaw in apt that the actions the user can take while a package is unconfigured are so constrained; however, we work with the tools we have, not the tools we'd like to have. We seem to have consensus among the technical committee that it is at least occasionally appropriate for failure to restart to cause failure to configure, although this might be the exception rather than the rule. The examples given where the error path is most important were packages that provide a system-level API to other packages, so their failures are likely to cause other packages to fail to configure (such as local DNS caches and authentication services like LDAP); and packages that provide remote access, so their failures need to be fixed before a potentially remote sysadmin logs out to prevent the sysadmin from being locked out longer-term (like sshd). I'm not sure whether we have a concrete example yet of packages at the opposite extreme, that are the least important to be able to restart. I'd like to propose the game servers that I maintain, like openarena-server, as a concrete example here: I hope we can agree that inability to capture the flag does not justify getting the package management system into a problematic state? :-) (I think this is currently a bug in those packages, but I'm not going to fix it until we have consensus here.) There's a general feeling among the technical committee that a package failing to configure is far from a user-friendly way to signal errors: Phil's memorable analogy was that it's like telling a car driver that they are low on fuel by having the wheels fall off. Historically, we had few other ways to manage service failures, and perhaps when all you have is a hammer, everything looks like the Failed-Config state; but in a default Debian installation we now have a service manager that monitors the state of all services at all times (not just when they happen to be upgraded) and collects their stderr at all times (not just writing it to the console during boot, and dpkg's stderr during upgrades). Even before we considered non-sysv init systems, monitoring systems like Nagios were available. It's perhaps also worth noting that most services, if they fail during boot rather than during upgrade, don't cause a drastic reaction. Historically, initscripts would (attempt to) carry on regardless from just about any failure mode, including failure of services that ought to be considered critical-path. With systemd as default, our default init system does have a more dramatic response to certain failures (going to an emergency-mode shell), but it only does that for a very limited subset of services (fsck and mount on required filesystems, according to the man page). As Anthony points out, we could benefit from there being a way for packages to report "something is wrong, but carry on anyway": continuing to get the system into the least-degraded state possible, but then arranging for dpkg/apt to exit with a nonzero status so that automated systems can detect that something is not right. However, this mechanism does not currently exist. One possible implementation for the default init system might be an apt Dpkg::Post-Invoke hook that runs `systemctl is-system-running` and, if the result is not success, `systemctl list-units --failed`. An init-system-agnostic implementation would require some other convention for maintainer scripts to signal partial success (or non-fatal failure, depending how you look at it) to apt/dpkg. During the technical committee IRC meeting, we considered whether the recommendation to "set -e" in maintainer scripts was consistent with considering a maintainer script failing to be a Very Bad Thing. We concluded that even if we want to disregard most or all failed service restarts, it is still good to "set -e", because if something does go wrong (for instance a typo in the maintainer script, a system that is already seriously broken, or some other unforeseen circumstance), we want the maintainer script to fail safe: stop what it's doing, rather than carry on regardless. If a particular failure is something we can reasonably predict, reason about and tolerate (as we are arguing failure to restart a service is, at least sometimes) then someone should make a conscious decision to add "|| true" (or preferably "|| some-failure-reporting-mechanism") to that command. Finally, here are the debhelper mechanisms that most packages use to manage their services, which I think represent the status quo: * dh_installinit: defaults to "failure to (re)start is failure to configure", but can be overridden with --error-handler; some packages set the error handler to "true" (e.g. apache2, isc-dhcp) or to a custom shell function (e.g. krb5, samba). This is used for LSB init scripts, and for systemd units that have a corresponding LSB init script. * dh_systemd_start: unconditionally uses "|| true". This is only used for systemd units that *do not* have a corresponding LSB init script. A dh_installinit-style --error-handler would probably be a reasonable feature request. smcv