Hello Ritesh,

the system boot will hang for 90s because of systemd's default timeout
when devices are not available.

Actually, from what I know so far, systemd aggressively backgrounds any processes that is taking time. And only processes that depend on it, are
put on hold, again in the background.

Well, yes, in principle, but the way dependencies are expressed (both by
default and in the current Debian packaging of systemd), you can still
have serialization of things. See below.

The reason behind this is that open-iscsi contains the following LSB
headers:
      Required-Start:    $network $remote_fs
      Required-Stop:     $network $remote_fs sendsigs
Here, $network maps to network-online.target in systemd, that's fine, but $remote_fs maps to remote-fs.target in systemd, that is the problem.
This is because

 a) systemd treats file systems that couldn't be mounted as hard
    failures.
and
 b) systemd's logic of mounting all remote filesystems is to mount
    all filesystems in /etc/fstab that are marked _netdev (and not
    makred noauto)

Therefore, systemd waits for the iSCSI device to appear for 90s before
timing out and proceeding with boot. Only then remote-fs.target is
reached and systemd starts the open-iscsi init script.

I think you may be missing something here. I believe devices marked
_netdev are always backgrounded. At least in sysvinit. And not having
them do so in systemd is highly unlikely.

No, in both cases that is not true.

First, if you look at sysvinit with LSB dependency-based boot (Squeeze,
Wheezy, Jessie w/ sysvinit-core). Debian does use startpar(8) to
parallelize some aspects of sysvinit boot, but there are a couple of
syncronization points. They are defined in /etc/insserv.conf and the
relevant ones are:

 $local_fs
 $remote_fs

If you look at the configuration, you will see that $remote_fs is
$local_fs and the mountnfs init script.

Also, there's the fact that all rcS scripts will completed before any
rc[2-5] scripts are run (the way inittab + rc are set up), so that's an
additional syncronization point.

So if you have an init script with Requires-Start: $local_fs, it will be ordered after all scripts (primarily mountall) that appear for $local_fs
in /etc/insserv.conf, but (according to insserv logic) as early as
otherwise possible.

Same with Requires-Start: $remote_fs: it will be ordered after $local_fs
(i.e. after mountall) and also after mountnfs.

So you have the following boot ordering

 1. anything in rcS that doesn't require $local_fs
 2. $local_fs stuff (i.e. mainly mountall)
 3. anything else in rcS that doesn't require $remote_fs
 4. $remote_fs stuff (i.e. mainly mountnfs)
 5. anything else in rcS
 6. anything in rc[2-5]

So if you have Requires-Start: $remote_fs in the open-iscsi init script,
you have the following situation:

 - early boot services (1) are started
 - local file systems are mounted (2)
 - some other services started (3)
 - tries to mount remote file systems (4)
      /etc/init.d/mountnfs calls /etc/network/if-up.d/mountnfs
       (or waits until networking has called that dynamically once
        the network is up, depending on your configuration)
      /etc/network/if-up.d/mountnfs effectively does
           mount -a -O _netdev
      At this point, open-iscsi is NOT started. So mount will fail for
all mount points on iSCSI devices. However, since mountnfs doesn't check the exit code of the mount command, it will happily continue
      on and pretend everything is fine.
 - services ordered after $remote_fs are started, including open-iscsi
      open-iscsi calls mount -a -O _netdev itself, which will try to
      mount the remaining filesystems again, then succeeding

So nothing is really 'backgrounded', you are just relying on the fact
that mountnfs doesn't really check any exit codes (and that sysvinit
doesn't care if init scripts that your init scripts depends on were
successful), you just tape over that fact by running mount again.

This in turn means that with sysvinit you have kind of exempted
$remote_fs from being the true synchronization point. This doesn't
really matter that much for sysvinit, because there's a different
syncronization point directly after that (end of rcS execution, start of
rc[2-5] execution), but for systemd that's a different story (see
below). (But note that this COULD break for an early boot service
ordered after $remote_fs that needs the filesystems, it's just that
Jessie by default doesn't ship one.)


Now let's take systemd. systemd has so-called 'targets' which are also
used as synchronization points at boot. The two sysvinit sync points are
mapped as follows:

 $local_fs    -> local-fs.target
 $remote_fs   -> remote-fs.target

Additionally, systemd knows a couple of more sync points, namely

 local-fs-pre.target
 remote-fs-pre.target

However, systemd doesn't really have a sync point for early-boot vs.
runlevel services.

The boot sequence with systemd is then as follows (only depicting a part
of it):

       early boot services (e.g. udev)
       ordered before local-fs-pre.target
                  |
                  v
         local-fs-pre.target
                  |
                  v
        mount local file systems
                  |
                  v
            local-fs.target
                  |
                  v
       early boot services ordered after local-fs.target
       but before remote-fs-pre.target
                  |
                  v
          remote-fs-pre.target
                  |
                  v
       mount remote file systems
                  |
                  v
           remote-fs.target
                  |
                  v
              the rest

Within each block, everything is of course parallel (barring other
ordering constraints, of course) - even the filesystems are mounted in
parallel.

And obviously, if something doesn't order against any targets shown
here, they will be started immediately (before or in parallel to
local-fs.target) and the targets in the middle won't wait for their
completion.

On shutdown, the whole thing is done in reverse, with one important
caveat: systemd tracks the state of the system, so it looks at the
dependencies of stuff that's running, so if you start a service manually
without having it enabled at boot, its dependencies will still work
properly. (sysvinit/LSB tries to do that partially by always creating
stop links, even if the services is not enabled.)




Now you have two problems in this setup:

  - same thing as with sysvinit: open-iscsi is ordered after
    remote-fs.target, so it won't get started until remote-fs.target is
    reached

  - however, the crucial difference here is that systemd cares whether
    stuff has actually worked or not. It doesn't just call
    mount -a -O _netdev and hopes for the best, it tries to wait for
    the required devices to appear (because they might not appear
    synchronously)

       -> unfortunately, since open-iscsi won't start before
          remote_fs.target, those devices will never appear while
          systemd is waiting for them

       -> systemd has a default timeout of 90s for devices showing up
          so it will wait for 90s for these devices to show up and then
          fail

       -> only then will systemd consider remote-fs.target reached
          (btw. local-fs.target has a setting
          OnFailure=emergency.target, so that when it can't mount a
          local file system, the boot doesn't even continue, see
          Debian bug #743265 for a discussion on this; fortunately
          remote-fs.target doesn't have this setting, so boot does
          continue in this case)

       -> only then will systemd start open-iscsi

       -> that will then mount the filesystems again
          (which is actually unnecessary with systemd, because as soon
          as the devices appear, it will mount the stuff anyway)

-> hence the 90s delay for waiting on devices that will only show
          up later

    You can actually try this easily (if you have an iSCSI target lying
    around ;-)): setup a Jessie box, install open-iscsi, configure it
    to automatically log in to your target, put an iSCSI filesystem as
    _netdev into /etc/fstab and reboot - voilĂ : 90s delay. It's very
    simple to reproduce, and it ALWAYS happens in that constellation.
    With rootfs on iSCSI it should also happen if you log in to
    additional targets. (Otherwise, rootfs on iSCSI is not affected.)

- on shutdown, things are also messy, since systemd tries to shut down
    stuff much more in parallel than sysvinit does

       - open-iscsi is a early-boot ("runlevel S") service, i.e. with
         sysvinit those always get stopped after all services of the
         current runlevel (e.g. 2) are stopped

       - with systemd, it just cares about explicit dependencies, so
         it will try to stop open-iscsi as early as possible (since
         by default nothing is ordered after it)

       -> this has the consequence that stuff that's using remote
          filesystems might still be running while open-iscsi is
          terminating and it can't unmount them

       -> the open-iscsi service will then (try to) logout of the
          sessions even though stuff is still active.

               -> very, very bad

As I said in the original report, on the test system I've used so far
for Jessie I haven't actually seen this race condition (i.e. shutdown
always worked anyway), since nothing was really using the remote
filesystems on my test box, and it might be the case that it doesn't
always occur, but it will at least some times.

That in turn will then make the devices appear. The init script will
then call a "mount -a -O _netdev" and "swapon -a -e" in it's start()
routine, that will then cause the mount points to be activated.

So in the end, the boot is kind-of successful in the sense that
everything kind of works at the end of boot, with the following two caveats:

- there is this needless 90s delay (or whatever other delay the admin
   has configured) in waiting on the iSCSI targets

Have you had luck root causing in why there is the 90 sec delay ?

I hope this reply can make it a bit clearer as to where the problem lies
and why my diagnosis is correct.

Note that I have spent probably 10-12 hours on this problem, first
trying to figure out what the problem was and then trying to come up
with a solution that changes as little as possible (because of the
freeze) and testing that against a lot of different scenarios:

 - I only noticed that I needed to move #DEBHELPER# around because of
   testing partial upgrades

 - I don't use rootfs in iSCSI myself, so I set up a test system to
   check that nothing broke (which the first version I wanted to send
   did, so I fixed that before reporting this)

 - I rebooted test boxes quite a lot to see if there was any trouble.

Therefore, I suggest that you provide a unit file specifically for
systemd. In order to as minimally invasive as possible (especially this late in the freeze), the unit file should ideally call the original init
script.

I am willing to accept a systemd unit. But it is too late for Jessie
right now. If you have the unit ready and tested, for now, we can put it
into experimental.

I would not want to ship something for Jessie now. Ideally, systemd's
logic on handling init scripts should take care of it. It has worked for
other sysvinit scripts so far.

And introducing the systemd unit now in Jessie is late. Because it
wouldn't have had enough test cycles.

systemd's logic of handling it won't take care of it, because it's
already kind-of broken on sysvinit, but a lot of specific details in
sysvinit that systemd doesn't emulate quite that way mitigate that.

The changes required to make systemd support this in the same way as
sysvinit would be far more invasive to the current systemd code base as
fixing a couple of dependencies here.



I'm going to explain how systemd currently handles unit files, because
then it becomes clear why the unit file I have provided is not really
experimental at all.


systemd does not support init scripts directly from PID1 anymore (this
was different in very old versions). systemd's PID1 only understands
systemd unit files. Instead, systemd now has a concept called
'generators', which are small programs (sometimes even scripts) that are run

 - at boot
 - every time systemd re-reads its configuration

The job of a generator is to read some aspect of the system
configuration (init scripts, /etc/fstab, /etc/crypttab, ...) and
generate native systemd units from that.

If you boot a systemd Jessie system and look in /run/systemd/generator
and /run/systemd/generator.late, you will see the units that were
generated by these generators. Each line in /etc/fstab becomes a .mount
unit, each sysvinit script becomes a .service file.

Of course, the generator responsible for init scripts doesn't magically
convert a sysvinit file completely into a service file (that's not
really possible to do automatically in the general case), but the
service file it generates just contains the necessary metadata.
Additionally, it sets ExecStart=/etc/init.d/$SCRIPT start and
ExecStop=/etc/init.d/$SCRIPT stop in the service file, so that the
original service file is actually called.

For example, if I take /etc/init.d/kbd, the systemd-sysv-generator will
produce the following serviced file in
/run/systemd/generator.late/kbd.service:

-----------------------------------------------------------
# Automatically generated by systemd-sysv-generator

[Unit]
SourcePath=/etc/init.d/kbd
Description=LSB: Prepare console
DefaultDependencies=no
Before=sysinit.target
After=remote-fs.target

[Service]
Type=forking
Restart=no
TimeoutSec=0
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
SysVStartPriority=18
ExecStart=/etc/init.d/kbd start
ExecStop=/etc/init.d/kbd stop
-----------------------------------------------------------

So what did I do in order to produce the service file I've attached in
my original report?

 - I took the generate service file for the open-iscsi init script
 - I removed the comment about automatic generation
- I removed SourcePath (that's mainly for documentation purposes if you
   run systemctl status)
 - I adjusted the After= and Before= dependencies
 - I added a [Install] section to make it possible to enable this unit

Here's a diff for comparison (old is generated, new is my modified version):

-----------------------------------------------------------
diff -u open-iscsi.service /lib/systemd/system/open-iscsi.service
--- open-iscsi.service  2015-01-18 21:12:16.325286854 +0100
+++ /lib/systemd/system/open-iscsi.service 2015-01-19 19:14:53.000000000 +0100
@@ -1,11 +1,8 @@
-# Automatically generated by systemd-sysv-generator
-
 [Unit]
-SourcePath=/etc/init.d/open-iscsi
-Description=LSB: Starts and stops the iSCSI initiator services and logs in to default targets
+Description=iSCSI initiator
 DefaultDependencies=no
-Before=sysinit.target shutdown.target
-After=network-online.target remote-fs.target
+Before=sysinit.target shutdown.target remote-fs-pre.target
+After=network-online.target
 Wants=network-online.target
 Conflicts=shutdown.target

@@ -20,3 +17,6 @@
 SysVStartPriority=20
 ExecStart=/etc/init.d/open-iscsi start
 ExecStop=/etc/init.d/open-iscsi stop
+
+[Install]
+WantedBy=multi-user.target
-----------------------------------------------------------

So it's not like this is really that untested, it's basically the way
systemd handles sysv scripts but just with modified dependencies, to
make sure the unit is started before remote-fs-pre.target and not after
remote-fs.target.

 - irrespective of systemd, while looking at it I noticed that
   umountiscsi.sh's logic is incomplete, it doesn't try to umount
   filesystems on LVM on top of iSCSI, unless they were marked with
   _netdev (it only detects direct devices).

Can you please elaborate more here ? Or perhaps just file a separate bug
report. The current init scripts are designed to support LVM + iSCSI.

I'll file a separate bug report for this. I don't think it's very
critical, especially it doesn't do anything wrong if everything is in
/etc/fstab (or you manually mounted with -o _netdev).

   OTOH, this has been the case since at least Squeeze, so it can't
   be that critical.

 - the current design of using umountiscsi.sh doesn't integrate well
with systemd's dependency logic. I don't think this is a huge issue, as far as I can see, stuff works as well under systemd with my patch
   as under sysvinit (except for the /usr-NFS thing), but I do think
   that you could make the whole thing a lot more robust if this is
redesigned a bit - but I don't think that is something that should
   go to Jessie.

I agree. We need to switch to systemd. But I haven't had the time to do
it, and right now, your patch is too late. :-(

I don't think it is: it doesn't change much, I spent a LOT of time
making it as little invasive as possible. And while open-iscsi is not
completely unusable with systemd, there is enough problems with the way
the current package interacts with systemd due to subtle differences in
the handling of dependencies and failures that I think this should
really be fixed in Jessie.

As I said in the original report:

Btw. I selected severity 'important' because I don't think this bug is 'grave', but I do think that it could be categorized as 'serious', since in my eyes it is unwritten policy that packages should properly support the default init system unless there's a really good reason against it.
Unfortunately for me, current policy doesn't mention multiple init
systems at all, therefore the severity 'important', because I can't
point to a specific part of the text. Nevertheless, I think this bug
would qualify as RC.

Regards,
Christian


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to