Bug#791794: RAID device not active during boot

Philip Hands Sun, 12 Jul 2015 03:49:00 -0700

Peter Nagel <peter.na...@kit.edu> writes:

> Am 11.07.2015 18:40, schrieb Philip Hands:
>>
>> ... which is what suggests to me that it's been broken by other
>> means -- the fact that one can apparently start it by hand tells you
>> that it's basically working, so I'd think the described symptoms point
>> strongly towards duff mdadm.conf in the initramfs.
>>
>> N.B. I've not very had much to do with systemd, so am in no sense an
>> expert about that, but I've been using software raid and initrd's since
>> almost as soon as they were available, and the idea that this would be
>> down to systemd does not ring true.
>
> Thanks for pointing out this.
> Hopefully, someone is able to solve this problem.


Well, yes -- _you_ can hopefully.

 0) (just in case you've not already done so, check all the bits
     suggested in the warning that you quoted initially, about the
     contents of /proc/... etc.)

 1) on the system when booted up, check the current state of your
    /etc/mdadm/mdadm.conf
 
    Compare it with the output of:

      mdadm --examine --scan

    If there are significant differences (other than the missing disk),
    then fix them.

  2) have a look at your initrd, thus:

    mkdir /tmp/initrd ; cd /tmp/initrd ; zcat /boot/initrd.img-* | cpio -iv 
--no-absolute-filenames

    (of course, being an ARM thing, you probably have some sort of
    uInitrd thing as well, so I guess it's possible to break things
    between the initrd.img and that, but someone who knows about such
    things would need to tell you about that).

    Anyway, you should have something like this:

      /tmp/initrd$ find . -name mdadm\*
      ./scripts/local-top/mdadm
      ./etc/mdadm
      ./etc/mdadm/mdadm.conf
      ./etc/modprobe.d/mdadm.conf
      ./conf/mdadm
      ./sbin/mdadm

    so, take a look at that lot to see if you can spot what's up.

    As an example, this is what I see on a little amd64 RAID box with
    Jessie, which I have to hand:

      root@linhost-th:/tmp/initrd# cat conf/mdadm 
      MD_HOMEHOST='linhost-th'
      MD_DEVS=all
      root@linhost-th:/tmp/initrd# cat etc/mdadm/mdadm.conf 
      HOMEHOST <system>
      ARRAY /dev/md/2  metadata=1.2 UUID=00e84ce1:d96de981:375caa64:dac234f9 
name=grml:2
      ARRAY /dev/md/3  metadata=1.2 UUID=c9871cb8:46a3dd98:d9505965:5bd7dfe2 
name=grml:3

      (I tend to number my md's to match the partitions they sit on,
       hence the 2 & 3)

  3) save a copy of your old initrd.img somewhere, then run: 

     update-initramfs -u

    and try a reboot -- if it works, unpack both initrd's in adjacent
    directories, and use diff -ur to spot what changed, and report back
    here.

  4) If it didn't work, once in the emergency shell, try running:

    sh -x /scripts/local-top/mdadm

   and see if you can see why it's not working when starting things by
   hand does.

  5) If that fails to be diagnostic, is there anything hiding in your
     uboot configuration that might be causing this? (assuming this box
     has u-boot)

HTH

Cheers, Phil.

P.S. While you have the initrd unpacked, you might want to note that:

      root@linhost-th:/tmp/initrd# grep -r systemd .
      ./init:# Mount /usr only if init is systemd (after reading symlink)
      ./init:if [ "${checktarget##*/}" = systemd ] && read_fstab_entry /usr; 
then
      ./scripts/init-top/udev:/lib/systemd/systemd-udevd --daemon 
--resolve-names=never
      ./etc/lvm/lvm.conf:    # systemd's socket-based service activation or run 
as an initscripts service
      ./lib/udev/rules.d/63-md-raid-arrays.rules:# Tell systemd to run mdmon 
for our container, if we need it.
      Binary file ./lib/systemd/systemd-udevd matches
      Binary file ./lib/x86_64-linux-gnu/libselinux.so.1 matches
      Binary file ./bin/kmod matches
      Binary file ./bin/udevadm matches

    while the scripts on the initrd image are systemd-aware, it's init
    is actually a shell script -- so you're running busybox as your init
    at this point.

    Also:

     root@linhost-th:/tmp/initrd# grep -r 'Gave up waiting for' .
     ./scripts/local:           echo "Gave up waiting for $2 device.  Common 
problems:"

    this is the script that's dropping you into the emergency shell.

    The thing that starts the shell is the panic() function from
    scripts/functions -- I can see that that will do a timed reboot if
    you've got panic=... on the kernel command line, but otherwise not.

    Would you have something like that on your command line?  (as
    mentioned in the warning you quoted, /proc/cmdline tells you)

    If not, do you perhaps have a hardware watchdog, or some such?
-- 
|)|  Philip Hands  [+44 (0)20 8530 9560]  HANDS.COM Ltd.
|-|  http://www.hands.com/    http://ftp.uk.debian.org/
|(|  Hugo-Klemm-Strasse 34,   21075 Hamburg,    GERMANY

signature.asc
Description: PGP signature

Bug#791794: RAID device not active during boot

Reply via email to