On 14/03/2015 22:45, Bob Proulx wrote:
Normally the idea in rescue mode is that you are presented with a
shell with the root in your main system. At that point you can mount
the rest of your system. It would normally be like this:
# mount -a
However seeing those large numbers 121 and 121 in your device paths
/dev/md121 above I worry that you had /dev/md0 and /dev/md1 types of
names and those have been swapped for /dev/md121 and /dev/md122 and
therefore the paths in your fstab won't work at the moment. Or they
might be using other paths, labels, uuids, and so forth.
But if the device numbers have changed
then they could only be mounted manually because they won't match what
is in the /etc/fstab.
[Bob, I snipped other sections of your helpful and wide-ranging reply,
including some relevant remarks on /dev/md[x] numbering which I'll
explain how I got round.]
In short, I have now restored the system, including testing that it
can boot from either disk.
At first, in the rescue shell, I was first using the long mount
commands for each /dev/md[x] and the mount point, because I thought
the dev/md[x] numbers wouldn't match fstab. Once I had the
filesystems mounted and started checking for damage (there seems to
have been none, fortunately), I saw in fstab that - apart from the /
partition - all the other md partitions were loaded by UUID. So,
since the rescue system had mounted /dev/md122 on /,
# mount -a
worked for all others. (Except for the nfs mount, but I never looked
further into why the rescue system has difficulty with mounting an nfs
share. Maybe another topic, under less stress.)
I then made a mistake. I had seen that the installation of the kernel
package security update had changed some grub files in /boot. I
looked at those - lots of useful initrd and vmlinuz files, and a
complicated-looking grub.cfg. The rescue shell gives you 'man'
commands and 'info' commands, so I read the grub documentation
installed on my system. So I re-inserted grub on the boot partition.
# grub-install /dev/md121
Grub cannot be installed on a partition-less disk
[and a couple of other related warnings]
That frightened me. There obviously were partitions, because mdadm
had found them, and the rescue shell was happy with 7 of them mounted.
Using gdisk I had another fright when it reported that sda was a gpt
disk, with a protected MBR (what?) and no other partitions. That
couldn't be right because the installer had found the partitions, so
had mdadm.
I then wondered if perhaps grub wasn't involved and I shouldn't be
looking at things from a grub and gdisk/gpt viewpoint. Though I
thought I had seen the kernel update actually alter grub, maybe I had
only seen the initrd and vmlinuz files get updated (and 'assumed' grub
was there). Plus, the basic symptom I get on boot is that the loader
says
LILO 23 LILO loadiEBDA: too large kernel
or something like that.
Booting was saying, it's LILO, not grub. Maybe it's right, I thought.
# man lilo
lilo not found
# lilo
lilo not found
Well, it's not right, because lilo isn't there, the machine says. But
some version of lilo is on the boot sectors. I read the lilo manpage
on the web and saw that it has a fairly simple config in
/etc/lilo.conf Checking that I saw there was such a file on the
system, dating from 2010, referring to /dev/md0 (/boot, as was before
the rescue shell renumbered them) and dev/md1 (root fs before the
rescue shell renumbered them). So I needed to alter the lilo conf
file, and then execute lilo. Lilo wasn't on the machine, so
# apt-get install lilo
and changed /etc/lilo.conf to say (I'll list these to help others in
the future):
(a) the new initrd,
(b) the new vmlinuz,
(c) set boot to using /dev/md121 as the boot device, and
(d) set root to /dev/md122
saved the changed file, and
# lilo
No errors so, exiting the rescue shell, I rebooted.
LILO started, didn't complain about the large kernel, moved on to the
assemble the /dev/md[x] (as md0, md1, md2, etc, perfect!) before dying
with
Aborted waiting for root fs
Dev [something]:122 not found
Well, that was progress - the machine could boot, there was nothing
wrong with the partitions, lilo was no longer corrupted, the md[x] all
assembled. But, /dev/md122 was no longer called that. In 'real life'
as opposed to the 'rescue shell life', the root filesystem is on
/dev/md1. The line
root=/dev/md122
in /etc/lilo.conf caused lilo not find the real fs on /dev/md1.
There's a very simple solution, go back to the rescue shell so that
you can change lilo.conf, then, to say
root=/dev/md1
but it doesn't work. When you execute lilo after doing this change,
it objects:
# lilo
Invalid root filesystem: /dev/md1
Reason: Because /dev/md1 doesn't exist in the rescue shell, so the
lilo config preparation system, so carefully, protects you against
specifying a root filesystem that it doesn't think exists. It's right
to do that, so I had to somehow have consistent names in both the
rescue shell, and in the real installation. I looked long at man
mdadm and inferred that the only way to alter the names was to
dis-assemble and re-assemble with a --name= parameter. But I didn't
have an example of a command line with the right order and right other
things that needed to be there, and I - really - didn't want to risk
compromising the md system that was running and was as-yet undamaged.
The web page for man lilo.conf
http://linux.die.net/man/5/lilo.conf
mentions that partitions can be named in lilo.conf by UUID. On this
machine, fstab uses UUID for all but the root filesystem, so I
couldn't get the UUID for /dev/md1 from there. But I did find a UUID
string in the /boot/grub.cfg file. Rebooting back into the rescue
shell, and editing /etc/lilo.conf to use that UUID string
root="UUID={some hex string mixed with some dashes}"
(the inverted commas are important to ensure the second '=' gets
passed to the kernel during the boot sequence)
and exiting the shell, then rebooting
the system fully booted and everything sprang back into life. Phew.
For posterity, here are the key aspects of recovering from this type
of problem with a raid1 boot failure.
1. If the boot message says LILO, it *IS* lilo, and it may be
necessary to install the lilo package if it is not present, following
a distribution upgrade, for example.
2. The rescue shell uses mdadm's 'emergency' names. That will be
fine for setting the *boot* device in the lilo.conf, but will not be
fine setting the *root* device in lilo.conf
3. Using the rescue shell, get a root filesystem mounted on the
target machine's filesystem because you need to edit your real
/etc/lilo.conf
4. In lilo.conf change
boot=/dev/md0 (or whatever your file says here)
to
boot=/dev/md121 (or whatever the rescue shell has labelled the device
that you previously had in your 'boot=' line)
5. In lilo.conf change
root=/dev/md1 (or whatever your lilo.conf file says here)
to
root="UUID={the UUID of your md device that you normally mount as
'/'}" - and include the inverted commas and the extra '=' .
This UUID string could take some time to find, and don't panic if you
find the wrong string, try to find its label another way. If fstab
uses a UUID label to mount '/' then try with that. Maybe some other
posters could improve this suggestion if they know the 'trick' to get
the correct label.
6. Reboot. The system should now boot to the normal start.
Thanks to everyone who helped with suggestions.
Apologies, again, for the somewhat random chain of unlinked messages,
which must have irritated folk. This was due to this server failing,
with the result that without this (mail) server we had no access to
the emailed messages from the list - so I could not 'reply' properly -
instead I could only see web copies of posts. Hopefully, fixed now.
regards, Ron
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/55055ce4.4070...@tesco.net