tags 605357 patch thanks I can reproduce the segfault in a virtual machine by creating multiple RAID array members with the same number. This shouldn't happen normally, so I still wonder how it's possible to hit this bug, but the backtrace is exactly the same:
#0 0x0000000000407dcb in grub_disk_adjust_range (disk=0x0, sector=0x7fffffffdea0, offset=0x7fffffffde98, size=4096) at ../../kern/disk.c:364 part = 0x1010 #1 0x0000000000407f1f in grub_disk_read (disk=0x0, sector=0, offset=0, size=4096, buf=0x687660) at ../../kern/disk.c:397 tmp_buf = 0x0 real_offset = 0 #2 0x000000000042c9cd in grub_raid5_recover (array=0x66d4c0, disknr=0, buf=0x67e5d0 "", sector=0, size=4096) at ../../disk/raid5_recover.c:48 err = 64 buf2 = 0x687660 "" i = 1 #3 0x000000000042c014 in grub_raid_read (disk=0x66c460, sector=0, size=8, buf=0x67e5d0 "") at ../../disk/raid.c:400 read_size = 8 next_level = 0 read_sector = 0 e = 0 b = 0 p = 2 n = 1 disknr = 0 array = 0x66d4c0 err = GRUB_ERR_READ_ERROR #4 0x000000000040809f in grub_disk_read (disk=0x66c460, sector=0, offset=0, size=512, buf=0x7fffffffe190) at ../../kern/disk.c:443 data = 0x0 start_sector = 0 len = 512 pos = 0 tmp_buf = 0x67e5d0 "" real_offset = 0 #5 0x000000000042e16d in grub_lvm_scan_device (name=0x66d710 "md/0") at ../../disk/lvm.c:284 err = GRUB_ERR_NONE disk = 0x66c460 da_offset = 140737488348144 da_size = 4202832 mda_offset = 140737488348912 mda_size = 0 buf = '\000' <repeats 232 times>, "o\003C", '\000' <repeats 17 times>, "#\n\000\000\300\342\377\377\377\177\000\000\027\247B", '\000' <repeats 13 times>, "j\003C", '\000' <repeats 17 times>, "\b\000\000\000\360\342\377\377\377\177\000\000\273\251B", '\000' <repeats 13 times>, "j\003C", '\000' <repeats 21 times>"\360, \343\377\377\377\177\000\000;\...@\000\000\000\000\000\261\002c\000\000\000\000\000q\002c\000\000\000\000\000\000\000\000\000n\001\000\000v\002c", '\000' <repeats 69 times>"\260, \304f\000\000\000\000\000\020\000\000\000\000\000\000\000\377\377\377\377\000\000\000\000P\266\377\367\377\177\000\000\000\000\000\000\000\000\000\000\320\343\377\377\377\177\000" vg_id = "\f\344\377\377\377\177\000\000`\304f", '\000' <repeats 27 times> pv_id = "\340\343\377\377\377\177\000\000\177\215B", '\000' <repeats 13 times>"\377, \000\000\000\000\000\000\000\240\341\377\377\377\177" metadatabuf = 0x0 p = 0x0 q = 0x7ffff7bb8e40 "" vgname = 0x7fffffffe420 "@\344\377\377\377\177" lh = 0x7fffffffe190 pvh = 0x7fffffffe6f0 dlocn = 0x0 mdah = 0x0 rlocn = 0x7ffff78d284c i = 0 j = 4223242 vgname_len = 0 vg = 0xffffe400ba490040 pv = 0x66c440 #6 0x00000000004072d9 in iterate_disk (disk_name=0x66d710 "md/0") at ../../kern/device.c:96 dev = 0x0 hook = 0x42e0f0 <grub_lvm_scan_device> ents = 0x41007fff00e3ff49 #7 0x000000000042b5ba in grub_raid_iterate (hook=0x7fffffffe540) at ../../disk/raid.c:84 array = 0x66d4c0 #8 0x00000000004079a8 in grub_disk_dev_iterate (hook=0x7fffffffe540) at ../../kern/disk.c:212 p = 0x63afc0 #9 0x0000000000407462 in grub_device_iterate (hook=0x42e0f0 <grub_lvm_scan_device>) at ../../kern/device.c:168 ents = 0x63b010 #10 0x000000000042ef82 in grub_mod_init (mod=0x0) at ../../disk/lvm.c:679 No locals. #11 0x000000000042ef6a in grub_lvm_init () at ../../disk/lvm.c:677 No locals. #12 0x000000000042f072 in grub_init_all () at grub_probe_init.c:58 No locals. #13 0x0000000000402e10 in main (argc=2, argv=0x7fffffffe6f8) at ../../util/grub-probe.c:443 dev_map = 0x0 argument = 0x7fffffffe93c "/" In insert_array() in disk/raid.c, we first check if we already have all the devices of the array. After that we check if the specific member that we want to add already exists. In both cases we just print a debug message instead of returning an error. What happens then is that array->device[new_array->index] gets overwritten and nr_devs gets incremented. Thus nr_devs gets incremented without adding a new disk. When trying to read the raid array later on, some disk pointers are still NULL and we get a segfault when we dereference it. The attached patch returns an error in both cases, so we at least don't segfault (which I tested on the virtual machine). I talked with Julien on IRC, but his segfaults disappeared, so we will probably never know for sure what really happened. Regards, Jeroen dekkers
--- grub2-1.98+20100804/disk/raid.c~ 2010-12-15 18:36:32.000000000 +0100 +++ grub2-1.98+20100804/disk/raid.c 2010-12-15 19:58:53.000000000 +0100 @@ -496,17 +496,18 @@ the same. */ if (array->total_devs == array->nr_devs) - /* We found more members of the array than the array - actually has according to its superblock. This shouldn't - happen normally. */ - grub_dprintf ("raid", "array->nr_devs > array->total_devs (%d)?!?", - array->total_devs); - + /* We found more members of the array than the array + actually has according to its superblock. This shouldn't + happen normally. */ + return grub_error(GRUB_ERR_BAD_DEVICE, + "Found more RAID array members than the superblock says there are"); + if (array->device[new_array->index] != NULL) /* We found multiple devices with the same number. Again, this shouldn't happen. */ - grub_dprintf ("raid", "Found two disks with the number %d?!?", - new_array->number); + return grub_error(GRUB_ERR_BAD_DEVICE, + "Found two RAID array members with the same number %d", + new_array->number); if (new_array->disk_size < array->disk_size) array->disk_size = new_array->disk_size;