[sending to grub-devel@ as requested] Robert Millan wrote: > On Sun, May 04, 2008 at 05:01:32PM +0300, Török Edwin wrote: > >>>> Device Boot Start End Blocks Id System >>>> /dev/sda1 * 1 1275 10241406 7 HPFS/NTFS >>>> /dev/sda2 1276 2248 7815622+ a6 OpenBSD >>>> /dev/sda3 2249 5289 24426832+ f W95 Ext'd (LBA) >>>> /dev/sda4 6080 7296 9775552+ bf Solaris >>>> /dev/sda5 2249 2371 987966 82 Linux swap / >>>> Solaris >>>> /dev/sda6 2372 3587 9767488+ 83 Linux >>>> /dev/sda7 3588 3600 104391 83 Linux >>>> /dev/sda8 3601 4863 10145016 8e Linux LVM >>>> /dev/sda9 4864 5228 2931831 a6 OpenBSD >>>> /dev/sda10 5229 5289 489951 83 Linux >>>> >> [...] >> grub> ls (hd0,10) >> error: unknown device >> grub> ls (hd0,11) >> error: unknown device >> grub> >> > > I tried reproducing your setup, but I can't hit the same bug. This starts to > look really nasty. Just spotted this: > > /build/buildd/grub2-1.96+20080426/partmap/pc.c:141: partition 0: flag 0x80, > type 0x7, start 0x3f, len 0x1388afc > [...] > /build/buildd/grub2-1.96+20080426/partmap/pc.c:141: partition 0: flag 0x0, > type 0x82, start 0x2270f07, len 0x1e267c > > for which I can't find any explanation other than memory corruption. Also, > due to a missing fflush() call the output is somewhat scrambled, which makes > it harder to track (I fixed this already in upstream). > > Could you: > > - Apply the attached patch & run grub-probe again (this time output > will be a bit more readable) >
There was no patch attached, however I did a 'cvs diff -u -D2008-04-30', and applied that patch. I found what the problem is, and it also explains why you couldn't reproduce the problem. /dev/sda9 is not a valid OpenBSD partition, and in partmap/pc.c:176 the iteration fails with an error: invalid disk label magic 0x%x. If I replace that return with a continue, it works. The problem is that grub2 stops looking for more partitions as soon as it encountered the invalid partition, grub 0.97 was working perfectly and I never noticed the partition has the wrong type! Also if I change the partition type to 83 (as it should be) an unpatched grub-probe can find that /boot is on /dev/sda10: # grub-probe -t device /boot /dev/sda10 I think grub2 should handle errors more gracefully, eventually mark the partition as invalid, and keep going. grub-probe was looking for /dev/sda10, and it shouldn't be affected by /dev/sda9 being corrupted/invalid. Think of it this way: if a partition gets corrupted, that shouldn't prevent from booting, assuming the boot and root partitions are still ok. Compare what grub-emu says when sda9 has wrong type: grub> ls (hd0,10) error: unknown device And this is what it says when sda9 has the correct type: grub> ls (hd0,10) Partition hd0,10: Filesystem type ext2, Label debian_BOOT > - Send it to [EMAIL PROTECTED] > Done > ? > > Maybe someone there has an idea, but if it's memory corruption and we can't > reproduce it, tracing the problem remotely isn't going to work very well. > It wasn't memory corruption, however I have run valgrind and it has shown some leaks, plus call to stat() with NULL parameter. The attached patch fixes some valgrind warnings. Some leaks still remain, I attached the new valgrind logs. P.S.: grub2 seems to work now, I am able to boot with it with the text-mode menu. The default graphics mode doesn't work I will open a separate bug about that. Best regards, --Edwin
diff -ur grub2-1.96+20080429/kern/disk.c ../grub2-1.96+20080429/kern/disk.c --- grub2-1.96+20080429/kern/disk.c 2008-02-08 14:22:51.000000000 +0200 +++ ../grub2-1.96+20080429/kern/disk.c 2008-05-11 13:58:02.270673755 +0300 @@ -317,7 +317,10 @@ /* Reset the timer. */ grub_last_time = grub_get_rtc (); - grub_free (disk->partition); + if(disk->partition) { + grub_free (disk->partition->data); + grub_free (disk->partition); + } grub_free ((void *) disk->name); grub_free (disk); } diff -ur grub2-1.96+20080429/util/grub-probe.c ../grub2-1.96+20080429/util/grub-probe.c --- grub2-1.96+20080429/util/grub-probe.c 2008-05-11 13:59:14.934811935 +0300 +++ ../grub2-1.96+20080429/util/grub-probe.c 2008-05-11 13:46:21.729236855 +0300 @@ -190,9 +190,10 @@ struct stat st; grub_fs_t fs; - stat (path, &st); + if(path) + stat (path, &st); - if (st.st_mode == S_IFREG) + if (path && st.st_mode == S_IFREG) { /* Regular file. Verify that we can read it properly. */
==25071== Memcheck, a memory error detector. ==25071== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==25071== Using LibVEX rev 1804, a library for dynamic binary translation. ==25071== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==25071== Using valgrind-3.3.0-Debian, a dynamic binary instrumentation framework. ==25071== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==25071== For more details, rerun with: -v ==25071== ==25071== My PID = 25071, parent PID = 5663. Prog and args are: ==25071== ./grub-probe ==25071== -d ==25071== /dev/sda10 ==25071== ==25071== Warning: noted but unhandled ioctl 0x1261 with no size/direction hints ==25071== This could cause spurious value errors to appear. ==25071== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==25071== Warning: noted but unhandled ioctl 0x1261 with no size/direction hints ==25071== This could cause spurious value errors to appear. ==25071== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==25071== Warning: noted but unhandled ioctl 0x1261 with no size/direction hints ==25071== This could cause spurious value errors to appear. ==25071== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==25071== ==25071== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 10 from 1) ==25071== malloc/free: in use at exit: 611,077 bytes in 176 blocks. ==25071== malloc/free: 901 allocs, 725 frees, 2,397,201 bytes allocated. ==25071== For counts of detected errors, rerun with: -v ==25071== searching for pointers to 176 not-freed blocks. ==25071== checked 662,256 bytes. ==25071== ==25071== 4,096 bytes in 1 blocks are possibly lost in loss record 3 of 5 ==25071== at 0x4006AB8: malloc (vg_replace_malloc.c:207) ==25071== by 0x804AFE4: xmalloc (misc.c:81) ==25071== by 0x804B41A: grub_malloc (misc.c:222) ==25071== by 0x804C3EB: grub_disk_cache_store (disk.c:162) ==25071== by 0x804CDC1: grub_disk_read (disk.c:461) ==25071== by 0x8069A72: grub_lvm_scan_device (lvm.c:288) ==25071== by 0x804C014: iterate_partition.2134 (device.c:132) ==25071== by 0x8066C9C: pc_partition_map_iterate (pc.c:153) ==25071== by 0x804F3AD: grub_partition_iterate (partition.c:126) ==25071== by 0x804C09D: iterate_disk.2131 (device.c:101) ==25071== by 0x80498FA: call_hook (biosdisk.c:132) ==25071== by 0x804992B: grub_util_biosdisk_iterate (biosdisk.c:141) ==25071== ==25071== ==25071== 41,136 (41,132 direct, 4 indirect) bytes in 12 blocks are definitely lost in loss record 4 of 5 ==25071== at 0x4006AB8: malloc (vg_replace_malloc.c:207) ==25071== by 0x804AFE4: xmalloc (misc.c:81) ==25071== by 0x804B41A: grub_malloc (misc.c:222) ==25071== by 0x804C3EB: grub_disk_cache_store (disk.c:162) ==25071== by 0x804CDC1: grub_disk_read (disk.c:461) ==25071== by 0x8066D4E: pc_partition_map_iterate (pc.c:165) ==25071== by 0x804F3AD: grub_partition_iterate (partition.c:126) ==25071== by 0x804C09D: iterate_disk.2131 (device.c:101) ==25071== by 0x80498FA: call_hook (biosdisk.c:132) ==25071== by 0x804992B: grub_util_biosdisk_iterate (biosdisk.c:141) ==25071== by 0x804C4CC: grub_disk_dev_iterate (disk.c:205) ==25071== by 0x804BF63: grub_device_iterate (device.c:138) ==25071== ==25071== LEAK SUMMARY: ==25071== definitely lost: 41,132 bytes in 12 blocks. ==25071== indirectly lost: 4 bytes in 1 blocks. ==25071== possibly lost: 4,096 bytes in 1 blocks. ==25071== still reachable: 565,845 bytes in 162 blocks. ==25071== suppressed: 0 bytes in 0 blocks. ==25071== Reachable blocks (those to which a pointer was found) are not shown. ==25071== To see them, rerun with: --leak-check=full --show-reachable=yes