Thanks for you rapid answer. > Sounds like your kernel is crashing when a certain part of > the disk is being accessed. If a user-space program can > cause a system crash, by definition that's a kernel bug, > not a userspace bug. [..] Theoretically, I perfectly agree with you. > If it's not crashing when you boot a Knoppix CD, yes, > it could be because you're using a much older version > of e2fsprogs --- but remember, the Knoppix CD is also > using a significantly older kernel, [..] > 1) What kernel version are you running? I detected the problem with kernel 2.6.25-2-686 installed from debian repository. > 2) What kind of disk drive are you using for your > filesystem? What kind of disk controller are you using? > Is it SCSI, IDE, SATA, etc.? It's an ide controller, the disk is an IBM IDE-3.5" 123.5Gb, 7200rpm; the boot log reports: [..snip..] Uniform Multi-Platform E-IDE driver ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ICH2: IDE controller (0x8086:0x244b rev 0x02) at PCI slot 0000:00:1f.1 ICH2: not 100% native mode: will probe irqs later ide0: BM-DMA at 0x2400-0x2407, BIOS settings: hda:DMA, hdb:PIO ide1: BM-DMA at 0x2408-0x240f, BIOS settings: hdc:DMA, hdd:PIO [..snip..] hda: IC35L120AVV207-0, ATA DISK drive [..snip..] hda: UDMA/100 mode selected [..snip..] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 The program 'smartctl -a /dev/hda' reports: smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: IBM/Hitachi Deskstar GXP-180 family Device Model: IC35L120AVV207-0 Serial Number: VNVD02G4G6X6JG Firmware Version: V24OA63A User Capacity: 123,522,416,640 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a Local Time is: Sat Aug 30 14:23:01 2008 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (2855) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 48) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 060 Pre-fail Always - 0 2 Throughput_Performance 0x0005 140 140 050 Pre-fail Offline - 302 3 Spin_Up_Time 0x0007 203 203 024 Pre-fail Always - 133 (Average 137) 4 Start_Stop_Count 0x0012 099 099 000 Old_age Always - 4217 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 119 119 020 Pre-fail Offline - 39 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 5344 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 4182 192 Power-Off_Retract_Count 0x0032 097 097 050 Old_age Always - 4267 193 Load_Cycle_Count 0x0012 097 097 050 Old_age Always - 4267 194 Temperature_Celsius 0x0002 134 134 000 Old_age Always - 41 (Lifetime Min/Max 13/56) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 A short and long test performed with smartctl showed: smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 5346 - # 2 Short offline Completed without error 00% 5345 - # 3 Short offline Completed without error 00% 5345 - # 4 Short offline Completed without error 00% 218 - # 5 Short offline Completed without error 00% 81 - # 6 Short offline Completed without error 00% 73 - > 3) What happens if you run this command from maintenance mode: > dd if=/dev/hda5 of=/dev/null bs=32k > That's why I suggested the dd command above; it would > demontrate the problem without any use of e2fsprogs. As you supposed, if I run the previous 'dd' command from boot maintenance mode (kernel ver. 2.6.25-2-686), the system reboots (even if much more late than 'e2fsck -fnv -C1 /dev/had5') without messages or warnings (just like e2fsck does). Surprisingly, if the previous 'dd' command is run under the same conditions after the boot process is completed (for example, from a standard console), it always completes normally. As further evidence: 1) I have built a debian-live CD for distribution 'lenny' (kernel 2.6.26-1) from debian repos and booted from it: both 'dd' and 'e2fsck' (ver. 1.41.0 10-Jul-2008) - running from CD - REBOOTED the system while checking /dev/hda5; 2) I have built a debian-live CD for distribution 'etch' (kernel 2.6.18-6.486) from debian repos and booted from it: both 'dd' program and 'e2fsck' (ver. 1.40-WIP 14-Nov-2006) - running from CD - COMPLETED NORMALLY on /dev/hda5; 3) I have booted /dev/hda5 from a previously installed 'etch' kernel (ver. 2.6.18-5-686): while in maintenace mode, 'dd' COMPLETED NORMALLY, while 'e2fsck' (e2fsck 1.41.0 10-Jul-2008) REBOOTED THE SYSTEM. So this is the report of the two tests (dd if=/dev/hda5 -of=/dev/null bs=32k; fsck -fnv -C1 /dev/hda5) I performed checking /dev/hda5: kernel dd fsck-1.41 fsck-1.39+1.40-WIP 2.6.25-2-686 FAIL (maint-mode) FAIL (maint-mode) n.a. (/dev/hda5 boot) PASS (after boot) FAIL (after boot) n.a. 2.6.26-1-686 FAIL (after boot) FAIL (after boot) n.a. (live cd boot) 2.6.18-6-486 PASS (after boot) n.a. PASS (after boot) (live cd boot) 2.6.18.5-686 PASS (maint-mode) FAIL (after boot) n.a. (/dev/hda5 boot) So, it seems for my system that: 1) fsck-1.41 always fails with kernels 2.6.25-2-686 or 2.6.26-1-686 (note: 'dd' test succeeds from standard console after reboot with 2.6.25-2-686) 2) fsck-1.41 fails with 2.6.18.5-686, while fsck-1.39+1.40-WIP succeeds with a very similar kernel (2.6.18-6-486). There could be more than one cause: 1) a hard disk failure, but if I haven't found evidence of it; 2) a kernel problem could be present ('dd' reboot let it suppose on kernels 2.6.25-2-686 and 2.6.26-1-686); 3) a fsck-1.41 problem, because it always reboot checking /dev/hda5, while fsck-1.39+1.40-WIP does not (where it is available). So that, what can I do ? Do you think it could be usefull to debug e2fsck to understand which kernel call (if any) make the system reboot ? Do I have to issue a kernel bug ? Your suggestions will be very appreciated. Thanks, Achille.
Thanks for you rapid answer.
> Sounds like your kernel is crashing when a certain part of
> the disk is being accessed. If a user-space program can
> cause a system crash, by definition that's a kernel bug,
> not a userspace bug. [..]
Theoretically, I perfectly agree with you.
> If it's not crashing when you boot a Knoppix CD, yes,
> it could be because you're using a much older version
> of e2fsprogs --- but remember, the Knoppix CD is also
> using a significantly older kernel, [..]
> 1) What kernel version are you running?
I detected the problem with kernel 2.6.25-2-686 installed from debian repository.
> 2) What kind of disk drive are you using for your
> filesystem? What kind of disk controller are you using?
> Is it SCSI, IDE, SATA, etc.?
It's an ide controller, the disk is an IBM IDE-3.5" 123.5Gb, 7200rpm; the boot log reports:
[..snip..]
Uniform Multi-Platform E-IDE driver
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ICH2: IDE controller (0x8086:0x244b rev 0x02) at PCI slot 0000:00:1f.1 ICH2: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0x2400-0x2407, BIOS settings: hda:DMA, hdb:PIO
ide1: BM-DMA at 0x2408-0x240f, BIOS settings: hdc:DMA, hdd:PIO
[..snip..]
hda: IC35L120AVV207-0, ATA DISK drive
[..snip..]
hda: UDMA/100 mode selected
[..snip..]
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
The program 'smartctl -a /dev/hda' reports:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: IBM/Hitachi Deskstar GXP-180 family
Device Model:&nbs p; IC35L120AVV207-0
Serial Number: VNVD02G4G6X6JG
Firmware Version: V24OA63A
User Capacity: 123,522,416,640 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a
Local Time is: Sat Aug 30 14:23:01 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity was never started.
Auto Offline
Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run.
Total time to complet e Offline
data collection: (2855) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new &n bsp; command.
Offline surface scan supported. Self-test supported.
No Conveyance Self-test supported. &nbs p; No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode.
Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported.< br /> General Purpose Logging supported. Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
SMART Attributes
Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 060 Pre-fail Always - 0 2 Throughput_Performance 0x0005 140 140 050 Pre-fail Offline - 302
3 Spin_Up_Time 0x0007 203 203 024 Pre-fail Always - 133 (Average 137)
4 Start_Stop_Count 0x0012 099 099 000 Old_age Always - 4217
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always & nbsp; - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 119 119 020 Pre-fail Offline - 39
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 5344
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - &nbs p; 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 4182
192 Power-Off_Retract_Count 0x0032 097 097 050 Old_age Always - 4267
193 Load_Cycle_Count 0x0012 097 097 050 Old_age Always - 4267
194 Temperature_Celsius 0x0002 134 134 000 Old_age Always - 41 (Lifetime Min/Max 13/56)
196 R eallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
A short and long test performed with smartctl showed:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Al len Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 5346 - # 2 Short offline Completed without error 00% 5345 - # 3 Short offline Completed without error 00% 5345   ; - # 4 Short offline Completed without error 00% 218 - # 5 Short offline Completed without error 00% 81 - # 6 Short offline Completed without error 00% 73 -
> 3) What happens if you run this command from maintenance mode:
> dd if=/dev/hda5 of=/dev/null bs=32k
> That's why I suggested the dd command above; it would
> demontrate the problem without any use of e2fsprogs.
As you supposed, if I run the previous 'dd' command from boot maintenance mode (kernel ver. 2.6.25-2-686), the system reboots (even if much more late than 'e2fsck -fnv -C1 /dev/had5') without messages or warnings (just like e2fsck does).
Surprisingly, if the previous 'dd' command is run under the same conditions after the boot process is completed (for example, from a standard console), it always completes normally.
As further evidence:
1) I have built a debian-live CD for distribution 'lenny' (kernel 2.6.26-1) from debian repos and booted from it: both 'dd' and 'e2fsck' (ver. 1.41.0 10-Jul-2008) - running from CD - REBOOTED the system while checking /dev/hda5;
2) I have built a debian-live CD for distribution 'etch' (kernel 2.6.18-6.486) from debian repos and booted from it: both 'dd' program and 'e2fsck' (ver. 1.40-WIP 14-Nov-2006) - running from CD - COMPLETED NORMALLY on /dev/hda5;
3) I have booted /dev/hda5 from a previously installed 'etch' kernel (ve r. 2.6.18-5-686): while in maintenace mode, 'dd' COMPLETED NORMALLY, while 'e2fsck' (e2fsck 1.41.0 10-Jul-2008) REBOOTED THE SYSTEM.
So this is the report of the two tests (dd if=/dev/hda5 -of=/dev/null bs=32k; fsck -fnv -C1 /dev/hda5) I performed checking /dev/hda5:
kernel dd fsck-1.41 fsck-1.39+1.40-WIP 2.6.25-2-686 FAIL (maint-mode) FAIL (maint-mode) n.a. (/dev/hda5 boot) PASS (after boot) FAIL (after boot) n.a.
2.6.26-1-686 FAIL (after boot) FAIL (after boot) &nb sp; n.a. (live cd boot)
2.6.18-6-486 PASS (after boot) n.a. PASS (after boot) (live cd boot)
2.6.18.5-686 PASS (maint-mode) FAIL (after boot) n.a. (/dev/hda5 boot)
So, it seems for my system that:
1) fsck-1.41 always fails with kernels 2.6.25-2-686 or 2..6.26-1-686 (note: 'dd' test succeeds from standard console after reboot with 2.6.25-2-686)
2) fsck-1.41 fails with 2.6.18.5-686, while fsck-1.39+1.40-WIP succeeds with a very similar kernel (2.6.18-6-486).
There could be more than one cause:
1) a hard disk failure, but if I haven't found evidence of it;
2) a kernel problem could be present ('dd' reboot shows on kernels 2.6.25-2-6 86 and 2.6.26-1-686);
3) fsck-1.41 always reboot check /dev/hda5, while fsck-1.39+1.40-WIP does not (where it is available).
What can I do ? Do you think it could be usefull to debug e2fsck to understand which kernel call (if any) make the system reboot ? If yes, how to do it ? Any other suggestions ?
Thanks, Achille.