Thanks for you rapid answer.

> Sounds like your kernel is crashing when a certain part of
> the disk is being accessed.  If a user-space program can
> cause a system crash, by definition that's a kernel bug,
> not a userspace bug.  [..]

Theoretically, I perfectly agree with you.

> If it's not crashing when you boot a Knoppix CD, yes,
> it could be because you're using a much older version
> of e2fsprogs --- but remember, the Knoppix CD is also
> using a significantly older kernel, [..]
> 1)  What kernel version are you running?

I detected the problem with kernel 2.6.25-2-686 installed from debian 
repository.

> 2)  What kind of disk drive are you using for your
> filesystem?   What kind of disk controller are you using? 
> Is it SCSI, IDE, SATA, etc.?

It's an ide controller, the disk is an IBM IDE-3.5" 123.5Gb, 7200rpm; the boot 
log reports:

[..snip..]
Uniform Multi-Platform E-IDE driver
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx 
ICH2: IDE
controller (0x8086:0x244b rev 0x02) at  PCI slot 0000:00:1f.1 ICH2: not 100% 
native
mode: will probe irqs later
ide0: BM-DMA at 0x2400-0x2407, BIOS settings: hda:DMA, hdb:PIO
ide1: BM-DMA at 0x2408-0x240f, BIOS settings: hdc:DMA, hdd:PIO
[..snip..]
hda: IC35L120AVV207-0, ATA DISK drive
[..snip..]
hda: UDMA/100 mode selected
[..snip..]
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14


The program 'smartctl -a /dev/hda' reports:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home 
page is
http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     IBM/Hitachi Deskstar GXP-180 family
Device Model:     IC35L120AVV207-0
Serial Number:    VNVD02G4G6X6JG
Firmware Version: V24OA63A
User Capacity:    123,522,416,640 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 3a
Local Time is:    Sat Aug 30 14:23:01 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline
Data Collection: Enabled. Self-test execution status:      (   0) The previous 
self-test
routine completed                                         without error or no 
self-test
has ever                                         been run.
Total time to complete Offline
data collection:                 (2855) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported. SMART
capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer. Error 
logging
capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported. Short
self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.

SMART Attributes
Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED
RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   060    Pre-fail  Always       
-       0
  2 Throughput_Performance  0x0005   140   140   050    Pre-fail  Offline      
-      
302
  3 Spin_Up_Time            0x0007   203   203   024    Pre-fail  Always       
-      
133 (Average 137)
  4 Start_Stop_Count        0x0012   099   099   000    Old_age   Always       
-       4217
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       
-       0
  8 Seek_Time_Performance   0x0005   119   119   020    Pre-fail  Offline      
-      
39
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       
-       5344
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       
-       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       
-      
4182
192 Power-Off_Retract_Count 0x0032   097   097   050    Old_age   Always       
-       4267
193 Load_Cycle_Count        0x0012   097   097   050    Old_age   Always       
-       4267
194 Temperature_Celsius     0x0002   134   134   000    Old_age   Always       
-      
41 (Lifetime Min/Max 13/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       
-       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       
-       0


A short and long test performed with smartctl showed:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home 
page is
http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours) 
LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5346         - 
# 2 
Short offline       Completed without error       00%      5345         - # 3  
Short
offline       Completed without error       00%      5345         - # 4  Short
offline       Completed without error       00%       218         - # 5  Short
offline       Completed without error       00%        81         - # 6  Short
offline       Completed without error       00%        73         -


> 3)  What happens if you run this command from maintenance mode:
>     dd if=/dev/hda5 of=/dev/null bs=32k
> That's why I suggested the dd command above; it would
> demontrate the problem without any use of e2fsprogs.

As you supposed, if I run the previous 'dd' command from boot maintenance mode 
(kernel
ver. 2.6.25-2-686), the system reboots (even if much more late than 'e2fsck 
-fnv -C1
/dev/had5') without messages or warnings (just like e2fsck does).

Surprisingly, if the previous 'dd' command is run under the same conditions 
after the
boot process is completed (for example, from a standard console), it always 
completes
normally.

As further evidence:

1) I have built a debian-live CD for distribution 'lenny' (kernel 2.6.26-1) 
from debian
repos and booted from it: both 'dd' and 'e2fsck' (ver. 1.41.0 10-Jul-2008) - 
running
from CD - REBOOTED the system while checking /dev/hda5;

2) I have built a debian-live CD for distribution 'etch' (kernel 2.6.18-6.486) 
from
debian repos and booted from it: both 'dd' program and 'e2fsck' (ver. 1.40-WIP
14-Nov-2006) - running from CD - COMPLETED NORMALLY on /dev/hda5;

3) I have booted /dev/hda5 from a previously installed 'etch' kernel (ver.
2.6.18-5-686): while in maintenace mode, 'dd' COMPLETED NORMALLY, while 
'e2fsck' (e2fsck
1.41.0 10-Jul-2008) REBOOTED THE SYSTEM.

So this is the report of the two tests (dd if=/dev/hda5 -of=/dev/null bs=32k; 
fsck -fnv
-C1 /dev/hda5) I performed checking /dev/hda5:

kernel                  dd               fsck-1.41         fsck-1.39+1.40-WIP
 2.6.25-2-686      FAIL (maint-mode)  FAIL (maint-mode)          n.a.  
(/dev/hda5 boot) 
PASS (after boot)  FAIL (after boot)          n.a.

 2.6.26-1-686      FAIL (after boot)  FAIL (after boot)          n.a.  (live cd 
boot)

 2.6.18-6-486      PASS (after boot)       n.a.            PASS (after boot)  
(live cd
boot)

 2.6.18.5-686      PASS (maint-mode)  FAIL (after boot)         n.a.  
(/dev/hda5 boot)


So, it seems for my system that:

1) fsck-1.41 always fails with kernels 2.6.25-2-686 or 2.6.26-1-686 (note: 'dd' 
test
succeeds from standard console after reboot with 2.6.25-2-686)

2) fsck-1.41 fails with 2.6.18.5-686, while fsck-1.39+1.40-WIP succeeds with a 
very
similar kernel (2.6.18-6-486).


There could be more than one cause:

1) a hard disk failure, but if I haven't found evidence of it;

2) a kernel problem could be present ('dd' reboot let it suppose on kernels 
2.6.25-2-686
and 2.6.26-1-686);

3) a fsck-1.41 problem, because it always reboot checking /dev/hda5, while
fsck-1.39+1.40-WIP does not (where it is available).


So that, what can I do ? Do you think it could be usefull to debug e2fsck to 
understand
which kernel call (if any) make the system reboot ? Do I have to issue a kernel 
bug ?

Your suggestions will be very appreciated.

Thanks, Achille.





Thanks for you rapid answer.

> Sounds like your kernel is crashing when a certain part of
> the disk is being accessed.  If a user-space program can
> cause a system crash, by definition that's a kernel bug,
> not a userspace bug.  [..]

Theoretically, I perfectly agree with you.

> If it's not crashing when you boot a Knoppix CD, yes,
> it could be because you're using a much older version
> of e2fsprogs --- but remember, the Knoppix CD is also
> using a significantly older kernel, [..]
> 1)  What kernel version are you running?

I detected the problem with kernel 2.6.25-2-686 installed from debian repository.

> 2)  What kind of disk drive are you using for your
> filesystem?   What kind of disk controller are you using? 
> Is it SCSI, IDE, SATA, etc.?

It's an ide controller, the disk is an IBM IDE-3.5" 123.5Gb, 7200rpm; the boot log reports:

[..snip..]
Uniform Multi-Platform E-IDE driver
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ICH2: IDE controller (0x8086:0x244b rev 0x02) at  PCI slot 0000:00:1f.1 ICH2: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0x2400-0x2407, BIOS settings: hda:DMA, hdb:PIO
ide1: BM-DMA at 0x2408-0x240f, BIOS settings: hdc:DMA, hdd:PIO
[..snip..]
hda: IC35L120AVV207-0, ATA DISK drive
[..snip..]
hda: UDMA/100 mode selected
[..snip..]
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14


The program 'smartctl -a /dev/hda' reports:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     IBM/Hitachi Deskstar GXP-180 family
Device Model:&nbs p;    IC35L120AVV207-0
Serial Number:    VNVD02G4G6X6JG
Firmware Version: V24OA63A
User Capacity:    123,522,416,640 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 3a
Local Time is:    Sat Aug 30 14:23:01 2008 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity                                          was never started.
                                        Auto Offline
Data Collection: Enabled. Self-test execution status:      (   0) The previous self-test routine completed                                         without error or no self-test has ever                                         been run.
Total time to complet e Offline
data collection:                 (2855) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.                                         Auto Offline data collection on/off support.                                         Suspend Offline collection upon new                &n bsp;                        command.
                                        Offline surface scan supported.                                         Self-test supported.
                                        No Conveyance Self-test supported.      &nbs p;                                  No Selective Self-test supported. SMART capabilities:            (0x0003) Saves SMART data before entering                                         power-saving mode.
                                        Supports SMART auto save timer. Error logging capability:        (0x01) Error logging supported.< br />                                        General Purpose Logging supported. Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.

SMART Attributes
Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   060    Pre-fail  Always       -       0   2 Throughput_Performance  0x0005   140   140   050    Pre-fail  Offline      -       302
  3 Spin_Up_Time            0x0007   203   203   024    Pre-fail  Always       -       133 (Average 137)
  4 Start_Stop_Count        0x0012   099   099   000    Old_age   Always       -       4217
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always    & nbsp;  -       0   7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0   8 Seek_Time_Performance   0x0005   119   119   020    Pre-fail  Offline      -       39
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       5344
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -  &nbs p;    0  12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       4182
192 Power-Off_Retract_Count 0x0032   097   097   050    Old_age   Always       -       4267
193 Load_Cycle_Count        0x0012   097   097   050    Old_age   Always       -       4267
194 Temperature_Celsius     0x0002   134   134   000    Old_age   Always       -       41 (Lifetime Min/Max 13/56)
196 R eallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0 197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0 198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0 199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0


A short and long test performed with smartctl showed:

smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Al len Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5346         - # 2  Short offline       Completed without error       00%      5345         - # 3  Short offline       Completed without error       00%      5345         ; - # 4  Short offline       Completed without error       00%       218         - # 5  Short offline       Completed without error       00%        81         - # 6  Short offline       Completed without error       00%        73         -


> 3)  What happens if you run this command from maintenance mode:
>     dd if=/dev/hda5 of=/dev/null bs=32k
> That's why I suggested the dd command above; it would
> demontrate the problem without any use of e2fsprogs.

As you supposed, if I run the previous 'dd' command from boot maintenance mode (kernel ver. 2.6.25-2-686), the system reboots (even if much more late than 'e2fsck -fnv -C1 /dev/had5') without messages or warnings (just like e2fsck does).

Surprisingly, if the previous 'dd' command is run under the same conditions after the boot process is completed (for example, from a standard console), it always completes normally.

As further evidence:

1) I have built a debian-live CD for distribution 'lenny' (kernel 2.6.26-1) from debian repos and booted from it: both 'dd' and 'e2fsck' (ver. 1.41.0 10-Jul-2008) - running from CD - REBOOTED the system while checking /dev/hda5;

2) I have built a debian-live CD for distribution 'etch' (kernel 2.6.18-6.486) from debian repos and booted from it: both 'dd' program and 'e2fsck' (ver. 1.40-WIP 14-Nov-2006) - running from CD - COMPLETED NORMALLY on /dev/hda5;

3) I have booted /dev/hda5 from a previously installed 'etch' kernel (ve r. 2.6.18-5-686): while in maintenace mode, 'dd' COMPLETED NORMALLY, while 'e2fsck' (e2fsck 1.41.0 10-Jul-2008) REBOOTED THE SYSTEM.

So this is the report of the two tests (dd if=/dev/hda5 -of=/dev/null bs=32k; fsck -fnv -C1 /dev/hda5) I performed checking /dev/hda5:

kernel                  dd               fsck-1.41         fsck-1.39+1.40-WIP  2.6.25-2-686      FAIL (maint-mode)  FAIL (maint-mode)          n.a.  (/dev/hda5 boot)  PASS (after boot)  FAIL (after boot)          n.a.

 2.6.26-1-686      FAIL (after boot)  FAIL (after boot)       &nb sp;  n.a.  (live cd boot)

 2.6.18-6-486      PASS (after boot)       n.a.            PASS (after boot)  (live cd boot)

 2.6.18.5-686      PASS (maint-mode)  FAIL (after boot)         n.a.  (/dev/hda5 boot)


So, it seems for my system that:

1) fsck-1.41 always fails with kernels 2.6.25-2-686 or 2..6.26-1-686 (note: 'dd' test succeeds from standard console after reboot with 2.6.25-2-686)

2) fsck-1.41 fails with 2.6.18.5-686, while fsck-1.39+1.40-WIP succeeds with a very similar kernel (2.6.18-6-486).


There could be more than one cause:

1) a hard disk failure, but if I haven't found evidence of it;

2) a kernel problem could be present ('dd' reboot shows on kernels 2.6.25-2-6 86 and 2.6.26-1-686);

3) fsck-1.41 always reboot check /dev/hda5, while fsck-1.39+1.40-WIP does not (where it is available).


What can I do ? Do you think it could be usefull to debug e2fsck to understand which kernel call (if any) make the system reboot ? If yes, how to do it ? Any other suggestions ?

Thanks, Achille.

Reply via email to