On 4/25/22 07:18, Charles Curley wrote:
On Sun, 24 Apr 2022 22:52:15 -0700
David Christensen <dpchr...@holgerdanske.com> wrote:
So, RAID 5 HDD's are sda, sdc, and sdd, and optical is sdb?
Optical is sr0.
Interesting. (Must be the SATA controller expansion card?)
sda is an SSD with all the system, /home, etc.
Good.
I would run a long SMART test on it now.
md0 is
mounted at /crc.
Okay.
You do have a backup of your Debian system configuration files and
your data, right?
Lots of them.
https://charlescurley.com/blog/posts/2019/Nov/02/backups-on-linux/
Good.
On 4/25/22 08:33, Charles Curley wrote:
> On Sun, 24 Apr 2022 22:47:50 -0700
> David Christensen <dpchr...@holgerdanske.com> wrote:
>
>> I will assume all three HDD's are the same make and model, per
>> smartctl(8) report below.
>
> Nope. Two WD Reds, as the report indicates, but not bought at the same
> time so likely to be different in detail. One Seagate Ironwolf.
Okay.
>>>> How is the motherboard connected to the HDD's?
> Debian and finnix see things differently. On both. sdc and sdd are part
> of the RAID array. On Debian, sda is the system drive: /, /home, /etc,
> swap, /boot, etc.. sdb is part of the RAID array. Finnix swaps those
> two.
Okay.
Rather than a Live Linux distribution for troubleshooting, I install
Debian onto a USB flash drive (SanDisk Ultra Fit USB 3.0 16 GB). I
keep it updated/ upgraded, and install whatever tools I want. You might
want to make one that matches your Debian instance -- that should
eliminate the device enumeration differences.
>> SATA cables -- Color? Locking or non-locking connectors? Came
>> with motherboard or aftermarket? If the latter, make and model?
>
> All black, two red. The black ones came with the computer. The red ones
> are aftermarket, Alchemy SATA3 30 cm. BFA-MSC-SATA330RK-RP. All lock.
https://www.frozencpu.com/products/14060/cab-572/Bitfenix_Alchemy_Multisleeve_SATA_30_Cable_-_30cm_-_Red_BFA-MSC-SATA330RK-RP.html
Those Alchemy SATA cables look good. Assuming you like them, I would
replace the factory cables with new Alchemy cables.
>> Please run long tests now on all three drives. Save all of the
>> reports. Post the report(s) for any drive(s) with SMART failures
>> and/or dmesg(1) errors.
>
> Long tests run about 10 hours. I did one overnight on sdd, and it
> reported no errors. No dmesg errors sine the ones I reported in the
> original email.
I launch SMART tests on all of the drives at the same time. The
microcontroller in each drive runs the test for that drive
independently, so it is okay to run all the tests concurrently.
> smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.7.0-2-amd64] (local build)
> Copyright (C) 2002-19, Bruce Allen, Christian Franke,
www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Red
> Device Model: WDC WD40EFRX-68N32N0
> Serial Number: WD-<redacted>
> LU WWN Device Id: 5 0014ee 2bda92107
> Firmware Version: 82.00A82
> User Capacity: 4,000,787,030,016 bytes [4.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 5400 rpm
> Form Factor: 3.5 inches
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ACS-3 T13/2161-D revision 5
> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is: Mon Apr 25 15:02:26 2022 UTC
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test
routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (43680) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off
support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 463) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x303d) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 0
> 3 Spin_Up_Time 0x0027 168 166 021 Pre-fail
Always - 6583
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 41
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
> 9 Power_On_Hours 0x0032 083 083 000 Old_age
Always - 12670
> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age
Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 41
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 13
> 193 Load_Cycle_Count 0x0032 198 198 000 Old_age
Always - 7908
> 194 Temperature_Celsius 0x0022 119 108 000 Old_age
Always - 31
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed without error 00% 12661
-
> # 2 Short offline Completed without error 00% 12651
-
> # 3 Extended offline Completed without error 00% 10026
-
> # 4 Extended offline Completed without error 00% 7
-
> # 5 Short offline Completed without error 00% 0
-
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute
delay.
That drive looks good to me.
I still don't see the source of the original dmesg(1) errors. I would:
1. Power down and vacuum the inside of the case.
2. Carefully wiggle and then disconnect all PSU cables and connectors.
Feel for any poor connections.
3. Carefully wiggle and then disconnect all SATA cables and connectors,
as above.
4. Test the PSU with a hardware tester.
5. Vacuum some more.
6. Reconnect the PSU and SATA cables. Wiggle each connection, as
above. Dress cables.
7. Boot memtest86+ and run overnight or longer.
If and when you have convinced yourself that all of the hardware is
good, then software is what remains.
Did you figure out what was preventing Debian from booting? Is the
Debian instance fixed?
David