Re: Schrödinger's hash

David Christensen Mon, 01 Jun 2026 15:40:41 -0700

On 6/1/26 14:15, Charles Curley wrote:

Some additional testing.


Suspecting a bad hard drive, I ran more extended tests on all four
members of the RAID array. One showed problems:

       "Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 days + 9 
hours)",
       "  When the command that caused the error occurred, the device was active or 
idle.",
       "",
       "  After command completion occurred, registers were:",
       "  ER -- ST COUNT  LBA_48  LH LM LL DV DC",
       "  -- -- -- == -- == == == -- -- -- -- --",
       "  40 -- 51 00 01 00 00 00 00 00 00 40 00  Error: UNC 1 sectors at LBA = 
0x00000000 = 0",
       "",
       "  Commands leading to the command that caused the error were:",
       "  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
Command/Feature_Name",
       "  -- == -- == -- == == == -- -- -- -- --  ---------------  
--------------------",
       "  25 00 00 00 01 00 00 00 00 00 00 40 00     00:08:36.585  READ DMA 
EXT",
       "  ec 00 00 00 00 00 00 00 00 00 00 00 00     00:08:31.545  IDENTIFY 
DEVICE",
       "  b0 00 da 00 00 00 00 00 c2 4f 00 00 00     00:08:31.542  SMART RETURN 
STATUS",
       "  b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00     00:08:31.541  SMART 
ENABLE/DISABLE ATTRIBUTE AUTOSAVE",
       "  ec 00 00 00 00 00 00 00 00 00 00 00 00     00:08:31.541  IDENTIFY 
DEVICE",
       "",
       "SMART Extended Self-test Log Version: 1 (1 sectors)",
       "Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error",
       "# 1  Extended offline    Completed without error       00%      6756         
-",
       "# 2  Extended offline    Completed without error       00%      6573         
-",
       "# 3  Extended offline    Completed without error       00%       102         
-",
       "# 4  Short offline       Completed without error       00%        96         
-",
       "",


So I did the obvious: I failed and remove the drive from the array. The
problem still showed up, but not as many fails in the same data set.

I have since added the drive back to the array, and am testing the
array now.

mdadm --monitor --test --oneshot /dev/md0

I begin to wonder if I have a bad motherboard.

Up until 2019, I was using Debian GNU/Linux on desktop hardware as afile server. When I upgraded to a server motherboard and ECC memory, Istarted seeing DMA errors. During trouble-shooting, I realized that Ihad been collecting SATA parts since the days of SATA I 150 Gbps --HBA's, cables, racks, and drawers. My file server had a mix of variousknown and unknown parts, including red SATA cables (red dye can causecopper conductors to oxidize into dust). So, I replaced all of theunknown and obsolete parts with new parts clearly rated and marked forSATA III 6 Gbps. The disk problems went away. When I wanted moreHDD's, I bought SAS 6 Gbps HBA's, cables, and HDD's.

Similarly, most of the memory problems I encountered were caused byincompatibility between the motherboard and the memory module(s). Isuggest documenting your motherboard, documenting your memory modules,and doing the homework. Memory manufacturers typically have a searchfeature on their web site that will produce a list of compatible memorymodules given a computer or motherboard make and model. eBay sellersoften include the computer/motherboard make/model for pulled memorymodules. And, you can always STFW.

For a server, I prefer and recommended workstation/server motherboards,ECC memory, ext4/UFS for the system disk, and ZFS RAID10 for data.



David

Re: Schrödinger's hash

Reply via email to