On 6/1/26 14:15, Charles Curley wrote:
Some additional testing.

Suspecting a bad hard drive, I ran more extended tests on all four
members of the RAID array. One showed problems:

       "Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 days + 9 
hours)",
       "  When the command that caused the error occurred, the device was active or 
idle.",
       "",
       "  After command completion occurred, registers were:",
       "  ER -- ST COUNT  LBA_48  LH LM LL DV DC",
       "  -- -- -- == -- == == == -- -- -- -- --",
       "  40 -- 51 00 01 00 00 00 00 00 00 40 00  Error: UNC 1 sectors at LBA = 
0x00000000 = 0",
       "",
       "  Commands leading to the command that caused the error were:",
       "  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
Command/Feature_Name",
       "  -- == -- == -- == == == -- -- -- -- --  ---------------  
--------------------",
       "  25 00 00 00 01 00 00 00 00 00 00 40 00     00:08:36.585  READ DMA 
EXT",
       "  ec 00 00 00 00 00 00 00 00 00 00 00 00     00:08:31.545  IDENTIFY 
DEVICE",
       "  b0 00 da 00 00 00 00 00 c2 4f 00 00 00     00:08:31.542  SMART RETURN 
STATUS",
       "  b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00     00:08:31.541  SMART 
ENABLE/DISABLE ATTRIBUTE AUTOSAVE",
       "  ec 00 00 00 00 00 00 00 00 00 00 00 00     00:08:31.541  IDENTIFY 
DEVICE",
       "",
       "SMART Extended Self-test Log Version: 1 (1 sectors)",
       "Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error",
       "# 1  Extended offline    Completed without error       00%      6756         
-",
       "# 2  Extended offline    Completed without error       00%      6573         
-",
       "# 3  Extended offline    Completed without error       00%       102         
-",
       "# 4  Short offline       Completed without error       00%        96         
-",
       "",


So I did the obvious: I failed and remove the drive from the array. The
problem still showed up, but not as many fails in the same data set.

I have since added the drive back to the array, and am testing the
array now.

mdadm --monitor --test --oneshot /dev/md0

I begin to wonder if I have a bad motherboard.


Up until 2019, I was using Debian GNU/Linux on desktop hardware as a file server. When I upgraded to a server motherboard and ECC memory, I started seeing DMA errors. During trouble-shooting, I realized that I had been collecting SATA parts since the days of SATA I 150 Gbps -- HBA's, cables, racks, and drawers. My file server had a mix of various known and unknown parts, including red SATA cables (red dye can cause copper conductors to oxidize into dust). So, I replaced all of the unknown and obsolete parts with new parts clearly rated and marked for SATA III 6 Gbps. The disk problems went away. When I wanted more HDD's, I bought SAS 6 Gbps HBA's, cables, and HDD's.


Similarly, most of the memory problems I encountered were caused by incompatibility between the motherboard and the memory module(s). I suggest documenting your motherboard, documenting your memory modules, and doing the homework. Memory manufacturers typically have a search feature on their web site that will produce a list of compatible memory modules given a computer or motherboard make and model. eBay sellers often include the computer/motherboard make/model for pulled memory modules. And, you can always STFW.


For a server, I prefer and recommended workstation/server motherboards, ECC memory, ext4/UFS for the system disk, and ZFS RAID10 for data.


David

Reply via email to