John, Jason, Many thanks for your brainstorming on this…
> On Dec 10, 2018, at 6:19 PM, John D Groenveld <[email protected]> wrote: > > In message <[email protected]>, Lou Picciano > wri > tes: >> Is this evidence of erroneous attempts to read boot blocks/loader on disk0? >> >> Given the machine BIOS identification of drives, dunno that I can be >> absolutel >> y certain disk0 is referring to one disk - or is the entire rpool seen a >> s disk0 once the OS is loading? > > Does iostat(1M) -E report errors? Absolutely none. In fact, having called precisely that command before, I was thrown by the ‘Errors: 0’ everywhere… sd1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS5C302 Revision: A180 Serial No: ML0221F302X0MD Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd2 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS5C302 Revision: A580 Serial No: ML0220F30HWBSD Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 > Have you tried interrogating the drives via smartctl? > <URL:https://wiki.openindiana.org/oi/Adding+SMART+disk+monitoring+as+a+SMF+service> I have now, finally) managed to get perhaps the key bit of reporting from smartctl - does this seem adequately diagnostic?: (I am fully satisfied to replace the drive; I just want to be sure I’ve run to ground any potential root causes.) $ pfexec smartctl -a -d sat,12 /dev/rdsk/c2t0d0s0 | grep Raw_Read 1 Raw_Read_Error_Rate 0x000b 094 094 016 Pre-fail Always - 1376259 $ pfexec smartctl -a -d sat,12 /dev/rdsk/c2t1d0s0 | grep Raw_Read 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 Above seems consistent with all the read errors I see at boot. > Happy hunting, > John > [email protected] > > On Dec 10, 2018, at 6:36 PM, jason matthews <[email protected]> wrote: > > Have you tried look see if the drives are accumulating errors? > In other iostat fun I’ve tried before (not very helpful!): $ iostat -ien ---- errors --- s/w h/w trn tot device ... 0 0 0 0 rpool 0 0 0 0 c2t0d0 0 0 0 0 c2t1d0 ... > > if so, pull the bad drive. > > What happens if you go into the boot manager and manually select a boot disk? > If the problem is with a single drive, then the other drive should boot > normally right? Try booting from both drives select each one manually. That’s also interesting. With the hundreds of read errors at boot up, the boot manager is never even (visibly) presented. I guess I could try this again from a boot from USB image... > > you can speed up the scrub with: > > echo zfs_scrub_delay/W0x0 |mdb -kw > > echo zfs_scan_min_time_ms/W0x0 Good commands for reference. I was unaware of these! But, even with scrub canceled for the moment, am still seeing virtually continuous drive controller traffic. You also wanted to see: $ iostat -nMxC 5 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 962.3 0.0 11.3 15.7 0.2 16.3 0.2 5 23 c2 0.0 398.4 0.0 4.3 7.1 0.1 17.9 0.2 83 6 c2t0d0 0.0 415.2 0.0 4.2 8.6 0.1 20.6 0.2 87 9 c2t1d0 0.0 40.2 0.0 0.7 0.0 0.0 0.0 0.4 0 2 c2t2d0 0.0 40.4 0.0 0.7 0.0 0.0 0.0 1.1 0 4 c2t3d0 0.0 34.4 0.0 0.7 0.0 0.0 0.0 0.3 0 1 c2t4d0 0.0 33.6 0.0 0.7 0.0 0.0 0.0 0.3 0 1 c2t5d0 Again, I assume the symmetry in findings between t0 and t1 is due to their mirrored status… But doesn’t seem to help in differentiating offending device. (For comparison, t2-t5 are the data pool.) There is essential zero ‘user’ activity on either data or root pools... _______________________________________________ openindiana-discuss mailing list [email protected] https://openindiana.org/mailman/listinfo/openindiana-discuss
