Re: [OpenIndiana-discuss] Huge ZFS root pool slowdown - diagnose root cause?

Lou Picciano Tue, 11 Dec 2018 08:17:26 -0800

John, Jason,

Many thanks for your brainstorming on this…


> On Dec 10, 2018, at 6:19 PM, John D Groenveld <[email protected]> wrote:
> 
> In message <[email protected]>, Lou Picciano 
> wri
> tes:
>> Is this evidence of erroneous attempts to read boot blocks/loader on disk0?
>> 
>> Given the machine BIOS identification of drives, dunno that I can be 
>> absolutel
>> y certain disk0 is referring to one disk - or is the entire rpool seen a
>> s disk0 once the OS is loading?
> 
> Does iostat(1M) -E report errors?

Absolutely none. In fact, having called precisely that command before, I was 
thrown by the ‘Errors: 0’ everywhere…

sd1       Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA      Product: Hitachi HDS5C302 Revision: A180 Serial No: 
ML0221F302X0MD 
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd2       Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: ATA      Product: Hitachi HDS5C302 Revision: A580 Serial No: 
ML0220F30HWBSD 
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 

> Have you tried interrogating the drives via smartctl?
> <URL:https://wiki.openindiana.org/oi/Adding+SMART+disk+monitoring+as+a+SMF+service>

I have now, finally) managed to get perhaps the key bit of reporting from 
smartctl - does this seem adequately diagnostic?:
(I am fully satisfied to replace the drive; I just want to be sure I’ve run to 
ground any potential root causes.)

$ pfexec smartctl -a -d sat,12 /dev/rdsk/c2t0d0s0 | grep Raw_Read
  1 Raw_Read_Error_Rate     0x000b   094   094   016    Pre-fail  Always       
-       1376259
$ pfexec smartctl -a -d sat,12 /dev/rdsk/c2t1d0s0 | grep Raw_Read
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       
-       0

Above seems consistent with all the read errors I see at boot.

> Happy hunting,
> John
> [email protected]
> 

> On Dec 10, 2018, at 6:36 PM, jason matthews <[email protected]> wrote:
> 
> Have you tried look see if the drives are accumulating errors?
> 
In other iostat fun I’ve tried before (not very helpful!): 
$ iostat -ien
  ---- errors --- 
  s/w h/w trn tot device
...
    0   0   0   0 rpool
    0   0   0   0 c2t0d0
    0   0   0   0 c2t1d0
...
> 
> if so, pull the bad drive.
> 
> What happens if you go into the boot manager and manually select a boot disk? 
> If the problem is with a single drive, then the other drive should boot 
> normally right? Try booting from both drives select each one manually.

That’s also interesting. With the hundreds of read errors at boot up, the boot 
manager is never even (visibly) presented. I guess I could try this again from 
a boot from USB image...
> 
> you can speed up the scrub with:
> 
> echo zfs_scrub_delay/W0x0 |mdb -kw
> 
> echo zfs_scan_min_time_ms/W0x0

Good commands for reference. I was unaware of these! But, even with scrub 
canceled for the moment, am still seeing virtually continuous drive controller 
traffic.

You also wanted to see:
$ iostat -nMxC 5
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  962.3    0.0   11.3 15.7  0.2   16.3    0.2   5  23 c2
    0.0  398.4    0.0    4.3  7.1  0.1   17.9    0.2  83   6 c2t0d0
    0.0  415.2    0.0    4.2  8.6  0.1   20.6    0.2  87   9 c2t1d0
    0.0   40.2    0.0    0.7  0.0  0.0    0.0    0.4   0   2 c2t2d0
    0.0   40.4    0.0    0.7  0.0  0.0    0.0    1.1   0   4 c2t3d0
    0.0   34.4    0.0    0.7  0.0  0.0    0.0    0.3   0   1 c2t4d0
    0.0   33.6    0.0    0.7  0.0  0.0    0.0    0.3   0   1 c2t5d0

Again, I assume the symmetry in findings between t0 and t1 is due to their 
mirrored status… But doesn’t seem to help in differentiating offending device. 
(For comparison, t2-t5 are the data pool.) There is essential zero ‘user’ 
activity on either data or root pools...
_______________________________________________
openindiana-discuss mailing list
[email protected]
https://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] Huge ZFS root pool slowdown - diagnose root cause?

Reply via email to