Have you tried looking with 'latencytop'? pkg install diagnostic/latencytop latencytop and arrow over to the zpool process for your data pool?
That may help you track down what is specifically slowing you down; for example if you see something about space map loading, you need echo 'metaslab_debug/W1' | mdb -kw to maintain the space maps in memory; high ZFS ZIL writer means more SSD / write logs would be good, etc. If you are using any 'dedup' ; you may have also just hit whatever limits for in-memory de-dup tables (although I dunno what that looks like in latency top...). 'iostat -xnz 1' and looking for specific drives with notably longer asvc_t and /or %w than others is another easy check with sudden performance drops (suggests the system is slowing down for a single bad/dying disk). -Lucas Van Tol > Date: Fri, 5 Jul 2013 20:09:45 +0200 > From: [email protected] > To: [email protected] > Subject: Re: [OpenIndiana-discuss] Sudden ZFS performance issue > > On Fri, Jul 5, 2013 at 8:00 PM, Saso Kiselkov <[email protected]> wrote: > > On 05/07/2013 17:08, [email protected] wrote: > >> Good morning, > >> > >> I have a weird problem with two of the 15+ OpenSolaris storage servers in > >> our > >> environment. All the Nearline servers are essentially the same. Supermicro > >> X9DR3-F based server, Dual E5-2609's, 64GB memory, Dual 10Gb SFP+ NICs, LSI > >> 9200-8e HBA, Supermicro CSE-826E26-R1200LPB storage arrays and Seagate > >> enterprise 2TB SATA or SAS drives (not mixed within a server). Root, l2ARC > >> and > >> ZIL are all on Intel SSD (SLC series 313 for ZIL, MLC 520 for L2ARC and > >> MLC 330 > >> for boot) > >> > >> The volumes are built out of 9 drive Z1 groups, ashift is set to 9 (which > >> is > >> supposed to appropiate for the enterprise seagates). The pools are large > >> (120-130TB) but are only between 27 and 32% full. Each server serves an > >> iSCSI > >> (Comstar) and an CIFS (in kernel server) volume of the same pool. I > >> realize this > >> is not optimal from a recovery/resilver/rebuild standpoint but the servers > >> are > >> replicated and the data is easily rebuildable. > >> > >> Initially these servers did great for several months, while certainly no > >> speed > >> demons, 300+ MB/sec for sequential read/writes was not a problem. Several > >> weeks > >> ago, literally overnight, replication times went through the roof for one > >> server. Simple testing showed that reading from the pool would no longer > >> go over > >> 25MB/s. Even a scrub that used to run at 400+ MB/sec is now crawling along > >> at > >> below 40MB/s. > >> > >> Sometime yesterday the second server started to exhibit the exact same > >> behaviour. This one is used even less (it's our D2D2T server) and data is > >> written to it at night and read during the day to be written to tape. > >> > >> I've exhausted all I know and I'm at a loss. Does anyone have any ideas of > >> what > >> to look at, or do any obvious reasons for this behaviour jump out from the > >> configuration above? > > > > Isn't iostat -Exn reporting some transport errors? Smells like a drive > > gone bad and forcing retries, which would cause about a 10x decrease in > > performance. Just a guess, though. > > Why should a retry require a 10x decrease in performance? A proper > design would surely do retries in parallel to other operations > (Reiser4 and btrfs do it) up to a certain amount of > failures-in-flight. > > Irek > > _______________________________________________ > OpenIndiana-discuss mailing list > [email protected] > http://openindiana.org/mailman/listinfo/openindiana-discuss _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
