Hi all,
Bryan, Robert, and I have been investigating a pathological performance
issue related to the dbu_evict taskq that I think may be related to the
work done under illumos ticket 5056[0]. I'm looking for confirmation of
this analysis and our assessment of the the underlying issues, workarounds
(if any), and hoping that the issue can be addressed.
tl;dr: It seems to be that a number of operations, including at least
unmount and "zfs list", enqueue dataset evictions onto the dbu_evict taskq,
but there's no backpressure for this queue that would prevent it from
growing without bound. Moreover, the queue processing appears to be
O(n^2). Our workload does a lot of these operations and seems to have
reached a tipping point where work is increasing much faster than it can be
processed, and it will not recover.
Details: Our workload involves a very large number of reasonably frequent
ZFS rollback operations.[1] We have about 128 datasets on this server that
each get rolled back at least a few dozen times per day, though that varies
greatly based on customer workload. After a recent platform update from
December bits to relatively recent bits, we found one of our systems
running very sluggish. Interactive use was noticeably slow, a "du" that
normally takes tens of minutes has taken over a day, and "zfs list"
operations are taking tens of minutes. Most of the wall time spent by slow
processes was sleeping. The system is under heavy memory pressure, nearly
all in kernel memory:
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 64357384 251396 96%
ZFS File Data 21560 84 0%
Anon 1595082 6230 2%
Exec and libs 9315 36 0%
Page cache 7429 29 0%
Free (cachelist) 858843 3354 1%
Free (freelist) 248552 970 0%
Total 67098165 262102
Physical 67098164 262102
We noticed that the dbu_evict taskq had grown to an enormous size:
> ::taskq -a
ADDR NAME ACT/THDS Q'ED MAXQ
INST
ffffff430d52da10 dbu_evict 1/ 1 48332030
48332030 -
ffffff431e9ecc68 ipmi_kcs 1/ 1 1 1
-
ffffff4344b79900 output_master 1/ 1 1 1
-
ffffff42ed138370 STMF_SVC_TASKQ 1/ 1 1 1
0
ffffff42e2b7cb00 system_taskq 1/ 64 0 -
0
The system is processing entries from the queue, but the queue length is
still growing quite quickly. In just under a day, it grew by about 22
million entries.
This taskq is used for several different operations. Here's a breakdown of
tqent_func for the first 20,000 entries in the queue:
6222 tqent_func = dnode_buf_pageout
1961 tqent_func = dsl_dataset_evict
33 tqent_func = dsl_dir_evict
1888 tqent_func = zap_evict
9890 tqent_func = zap_leaf_pageout
It's hard to say how representative this distribution is for the rest of
the 48M entries on the queue, but it looks like less than 10% of these
entries are for dsl_dataset_evict(). But that function is taking almost
all of the time. I profiled the taskq thread for 30 seconds:
# dtrace -n 'profile-97/curthread == (kthread_t *)0xffffff01e9866c40/{
@[stack()] = count(); }' -c 'sleep 30' > stacks.out
and got this flame graph:
https://us-east.manta.joyent.com/dap/public/stackvis/b6466c81-76af-488c-81e8-8375bc267eda/index.htm
Based on the number of samples, we can tell this thread has basically
pegged a CPU, and 95% of the overall time is spent in
dsl_prop_unregister(), called directly from dmu_objset_evict() from
dmu_dataset_evict(). Now, dsl_prop_unregister() does a linear scan of a
linked list associated with the dataset's dsl_dir. By spot-checking a
couple of the datasets, I found that this linked list is quite long:
> ::offsetof dsl_dir_t dd_prop_cbs
offsetof (dsl_dir_t, dd_prop_cbs) = 0x90, sizeof (...->dd_prop_cbs) =
0x20
> 0xffffff7469eb1940::print dsl_dataset_t
ds_objset->os_dsl_dataset->ds_dir | ::eval
".+0x90::walk list" ! wc -l
1561915
In the first 10000 elements of the list, there seem to be 1000 different
dsl_datasets represented:
> 0xffffff7469eb1940::print dsl_dataset_t ds_objset->os_dsl_dataset->ds_dir
| ::eval
".+0x90::walk list" ! head -10000 > dap3
> ::cat dap3 | ::print dsl_prop_cb_record_t cbr_ds ! sort -u | wc
1001 3003 28028
The list seems to have 10 entries for each dsl_dataset: one for each of 10
dataset properties (checksum, compression, etc.) with callbacks like
checksum_changed_cb and the like. Note that this system should only have
about 440 datasets (including snapshots), so it's surprising that there
1000 distinct dsl_datasets (and quite possibly many, many more). That's
what leads me to suspect that our zfs rollback operations are causing
datasets to be logically destroyed, but the corresponding dsl_dataset_t is
kept around for asynchronous cleanup. The problem appears to be that we've
enqueued much more work than we can process. This is the first issue:
there doesn't seem to be any backpressure on this queue.
The second issue is that the eviction is O(n^2): the eviction process scans
all callbacks registered on the dsl_dir, but that includes callbacks for
all datasets being evicted, so processing all of the evictions ends up
being O(n^2).
The feedback loop appears to get worse because as memory pressure
increases, pageouts causes more ARC evictions, enqueueing more work.
Finally, we noticed while inspecting a slow "zfs list" that that operation
causes more of these callbacks to be registered, and presumably
unregistered asynchronously later. That would mean list operations
contribute to this problem, too.
I'm not sure the best way to deal with this, but internally we've discussed
a few things that might help this situation:
- The most serious issue is the lack of backpressure on the task queue.
The easiest way to deal with this is to have the operations that enqueue
tasks wait for them to complete. Presumably the tasks were put into a
queue because it's not safe to wait in the context where the tasks are
enqueued, so this may be tricky. But without this, regardless of the other
suggestions below, there's no way to guarantee this whole mess of problems
can't happen again.
- If the linear search through a linked list were changed to an AVL tree
search, the cost of the eviction would go down considerably, and N
evictions would be O(N log N) instead of O(N^2). Not knowing a ton about
this code, this looks like the easiest to implement and the lowest risk,
though it's hard to say how long-term that solution will suffice.
- Since most of the taskq operations are not actually the expensive
dsl_dataset_evict operation, if that could be moved to a separate taskq,
the memory usage may be less significant because we'd be able to process
all the other entries on this taskq. This may only be a marginal
improvement -- it's hard to know.
Thanks for making it this far. None of us has spent much time in this code
previously, so I hope we can get some confirmation of this analysis. If
there's anyway to work around this in advance of a fix, that would be good
to know.
Thanks,
Dave
[0] https://www.illumos.org/issues/5056
[1] See my talk at http://open-zfs.org/wiki/OpenZFS_Developer_Summit_2014
for details.
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer