Just for the posterity, I was able to reproduce the problem on illumos as well
(running a kernel without Justin's fix):
> ::panicinfo
cpu 0
thread ffffff0004819c40
message assertion failed: avl_is_empty(&dn->dn_dbufs), file:
.../../common/fs/zfs/dnode_sync.c, line: 495
> $C
ffffff0004819770 vpanic()
ffffff00048197a0 0xfffffffffbdf37d8()
ffffff00048197e0 dnode_sync_free+0x278(ffffff01cb3b6528, ffffff0166429a00)
ffffff0004819860 dnode_sync+0x68b(ffffff01cb3b6528, ffffff0166429a00)
ffffff00048198b0 dmu_objset_sync_dnodes+0x93(ffffff014af3d330, 0,
ffffff0166429a00)
ffffff00048199b0 dmu_objset_sync+0x1bc(ffffff014af3d080, ffffff014b1ff560,
ffffff0166429a00)
ffffff00048199f0 dsl_pool_sync_mos+0x42(ffffff01495d7e80, ffffff0166429a00)
ffffff0004819a80 dsl_pool_sync+0x2fe(ffffff01495d7e80, 233d6)
ffffff0004819b50 spa_sync+0x27e(ffffff014c2c0000, 233d6)
ffffff0004819c20 txg_sync_thread+0x260(ffffff01495d7e80)
ffffff0004819c30 thread_start+8()
On 28/09/2015 17:25, Andriy Gapon wrote:
>
> There are several reports from FreeBSD users about getting a panic because of
> avl_is_empty(&dn->dn_dbufs) assertion in dnode_sync_free(). I was also able
> to
> reproduce the problem with ZFS on Linux 0.6.5. There does not seem to be any
> reports from illumos users.
>
> I think that the following stack traces demonstrate the problem rather well
> (the
> stack traces are a little bit unusual as they come from Linux's crash utility,
> but should be legible):
> crash> foreach UN bt
> PID: 703 TASK: ffff88003b8a4440 CPU: 0 COMMAND: "txg_sync"
> #0 [ffff880039fa3848] __schedule at ffffffff8160918d
> #1 [ffff880039fa38b0] schedule at ffffffff816096e9
> #2 [ffff880039fa38c0] spl_panic at ffffffffa0012645 [spl]
> #3 [ffff880039fa3a48] dnode_sync at ffffffffa062b7cf [zfs]
> #4 [ffff880039fa3b38] dmu_objset_sync_dnodes at ffffffffa0612dd7 [zfs]
> #5 [ffff880039fa3b78] dmu_objset_sync at ffffffffa06130d5 [zfs]
> #6 [ffff880039fa3c50] dsl_pool_sync at ffffffffa0641a8a [zfs]
> #7 [ffff880039fa3cd0] spa_sync at ffffffffa0664408 [zfs]
> #8 [ffff880039fa3da0] txg_sync_thread at ffffffffa067b970 [zfs]
> #9 [ffff880039fa3e98] thread_generic_wrapper at ffffffffa000e18a [spl]
> #10 [ffff880039fa3ec8] kthread at ffffffff8109726f
> #11 [ffff880039fa3f50] ret_from_fork at ffffffff81614198
>
> PID: 716 TASK: ffff88003b8a6660 CPU: 0 COMMAND: "trial"
> #0 [ffff88003c68f738] __schedule at ffffffff8160918d
> #1 [ffff88003c68f7a0] schedule at ffffffff816096e9
> #2 [ffff88003c68f7b0] cv_wait_common at ffffffffa0014d15 [spl]
> #3 [ffff88003c68f818] __cv_wait at ffffffffa0014e65 [spl]
> #4 [ffff88003c68f828] txg_wait_synced at ffffffffa067a70f [zfs]
> #5 [ffff88003c68f868] dsl_sync_task at ffffffffa064b017 [zfs]
> #6 [ffff88003c68f928] dsl_destroy_head at ffffffffa06eee62 [zfs]
> #7 [ffff88003c68f978] dmu_recv_cleanup_ds at ffffffffa06194ed [zfs]
> #8 [ffff88003c68fa98] dmu_recv_stream at ffffffffa061a992 [zfs]
> #9 [ffff88003c68fc20] zfs_ioc_recv at ffffffffa06b1bad [zfs]
> #10 [ffff88003c68fe50] zfsdev_ioctl at ffffffffa06b3c86 [zfs]
> #11 [ffff88003c68feb8] do_vfs_ioctl at ffffffff811d9ca5
> #12 [ffff88003c68ff30] sys_ioctl at ffffffff811d9f21
> #13 [ffff88003c68ff80] system_call_fastpath at ffffffff81614249
> RIP: 00007ff39d5c0257 RSP: 00007ff38e5c2008 RFLAGS: 00010206
> RAX: 0000000000000010 RBX: ffffffff81614249 RCX: 0000000000000024
> RDX: 00007ff38e5c21d0 RSI: 0000000000005a1b RDI: 0000000000000004
> RBP: 00007ff38e5c57b0 R8: 342d663438372d62 R9: 636430382d646335
> R10: 643266636131612d R11: 0000000000000246 R12: 0000000000000060
> R13: 00007ff38e5c3200 R14: 00007ff3880080a0 R15: 00007ff38e5c21d0
> ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
>
> PID: 31758 TASK: ffff88003b332d80 CPU: 0 COMMAND: "dbu_evict"
> #0 [ffff88003b723ca0] __schedule at ffffffff8160918d
> #1 [ffff88003b723d08] schedule_preempt_disabled at ffffffff8160a8d9
> #2 [ffff88003b723d18] __mutex_lock_slowpath at ffffffff81608625
> #3 [ffff88003b723d78] mutex_lock at ffffffff81607a8f
> #4 [ffff88003b723d90] dbuf_rele at ffffffffa05fd290 [zfs]
> #5 [ffff88003b723db0] dmu_buf_rele at ffffffffa05fe57e [zfs]
> #6 [ffff88003b723dc0] bpobj_close at ffffffffa05f78ed [zfs]
> #7 [ffff88003b723dd8] dsl_deadlist_close at ffffffffa0636e19 [zfs]
> #8 [ffff88003b723e10] dsl_dataset_evict at ffffffffa062d78b [zfs]
> #9 [ffff88003b723e28] taskq_thread at ffffffffa000f912 [spl]
> #10 [ffff88003b723ec8] kthread at ffffffff8109726f
> #11 [ffff88003b723f50] ret_from_fork at ffffffff81614198
>
> In 100% cases where I hit the assertion it was with DMU_OT_BPOBJ dnodes.
> Justin thinks that the situation is harmless and the assertion can be removed.
> I agree with him.
> But on the other hand, I wonder if something could be done in the DSL to avoid
> the described situation.
> I mean, it seems that bpo_cached_dbuf is a rare (the only?) case where a dbuf
> can be held beyond lifetime of its dnode...
>
--
Andriy Gapon
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer