While trying to reproduce the problem why EFAULT is sporadically
returned when doing "xl save" of a HVM guest I happened to catch
another bug:

From time to time we have seen failures of

test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm 16 guest-localmigrate/x10

where there seemed to be problems with suspend handling in Xen. I have
now seen the very same problem while trying to do "xl save", but I could
look into the guest after that. The guest had the following in its
kernel log:

[ 2680.945450] Freezing user space processes ...
[ 2700.949012] Freezing of tasks failed after 20.003 seconds (1 tasks
refusing to freeze, wq_busy=0):
[ 2700.949027] btrfs           D    0  1976   1971 0x00000004
[ 2700.949033] Call Trace:
[ 2700.949059]  ? __schedule+0x2bf/0x850
[ 2700.949066]  schedule+0x39/0x90
[ 2700.949073]  io_schedule+0x12/0x40
[ 2700.949081]  blk_mq_get_tag+0x12b/0x260
[ 2700.949090]  ? elv_bio_merge_ok+0x12/0x70
[ 2700.949097]  ? remove_wait_queue+0x60/0x60
[ 2700.949102]  blk_mq_get_request+0xe6/0x3d0
[ 2700.949108]  blk_mq_make_request+0x10b/0x640
[ 2700.949115]  generic_make_request+0xf8/0x2e0
[ 2700.949120]  submit_bio+0x6e/0x140
[ 2700.949185]  scrub_add_page_to_rd_bio+0xf5/0x280 [btrfs]
[ 2700.949195]  ? __alloc_pages_nodemask+0xd1/0x260
[ 2700.949241]  scrub_pages+0x205/0x420 [btrfs]
[ 2700.949285]  scrub_stripe+0x934/0x10e0 [btrfs]
[ 2700.949297]  ? _raw_spin_unlock+0xc/0x20
[ 2700.949328]  ? block_rsv_release_bytes+0x148/0x2a0 [btrfs]
[ 2700.949369]  scrub_chunk+0x10a/0x150 [btrfs]
[ 2700.949408]  scrub_enumerate_chunks+0x27c/0x610 [btrfs]
[ 2700.949417]  ? add_wait_queue+0x70/0x70
[ 2700.949453]  btrfs_scrub_dev+0x1f2/0x510 [btrfs]
[ 2700.949462]  ? _copy_from_user+0x2e/0x60
[ 2700.949503]  btrfs_ioctl+0x11ab/0x2070 [btrfs]
[ 2700.949513]  ? kmem_cache_alloc_node+0x1dc/0x210
[ 2700.949516]  ? create_task_io_context+0x1e/0xf0
[ 2700.949523]  do_vfs_ioctl+0x8f/0x5c0
[ 2700.949527]  ? get_task_io_context+0x42/0x70
[ 2700.949534]  ? __fget+0x6c/0xa0
[ 2700.949539]  SyS_ioctl+0x74/0x80
[ 2700.949544]  entry_SYSCALL_64_fastpath+0x24/0x87
[ 2700.949549] RIP: 0033:0x7f03452424b7
[ 2700.949552] RSP: 002b:00007f034515ed68 EFLAGS: 00000246
[ 2700.949558] OOM killer enabled.
[ 2700.949560] Restarting tasks ... done.

This is a rather recent kernel (4.15). The backtrace shows rather
clearly that suspending failed due to some problems while doing
block I/O (I should note here that Xenstore is being suspended
_after_ trying to freeze processes).

So I'm quite confident that this problem is in no way related to Xen,
but could happen on bare metal, too, e.g. when closing the lid of a
notebook.

Another note: A retry of suspending the guest worked like a charm, so
we could retry to suspend in libxl. Another idea would be to have a
way to tell Xen suspend failed inside the guest in order to know xl
doesn't have to wait for the end of the timeout...

And yes, this problem is completely different to the EFAULT problem
which can't be the guest's problem.


Juergen

_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to