Hi,
Git bisect is pointing to your patch 084140bd49:
exec: fix access to ram_list.dirty_memory when sync dirty bitmap
trying to diagnose a bug I'm seeing; it looks like the dirty page count
is wrong for some reason.
Alex Bennée spotted a problem where the postcopy test would occasionally
fail under very heavy load; attaching a debugger and it looks like
the problem is we have a migration_dirty_page count stuck at 2;
in the normal migration tests we don't spot this, because 2 pages is
smaller than the threshold to end migration and so an extra 2 pages
doesn't block it finishing. However, with a very
small downtime setting (like we use in the postcopy test) and with
very low bandwidth (as when Alex ran the test on a very heavily loaded
machine) we end up never calling the bitmap sync again and never
completing the iteration.
I'm using the following addition to spot the problem:
diff --git a/migration/ram.c b/migration/ram.c
index e75f1050e4..3ddf884952 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1350,6 +1350,13 @@ static int ram_find_and_save_block(RAMState *rs, bool
last_stage)
}
} while (!pages && again);
+ if (!pages && !again && pss.complete_round && rs->migration_dirty_pages)
+ {
+ /* Should make this fail migration ? */
+ fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n",
+ __func__, rs->migration_dirty_pages);
+ }
+
rs->last_seen_block = pss.block;
rs->last_page = pss.page;
(which I might add as a test to fail a migration)
That test fails easily even on an unloaded machine:
tests/postcopy-test
/x86_64/postcopy: ram_find_and_save_block: no page found, yet dirty_pages=2
ram_find_and_save_block: no page found, yet dirty_pages=2
ram_find_and_save_block: no page found, yet dirty_pages=2
OK
I'll try and debug where our extra two pages are coming from.
Dave
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK