[Bug 1821594] Re: [SRU] Error in confirm_migration leaves stale allocations and 'confirming' migration state

Rodrigo Barbieri Wed, 26 Jun 2019 12:11:39 -0700

Improved test case with more detailed steps in order to make it easier
to script and validate the output


** Description changed:

  Description:
  
  When performing a cold migration, if an exception is raised by the
  driver during confirm_migration (this runs in the source node), the
  migration record is stuck in "confirming" state and the allocations
  against the source node are not removed.
  
  The instance is fine at the destination in this stage, but the source
  host has allocations that is not possible to clean without going to the
  database or invoking the Placement API via curl. After several migration
  attempts that fail in the same spot, the source node is filled with
  these allocations that prevent new instances from being created or
  instances migrated to this node.
  
  When confirm_migration fails in this stage, the migrating instance can
  be saved through a hard reboot or a reset state to active.
  
  Steps to reproduce:
  
  Unfortunately, I don't have logs of the real root cause of the problem
  inside driver.confirm_migration running libvirt driver. However, the
  stale allocations and migration status problem can be easily reproduced
  by raising an exception in libvirt driver's confirm_migration method,
  and it would affect any driver.
  
  Expected results:
  
  Discussed this issue with efried and mriedem over #openstack-nova on
  March 25th, 2019. They confirmed that allocations not being cleared up
  is a bug.
  
  Actual results:
  
  Instance is fine at the destination after a reset-state. Source node has
  stale allocations that prevent new instances from being created/migrated
  to the source node. Migration record is stuck in "confirming" state.
  
  Environment:
  
  I verified this bug on on pike, queens and stein branches. Running
  libvirt KVM driver.
  
  =======================================================================
  
  [Impact]
  
  If users attempting to perform cold migrations face any issues when
  the virt driver is running the "Confirm Migration" step, the failure leaves 
stale allocation records in the database, in migration records in "confirming" 
state. The stale allocations are not cleaned up by nova, consuming the user's 
quota indefinitely.
  
  This bug was confirmed from pike to stein release, and a fix was
  implemented for queens, rocky and stein. It should be backported to
  those releases to prevent the issue from reoccurring.
  
  This fix prevents new stale allocations being left over, by cleaning
  them up immediately when the failures occur. At the moment, the users
  affected by this bug have to clean their previous stale allocations
  manually.
  
  [Test Case]
  
+ 1. Reproducing the bug
+ 
+ 1a. Inject failure
+ 
  The root cause for this problem may vary for each driver and
  environment, so to reproduce the bug, it is necessary first to inject a
  failure in the driver's confirm_migration method to cause an exception
  to be raised.
  
  An example when using libvirt is to add a line:
  
  raise Exception("TEST")
  
  in
  
https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012
- and then restart the nova-compute service.
  
- Then, invoke a cold migration through "openstack server migrate {id}",
- wait for VERIFY_RESIZE status, and then invoke "openstack server resize
- {id} --confirm". The confirmation will fail asynchronously and the
- instance will be in ERROR status, while the migration database record is
- in "confirming" state and the stale allocations for the source host is
- still present in the "allocations" database table.
+ 1b. Restart nova-compute service: systemctl restart nova-compute
+ 
+ 1c. Create a VM
+ 
+ 1d. Then, invoke a cold migration: "openstack server migrate {id}"
+ 
+ 1e. Wait for instance status: VERIFY_RESIZE
+ 
+ 1f. Invoke "openstack server resize {id} --confirm"
+ 
+ 1g. Wait for instance status: ERROR
+ 
+ 1h. Check migration stuck in "confirming" status: nova migration-list
+ 
+ 1i. Check allocations, you should see 2 allocations, one with the VM ID,
+ the other with the migration uuid
+ 
+ export ENDPOINT=<placement_endpoint>
+ export TOKEN=`openstack token issue| grep ' id '| awk '{print $4}'`
+ for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: 
application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: 
placement 1.17" | jq -r .resource_providers[].uuid); do curl -k -s -X GET 
$ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H 
"X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq 
[.allocations]; done
+ 
+ 2. Cleanup
+ 
+ 2a. Delete the VM
+ 
+ 2b. Delete the stale allocation:
+ 
+ export ID=<migration_uuid>
+ curl -k -s -X DELETE $ENDPOINT/allocations/$ID -H "Accept: application/json" 
-H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement placement 1.17"
+ 
+ 3. Install package that contains the fixed code
+ 
+ 4. Confirm bug is fixed
+ 
+ 4a. Repeat steps 1a through 1g
+ 
+ 4b. Check migration with "error" status: nova migration-list
+ 
+ 4c. Check allocations, you should see only 1 allocation with the VM ID
+ 
+ for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept:
+ application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version:
+ placement 1.17" | jq -r .resource_providers[].uuid); do curl -k -s -X
+ GET $ENDPOINT/resource_providers/$id/allocations -H "Accept:
+ application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version:
+ placement 1.17" | jq [.allocations]; done
+ 
+ 5. Cleanup
+ 
+ 5a. Delete the VM
+ 
  
  [Regression Potential]
  
  New functional test https://review.opendev.org/#/c/657870/ validated the
  fix and was backported all the way to Queens. The fix being backported
  caused no functional test to fail.
  
  [Other Info]
  
  None

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1821594

Title:
  [SRU] Error in confirm_migration leaves stale allocations and
  'confirming' migration state

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1821594/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1821594] Re: [SRU] Error in confirm_migration leaves stale allocations and 'confirming' migration state

Reply via email to