I've been running this for 3+ days now and cannot reproduce this
specific issue. From the look of the error it appears to be a hardware
related NMI issue, so perhaps we have some faulty H/W in this specific
case.

When running these tests for several days now with and without the
kernel parameters I have observed the following:

1. It can take >10-15 minutes for a reboot.
2. Our instances were being accidentally deleted by a jenkins job which could 
be a reason why some of our original assumptions that the VM had died on reboot 
were incorrect. 
3. When rebooting almost immediately when ssh access becomes available reboot 
gets stuck with systemd issues:

sudo reboot
systemctl status reboot.target
Failed to get properties: Connection timed out

and the only way to reboot is using the following:
sudo systemctl --force reboot

This could also be a reason why the automated reboot testing got locked
up and we mistakenly believed that reboots were failing due to H/W
issues.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/1822175

Title:
  i3.metal flavour type fails to respond after a reboot

Status in linux-aws package in Ubuntu:
  In Progress

Bug description:
  Series: Cosmic
  Instance Size: I3.Metal
  Region: (Default) US-WEST-2
  Kernel: linux-aws

  During SRU testing the i3.metal instance flavor type will sometimes
  fail to respond after the instance is rebooted. Usually this has been
  seen at least 2 or 3 times during at test cycle.

  While rebooting an I3.Metal instance on the AWS Cloud. I observed the
  following crash which resulting in tearing down the instance and
  starting over. The instance was only restarted ~4 times at the time of
  this failure.

  
  [[0;32m  OK  [0m] Reached target Shutdown.
  [[0;32m  OK  [0m] Reached target Final Step.
           Starting Reboot...
           Stopping LVM2 metadata daemon...
  [[0;32m  OK  [0m] Stopped LVM2 metadata daemon.
  [  447.340575] INFO: rcu_sched self-detected stall on CPU
  [  447.340577] INFO: rcu_sched self-detected stall on CPU
  [  447.340580] INFO: rcu_sched self-detected stall on CPU
  [  447.340587] INFO: rcu_sched self-detected stall on CPU
  [  447.340590] INFO: rcu_sched self-detected stall on CPU
  [  447.340592] INFO: rcu_sched self-detected stall on CPU
  [  447.340595] Uhhuh. NMI received for unknown reason 21 on CPU 0.
  [  447.340599] INFO: rcu_sched self-detected stall on CPU
  [  447.340602] INFO: rcu_sched self-detected stall on CPU
  [  447.340606] INFO: rcu_sched self-detected stall on CPU
  [  447.340614]        53-...!: (43 GPs behind) idle=7ce/1/0 softirq=392/392 
fqs=0 
  [  447.340617] INFO: rcu_sched self-detected stall on CPU
  [  447.340621] Do you have a strange power saving mode enabled?
  [  447.340628]        1-...!: (1 ticks this GP) idle=79e/1/0 softirq=881/881 
fqs=0 
  [  447.340632] INFO: rcu_sched self-detected stall on CPU
  [  447.340634] INFO: rcu_sched self-detected stall on CPU
  [  447.340636] INFO: rcu_sched self-detected stall on CPU
  [  447.340639] INFO: rcu_sched self-detected stall on CPU
  [  447.340641] INFO: rcu_sched self-detected stall on CPU
  [  447.340644] INFO: rcu_sched self-detected stall on CPU
  [  447.340647] INFO: rcu_sched self-detected stall on CPU

  The full log can be seen in the attached file.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1822175/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to