So after reading and experimenting a bit more, what the upstream change
is doing is setting the defaults to

spec_store_bypass_disable=prctl 
spectre_v2_user=prctl

instead of "seccomp".  This basically means that instead of all
seccomp() users setting these flags, it is up to userspace to set
manually via prctl().  The linked upstream change goes into all the
reasons why this is the right thing to do.

>From the cpuid on the output of the failing cloud provider we see

      SSBD: speculative store bypass disable   = true

suggesting that this has been explicitly disabled?  It's unclear to me
if that's set by the cloud provider in qemu?  Not sure if I can tell
from a guest without backend access?

OpenDev is a canary for this sort of thing as we are extremely
heterogeneous with clouds, we have resources donated by about 7-8
different cloud providers, each with multiple regions (across x86_64 &
arm64) that we use simultaneously for CI work (we use whatever people
will donate).  I've tested and booting with
spec_store_bypass_disable=prctl stops the traces in the affected cloud,
so we'll probably implement this.

However, I think there's probably enough here to think about backporting
this commit for maximum compatibility of the generic images.  It seems
like the system works well enough (which is how it passed all our
initial CI) but the traces spewing will quickly lead to disks filling up
with bloated log files (how we found it after running in production).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1973839

Title:
  5.15.0-30-generic :  unchecked MSR access error: WRMSR to 0x48 (tried
  to write 0x0000000000000004)

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  When booting this in one of our clouds, we see an error early in the
  kernel output

    kernel: unchecked MSR access error: WRMSR to 0x48 (tried to write
  0x0000000000000004) at rIP: 0xffffffffabc90af4
  (native_write_msr+0x4/0x20)

  and then an un-ending stream of "bare" tracebacks; which I think must
  be related

  [    2.285717] kernel: Call Trace:
  [    2.285722] kernel:  <TASK>
  [    2.285723] kernel:  ? speculation_ctrl_update+0x95/0x200
  [    2.292001] kernel:  speculation_ctrl_update_current+0x1f/0x30
  [    2.292011] kernel:  ssb_prctl_set+0x92/0xe0
  [    2.292016] kernel:  arch_seccomp_spec_mitigate+0x62/0x70
  [    2.292019] kernel:  seccomp_set_mode_filter+0x4de/0x530
  [    2.292024] kernel:  do_seccomp+0x37/0x1f0
  [    2.292026] kernel:  __x64_sys_seccomp+0x18/0x20
  [    2.292028] kernel:  do_syscall_64+0x5c/0xc0
  [    2.292035] kernel:  ? handle_mm_fault+0xd8/0x2c0
  [    2.299617] kernel:  ? do_user_addr_fault+0x1e3/0x670
  [    2.312878] kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
  [    2.312894] kernel:  ? irqentry_exit_to_user_mode+0x9/0x20
  [    2.312905] kernel:  ? irqentry_exit+0x19/0x30
  [    2.312907] kernel:  ? exc_page_fault+0x89/0x160
  [    2.312909] kernel:  ? asm_exc_page_fault+0x8/0x30
  [    2.312914] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [    2.312919] kernel: RIP: 0033:0x7fcffd6eaa3d
  [    2.312924] kernel: Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e 
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 
<48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48
  [    2.312926] kernel: RSP: 002b:00007ffe352e2938 EFLAGS: 00000246 ORIG_RAX: 
000000000000013d
  [    2.312930] kernel: RAX: ffffffffffffffda RBX: 0000557d99d0c0c0 RCX: 
00007fcffd6eaa3d
  [    2.319941] systemd[1]: Starting Load Kernel Module configfs...
  [    2.320103] kernel: RDX: 0000557d99c01290 RSI: 0000000000000000 RDI: 
0000000000000001
  [    2.339938] kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: 
0000557d99c01290
  [    2.339941] kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000000
  [    2.339942] kernel: R13: 0000000000000001 R14: 0000557d99c01290 R15: 
0000000000000001
  [    2.339947] kernel:  </TASK>

  We never see any more warnings or context for the tracebacks, but they
  just keep coming over and over, filling up the logs.  This is with a
  jammy x86_64 system running 5.15.0-30-generic

  Unfortunately I don't exactly know what is behind it on the cloud
  side.

  There seem to be several bugs that are similar but not exactly the
  same

  https://bugzilla.redhat.com/show_bug.cgi?id=1808996
  https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921880

  The gist seems to be that 0x4 refers to SSBD mitigation and something
  about the combo of some versions of qemu and a Jammy guest kernel make
  the guest unhappy.  I will attach cpuid info for the guest.

  I installed the 5.17 kernel from the mainline repository, and this
  appears to go away.  I will attempt to bisect it to something more
  specific.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1973839/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to