Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system
Package: libc6 Version: 2.31-13+deb11u4 Severity: critical Dear Maintainer, After an upgrade to version +deb11u4 on my system running Haswell (4th gen Intel Core) CPU, most of the programs including bash or dpkg are immediately crashing with SIGILL. The problem seems to be caused/ related to AVX2 and changes made to some functions utilizing this instruction set. I don't know much about Debian bug reporting, so forgive me any mistakes I've made. The issue is on both host, LXC and Docker. I have described more on this link: https://github.com/debuerreotype/docker-debian-artifacts/issues/175 where I also linked my coredump from example program and described stuff more thoroughly. Coredump link directly just in case: https://github.com/debuerreotype/docker-debian-artifacts/files/9569748/core.bash.10.2663c40e671041e6b40c882a70b83c3f.1480736.166318582400.zip Also log lines from kernel: kernel: [834669.721253] traps: dpkg[1455373] trap invalid opcode ip:7fa39701951d sp:7ffc4ad26e58 error:0 in libc-2.31.so[7fa396edd000+15a000] kernel: [834669.732958] traps: dpkg[1455374] trap invalid opcode ip:7f529ca9551d sp:7fffb6f0a238 error:0 in libc-2.31.so[7f529c959000+15a000] kernel: [834669.840128] traps: dpkg[1455375] trap invalid opcode ip:7f1874cc951d sp:7fffc2c2f5d8 error:0 in libc-2.31.so[7f1874b8d000+15a000] kernel: [834669.907918] traps: dpkg[1455378] trap invalid opcode ip:7f3b4f8d851d sp:7fff3ec970f8 error:0 in libc-2.31.so[7f3b4f79c000+15a000] kernel: [834712.152139] traps: passwd[1455693] trap invalid opcode ip:7fefee4b52b7 sp:7cb506b8 error:0 in libc-2.31.so[7fefee37d000+15a000] Not sure what exactly might be causing the issue, but if these changes aren't pulled, potentially anyone with this or similar CPU as me will upgrade and end up with bricked system. I will proceed to try using `clearcpuid=293` kernel flag myself, but consider how many distros depend on Debian, live CDs etc, with people unable to figure out why their system became useless, unable to trace the source, and blaming it just on Linux... I'm filling this bug report from my downgraded host system to the previous libc6 version. * What led up to the situation? apt upgrade... * What exactly did you do (or not do) that was effective (or ineffective)? downgrade to +deb11u3 * What was the outcome of this action? everything works on the older version * What outcome did you expect instead? -- System Information: Debian Release: 11.4 APT prefers stable-updates APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 5.15.39-1-pve (SMP w/4 CPU threads) Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages libc6 depends on: ii libcrypt1 1:4.4.18-4 ii libgcc-s1 10.2.1-6 Versions of packages libc6 recommends: ii libidn2-0 2.3.0-5 pn libnss-nis pn libnss-nisplus Versions of packages libc6 suggests: ii debconf [debconf-2.0] 1.5.77 pn glibc-doc ii libc-l10n 2.31-13+deb11u3 ii locales2.31-13+deb11u3 -- debconf information: glibc/disable-screensaver: glibc/restart-services: glibc/kernel-not-supported: glibc/kernel-too-old: libraries/restart-without-asking: false glibc/restart-failed: glibc/upgrade: true
Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system
> The first thing would be to provide the output of /proc/cpuinfo Pasting below (please **NOTE** that "avx2" would normally be there, but is currently missing due to this kernel option `clearcpuid=293` with which I booted the PC now -- I can **100%** confirm "avx2" was there before, but don't want to reboot for now to remove this kernel flag): # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz stepping: 3 microcode : 0x12 cpu MHz : 2394.664 cache size : 3072 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms invpcid xsaveopt dtherm arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds bogomips: 4789.10 clflush size: 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz stepping: 3 microcode : 0x12 cpu MHz : 2400.000 cache size : 3072 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms invpcid xsaveopt dtherm arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds bogomips: 4789.10 clflush size: 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz stepping: 3 microcode : 0x12 cpu MHz : 2400.000 cache size : 3072 KB physical id : 0 siblings: 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms invpcid xsaveopt dtherm arat pln pts vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds bogomips: 4789.10 clflush size: 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz stepping: 3 microcode : 0x12 cpu MHz : 2400.000 cache size : 3072 KB physical id : 0 siblings: 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid lev
Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system
Hello, sorry for delayed response, I've managed to collect and analyze a few coredump files with valid symbols (I installed libc6-dbg and dpkg-dev, and pointed gdb at Debian's debuginfod server, also used apt-get source to get the sources for libc6). It seems there are at least 3-4 distinct places it crashes at, two places at memchr-avx2.S, one at strlen-avx2.S, and potentially one at syscall-template.S, although that last one may be just some kind of kill signal redirect. Pasting all below: Core was generated by `apt'. Program terminated with signal SIGILL, Illegal instruction. #0 __memchr_avx2 () at ../sysdeps/x86_64/multiarch/memchr-avx2.S:400 Download failed: Invalid argument. Continuing without source file ./string/../sysdeps/x86_64/multiarch/memchr-avx2.S. 400 ../sysdeps/x86_64/multiarch/memchr-avx2.S: No such file or directory. (gdb) ### Core was generated by `dpkg'. Program terminated with signal SIGILL, Illegal instruction. #0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:514 Download failed: Invalid argument. Continuing without source file ./string/../sysdeps/x86_64/multiarch/strlen-avx2.S. 514 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory. (gdb) ### Core was generated by `/usr/bin/perl /usr/sbin/adduser'. Program terminated with signal SIGILL, Illegal instruction. #0 __memchr_avx2 () at ../sysdeps/x86_64/multiarch/memchr-avx2.S:135 Download failed: Invalid argument. Continuing without source file ./string/../sysdeps/x86_64/multiarch/memchr-avx2.S. 135 ../sysdeps/x86_64/multiarch/memchr-avx2.S: No such file or directory. (gdb) ### Core was generated by `useradd'. Program terminated with signal SIGILL, Illegal instruction. #0 __memchr_avx2 () at ../sysdeps/x86_64/multiarch/memchr-avx2.S:135 Download failed: Invalid argument. Continuing without source file ./string/../sysdeps/x86_64/multiarch/memchr-avx2.S. 135 ../sysdeps/x86_64/multiarch/memchr-avx2.S: No such file or directory. (gdb) ### Core was generated by `passwd'. Program terminated with signal SIGILL, Illegal instruction. #0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:514 Download failed: Invalid argument. Continuing without source file ./string/../sysdeps/x86_64/multiarch/strlen-avx2.S. 514 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory. (gdb) ### Core was generated by `bash'. Program terminated with signal SIGILL, Illegal instruction. #0 0x7f2006faf087 in kill () at ../sysdeps/unix/syscall-template.S:120 Download failed: Invalid argument. Continuing without source file ./signal/../sysdeps/unix/syscall-template.S. 120 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) ### Core was generated by `su'. Program terminated with signal SIGILL, Illegal instruction. #0 __memchr_avx2 () at ../sysdeps/x86_64/multiarch/memchr-avx2.S:135 Download failed: Invalid argument. Continuing without source file ./string/../sysdeps/x86_64/multiarch/memchr-avx2.S. 135 ../sysdeps/x86_64/multiarch/memchr-avx2.S: No such file or directory. (gdb) ### It does seem in case of this SIGILL there's no additional stack trace, also the path containing ".." seems to cause the source code resolution to fail, but still the debug symbols seem to show the file source and line, so it should hopefully help see what exactly fails. I'm yet to try rebooting with microcode package installed though (I'll soon check it and update on whether it helps, but even if it does, one without bootable system first won't get a chance to install it; I'm a bit curious how these changes did trigger this, given all these years it didn't happen to occur before)
Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system
I can confirm updating the microcode by installing the intel-microcode package and rebooting does indeed mitigate this issue. An LXC container that was previously bricked due to update now starts and seems to behave fully normally. [0.00] microcode: microcode updated early to revision 0x28, date = 2019-11-12 But as microcode update needs to be loaded every time on boot (unless I presumably updated the UEFI), while it technically solves my problem on this installation, the concern of people with the same family of processors and outdated microcode running into this issue and having no idea why any Linux does not want to boot anymore still probably remains... (is there even any easy way to load updated microcode while installing Debian? I can most certainly bet its ISO does not include those due to non-free constraints)
Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system
Now that we understood the bug, I actually find strange that the microcode update is fixing this, it looks like that the BMI2 instructions support has been added in a microcode update. Would it be possible to give the output of /proc/cpuinfo with and without the microcode update applied? The /proc/cpuinfo without microcode update is already attached somewhere above in the bug report, the new one after update is as follows: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz stepping: 3 microcode : 0x28 cpu MHz : 2400.000 cache size : 3072 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 smep bmi2 erms invpcid xsaveopt dtherm arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds bogomips: 4788.76 clflush size: 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: Please note that "avx2" is once again missing due to the kernel masking flag from before that I once again forgot to remove before rebooting, and sorry for confusion it might cause -- that flag would normally be there. Running a quick diff against old procinfo reveals that "flags" has the following new entries now: tsc_deadline_timer ssbd ibrs ibpb stibp bmi1 bmi2 md_clear flush_l1d > it looks like that the BMI2 > instructions support has been added in a microcode update As such it does appear that indeed this is the case.
Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system
Is there an easy way to unbrick a system affected by the issue? such as a kernel-line option or a configuration file in /etc? I don't see how I can set a GLIBC_TUNABLES environment variable for the whole system. I was trying during my testing to set such option globally somehow, but failed, though maybe some method for this exists. As it stands I only see two possibilities of unbricking a system, both assuming you can access the partition externally from some bootable system: 1. Downgrade the affected libc6 package to a version before the one causing issues (either chroot and dpkg, or just extract and physically replace the files), after booting apt-mark hold libc6 to prevent faulty update from being installed until the issue is fixed 2. Or install intel-microcode package, assuming the microcode update adds the missing instructions in particular case, basically coincidentally fixing this issue (the updated CPU microcode is loaded on every bootup)