On Wed, Jan 27, 2021 at 07:11:49AM +0100, alf wrote:
> Hello,
>
> while trying to upgrade one of our machines to 6.8 we experienced a
> repeatable crash while booting (bsd.rd + install went fine).
>
> The machine in question is a:
> ...
> hw.vendor=HP
> hw.product=ProLiant DL360 G7
> hw.serialno=CZ3451KJW6
> hw.uuid=36333337-3738-435a-3334-35314b4a5736
> hw.physmem=8562860032
> hw.usermem=8562847744
> hw.ncpufound=12
> hw.allowpowerdown=1
> hw.perfpolicy=manual
> hw.smt=0
> hw.ncpuonline=6
> ...
>
> Since this is a production machine we downgraded to 6.7 (upgrade from
> 6.6 which it was running before went flawlessly).
>
> Find below the dmesg of the 6.8 kernel, 6.8-current and finally the
> 6.7 kernel. For the 6.8* I also provided 'trace' and 'show registers'
> output.
>
> I hope this is enough info to get an idea of what was going on.
> I'll happily will provide additional info if needed.
>
> Alf
>
> cpu0: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz, 2667.08 MHz, 06-2c-02
> cpu0:
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AES,NXE,PAGE1GB,RDTSCP,LONG,LAHF,PERF,ITSC,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,MELTDOWN
> initializing kernel modesetting (RV100 0x1002:0x515E 0x103C:0x31FB 0x02).
> NMI ... going to debugger
> Stopped at tsc_delay+0x63: lfence
> ddb{0}> trace
> tsc_delay(1) at tsc_delay+0x63
> r100_ring_test(ffff8000001a4000,ffff8000001a5858) at r100_ring_test+0x277
> r100_cp_init(ffff8000001a4000,100000) at r100_cp_init+0x5a1
> r100_startup(ffff8000001a4000) at r100_startup+0x535
> r100_init(ffff8000001a4000) at r100_init+0x4ac
> radeon_device_init(ffff8000001a4000,ffff800000196800,ffff800000196840,840001)
> a
> t radeon_device_init+0x944
> radeondrm_attachhook(ffff8000001a4000) at radeondrm_attachhook+0x36
> config_process_deferred_mountroot() at config_process_deferred_mountroot+0x6b
> main(0) at main+0x723
> end trace frame: 0x0, count: -9
I don't understand why an lfence would cause an nmi.
Does it still occur with the below diff to change lfence;rdtsc to rdtscp?
This requires RDTSCP which your machine has but bluhm's machine does not.
Perhaps it is related to some kind of watchdog timer? Can you check if
the ilo event log has any relevant information?
Index: sys/arch/amd64/include/cpufunc.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v
retrieving revision 1.36
diff -u -p -r1.36 cpufunc.h
--- sys/arch/amd64/include/cpufunc.h 13 Sep 2020 11:53:16 -0000 1.36
+++ sys/arch/amd64/include/cpufunc.h 28 Jan 2021 00:47:16 -0000
@@ -307,7 +307,8 @@ rdtsc_lfence(void)
{
uint32_t hi, lo;
- __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo));
+// __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo));
+ __asm volatile("rdtscp" : "=d" (hi), "=a" (lo) :: "ecx");
return (((uint64_t)hi << 32) | (uint64_t) lo);
}