On Thu, 07 Jul 2022 14:02:35 +0900 (JST)
YASUOKA Masahiko <yasu...@openbsd.org> wrote:
> Hello Scott,
> 
> With the patch, my machine on ESXi it doesn't show any extra message.
> 
> *Without* the patch, the machine shows
> 
>  % grep 'TSC.*skew' dmesg.current-tsc-debug 
>  cpu1: disabling user TSC (skew=-2603)
>  cpu2: disabling user TSC (skew=-2959)
>  cpu3: disabling user TSC (skew=-3784)
>  %
> 
> and monotonic time goes backward (the test in
> https://marc.info/?l=openbsd-tech&m=161699406119704&w=2 failed)

Oops, the link was wrong..

https://marc.info/?l=openbsd-tech&m=161657532610882&w=2

is the correct one.  Also, with the patch, the test is passed.

> dmesg of with the patch:
> 
> OpenBSD 7.1-current (GENERIC.MP) #24: Thu Jul  7 10:09:32 JST 2022
>     
> yasu...@yasuoka-ob-c.tokyo.iiji.jp:/source/yasuoka/head/git/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 2113290240 (2015MB)
> avail mem = 2031927296 (1937MB)
> random: good seed from bootblocks
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xff7401f (161 entries)
> bios0: vendor VMware, Inc. version "VMW71.00V.16460286.B64.2006250725" date 
> 06/25/2020
> bios0: VMware, Inc. VMware7,1
> acpi0 at bios0: ACPI 4.0
> acpi0: sleep states S0 S1 S4 S5
> acpi0: tables DSDT SRAT FACP APIC MCFG HPET WAET WSMT
> acpi0: wakeup devices PCI0(S3) USB_(S1) P2P0(S3) S1F0(S3) S2F0(S3) S8F0(S3) 
> S16F(S3) S17F(S3) S18F(S3) S22F(S3) S23F(S3) S24F(S3) S25F(S3) PE40(S3) 
> S1F0(S3) PE41(S3) [...]
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz, 2393.81 MHz, 06-55-07
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,VMX,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
> cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
> 16-way L2 cache, 13MB 64b/line 11-way L3 cache
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
> cpu0: apic clock running at 65MHz
> cpu1 at mainbus0: apid 1 (application processor)
> cpu1: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz, 2393.61 MHz, 06-55-07
> cpu1: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,VMX,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
> cpu1: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
> 16-way L2 cache, 13MB 64b/line 11-way L3 cache
> cpu1: smt 0, core 1, package 0
> cpu2 at mainbus0: apid 2 (application processor)
> cpu2: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz, 2393.62 MHz, 06-55-07
> cpu2: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,VMX,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
> cpu2: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
> 16-way L2 cache, 13MB 64b/line 11-way L3 cache
> cpu2: smt 0, core 2, package 0
> cpu3 at mainbus0: apid 3 (application processor)
> cpu3: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz, 2393.62 MHz, 06-55-07
> cpu3: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,VMX,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
> cpu3: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
> 16-way L2 cache, 13MB 64b/line 11-way L3 cache
> cpu3: smt 0, core 3, package 0
> ioapic0 at mainbus0: apid 4 pa 0xfec00000, version 20, 24 pins, remapped
> acpimcfg0 at acpi0
> acpimcfg0: addr 0xe0000000, bus 0-127
> acpihpet0 at acpi0: 14318179 Hz
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpipci0 at acpi0 PCI0: 0x00000000 0x00000011 0x00000001
> acpicmos0 at acpi0
> "PNP0A05" at acpi0 not configured
> com0 at acpi0 COMA addr 0x3f8/0x8 irq 4: ns16550a, 16 byte fifo
> com0: console
> acpiac0 at acpi0: AC unit online
> acpicpu0 at acpi0: C1(@1 halt!)
> acpicpu1 at acpi0: C1(@1 halt!)
> acpicpu2 at acpi0: C1(@1 halt!)
> acpicpu3 at acpi0: C1(@1 halt!)
> pvbus0 at mainbus0: VMware
> vmt0 at pvbus0
> pci0 at mainbus0 bus 0
> 0:16:0: rom address conflict 0xffffc000/0x4000
> pchb0 at pci0 dev 0 function 0 "Intel 82443BX AGP" rev 0x01
> ppb0 at pci0 dev 1 function 0 "Intel 82443BX AGP" rev 0x01
> pci1 at ppb0 bus 1
> pcib0 at pci0 dev 7 function 0 "Intel 82371AB PIIX4 ISA" rev 0x08
> pciide0 at pci0 dev 7 function 1 "Intel 82371AB IDE" rev 0x01: DMA, channel 0 
> configured to compatibility, channel 1 configured to compatibility
> pciide0: channel 0 disabled (no drives)
> atapiscsi0 at pciide0 channel 1 drive 0
> scsibus1 at atapiscsi0: 2 targets
> cd0 at scsibus1 targ 0 lun 0: <NECVMWar, VMware IDE CDR10, 1.00> removable
> cd0(pciide0:1:0): using PIO mode 4, Ultra-DMA mode 2
> piixpm0 at pci0 dev 7 function 3 "Intel 82371AB Power" rev 0x08: SMBus 
> disabled
> "VMware VMCI" rev 0x10 at pci0 dev 7 function 7 not configured
> "VMware SVGA II" rev 0x00 at pci0 dev 15 function 0 not configured
> mpi0 at pci0 dev 16 function 0 "Symbios Logic 53c1030" rev 0x01: apic 4 int 17
> mpi0: 0, firmware 1.3.41.32
> scsibus2 at mpi0: 16 targets, initiator 7
> sd0 at scsibus2 targ 0 lun 0: <VMware, Virtual disk, 2.0>
> sd0: 16384MB, 512 bytes/sector, 33554432 sectors
> sd1 at scsibus2 targ 1 lun 0: <VMware, Virtual disk, 2.0>
> sd1: 81920MB, 512 bytes/sector, 167772160 sectors
> mpi0: target 0 Sync at 160MHz width 16bit offset 127 QAS 1 DT 1 IU 1
> mpi0: target 1 Sync at 160MHz width 16bit offset 127 QAS 1 DT 1 IU 1
> ppb1 at pci0 dev 17 function 0 "VMware PCI" rev 0x02
> pci2 at ppb1 bus 2
> ppb2 at pci0 dev 21 function 0 "VMware PCIE" rev 0x01: msi
> pci3 at ppb2 bus 3
> vmx0 at pci3 dev 0 function 0 "VMware VMXNET3" rev 0x01: msix, 4 queues, 
> address 00:0c:29:24:3d:0a
> ppb3 at pci0 dev 21 function 1 "VMware PCIE" rev 0x01: msi
> pci4 at ppb3 bus 4
> ppb4 at pci0 dev 21 function 2 "VMware PCIE" rev 0x01: msi
> pci5 at ppb4 bus 5
> ppb5 at pci0 dev 21 function 3 "VMware PCIE" rev 0x01: msi
> pci6 at ppb5 bus 6
> ppb6 at pci0 dev 21 function 4 "VMware PCIE" rev 0x01: msi
> pci7 at ppb6 bus 7
> ppb7 at pci0 dev 21 function 5 "VMware PCIE" rev 0x01: msi
> pci8 at ppb7 bus 8
> ppb8 at pci0 dev 21 function 6 "VMware PCIE" rev 0x01: msi
> pci9 at ppb8 bus 9
> ppb9 at pci0 dev 21 function 7 "VMware PCIE" rev 0x01: msi
> pci10 at ppb9 bus 10
> ppb10 at pci0 dev 22 function 0 "VMware PCIE" rev 0x01: msi
> pci11 at ppb10 bus 11
> ppb11 at pci0 dev 22 function 1 "VMware PCIE" rev 0x01: msi
> pci12 at ppb11 bus 12
> ppb12 at pci0 dev 22 function 2 "VMware PCIE" rev 0x01: msi
> pci13 at ppb12 bus 13
> ppb13 at pci0 dev 22 function 3 "VMware PCIE" rev 0x01: msi
> pci14 at ppb13 bus 14
> ppb14 at pci0 dev 22 function 4 "VMware PCIE" rev 0x01: msi
> pci15 at ppb14 bus 15
> ppb15 at pci0 dev 22 function 5 "VMware PCIE" rev 0x01: msi
> pci16 at ppb15 bus 16
> ppb16 at pci0 dev 22 function 6 "VMware PCIE" rev 0x01: msi
> pci17 at ppb16 bus 17
> ppb17 at pci0 dev 22 function 7 "VMware PCIE" rev 0x01: msi
> pci18 at ppb17 bus 18
> ppb18 at pci0 dev 23 function 0 "VMware PCIE" rev 0x01: msi
> pci19 at ppb18 bus 19
> ppb19 at pci0 dev 23 function 1 "VMware PCIE" rev 0x01: msi
> pci20 at ppb19 bus 20
> ppb20 at pci0 dev 23 function 2 "VMware PCIE" rev 0x01: msi
> pci21 at ppb20 bus 21
> ppb21 at pci0 dev 23 function 3 "VMware PCIE" rev 0x01: msi
> pci22 at ppb21 bus 22
> ppb22 at pci0 dev 23 function 4 "VMware PCIE" rev 0x01: msi
> pci23 at ppb22 bus 23
> ppb23 at pci0 dev 23 function 5 "VMware PCIE" rev 0x01: msi
> pci24 at ppb23 bus 24
> ppb24 at pci0 dev 23 function 6 "VMware PCIE" rev 0x01: msi
> pci25 at ppb24 bus 25
> ppb25 at pci0 dev 23 function 7 "VMware PCIE" rev 0x01: msi
> pci26 at ppb25 bus 26
> ppb26 at pci0 dev 24 function 0 "VMware PCIE" rev 0x01: msi
> pci27 at ppb26 bus 27
> ppb27 at pci0 dev 24 function 1 "VMware PCIE" rev 0x01: msi
> pci28 at ppb27 bus 28
> ppb28 at pci0 dev 24 function 2 "VMware PCIE" rev 0x01: msi
> pci29 at ppb28 bus 29
> ppb29 at pci0 dev 24 function 3 "VMware PCIE" rev 0x01: msi
> pci30 at ppb29 bus 30
> ppb30 at pci0 dev 24 function 4 "VMware PCIE" rev 0x01: msi
> pci31 at ppb30 bus 31
> ppb31 at pci0 dev 24 function 5 "VMware PCIE" rev 0x01: msi
> pci32 at ppb31 bus 32
> ppb32 at pci0 dev 24 function 6 "VMware PCIE" rev 0x01: msi
> pci33 at ppb32 bus 33
> ppb33 at pci0 dev 24 function 7 "VMware PCIE" rev 0x01: msi
> pci34 at ppb33 bus 34
> isa0 at pcib0
> isadma0 at isa0
> pckbc0 at isa0 port 0x60/5 irq 1 irq 12
> pckbd0 at pckbc0 (kbd slot)
> wskbd0 at pckbd0 mux 1
> pms0 at pckbc0 (aux slot)
> wsmouse0 at pms0 mux 0
> pcppi0 at isa0 port 0x61
> spkr0 at pcppi0
> vmm0 at mainbus0: VMX/EPT
> efifb0 at mainbus0: 1152x864, 32bpp
> wsdisplay0 at efifb0 mux 1
> wskbd0: connecting to wsdisplay0
> wsdisplay0: screen 0-5 added (std, vt100 emulation)
> vscsi0 at root
> scsibus3 at vscsi0: 256 targets
> softraid0 at root
> scsibus4 at softraid0: 256 targets
> root on sd0a (a5bd0d220920df21.a) swap on sd0b dump on sd0b
> 
> 
> On Mon, 4 Jul 2022 21:06:55 -0500
> Scott Cheloha <scottchel...@gmail.com> wrote:
>> Hi,
>> 
>> Once again, I am trying to change our approach to TSC sync testing to
>> eliminate false positive results.  Instead of trying to repair the TSC
>> by measuring skew, we just spin in a lockless loop looking for skew
>> and mark the TSC as broken if we detect any.
>> 
>> This is motivated in part by some multisocket machines that do not use
>> the TSC as a timecounter because the current sync test confuses NUMA
>> latency for TSC skew.
>> 
>> The only difference between this version and the prior version (v2) is
>> that we check whether we have the IA32_TSC_ADJUST register by hand in
>> tsc_reset_adjust().  If someone wants to help me rearrange cpu_hatch()
>> so we do CPU identification before TSC sync testing we can remove the
>> workaround later.
>> 
>> If you have the IA32_TSC_ADJUST register and it is non-zero going into
>> the test, you will see something on the console like this:
>> 
>>      tsc: cpu5: IA32_TSC_ADJUST: -150 -> 0
>> 
>> This does *not* mean you failed the test.  It just means you probably
>> have a bug in your BIOS (or some other firmware) and you should report
>> it to your vendor.
>> 
>> If you fail the test you will see something like this:
>> 
>>      tsc: cpu0/cpu2: sync test round 1/2 failed
>>      tsc: cpu0/cpu2: cpu2: 13043 lags 438 cycles
>> 
>> A printout like this would mean that the sync test for cpu2 failed.
>> In particular, cpu2's TSC trails cpu0's TSC by at least 438 cycles.
>> If this happens for *any* CPU we mark the TSC timecounter as
>> defective.
>> 
>> --
>> 
>> Please test!  Send your dmesg, pass or fail.
>> 
>> I am especially interested in:
>> 
>> 1. A test from dv@.  Your dual-socket machine has the IA32_TSC_ADJUST
>>    register but it failed the test running patch v2.  Maybe it will pass
>>    with this version?
>> 
>> 2. Other multisocket machines.
>> 
>> 3. There were reports of TSC issues with OpenBSD VMs running on ESXi.
>>    What happens when you run with this patch?
>> 
>> 4. OpenBSD VMs on other hypervisors.
>> 
>> 5. Non-Lenovo machines, non-Intel machines.
>> 
>> -Scott
>> 
>> Index: amd64/tsc.c
>> ===================================================================
>> RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
>> retrieving revision 1.24
>> diff -u -p -r1.24 tsc.c
>> --- amd64/tsc.c      31 Aug 2021 15:11:54 -0000      1.24
>> +++ amd64/tsc.c      5 Jul 2022 01:52:10 -0000
>> @@ -36,13 +36,6 @@ int               tsc_recalibrate;
>>  uint64_t    tsc_frequency;
>>  int         tsc_is_invariant;
>>  
>> -#define     TSC_DRIFT_MAX                   250
>> -#define TSC_SKEW_MAX                        100
>> -int64_t     tsc_drift_observed;
>> -
>> -volatile int64_t    tsc_sync_val;
>> -volatile struct cpu_info    *tsc_sync_cpu;
>> -
>>  u_int               tsc_get_timecount(struct timecounter *tc);
>>  void                tsc_delay(int usecs);
>>  
>> @@ -236,22 +229,12 @@ cpu_recalibrate_tsc(struct timecounter *
>>  u_int
>>  tsc_get_timecount(struct timecounter *tc)
>>  {
>> -    return rdtsc_lfence() + curcpu()->ci_tsc_skew;
>> +    return rdtsc_lfence();
>>  }
>>  
>>  void
>>  tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
>>  {
>> -#ifdef TSC_DEBUG
>> -    printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
>> -        (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
>> -#endif
>> -    if (ci->ci_tsc_skew < -TSC_SKEW_MAX || ci->ci_tsc_skew > TSC_SKEW_MAX) {
>> -            printf("%s: disabling user TSC (skew=%lld)\n",
>> -                ci->ci_dev->dv_xname, (long long)ci->ci_tsc_skew);
>> -            tsc_timecounter.tc_user = 0;
>> -    }
>> -
>>      if (!(ci->ci_flags & CPUF_PRIMARY) ||
>>          !(ci->ci_flags & CPUF_CONST_TSC) ||
>>          !(ci->ci_flags & CPUF_INVAR_TSC))
>> @@ -268,111 +251,267 @@ tsc_timecounter_init(struct cpu_info *ci
>>              calibrate_tsc_freq();
>>      }
>>  
>> -    if (tsc_drift_observed > TSC_DRIFT_MAX) {
>> -            printf("ERROR: %lld cycle TSC drift observed\n",
>> -                (long long)tsc_drift_observed);
>> -            tsc_timecounter.tc_quality = -1000;
>> -            tsc_timecounter.tc_user = 0;
>> -            tsc_is_invariant = 0;
>> -    }
>> -
>>      tc_init(&tsc_timecounter);
>>  }
>>  
>> -/*
>> - * Record drift (in clock cycles).  Called during AP startup.
>> - */
>>  void
>> -tsc_sync_drift(int64_t drift)
>> +tsc_delay(int usecs)
>>  {
>> -    if (drift < 0)
>> -            drift = -drift;
>> -    if (drift > tsc_drift_observed)
>> -            tsc_drift_observed = drift;
>> +    uint64_t interval, start;
>> +
>> +    interval = (uint64_t)usecs * tsc_frequency / 1000000;
>> +    start = rdtsc_lfence();
>> +    while (rdtsc_lfence() - start < interval)
>> +            CPU_BUSY_CYCLE();
>>  }
>>  
>> +#ifdef MULTIPROCESSOR
>> +
>> +#define TSC_DEBUG 1
>> +
>>  /*
>> - * Called during startup of APs, by the boot processor.  Interrupts
>> - * are disabled on entry.
>> + * Protections for global variables in this code:
>> + *
>> + *  a       Modified atomically
>> + *  b       Protected by a barrier
>> + *  p       Only modified by the primary CPU
>>   */
>> +
>> +#define TSC_TEST_MS         1       /* Test round duration */
>> +#define TSC_TEST_ROUNDS             2       /* Number of test rounds */
>> +
>> +struct tsc_test_status {
>> +    volatile uint64_t val __aligned(64);    /* [b] CPU's latest TSC value */
>> +    uint64_t lag_count;                     /* [b] # of lags seen by CPU */
>> +    uint64_t lag_max;                       /* [b] Biggest lag seen */
>> +    int64_t adj_val;                        /* [b] initial TSC_ADJUST val */
>> +};
>> +struct tsc_test_status tsc_ap_status;       /* [b] Test results from AP */
>> +struct tsc_test_status tsc_bp_status;       /* [p] Test results from BP */
>> +uint64_t tsc_test_cycles;           /* [p] TSC cycles per test round */
>> +const char *tsc_ap_name;            /* [b] Name of AP */
>> +volatile u_int tsc_egress_barrier;  /* [a] Test end barrier */
>> +volatile u_int tsc_ingress_barrier; /* [a] Test start barrier */
>> +volatile u_int tsc_test_rounds;             /* [p] Remaining test rounds */
>> +int tsc_is_synchronized = 1;                /* [p] TSC sync'd across all 
>> CPUs? */
>> +
>> +void tsc_report_test_results(void);
>> +void tsc_reset_adjust(struct tsc_test_status *);
>> +void tsc_test_ap(void);
>> +void tsc_test_bp(void);
>> +
>>  void
>> -tsc_read_bp(struct cpu_info *ci, uint64_t *bptscp, uint64_t *aptscp)
>> +tsc_test_sync_bp(struct cpu_info *ci)
>>  {
>> -    uint64_t bptsc;
>> +    /* TSC must be constant and invariant to use it as a timecounter. */
>> +#ifndef TSC_DEBUG
>> +    if (!tsc_is_invariant)
>> +            return;
>> +#endif
>>  
>> -    if (atomic_swap_ptr(&tsc_sync_cpu, ci) != NULL)
>> -            panic("tsc_sync_bp: 1");
>> +    /* Reset IA32_TSC_ADJUST if it exists. */
>> +    tsc_reset_adjust(&tsc_bp_status);
>>  
>> -    /* Flag it and read our TSC. */
>> -    atomic_setbits_int(&ci->ci_flags, CPUF_SYNCTSC);
>> -    bptsc = (rdtsc_lfence() >> 1);
>> +    /* Reset the test cycle limit and round count. */
>> +    tsc_test_cycles = TSC_TEST_MS * tsc_frequency / 1000;
>> +    tsc_test_rounds = TSC_TEST_ROUNDS;
>> +
>> +    do {
>> +            /*
>> +             * Pass through the ingress barrier, run the test,
>> +             * then wait for the AP to reach the egress barrier.
>> +             */
>> +            atomic_inc_int(&tsc_ingress_barrier);
>> +            while (tsc_ingress_barrier != 2)
>> +                    CPU_BUSY_CYCLE();
>> +            tsc_test_bp();
>> +            while (tsc_egress_barrier != 1)
>> +                    CPU_BUSY_CYCLE();
>> +
>> +            /*
>> +             * Report what happened.  Adjust the timecounter quality
>> +             * if we failed the test.
>> +             *
>> +             * XXX We need a tc_detach() function to cleanly
>> +             * disable the timecounter.  Lowering the quality
>> +             * like this is a nasty hack.
>> +             */
>> +            tsc_report_test_results();
>> +            if (tsc_ap_status.lag_count || tsc_bp_status.lag_count) {
>> +                    tsc_timecounter.tc_quality = -1000;
>> +                    tsc_is_synchronized = 0;
>> +                    tsc_test_rounds = 0;
>> +            } else
>> +                    tsc_test_rounds--;
>> +
>> +            /*
>> +             * Clean up for the next round.
>> +             *
>> +             * It is safe to reset the ingress barrier because
>> +             * at this point we know the AP has reached the egress
>> +             * barrier.
>> +             */
>> +            memset(&tsc_ap_status, 0, sizeof tsc_ap_status);
>> +            memset(&tsc_bp_status, 0, sizeof tsc_bp_status);
>> +            tsc_ingress_barrier = 0;
>> +            if (tsc_test_rounds == 0)
>> +                    tsc_ap_name = NULL;
>> +
>> +            /*
>> +             * Pass through the egress barrier and release the AP.
>> +             * The AP is responsible for resetting the egress barrier.
>> +             */
>> +            if (atomic_inc_int_nv(&tsc_egress_barrier) != 2)
>> +                    panic("%s: unexpected egress count", __func__);
>> +    } while (tsc_test_rounds > 0);
>> +}
>> +
>> +void
>> +tsc_test_sync_ap(struct cpu_info *ci)
>> +{
>> +#ifndef TSC_DEBUG
>> +    if (!tsc_is_invariant)
>> +            return;
>> +#endif
>>  
>> -    /* Wait for remote to complete, and read ours again. */
>> -    while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
>> -            membar_consumer();
>> -    bptsc += (rdtsc_lfence() >> 1);
>> +    /* The BP needs our name in order to report any problems. */
>> +    tsc_ap_name = ci->ci_dev->dv_xname;
>>  
>> -    /* Wait for the results to come in. */
>> -    while (tsc_sync_cpu == ci)
>> -            CPU_BUSY_CYCLE();
>> -    if (tsc_sync_cpu != NULL)
>> -            panic("tsc_sync_bp: 2");
>> +    tsc_reset_adjust(&tsc_ap_status);
>>  
>> -    *bptscp = bptsc;
>> -    *aptscp = tsc_sync_val;
>> +    /*
>> +     * The AP is only responsible for running the test and
>> +     * resetting the egress barrier.  The BP sets up and tears
>> +     * down everything else.
>> +     */
>> +    do {
>> +            atomic_inc_int(&tsc_ingress_barrier);
>> +            while (tsc_ingress_barrier != 2)
>> +                    CPU_BUSY_CYCLE();
>> +            tsc_test_ap();
>> +            atomic_inc_int(&tsc_egress_barrier);
>> +            while (atomic_cas_uint(&tsc_egress_barrier, 2, 0) != 2)
>> +                    CPU_BUSY_CYCLE();
>> +    } while (tsc_test_rounds > 0);
>>  }
>>  
>>  void
>> -tsc_sync_bp(struct cpu_info *ci)
>> +tsc_report_test_results(void)
>>  {
>> -    uint64_t bptsc, aptsc;
>> +    u_int round = TSC_TEST_ROUNDS - tsc_test_rounds + 1;
>>  
>> -    tsc_read_bp(ci, &bptsc, &aptsc); /* discarded - cache effects */
>> -    tsc_read_bp(ci, &bptsc, &aptsc);
>> -
>> -    /* Compute final value to adjust for skew. */
>> -    ci->ci_tsc_skew = bptsc - aptsc;
>> +    if (tsc_bp_status.adj_val != 0) {
>> +            printf("tsc: cpu0: IA32_TSC_ADJUST: %lld -> 0\n",
>> +                tsc_bp_status.adj_val);
>> +    }
>> +    if (tsc_ap_status.adj_val != 0) {
>> +            printf("tsc: %s: IA32_TSC_ADJUST: %lld -> 0\n",
>> +                tsc_ap_name, tsc_ap_status.adj_val);
>> +    }
>> +    if (tsc_ap_status.lag_count > 0 || tsc_bp_status.lag_count > 0) {
>> +            printf("tsc: cpu0/%s: sync test round %u/%u failed\n",
>> +                tsc_ap_name, round, TSC_TEST_ROUNDS);
>> +    }
>> +    if (tsc_bp_status.lag_count > 0) {
>> +            printf("tsc: cpu0/%s: cpu0: %llu lags %llu cycles\n",
>> +                tsc_ap_name, tsc_bp_status.lag_count,
>> +                tsc_bp_status.lag_max);
>> +    }
>> +    if (tsc_ap_status.lag_count > 0) {
>> +            printf("tsc: cpu0/%s: %s: %llu lags %llu cycles\n",
>> +                tsc_ap_name, tsc_ap_name, tsc_ap_status.lag_count,
>> +                tsc_ap_status.lag_max);
>> +    }
>>  }
>>  
>>  /*
>> - * Called during startup of AP, by the AP itself.  Interrupts are
>> - * disabled on entry.
>> + * Reset IA32_TSC_ADJUST if we have it.
>> + *
>> + * XXX We should rearrange cpu_hatch() so that the feature flags
>> + * are already set before we get here.  Do CPUID() and CPUID_LEAF()
>> + * here as a workaround until then.
>>   */
>>  void
>> -tsc_post_ap(struct cpu_info *ci)
>> +tsc_reset_adjust(struct tsc_test_status *tts)
>>  {
>> -    uint64_t tsc;
>> -
>> -    /* Wait for go-ahead from primary. */
>> -    while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
>> -            membar_consumer();
>> -    tsc = (rdtsc_lfence() >> 1);
>> +    uint32_t eax, ebx, ecx, edx;
>>  
>> -    /* Instruct primary to read its counter. */
>> -    atomic_clearbits_int(&ci->ci_flags, CPUF_SYNCTSC);
>> -    tsc += (rdtsc_lfence() >> 1);
>> -
>> -    /* Post result.  Ensure the whole value goes out atomically. */
>> -    (void)atomic_swap_64(&tsc_sync_val, tsc);
>> -
>> -    if (atomic_swap_ptr(&tsc_sync_cpu, NULL) != ci)
>> -            panic("tsc_sync_ap");
>> +    CPUID(0, eax, ebx, ecx, edx);
>> +    if (eax >= 7) {
>> +            CPUID_LEAF(7, 0, eax, ebx, ecx, edx);
>> +            if (ISSET(ebx, SEFF0EBX_TSC_ADJUST)) {
>> +                    tts->adj_val = rdmsr(MSR_TSC_ADJUST);
>> +                    if (tts->adj_val != 0)
>> +                            wrmsr(MSR_TSC_ADJUST, 0);
>> +            }
>> +    }
>>  }
>>  
>>  void
>> -tsc_sync_ap(struct cpu_info *ci)
>> +tsc_test_ap(void)
>>  {
>> -    tsc_post_ap(ci);
>> -    tsc_post_ap(ci);
>> +    uint64_t ap_val, bp_val, end, lag;
>> +    u_int i = 0;
>> +
>> +    ap_val = rdtsc_lfence();
>> +    end = ap_val + tsc_test_cycles;
>> +    while (ap_val < end) {
>> +            /*
>> +             * The LFENCE ensures bp_val predates ap_val.  If ap_val
>> +             * is smaller than bp_val then the AP's TSC must lag that
>> +             * of the BP, i.e. the two cannot be synchronized.
>> +             */
>> +            bp_val = tsc_bp_status.val;
>> +            ap_val = rdtsc_lfence();
>> +            tsc_ap_status.val = ap_val;
>> +
>> +            /*
>> +             * Ensure the other CPU has a chance to run.  This is
>> +             * crucial if the other CPU is our SMT sibling.  SMT
>> +             * starvation can prevent this test from detecting
>> +             * relatively large lags.  Eight is an arbitrary value,
>> +             * but it seems to work in practice without impacting
>> +             * the sensitivity of the test for non-sibling threads.
>> +             */
>> +            if (++i % 8 == 0)
>> +                    CPU_BUSY_CYCLE();
>> +
>> +            /*
>> +             * Record the magnitude of the problem if our TSC lags
>> +             * the other.
>> +             */
>> +            if (__predict_false(ap_val < bp_val)) {
>> +                    tsc_ap_status.lag_count++;
>> +                    lag = bp_val - ap_val;
>> +                    if (tsc_ap_status.lag_max < lag)
>> +                            tsc_ap_status.lag_max = lag;
>> +            }
>> +    }
>>  }
>>  
>>  void
>> -tsc_delay(int usecs)
>> +tsc_test_bp(void)
>>  {
>> -    uint64_t interval, start;
>> +    uint64_t ap_val, bp_val, end, lag;
>> +    u_int i = 0;
>>  
>> -    interval = (uint64_t)usecs * tsc_frequency / 1000000;
>> -    start = rdtsc_lfence();
>> -    while (rdtsc_lfence() - start < interval)
>> -            CPU_BUSY_CYCLE();
>> +    bp_val = rdtsc_lfence();
>> +    end = bp_val + tsc_test_cycles;
>> +    while (bp_val < end) {
>> +            ap_val = tsc_ap_status.val;
>> +            bp_val = rdtsc_lfence();
>> +            tsc_bp_status.val = bp_val;
>> +
>> +            if (++i % 8 == 0)
>> +                    CPU_BUSY_CYCLE();
>> +
>> +            if (__predict_false(bp_val < ap_val)) {
>> +                    tsc_bp_status.lag_count++;
>> +                    lag = ap_val - bp_val;
>> +                    if (tsc_bp_status.lag_max < lag)
>> +                            tsc_bp_status.lag_max = lag;
>> +            }
>> +    }
>>  }
>> +
>> +#endif /* MULTIPROCESSOR */
>> Index: amd64/cpu.c
>> ===================================================================
>> RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
>> retrieving revision 1.156
>> diff -u -p -r1.156 cpu.c
>> --- amd64/cpu.c      26 Apr 2022 08:35:30 -0000      1.156
>> +++ amd64/cpu.c      5 Jul 2022 01:52:11 -0000
>> @@ -772,9 +772,9 @@ cpu_init(struct cpu_info *ci)
>>      lcr4(cr4 & ~CR4_PGE);
>>      lcr4(cr4);
>>  
>> -    /* Synchronize TSC */
>> +    /* Check if TSC is synchronized. */
>>      if (cold && !CPU_IS_PRIMARY(ci))
>> -          tsc_sync_ap(ci);
>> +          tsc_test_sync_ap(ci);
>>  #endif
>>  }
>>  
>> @@ -854,18 +854,14 @@ cpu_start_secondary(struct cpu_info *ci)
>>  #endif
>>      } else {
>>              /*
>> -             * Synchronize time stamp counters. Invalidate cache and
>> -             * synchronize twice (in tsc_sync_bp) to minimize possible
>> -             * cache effects. Disable interrupts to try and rule out any
>> -             * external interference.
>> +             * Test if TSCs are synchronized.  Invalidate cache to
>> +             * minimize possible cache effects.  Disable interrupts to
>> +             * rule out external interference.
>>               */
>>              s = intr_disable();
>>              wbinvd();
>> -            tsc_sync_bp(ci);
>> +            tsc_test_sync_bp(ci);
>>              intr_restore(s);
>> -#ifdef TSC_DEBUG
>> -            printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
>> -#endif
>>      }
>>  
>>      if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) {
>> @@ -890,7 +886,6 @@ void
>>  cpu_boot_secondary(struct cpu_info *ci)
>>  {
>>      int i;
>> -    int64_t drift;
>>      u_long s;
>>  
>>      atomic_setbits_int(&ci->ci_flags, CPUF_GO);
>> @@ -905,18 +900,11 @@ cpu_boot_secondary(struct cpu_info *ci)
>>              db_enter();
>>  #endif
>>      } else if (cold) {
>> -            /* Synchronize TSC again, check for drift. */
>> -            drift = ci->ci_tsc_skew;
>> +            /* Test if TSCs are synchronized again. */
>>              s = intr_disable();
>>              wbinvd();
>> -            tsc_sync_bp(ci);
>> +            tsc_test_sync_bp(ci);
>>              intr_restore(s);
>> -            drift -= ci->ci_tsc_skew;
>> -#ifdef TSC_DEBUG
>> -            printf("TSC skew=%lld drift=%lld\n",
>> -                (long long)ci->ci_tsc_skew, (long long)drift);
>> -#endif
>> -            tsc_sync_drift(drift);
>>      }
>>  }
>>  
>> @@ -942,13 +930,12 @@ cpu_hatch(void *v)
>>  #endif
>>  
>>      /*
>> -     * Synchronize the TSC for the first time. Note that interrupts are
>> -     * off at this point.
>> +     * Test if our TSC is synchronized for the first time.
>> +     * Note that interrupts are off at this point.
>>       */
>>      wbinvd();
>>      ci->ci_flags |= CPUF_PRESENT;
>> -    ci->ci_tsc_skew = 0;    /* reset on resume */
>> -    tsc_sync_ap(ci);
>> +    tsc_test_sync_ap(ci);
>>  
>>      lapic_enable();
>>      lapic_startclock();
>> Index: include/cpu.h
>> ===================================================================
>> RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
>> retrieving revision 1.144
>> diff -u -p -r1.144 cpu.h
>> --- include/cpu.h    28 Jun 2022 12:11:41 -0000      1.144
>> +++ include/cpu.h    5 Jul 2022 01:52:11 -0000
>> @@ -209,8 +209,6 @@ struct cpu_info {
>>      paddr_t         ci_vmxon_region_pa;
>>      struct vmxon_region *ci_vmxon_region;
>>  
>> -    int64_t         ci_tsc_skew;            /* counter skew vs cpu0 */
>> -
>>      char            ci_panicbuf[512];
>>  
>>      paddr_t         ci_vmcs_pa;
>> @@ -230,7 +228,6 @@ struct cpu_info {
>>  #define CPUF_INVAR_TSC      0x0100          /* CPU has invariant TSC */
>>  #define CPUF_USERXSTATE     0x0200          /* CPU has curproc's xsave 
>> state */
>>  
>> -#define CPUF_SYNCTSC        0x0800          /* Synchronize TSC */
>>  #define CPUF_PRESENT        0x1000          /* CPU is present */
>>  #define CPUF_RUNNING        0x2000          /* CPU is running */
>>  #define CPUF_PAUSE  0x4000          /* CPU is paused in DDB */
>> Index: include/cpuvar.h
>> ===================================================================
>> RCS file: /cvs/src/sys/arch/amd64/include/cpuvar.h,v
>> retrieving revision 1.11
>> diff -u -p -r1.11 cpuvar.h
>> --- include/cpuvar.h 16 May 2021 04:33:05 -0000      1.11
>> +++ include/cpuvar.h 5 Jul 2022 01:52:11 -0000
>> @@ -97,8 +97,7 @@ void identifycpu(struct cpu_info *);
>>  void cpu_init(struct cpu_info *);
>>  void cpu_init_first(void);
>>  
>> -void tsc_sync_drift(int64_t);
>> -void tsc_sync_bp(struct cpu_info *);
>> -void tsc_sync_ap(struct cpu_info *);
>> +void tsc_test_sync_bp(struct cpu_info *);
>> +void tsc_test_sync_ap(struct cpu_info *);
>>  
>>  #endif

Reply via email to