On Tue, Sep 06, 2022 at 03:30:44AM -0700, Mike Larkin wrote: > On Sun, Sep 04, 2022 at 02:50:10PM +1000, Jonathan Gray wrote: > > On Sat, Sep 03, 2022 at 05:33:01PM -0500, Scott Cheloha wrote: > > > On Sat, Sep 03, 2022 at 10:37:31PM +1000, Jonathan Gray wrote: > > > > On Sat, Sep 03, 2022 at 06:52:20AM -0500, Scott Cheloha wrote: > > > > > > On Sep 3, 2022, at 02:22, Jonathan Gray <j...@jsg.id.au> wrote: > > > > > > > > > > > > ???On Fri, Sep 02, 2022 at 06:00:25PM -0500, Scott Cheloha wrote: > > > > > >> dv@ suggested coming to the list to request testing for the > > > > > >> pvclock(4) > > > > > >> driver. Attached is a patch that corrects several bugs. Most of > > > > > >> these changes will only matter in the non-TSC_STABLE case on a > > > > > >> multiprocessor VM. > > > > > >> > > > > > >> Ideally, nothing should break. > > > > > >> > > > > > >> - pvclock yields a 64-bit value. The BSD timecounter layer can > > > > > >> only > > > > > >> use the lower 32 bits, but internally we need to track the full > > > > > >> 64-bit value to allow comparisons with the full value in the > > > > > >> non-TSC_STABLE case. So make pvclock_lastcount a 64-bit quantity. > > > > > >> > > > > > >> - In pvclock_get_timecount(), move rdtsc() up into the lockless > > > > > >> read > > > > > >> loop to get a more accurate timestamp. > > > > > >> > > > > > >> - In pvclock_get_timecount(), use rdtsc_lfence(), not rdtsc(). > > > > > >> > > > > > >> - In pvclock_get_timecount(), check that our TSC value doesn't > > > > > >> predate > > > > > >> ti->ti_tsc_timestamp, otherwise we will produce an enormous value. > > > > > >> > > > > > >> - In pvclock_get_timecount(), update pvclock_lastcount in the > > > > > >> non-TSC_STABLE case with more care. On amd64 we can do this with > > > > > >> an > > > > > >> atomic_cas_ulong(9) loop because u_long is 64 bits. On i386 we > > > > > >> need > > > > > >> to introduce a mutex to protect our comparison and read/write. > > > > > > > > > > > > i386 has cmpxchg8b, no need to disable interrupts > > > > > > the ifdefs seem excessive > > > > > > > > > > How do I make use of CMPXCHG8B on i386 > > > > > in this context? > > > > > > > > > > atomic_cas_ulong(9) is a 32-bit CAS on > > > > > i386. > > > > > > > > static inline uint64_t > > > > atomic_cas_64(volatile uint64_t *p, uint64_t o, uint64_t n) > > > > { > > > > return __sync_val_compare_and_swap(p, o, n); > > > > } > > > > > > > > Or md atomic.h files could have an equivalent. > > > > Not possible on all 32-bit archs. > > > > > > > > > > > > > > We can't use FP registers in the kernel, no? > > > > > > > > What do FP registers have to do with it? > > > > > > > > > > > > > > Am I missing some other avenue? > > > > > > > > There is no rdtsc_lfence() on i386. Initial diff doesn't build. > > > > > > LFENCE is an SSE2 extension. As is MFENCE. I don't think I can just > > > drop rdtsc_lfence() into cpufunc.h and proceed without causing some > > > kind of fault on an older CPU. > > > > > > What are my options on a 586-class CPU for forcing RDTSC to complete > > > before later instructions? > > > > "3.3.2. Serializing Operations > > After executing certain instructions the Pentium processor serializes > > instruction execution. This means that any modifications to flags, > > registers, and memory for previous instructions are completed before > > the next instruction is fetched and executed. The prefetch queue > > is flushed as a result of serializing operations. > > > > The Pentium processor serializes instruction execution after executing > > one of the following instructions: Move to Special Register (except > > CRO), INVD, INVLPG, IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD, > > CPUID, RSM and WRMSR." > > > > from: > > Pentium Processor User's Manual > > Volume 1: Pentium Processor Data Book > > Order Number 241428 > > > > http://bitsavers.org/components/intel/pentium/1993_Intel_Pentium_Processor_Users_Manual_Volume_1.pdf > > > > So it could be rdtsc ; cpuid. > > lfence; rdtsc should still be preferred. > > > > It could be tested during boot and set a function pointer. > > Or the codepatch bits could be used. > > > > In the specific case of pvclock, can it be assumed that the host > > has hardware virt and would then have lfence? > > > > I think this is a fair assumption.
Okay, so how about we use rdtsc_lfence() in pvclock_get_timecount() and, just to be safe, check for SSE2 during pvclock_match(). If in the future someone wants pvclock(4) support on guests without SSE2 support (if any exist) we can deal with it at that time. Updated patch: - Add atomic_cas_64() to amd64/include/atomic.h. (NetBSD calls it atomic_cas_64(), so this is consistent with their atomic function naming scheme.) I am unsure whether it would be better to just do this: #define atomic_cas_64(_p, _e, _n) _atomic_cas_ulong((unsigned long *)(_p), (_e), (_n)) If I don't cast the pointer, the compiler complains, so I added _atomic_cas_ullong(). What is preferred here? - Add atomic_cas_64() to i386/include/atomic.h. Implemented with the __sync_val_compare_and_swap() builtin. - Add rdtsc_lfence() to i386/include/cpufunc.h. - In pvclock.c, make pvclock_lastcount a volatile 64-bit value to fix the non-PVCLOCK_FLAG_TSC_STABLE case. - In pvclock_match(), check for SSE2 support in ci_feature_flags. If we don't have it, we don't have LFENCE and we can't match. - In pvclock_get_timecount(), do RDTSC as early as possible in the lockless read loop to get a better timestamp. - In pvclock_get_timecount(), use rdtsc_lfence() instead of rdtsc(). - In pvclock_get_timecount(), check whether our TSC lags ti_tsc_timestamp to make sure so we don't produce an enormous, invalid delta. If we're lagging, set delta to zero. - In pvclock_get_timecount(), fix the non-PVCLOCK_FLAG_TSC_STABLE case. Because we could be in pvclock_get_timecount() with other vCPUs, we need to read, compare, and (maybe) replace pvclock_lastcount in an atomic CAS loop. The only thing that I'm still unsure about is whether this: volatile uint64_t pvclock_lastcount; void foo(void) { uint64_t last; last = pvclock_lastcount; /* atomic on i386? */ } is a safe, atomic 8-byte load on i386 on all the CPUs we support, i.e. 586 and up. Can someone confirm this? I didn't know you could do this with 32-bit registers. -- Should I put this out in a second request-for-test email? The patch has drifted a bit. Index: arch/amd64/include/atomic.h =================================================================== RCS file: /cvs/src/sys/arch/amd64/include/atomic.h,v retrieving revision 1.22 diff -u -p -r1.22 atomic.h --- arch/amd64/include/atomic.h 29 Aug 2022 02:01:18 -0000 1.22 +++ arch/amd64/include/atomic.h 8 Sep 2022 13:28:48 -0000 @@ -77,6 +77,18 @@ _atomic_cas_ulong(volatile unsigned long } #define atomic_cas_ulong(_p, _e, _n) _atomic_cas_ulong((_p), (_e), (_n)) +static inline unsigned long long +_atomic_cas_ullong(volatile unsigned long long *p, unsigned long long e, + unsigned long long n) +{ + __asm volatile(_LOCK " cmpxchgq %2, %1" + : "=a" (n), "=m" (*p) + : "r" (n), "a" (e), "m" (*p)); + + return (n); +} +#define atomic_cas_64(_p, _e, _n) _atomic_cas_ullong((_p), (_e), (_n)) + static inline void * _atomic_cas_ptr(volatile void *p, void *e, void *n) { Index: arch/i386/include/atomic.h =================================================================== RCS file: /cvs/src/sys/arch/i386/include/atomic.h,v retrieving revision 1.20 diff -u -p -r1.20 atomic.h --- arch/i386/include/atomic.h 29 Aug 2022 02:01:18 -0000 1.20 +++ arch/i386/include/atomic.h 8 Sep 2022 13:28:49 -0000 @@ -83,6 +83,12 @@ _atomic_cas_ptr(volatile void *p, void * } #define atomic_cas_ptr(_p, _e, _n) _atomic_cas_ptr((_p), (_e), (_n)) +static inline uint64_t +atomic_cas_64(volatile uint64_t *p, uint64_t e, uint64_t n) +{ + return __sync_val_compare_and_swap(p, e, n); +} + static inline unsigned int _atomic_swap_uint(volatile unsigned int *p, unsigned int n) { Index: arch/i386/include/cpufunc.h =================================================================== RCS file: /cvs/src/sys/arch/i386/include/cpufunc.h,v retrieving revision 1.33 diff -u -p -r1.33 cpufunc.h --- arch/i386/include/cpufunc.h 13 Sep 2020 11:53:16 -0000 1.33 +++ arch/i386/include/cpufunc.h 8 Sep 2022 13:28:49 -0000 @@ -229,6 +229,15 @@ rdtsc(void) return (tsc); } +static inline uint64_t +rdtsc_lfence(void) +{ + uint64_t tsc; + + __asm volatile("lfence; rdtsc" : "=A" (tsc)); + return tsc; +} + static __inline void wrmsr(u_int msr, u_int64_t newval) { Index: dev/pv/pvclock.c =================================================================== RCS file: /cvs/src/sys/dev/pv/pvclock.c,v retrieving revision 1.8 diff -u -p -r1.8 pvclock.c --- dev/pv/pvclock.c 5 Nov 2021 11:38:29 -0000 1.8 +++ dev/pv/pvclock.c 8 Sep 2022 13:28:49 -0000 @@ -21,21 +21,23 @@ #endif #include <sys/param.h> +#include <sys/stdint.h> #include <sys/systm.h> #include <sys/kernel.h> #include <sys/timetc.h> #include <sys/timeout.h> #include <sys/malloc.h> -#include <sys/atomic.h> #include <machine/cpu.h> #include <machine/atomic.h> +#include <machine/specialreg.h> + #include <uvm/uvm_extern.h> #include <dev/pv/pvvar.h> #include <dev/pv/pvreg.h> -uint pvclock_lastcount; +volatile uint64_t pvclock_lastcount; struct pvclock_softc { struct device sc_dev; @@ -116,6 +118,12 @@ pvclock_match(struct device *parent, voi (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) == 0) return (0); + /* + * LFENCE is also required, so check for SSE2 support. + */ + if (!ISSET(curcpu()->ci_feature_flags, CPUID_SSE2)) + return (0); + return (1); } @@ -213,6 +221,7 @@ pvclock_get_timecount(struct timecounter struct pvclock_softc *sc = tc->tc_priv; struct pvclock_time_info *ti; uint64_t tsc_timestamp, system_time, delta, ctr; + uint64_t last, tsc; uint32_t version, mul_frac; int8_t shift; uint8_t flags; @@ -220,6 +229,7 @@ pvclock_get_timecount(struct timecounter ti = sc->sc_time; do { version = pvclock_read_begin(ti); + tsc = rdtsc_lfence(); system_time = ti->ti_system_time; tsc_timestamp = ti->ti_tsc_timestamp; mul_frac = ti->ti_tsc_to_system_mul; @@ -231,7 +241,10 @@ pvclock_get_timecount(struct timecounter * The algorithm is described in * linux/Documentation/virtual/kvm/msr.txt */ - delta = rdtsc() - tsc_timestamp; + if (tsc > tsc_timestamp) + delta = tsc - tsc_timestamp; + else + delta = 0; if (shift < 0) delta >>= -shift; else @@ -241,10 +254,11 @@ pvclock_get_timecount(struct timecounter if ((flags & PVCLOCK_FLAG_TSC_STABLE) != 0) return (ctr); - if (ctr < pvclock_lastcount) - return (pvclock_lastcount); - - atomic_swap_uint(&pvclock_lastcount, ctr); + do { + last = pvclock_lastcount; + if (ctr < last) + return (last); + } while (atomic_cas_64(&pvclock_lastcount, last, ctr) != last); return (ctr); }