On Tue, Sep 06, 2022 at 03:30:44AM -0700, Mike Larkin wrote:
> On Sun, Sep 04, 2022 at 02:50:10PM +1000, Jonathan Gray wrote:
> > On Sat, Sep 03, 2022 at 05:33:01PM -0500, Scott Cheloha wrote:
> > > On Sat, Sep 03, 2022 at 10:37:31PM +1000, Jonathan Gray wrote:
> > > > On Sat, Sep 03, 2022 at 06:52:20AM -0500, Scott Cheloha wrote:
> > > > > > On Sep 3, 2022, at 02:22, Jonathan Gray <j...@jsg.id.au> wrote:
> > > > > >
> > > > > > ???On Fri, Sep 02, 2022 at 06:00:25PM -0500, Scott Cheloha wrote:
> > > > > >> dv@ suggested coming to the list to request testing for the 
> > > > > >> pvclock(4)
> > > > > >> driver.  Attached is a patch that corrects several bugs.  Most of
> > > > > >> these changes will only matter in the non-TSC_STABLE case on a
> > > > > >> multiprocessor VM.
> > > > > >>
> > > > > >> Ideally, nothing should break.
> > > > > >>
> > > > > >> - pvclock yields a 64-bit value.  The BSD timecounter layer can 
> > > > > >> only
> > > > > >>  use the lower 32 bits, but internally we need to track the full
> > > > > >>  64-bit value to allow comparisons with the full value in the
> > > > > >>  non-TSC_STABLE case.  So make pvclock_lastcount a 64-bit quantity.
> > > > > >>
> > > > > >> - In pvclock_get_timecount(), move rdtsc() up into the lockless 
> > > > > >> read
> > > > > >>  loop to get a more accurate timestamp.
> > > > > >>
> > > > > >> - In pvclock_get_timecount(), use rdtsc_lfence(), not rdtsc().
> > > > > >>
> > > > > >> - In pvclock_get_timecount(), check that our TSC value doesn't 
> > > > > >> predate
> > > > > >>  ti->ti_tsc_timestamp, otherwise we will produce an enormous value.
> > > > > >>
> > > > > >> - In pvclock_get_timecount(), update pvclock_lastcount in the
> > > > > >>  non-TSC_STABLE case with more care.  On amd64 we can do this with 
> > > > > >> an
> > > > > >>  atomic_cas_ulong(9) loop because u_long is 64 bits.  On i386 we 
> > > > > >> need
> > > > > >>  to introduce a mutex to protect our comparison and read/write.
> > > > > >
> > > > > > i386 has cmpxchg8b, no need to disable interrupts
> > > > > > the ifdefs seem excessive
> > > > >
> > > > > How do I make use of CMPXCHG8B on i386
> > > > > in this context?
> > > > >
> > > > > atomic_cas_ulong(9) is a 32-bit CAS on
> > > > > i386.
> > > >
> > > > static inline uint64_t
> > > > atomic_cas_64(volatile uint64_t *p, uint64_t o, uint64_t n)
> > > > {
> > > >         return __sync_val_compare_and_swap(p, o, n);
> > > > }
> > > >
> > > > Or md atomic.h files could have an equivalent.
> > > > Not possible on all 32-bit archs.
> > > >
> > > > >
> > > > > We can't use FP registers in the kernel, no?
> > > >
> > > > What do FP registers have to do with it?
> > > >
> > > > >
> > > > > Am I missing some other avenue?
> > > >
> > > > There is no rdtsc_lfence() on i386.  Initial diff doesn't build.
> > >
> > > LFENCE is an SSE2 extension.  As is MFENCE.  I don't think I can just
> > > drop rdtsc_lfence() into cpufunc.h and proceed without causing some
> > > kind of fault on an older CPU.
> > >
> > > What are my options on a 586-class CPU for forcing RDTSC to complete
> > > before later instructions?
> >
> > "3.3.2. Serializing Operations
> > After executing certain instructions the Pentium processor serializes
> > instruction execution. This means that any modifications to flags,
> > registers, and memory for previous instructions are completed before
> > the next instruction is fetched and executed. The prefetch queue
> > is flushed as a result of serializing operations.
> >
> > The Pentium processor serializes instruction execution after executing
> > one of the following instructions: Move to Special Register (except
> > CRO), INVD, INVLPG, IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD,
> > CPUID, RSM and WRMSR."
> >
> > from:
> > Pentium Processor User's Manual
> > Volume 1: Pentium Processor Data Book
> > Order Number 241428
> >
> > http://bitsavers.org/components/intel/pentium/1993_Intel_Pentium_Processor_Users_Manual_Volume_1.pdf
> >
> > So it could be rdtsc ; cpuid.
> > lfence; rdtsc should still be preferred.
> >
> > It could be tested during boot and set a function pointer.
> > Or the codepatch bits could be used.
> >
> > In the specific case of pvclock, can it be assumed that the host
> > has hardware virt and would then have lfence?
> >
> 
> I think this is a fair assumption.

Okay, so how about we use rdtsc_lfence() in pvclock_get_timecount()
and, just to be safe, check for SSE2 during pvclock_match().

If in the future someone wants pvclock(4) support on guests without
SSE2 support (if any exist) we can deal with it at that time.

Updated patch:

- Add atomic_cas_64() to amd64/include/atomic.h.

  (NetBSD calls it atomic_cas_64(), so this is consistent with their
  atomic function naming scheme.)

  I am unsure whether it would be better to just do this:

#define atomic_cas_64(_p, _e, _n) _atomic_cas_ulong((unsigned long *)(_p), 
(_e), (_n))

  If I don't cast the pointer, the compiler complains, so I added
  _atomic_cas_ullong().

  What is preferred here?

- Add atomic_cas_64() to i386/include/atomic.h.  Implemented with the
  __sync_val_compare_and_swap() builtin.

- Add rdtsc_lfence() to i386/include/cpufunc.h.

- In pvclock.c, make pvclock_lastcount a volatile 64-bit value to
  fix the non-PVCLOCK_FLAG_TSC_STABLE case.

- In pvclock_match(), check for SSE2 support in ci_feature_flags.
  If we don't have it, we don't have LFENCE and we can't match.

- In pvclock_get_timecount(), do RDTSC as early as possible in the
  lockless read loop to get a better timestamp.

- In pvclock_get_timecount(), use rdtsc_lfence() instead of rdtsc().

- In pvclock_get_timecount(), check whether our TSC lags
  ti_tsc_timestamp to make sure so we don't produce an enormous,
  invalid delta.  If we're lagging, set delta to zero.

- In pvclock_get_timecount(), fix the non-PVCLOCK_FLAG_TSC_STABLE
  case.

  Because we could be in pvclock_get_timecount() with other vCPUs, we
  need to read, compare, and (maybe) replace pvclock_lastcount in an
  atomic CAS loop.

  The only thing that I'm still unsure about is whether this:

volatile uint64_t pvclock_lastcount;

void
foo(void)
{
        uint64_t last;

        last = pvclock_lastcount;       /* atomic on i386? */
}

  is a safe, atomic 8-byte load on i386 on all the CPUs we support,
  i.e. 586 and up.

  Can someone confirm this?  I didn't know you could do this with
  32-bit registers.

--

Should I put this out in a second request-for-test email?  The patch
has drifted a bit.

Index: arch/amd64/include/atomic.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/atomic.h,v
retrieving revision 1.22
diff -u -p -r1.22 atomic.h
--- arch/amd64/include/atomic.h 29 Aug 2022 02:01:18 -0000      1.22
+++ arch/amd64/include/atomic.h 8 Sep 2022 13:28:48 -0000
@@ -77,6 +77,18 @@ _atomic_cas_ulong(volatile unsigned long
 }
 #define atomic_cas_ulong(_p, _e, _n) _atomic_cas_ulong((_p), (_e), (_n))
 
+static inline unsigned long long
+_atomic_cas_ullong(volatile unsigned long long *p, unsigned long long e,
+    unsigned long long n)
+{
+       __asm volatile(_LOCK " cmpxchgq %2, %1"
+           : "=a" (n), "=m" (*p)
+           : "r" (n), "a" (e), "m" (*p));
+
+       return (n);
+}
+#define atomic_cas_64(_p, _e, _n) _atomic_cas_ullong((_p), (_e), (_n))
+
 static inline void *
 _atomic_cas_ptr(volatile void *p, void *e, void *n)
 {
Index: arch/i386/include/atomic.h
===================================================================
RCS file: /cvs/src/sys/arch/i386/include/atomic.h,v
retrieving revision 1.20
diff -u -p -r1.20 atomic.h
--- arch/i386/include/atomic.h  29 Aug 2022 02:01:18 -0000      1.20
+++ arch/i386/include/atomic.h  8 Sep 2022 13:28:49 -0000
@@ -83,6 +83,12 @@ _atomic_cas_ptr(volatile void *p, void *
 }
 #define atomic_cas_ptr(_p, _e, _n) _atomic_cas_ptr((_p), (_e), (_n))
 
+static inline uint64_t
+atomic_cas_64(volatile uint64_t *p, uint64_t e, uint64_t n)
+{
+       return __sync_val_compare_and_swap(p, e, n);
+}
+
 static inline unsigned int
 _atomic_swap_uint(volatile unsigned int *p, unsigned int n)
 {
Index: arch/i386/include/cpufunc.h
===================================================================
RCS file: /cvs/src/sys/arch/i386/include/cpufunc.h,v
retrieving revision 1.33
diff -u -p -r1.33 cpufunc.h
--- arch/i386/include/cpufunc.h 13 Sep 2020 11:53:16 -0000      1.33
+++ arch/i386/include/cpufunc.h 8 Sep 2022 13:28:49 -0000
@@ -229,6 +229,15 @@ rdtsc(void)
        return (tsc);
 }
 
+static inline uint64_t
+rdtsc_lfence(void)
+{
+       uint64_t tsc;
+
+       __asm volatile("lfence; rdtsc" : "=A" (tsc));
+       return tsc;
+}
+
 static __inline void
 wrmsr(u_int msr, u_int64_t newval)
 {
Index: dev/pv/pvclock.c
===================================================================
RCS file: /cvs/src/sys/dev/pv/pvclock.c,v
retrieving revision 1.8
diff -u -p -r1.8 pvclock.c
--- dev/pv/pvclock.c    5 Nov 2021 11:38:29 -0000       1.8
+++ dev/pv/pvclock.c    8 Sep 2022 13:28:49 -0000
@@ -21,21 +21,23 @@
 #endif
 
 #include <sys/param.h>
+#include <sys/stdint.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
 #include <sys/timetc.h>
 #include <sys/timeout.h>
 #include <sys/malloc.h>
-#include <sys/atomic.h>
 
 #include <machine/cpu.h>
 #include <machine/atomic.h>
+#include <machine/specialreg.h>
+
 #include <uvm/uvm_extern.h>
 
 #include <dev/pv/pvvar.h>
 #include <dev/pv/pvreg.h>
 
-uint pvclock_lastcount;
+volatile uint64_t pvclock_lastcount;
 
 struct pvclock_softc {
        struct device            sc_dev;
@@ -116,6 +118,12 @@ pvclock_match(struct device *parent, voi
                    (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) == 0)
                        return (0);
 
+               /*
+                * LFENCE is also required, so check for SSE2 support.
+                */
+               if (!ISSET(curcpu()->ci_feature_flags, CPUID_SSE2))
+                       return (0);
+
                return (1);
        }
 
@@ -213,6 +221,7 @@ pvclock_get_timecount(struct timecounter
        struct pvclock_softc            *sc = tc->tc_priv;
        struct pvclock_time_info        *ti;
        uint64_t                         tsc_timestamp, system_time, delta, ctr;
+       uint64_t                         last, tsc;
        uint32_t                         version, mul_frac;
        int8_t                           shift;
        uint8_t                          flags;
@@ -220,6 +229,7 @@ pvclock_get_timecount(struct timecounter
        ti = sc->sc_time;
        do {
                version = pvclock_read_begin(ti);
+               tsc = rdtsc_lfence();
                system_time = ti->ti_system_time;
                tsc_timestamp = ti->ti_tsc_timestamp;
                mul_frac = ti->ti_tsc_to_system_mul;
@@ -231,7 +241,10 @@ pvclock_get_timecount(struct timecounter
         * The algorithm is described in
         * linux/Documentation/virtual/kvm/msr.txt
         */
-       delta = rdtsc() - tsc_timestamp;
+       if (tsc > tsc_timestamp)
+               delta = tsc - tsc_timestamp;
+       else
+               delta = 0;
        if (shift < 0)
                delta >>= -shift;
        else
@@ -241,10 +254,11 @@ pvclock_get_timecount(struct timecounter
        if ((flags & PVCLOCK_FLAG_TSC_STABLE) != 0)
                return (ctr);
 
-       if (ctr < pvclock_lastcount)
-               return (pvclock_lastcount);
-
-       atomic_swap_uint(&pvclock_lastcount, ctr);
+       do {
+               last = pvclock_lastcount;
+               if (ctr < last)
+                       return (last);
+       } while (atomic_cas_64(&pvclock_lastcount, last, ctr) != last);
 
        return (ctr);
 }

Reply via email to