> From: Mike Belopuhov <m...@belopuhov.com>
> Date: Thu, 7 Jan 2016 12:02:23 +0100
> 
> On 6 January 2016 at 17:58, Stefan Fritsch <s...@sfritsch.de> wrote:
> > On Wed, 6 Jan 2016, Mike Belopuhov wrote:
> >
> >> There's still stuff to do, but it receives and transmits reliably
> >> (at least on modern Xen) so I'd like to get it in.  Man page will
> >> follow.
> >
> > I only had a quick glance at the code, but I have one comment about your
> > use of memory barriers. The membar_* macros are pure compiler barriers
> > when the openbsd kernel is compiled for UP. But since the host machine and
> > xen may use SMP even in this case, I suspect the that you need hardware
> > memory barriers even if MULTIPROCESSOR is not defined. This does not seem
> > relevant for x86 because you don't use membar_sync, but it may become
> > relevant for arm, which is also supported by xen.
> >
> 
> membar_{producer,consumer} are defined on arm to perform store and
> load memory barriers.  Our arm code currently does not distinguish
> between an MP case and non-MP case regarding the definition of these
> macros, so I'm not entirely certain what are you trying to say.

Not sure ARM is a good example to look at.

In principle I think that the membar_xxx() interfaces could be simple
compiler barriers on all our architectures, at least as long as the
CPU will observe its own stores in the same order as they were
emitted.  But I think all sane CPU architectures make those
guarantees.  At least for "normal" memory.  However, we treat that as
an optimization.  And we haven't done that for all our architectures.

The problem with virtualization is of course that even a non-MP kernel
is actually running in an MP environment.  If data structures are
shared with the hypervisor or another domain running on a different
CPU, proper memory barriers must be used to guarantee the other side
sees our stores in the right order.  The typical case would be
populating a descriptor with some sort of validity bit.  There you
want to make sure the other side doesn't see the valid bit set until
all the other parts of the descriptor have been filled in and are
visible.  In that case a simple compiler barrier may not be enough.
This is why the virtio_membar_xxx() primitives were introduced.

This is actually not all that different from handling DMA to real
hardware devices.  There we must make sure that stores become visible
to the hardware device in the right order.  That matters even on
non-MP kernels too and is handled by bus_dmamap_sync(9).

Since you have embraced bus_dma(9) for the xen stuff, it would make
sense to add a xen-specifc bus_dmamap_sync() implementation that
issues the appropriate memory barrier.  I think it should be
virtio_membar_consumer() for BUS_DMASYNC_PREREAD and
virtio_membar_producer() for BUS_DMASYNC_POSTWRITE.  But you'd better
double-check, because I always get confused!

BTW, your xen bus_dma(9) implementation relies on the internals of the
MD bus_dma(9) implementation.  Don't expect it to to work on other
architectures.  I'm not even sure I want to be held responsible if
changes in the MD code break it.

> However I'm thankful for bringing this up and I'll spend some time
> figuring out if I need actual fences in my code.  for instance the
> cas loop in xen_grant_table_remove runs for more than 10000 iterations
> in the normal case.  I've changed the code to perform bus_dma_unload
> after zeroing the descriptor out so that there won't be technically
> any dangling grant table references, but I didn't remeasure the cas
> loop.  Possibly due to caching and CPU migration on the host we lose
> out and perhaps can get a boost in performance by putting an implicit
> memory barrier.

Not sure what memory fences have to do with this.  The hypervisor
should defenitely issue any appropriate barriers as part of the
context switching.

> > I had the same problem in virtio and introduced the virtio_membar_* macros
> > for this purpose. Maybe they should be renamed to a more generic name and
> > you should use them, too?
> >
> 
> I'm not sure cause I don't think x86 needs any explicit membars, but
> I'll do some test and report on this.

It's a grey area.  The x86 memory model evolved over time and isn't
all that well specified.  On top of that it seems the actual hardware
is abit more strongly ordered than the specification.  It is fairly
strongly ordered, which means that memory barriers can be omitted in
most cases that deal with "normal" memory.  But our implementation
does issue memory barriers for membar_enter() and membar_sync().  I'm
not 100% certain it is correct.

Reply via email to