> From: Mike Belopuhov <m...@belopuhov.com> > Date: Thu, 7 Jan 2016 12:02:23 +0100 > > On 6 January 2016 at 17:58, Stefan Fritsch <s...@sfritsch.de> wrote: > > On Wed, 6 Jan 2016, Mike Belopuhov wrote: > > > >> There's still stuff to do, but it receives and transmits reliably > >> (at least on modern Xen) so I'd like to get it in. Man page will > >> follow. > > > > I only had a quick glance at the code, but I have one comment about your > > use of memory barriers. The membar_* macros are pure compiler barriers > > when the openbsd kernel is compiled for UP. But since the host machine and > > xen may use SMP even in this case, I suspect the that you need hardware > > memory barriers even if MULTIPROCESSOR is not defined. This does not seem > > relevant for x86 because you don't use membar_sync, but it may become > > relevant for arm, which is also supported by xen. > > > > membar_{producer,consumer} are defined on arm to perform store and > load memory barriers. Our arm code currently does not distinguish > between an MP case and non-MP case regarding the definition of these > macros, so I'm not entirely certain what are you trying to say.
Not sure ARM is a good example to look at. In principle I think that the membar_xxx() interfaces could be simple compiler barriers on all our architectures, at least as long as the CPU will observe its own stores in the same order as they were emitted. But I think all sane CPU architectures make those guarantees. At least for "normal" memory. However, we treat that as an optimization. And we haven't done that for all our architectures. The problem with virtualization is of course that even a non-MP kernel is actually running in an MP environment. If data structures are shared with the hypervisor or another domain running on a different CPU, proper memory barriers must be used to guarantee the other side sees our stores in the right order. The typical case would be populating a descriptor with some sort of validity bit. There you want to make sure the other side doesn't see the valid bit set until all the other parts of the descriptor have been filled in and are visible. In that case a simple compiler barrier may not be enough. This is why the virtio_membar_xxx() primitives were introduced. This is actually not all that different from handling DMA to real hardware devices. There we must make sure that stores become visible to the hardware device in the right order. That matters even on non-MP kernels too and is handled by bus_dmamap_sync(9). Since you have embraced bus_dma(9) for the xen stuff, it would make sense to add a xen-specifc bus_dmamap_sync() implementation that issues the appropriate memory barrier. I think it should be virtio_membar_consumer() for BUS_DMASYNC_PREREAD and virtio_membar_producer() for BUS_DMASYNC_POSTWRITE. But you'd better double-check, because I always get confused! BTW, your xen bus_dma(9) implementation relies on the internals of the MD bus_dma(9) implementation. Don't expect it to to work on other architectures. I'm not even sure I want to be held responsible if changes in the MD code break it. > However I'm thankful for bringing this up and I'll spend some time > figuring out if I need actual fences in my code. for instance the > cas loop in xen_grant_table_remove runs for more than 10000 iterations > in the normal case. I've changed the code to perform bus_dma_unload > after zeroing the descriptor out so that there won't be technically > any dangling grant table references, but I didn't remeasure the cas > loop. Possibly due to caching and CPU migration on the host we lose > out and perhaps can get a boost in performance by putting an implicit > memory barrier. Not sure what memory fences have to do with this. The hypervisor should defenitely issue any appropriate barriers as part of the context switching. > > I had the same problem in virtio and introduced the virtio_membar_* macros > > for this purpose. Maybe they should be renamed to a more generic name and > > you should use them, too? > > > > I'm not sure cause I don't think x86 needs any explicit membars, but > I'll do some test and report on this. It's a grey area. The x86 memory model evolved over time and isn't all that well specified. On top of that it seems the actual hardware is abit more strongly ordered than the specification. It is fairly strongly ordered, which means that memory barriers can be omitted in most cases that deal with "normal" memory. But our implementation does issue memory barriers for membar_enter() and membar_sync(). I'm not 100% certain it is correct.