On Thu, 7 Jan 2016, Mike Belopuhov wrote: > On 7 January 2016 at 13:17, Mark Kettenis <mark.kette...@xs4all.nl> wrote: > >> From: Mike Belopuhov <m...@belopuhov.com> > >> Date: Thu, 7 Jan 2016 12:02:23 +0100 > >> > >> On 6 January 2016 at 17:58, Stefan Fritsch <s...@sfritsch.de> wrote: > >> > On Wed, 6 Jan 2016, Mike Belopuhov wrote: > >> > > >> >> There's still stuff to do, but it receives and transmits reliably > >> >> (at least on modern Xen) so I'd like to get it in. Man page will > >> >> follow. > >> > > >> > I only had a quick glance at the code, but I have one comment about your > >> > use of memory barriers. The membar_* macros are pure compiler barriers > >> > when the openbsd kernel is compiled for UP. But since the host machine > >> > and > >> > xen may use SMP even in this case, I suspect the that you need hardware > >> > memory barriers even if MULTIPROCESSOR is not defined. This does not seem > >> > relevant for x86 because you don't use membar_sync, but it may become > >> > relevant for arm, which is also supported by xen. > >> > > >> > >> membar_{producer,consumer} are defined on arm to perform store and > >> load memory barriers. Our arm code currently does not distinguish > >> between an MP case and non-MP case regarding the definition of these > >> macros, so I'm not entirely certain what are you trying to say.
I didn't check arm's implementation but new that it had non-empty membar_{producer,consumer}. So, if it does not distinguish between an MP case and non-MP case, then there is no problem there. But maybe you should document somewhere which assumptions about the architecture you make, so that they can be checked when adding a new architecture. I guess arm64 will come sooner or later and I don't know if it has exactly the same memory model as 32bit arm. > > > > Not sure ARM is a good example to look at. > > > > The only architectures that Xen dom0 is implemented for are i386, > adm64 and arm, so there's no real need to look at anything else. > > > In principle I think that the membar_xxx() interfaces could be simple > > compiler barriers on all our architectures, at least as long as the > > CPU will observe its own stores in the same order as they were > > emitted. But I think all sane CPU architectures make those > > guarantees. At least for "normal" memory. However, we treat that as > > an optimization. And we haven't done that for all our architectures. > > > > The problem with virtualization is of course that even a non-MP kernel > > is actually running in an MP environment. If data structures are > > shared with the hypervisor or another domain running on a different > > CPU, proper memory barriers must be used to guarantee the other side > > sees our stores in the right order. The typical case would be > > populating a descriptor with some sort of validity bit. There you > > want to make sure the other side doesn't see the valid bit set until > > all the other parts of the descriptor have been filled in and are > > visible. In that case a simple compiler barrier may not be enough. Yes. With intel it's the "Reads may be reordered with older writes to different locations but not with older writes to the same location" bit from the memory model that is causing problems. So you have to check if xen hits this case. virtio does (and removing the membarriers causes observable hangs). > > That's what I was referring to in my example below. > > > This is why the virtio_membar_xxx() primitives were introduced. > > > > Any idea why wasn't store and load barriers implemented separately? No idea. virtio_membar_xxx() was modeled after the existing membar_xxx(). But AIUI membar_consumer() plus membar_producer() is not equivalent to membar_sync() (which also prevents read vs. write reordering).