GCC interpretation of C11 atomics (DR 459)
Hi I have read multiple bug reports (84522, 80878, 70490), and the past decision regarding GCC change to redirect double-width (128-bit) atomics for x86-64 and arm64 to libatomic. Below I mention major concerns as well as the response from C11 (WG14) regarding DR 459 which, most likely, triggered this change in more recent GCC releases in the first place. If I understand correctly, the redirection to libatomic was made for 2 reasons: 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag already specifies that you use CPUs that have this instruction, so it should not be a concern when the flag is specified.) 2. atomic_load on read-only memory. DR 459 now requires to have 'const' qualifiers for atomic_load which probably resulted in the interpretation that read-only memory must be supported. However, per response from C11/WG14 (see below), it does not seem to be the case at all. Therefore, previously filed bug 70490 does not seem to be valid. There are several concerns with current GCC behavior: 1. Not consistent with clang/llvm which completely supports double-width atomics for arm32, arm64, x86 and x86-64 making it possible to write portable code (w/o specific extensions or assembly code) across all these architectures (which is finally possible with C11!).The behavior of clang: if mxc16 is specified, cmpxchg16b is generated for x86-64 (without any calls to libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are always generated. In my opinion, this is very logical and non-confusing. 2. Oftentimes you want to have strict guarantees (by specifying mcx16 flag for x86-64) that the generated code is lock-free, otherwise it is useless. Double-width atomics are often used in lock-free algorithms that use tags (stamps) for pointers to resolve the ABA problem. So, it is very useful to have corresponding support in the compiler. 3. The behavior is inconsistent even within GCC. Older (and more limited, less portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic and C11 -- do not. Moreover, __sync builtins are probably less suitable for arm/arm64. 4. atomic_load can be implemented using read-modify-write as it is the only option for x86-64 and arm64 (see below). For these reasons, it may be a good idea if GCC folks reconsider past decision. And just to clarify: if mcx16 (x86-64) is not specified during compilation, it is totally OK to redirect to libatomic, and there make the final decision if target CPU supports a given instruction or not. But if it is specified, it makes sense for performance reasons and lock-freedom guarantees to always generate it directly. -- Ruslan Response from the WG14 (C11) Convener regarding DR 459: (I asked for a permission to publish this response here.) Ruslan, Thank you for your comments. There is no normative requirement that const objects be suitable for read-only memory. An example and a footnote refer to read-only memory as a way to illustrate a point, but examples and footnotes are not normative. The actual nature of read-only memory and how it can be used are outside the scope of the standard, so there is nothing to prevent atomic_load from being implemented as a read-modify-write operation. David My original email: Dear David Keaton, After reviewing the proposed change DR 459 for C11: http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_459 ,I identified that adding const qualifier to atomic_load (C11 implements its without it) may actually be harmful in some cases. Particularly, for double-width (128-bit) atomics found in x86-64 (cmpxchg16b instruction), arm64 (ldaxp/staxp instructions), it is currently only possible to implement atomic_load for 128 bit using corresponding read-modify-write instructions (i.e., potentially rewriting memory with the same value, but, in essence, not changing it). But these implementations will not work on read-only memory. Similar concerns apply to some extent to x86 and arm32 for double-width (64-bit) atomics. Otherwise, there is no obstacle to implement all C11 atomics for corresponding types in these architectures. Moreover, a well-known clang/llvm compiler already implements all double-width operations for x86, x86-64, arm32 and arm64 (atomic_load is implemented using corresponding read-modify-write instructions). Double-width atomics are often used in data structures that need tagging for pointers to avoid the ABA problem (e.g., in lock-free stacks and queues). It is my understanding that C11 aimed to make atomics more or less portable across different microarchitectures, while at the same time provide an ability for a compiler to optimize code well and utilize all potential of the corresponding microarchitecture. If now it is required to support read-only memory (i.e., const qualifier) for atomic_load, 128-bit atomics are likely be impossible to impleme
Fw: GCC interpretation of C11 atomics (DR 459)
Alexander, Thank you for your comments. Please see my response below. I definitely do not want to fight for or against this change in gcc, but there are definitely legitimate concerns to consider. I think, it would really be good to consider this change to make things more compatible (i.e., at least between clang/llvm and gcc which can be both used within the same ecosystem). There are real practical benefits of having true lock-free double-width operations when implementing algorithms that rely on ABA tagging for pointers, and C11 at last gives an opportunity to do that without resorting to assembly or platform-specific implementations. > Note that there's more issues to that than just behavior on readonly memory: > you need to ensure that the whole program, including all static and shared > libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level > support to ensure that), or you'd need to be sure that it's safe to mix code > compiled with different -mcx16 settings because it never happens to interop > on wide atomic objects. Well, if libatomic is already doing it when corresponding CPU feature is available (i.e., effectively implementing operations using cmpxchg16b), I do not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore other code compiled without -mcx16 flag will go to libatomic. Inside libatomic, it will detect that cmpxchg16b *is* available, thus making code compiled with and without -mcx16 flag completely compatible on a given system. Or do I miss something here? If you do not have cmpxchg16b, but the program is compiled with the flag, it will simply not run (as expected). So, in other words, libatomic should still decide whether you have cmpxchg16b or not for cases when -mcx16 is not specified. But if it is specified, cmpxchg16b can be generated unconditionally. If you want better compatibility, you will not specify the flag. Mix of -mcx16 and mno-cx16 will be, thus, binary compatible. > Note that there's no "load" function in the __sync family, so the original > concern about operations on readonly memory does not apply. Yes, but per clarification from WG14/C11, read-only memory should not be a concern at all, as this behavior is not specified anyway (regardless of the const specifier). Read-modify-write is allowed for atomic_load as long as there is no 'visible' change on the value being loaded. In this sense, the bug that was filed previously regarding read-only memory accesses and const specifier does not seem to be valid. Additionally, it is really odd and counterintuitive to still provide support for (almost) deprecated macros while not giving such an opportunity for newer and more advanced functions. > You don't mention it directly, so just to make it clear for readers: on > systems > where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do > exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to > lock-free RMW implementations if so. (I don't like this solution) Yes, but libatomic makes things slower due to indirection. Also, it is much harder to track what is going on, as there is no guarantee of lock-freedom in this case. BTW -- The fact that it currently uses cmpxchg16b if available may actually be helpful to switch to the suggested behavior without breaking binary compatibility (if I understand everything correctly). -- Ruslan
Re: Fw: GCC interpretation of C11 atomics (DR 459)
> I'd say the main issue is that libatomic is not guaranteed to work like that. > Today it relies on IFUNC for redirection, so you may (and not "will") get the > desired behavior on Glibc (implying Linux), not on other OSes, and neither on > Linux with non-GNU libc (nor on bare metal, for that matter). I think, in case if IFUNC is not available (i.e., outside glibc), redirection is still possible by introducing a regular function pointer there. Yes, it is an extra cost but better than nothing (+ consistent behavior on all platforms), probably will not add too much anyway because there is already a performance hit by going to libatomic.
Re: GCC interpretation of C11 atomics (DR 459)
Thank you for more comments, my response is below. On Mon, 26 Feb 2018, Szabolcs Nagy wrote:> > rmw load is only valid if the implementation can > guarantee that atomic objects are never read-only. But per response from WG14 regarding DR 459 which I quoted, the standard does not seem to define behavior for read-only memory (and const qualifier should not suggest that). RMW, according to them, is fine for atomic_load. > current implementations on linux (including clang) > don't do that, so an rmw load can observably break > conforming c code: a static global const object is > placed in .rodata section and thus rmw on it is a > crash at runtime contrary to c standard requirements. I have just tried to compile the code using clang. Latest stable version of clang seems to emit cmpxchg16b for the code you mentioned if I specify mcx16. If I do not, it redirects to libatomic. (I have not tried the version from the trunk, though.) On Monday, February 26, 2018 8:57 AM, Alexander Monakov wrote: > OK, but that sounds like a matter of not emitting atomic > objects into .rodata, which shouldn't be a big problem, > if not for backwards compatibility concern? I agree, sounds like a good idea. Certainly for _Atomic objects > 8 bytes. > and then with new enough libatomic on Glibc this segfaults > with GCC on x86_64 too due to IFUNC redirection mentioned > in the other subthread. Seems like it is a problem anyway. Another reason to never emit _Atomic inside .rodata
Re: GCC interpretation of C11 atomics (DR 459)
On Monday, February 26, 2018 1:15 PM, Florian Weimer wrote: > I think x86-64 should be able to do atomic load and store via SSE2 > registers, but perhaps if the memory is suitably aligned (which is the > other problem—the libatomic code will work irrespective of alignment, as > far as I understand it). IIRC, it is not always guaranteed to be atomic, so RMW is probably the only safe option for x86-64. And for ARM64, too, as far as I understand. Just to summarize what can be done if the proposed change is accepted (from the discussion so far): 1. _Atomic on objects larger than 8 bytes should not be placed in .rodata even if declared as const. It can also be specified that atomic_load should not be used on read-only memory with double-width operations. 2. libatomic can be modified to redirect to functions that use cmpxchg16b (whenever available on target CPU) through regular functions pointers even if IFFUNC is not available. This will provide consistent behavior everywhere, and binary compatibility for mcx16 and mno-cx16 3. never redirect to libatomic for arm64 (since ldaxp/staxp are available), redirect for x86-64 only if mcx16 is not specified. For ARM64, there is no mcx16 option at all. -- Ruslan
Re: GCC interpretation of C11 atomics (DR 459)
Torvald, thank you for your output. See my response below. On Monday, February 26, 2018 1:35 PM, Torvald Riegel wrote: > ... does not imply this latter statement. The statement you cited is > about what the standard itself requires, not what makes sense for a > particular implementation. True but makes sense to provide true atomics when they are available. Since the standard seem to allow atomic_load implementation using RMW, does not seem to be a problem. In fact, lock_free flag for this type can return true only if mcx16 is specified; otherwise -- it returns false (since it can only be determined during runtime, assuming worst case scenario) > So, in such a case, using the wide CAS for > atomic loads breaks a reasonable assumption. Moreover, it's also a > special case, in that 32b atomics do work as intended. But in this case a programmer already makes an assumption that atomic_load does not use RMW which C11 does not seem to guarantee.Of course, for single-width operations, the programmer may in most practical cases assume it (even though there is no guarantee). Anyway, there is no good solution here for double-width operations, and the programmer should not assume it is possible when writing portable code.In fact, lock-based solution is even more confusing and potentially error-prone (e.g., cannot be safely used inside signal handlers since it is not lock-free, etc) > The behavior you favor would violate that, and > there's no portable way to distinguish one from the other. There is already a similar problem with IFFUNC (when used with Linux and glibc). In fact, I do not see any difference here. Redirection to libatomic when mcx16 is specified just adds extra cost + less predictable behavior. Moreover, it seems counterintuitive -- I specify a flag that mcx16 is supported but gcc still does not use it (at least directly). It is possible to make a change to libatomic to always use cmpxchg16b when available (even on systems without IFFUNC), this way it is totally consistent and binary compatible for code compiled with and without mcx16. > I see your point in wanting to have a builtin or such for the 64b atomic > CAS. However, IMO, this doesn't fit into the world of C11/C++11 > atomics, and thus rather should be accessible through a separate > interface. Why not? If atomic_load is not really an issue, then it may be good to use standardized interface.
Re: Fw: GCC interpretation of C11 atomics (DR 459)
Torvald, I definitely do not want to insist on this design choice, but it makes sense to at least seriuously consider it given the concerns I described. And especially because IFFUNC in libatomic already redirects to cmpxchg16b, so it just adds extra cost and indirection. Quite frankly, I do not even see any serious problem here with respect to binary compatibility. Even if cmpxchg16b was not used on some platforms outside Linux, old binaries will go to libatomic which can now be updated to simply use cmpxchg16b. (Even for statically linked should not be an issue -- they will not have any direct interaction with newer binaries.) > Not getting the performance usually associated with atomic loads can be > a big problem for code that tries to be portable. I do not think it is a common use case anyway. How often atomic_load is used on double-width operations? If a programmer needs some guarantees and does not care about lock-freedom, why not use a regular lock here? This way nothing magical happens. Otherwise, he will may hit unexpected issues in places like signal handlers (which is hard to debug since it will hang only once in a while). With cmpxchg16b, it is at least more or less reproducible: if you tried to use it on read-only memory, you will immediately get a segfault. > I think I now remember why we "didn't fix" libatomic: There might be > compiled code out there that does use the wide CAS, so changing > libatomic from the status quo to using its intenral locks could break > programs. Well, it already happens for Linux and glibc. There nothing will break. For other architectures, it would be good to implement the same, so that consistent behavior is observed everywhere. > No, they only said that it doesn't need to be a concern for the > standard. Implementations have to pay attention to more things, so it > is a concern for implementation. Yes, but the only problem I see is that it is currently placed to .rodata when const is used. It is easy to resolve: just do not place it there for _Atomic objects > 8 bytes. Then also clarify that a programmer cannot safely cast some arbitrary object that can be placed in .rodata to use with atomic_load. It needs to be addressed anyway, as there is already a segfault for provided example in x86-64 and Linux even with redirection to libatomic. > It's not "visible" in the abstract machine under some setting of the > as-if rule. But it is definitely visible in an implementation in which > the effects of read-only memory are visible (see my example of mapping > memory from another process read-only so as to read data from that > process). True but it is not defined for read-only memory anyway, and no assumptions can be made in portable code. -- Ruslan
Re: Fw: GCC interpretation of C11 atomics (DR 459)
Thanks, everyone, for the output, it is very useful. I am just proposing to consider the change unless there are clear roadblocks. (Either design choice is probably OK with respect to the standard formally speaking, but there are some clear advantages also.) I wrote a summary of pros & cons (which, of course, is slightly biased towards the change :) ) I also opened Bug 84563 with the rationale. Pros of the proposed approach: 1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). Hopefully, the behavior may also be made more or less consistent across different compilers over time. It is already the case for clang/llvm. As mentioned, double-width lock-free atomics have real practical use (ABA tags for pointers). 2. More likely to find a bug immediately if a programmer tries to do something that is not guaranteed by the standard (i.e., getting segfault on read-only memory when using double-width atomic_load). This is true even if mcx16 is not used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other hand, atomic_load implemented through locks may have hard-to-find and debug issues in signal handlers, interrupt contexts, etc when a programmer erroneously assumes that atomic_load is non-blocking 3. For arm64 the corresponding instructions are always available, no need for mcx16 flag or redirection to libatomic at all (libatomic may still keep old implementation for backward compatibility). 4. Faster & easy to analyze code when mcx16 is specified. 5. Ability to tell for sure if the implementation is lock-free by checking corresponding C11 flag when mcx16 is specified. When unspecified, the flag will be false to accommodate the worse-case scenario. 6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16 flag, etc. If cmpxchg16b is available, it is always used (platforms that do not support IFFUNC will use function pointers for redirection). The only thing the mcx16 flag changes is removing indirection to libatomic and giving guaranteed lock_free flag for corresponding types. (BTW, in practice, if you use the flag, you should know what you are doing already) 7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere. Cons of the proposed approach: 1. Compiler may place const atomic objects to .rodata. (Avoided by making sure _Atomic objects with the size > 8 are not placed in .rodata + clarifying that casting random .rodata objects for double-width atomics is undefined and is not allowed.) 2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely, even in this case, not an issue since all calls there are already redirected to libatomic anyway, and statically-linked binaries will not interact with new binaries directly. 3. Read-only memory for atomic_load will not be supported for double-width types. But it is actually better than hiding the problem under the carpet (current behavior is actually even worse because it is inconsistent across different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it is better to use a lock-based approach explicitly if for whatever reason it is more preferable (read-only memory, performance (?), etc). -- Ruslan
Re: Fw: GCC interpretation of C11 atomics (DR 459)
Formally speaking, either implementation satisfies C11 because the standard allows much leeway in the interpretation here. But, of course, it is kind of annoying that double-width types (and that also includes potentially 64-bit on some 32-bit processors, e.g. i586 also has cmpxchg8b and no official way to read atomically otherwise) need special handling and compiler extensions which basically means that in a number of cases I cannot write portable code, I need to put a bunch of architecture-dependent ifdefs, for say, 64 bit atomics even. (And all this mess to accommodate almost non-existent case when someone wants to use atomic_load on read-only memory for wide types, in which no good solution exists anyway.) Particularly, imagine when someones writes some lock-free code for different types (in templates, macros, etc). It basically uses same C11 atomic primitives but for various integer sizes. Now I need special handling for larger types because whatever libatomic provides does not guarantee lock-freedom (i.e., useless) which otherwise I do not need. True that wider types may not be available across all architectures, but I would prefer to have generic and standard-conformant code at least for those that have them. > That's a valid goal, but it does not imply that we should mess with how > atomics are implemented by default, nor should we mess with the default > use cases. This goal wants something special, and that is exposing the > fact that *only* a CAS is available to synchronize atomically on a >particular type. That is an extension of the existing atomics design. See above > The standard doesn't specify read-only memory, so it also doesn't forbid > the concept. The implementation takes it into account though, and thus > it's defined in that context. But my point is that a programmer cannot rely on this feature anyway unless she/he wants to write code which compiles only with gcc. It is unspecified by the standard and implementations that use read-modify-write for atomic_load are perfectly valid. The whole point to have this standard in the first place is to allow code be compiled by different compilers, otherwise people can just rely on gcc-specific extensions. > The topic we're currently discussing does not significantly affect when > we can remove __sync builtins, IMO. They are the only builtins that directly expose double-width operations. Short of using assembly fall-backs, they are the only option right now. > They do care about whether atomic operations are natively supported on > that particular type -- and that should include a load. I think, the whole point to have atomic operations is ability to provide lock-free operations whenever possible. Even though standard does not guarantee it, that is almost the only sane use case. Otherwise, there is no point -- you can always use locks. If they do not care about lock-freedom, they should just use locks. > Nobody is proposing to mark things as lock-free if they aren't. Thus, I > don't see any change to what's usable in signal handlers. It is not obvious to anyone that atomic_load will block. It will *not* for single-width types. So, again we see differences for single- and double-width types. Even though you do not have problems with read-only memory, you have another problem for double-width types which may be even more subtle and much harder to debug in a number of cases. Of course, no one can make an assumption that it will not block, but the same can be said about read-only memory. Anyway, I do not have a horse in the race... I just proposed to consider this change for a number of legitimate use cases, but it is eventually up to the gcc developers to decide. -- Ruslan
Re: Fw: GCC interpretation of C11 atomics (DR 459)
> 1) your proposal would make gcc non-conforming to iso c unless it changes how > static const objects are emitted. I do not think, ISO C requires to put const objects to .rodata. And it is easily solved by not placing it there for _Atomic objects that cannot be safely loaded from read-only memory. > 2) the two implementations are not abi compatible, the choice is already > made, changing it is an abi break. Since current implementations redirects to libatomic anyway, almost nothing should break. The only case it will break -- if somebody erroneously used atomic_load for 128-bit type on read-only memory (which is, again, not guaranteed by the standard). In practice, this case almost non-existent. The worst that may happen -- you will a segfault right away. > 3) Torvald pointed out further considerations such as users expecting > lock-free atomic loads to be faster than stores. Is it even true? Is it faster to use some global lock (implemented through RMW) than a single RMW operation? If you use this global lock, you will not get loads faster than stores.
Re: Fw: GCC interpretation of C11 atomics (DR 459)
Torvald, thank you for your output, but I think, this discussion gets a little pointless. There is nothing else I can add since gcc folks are reluctant to this change anyway. In my opinion, there is no compelling reason against such an implementation (it is perfectly fine with the standard, read-only memory is not guaranteed for atomic_load anyway). Even binary compatibility that was mentioned is unlikely to be an issue if implemented as I described. And finally this is something that can actually be useful in practice (at least as far as I can judge from my experience). By the way, this issue was already raised multiple times during last couple of years by different people who actually use it for various real projects (bugs were eventually closed as 'INVALID'). All described challenges are purely technical and can easily be resolved. Moreover, clang/llvm chose this implementation, and it seems very logical and non-confusing to me. It certainly makes sense to expose hardware capabilities through standard interfaces whenever possible. For my projects, I will simply fall back to my own implementation using inline assembly (at least for now) because, unfortunately, it is the only thing that is guaranteed to work outside of clang/llvm in the foreseeable future (__sync functions have some limitations and do not look like an attractive option either, by the way). On Tuesday, February 27, 2018 11:21 AM, Torvald Riegel wrote: On Tue, 2018-02-27 at 13:16 +0000, Ruslan Nikolaev via gcc wrote: > > 3) Torvald pointed out further considerations such as users expecting > > lock-free atomic loads to be faster than stores. > > Is it even true? Is it faster to use some global lock (implemented through > RMW) than a single RMW operation? If you use this global lock, you will not > get loads faster than stores. If GCC declares a type as lock-free, atomic loads on this type will be natively supported through some sort of load instruction. That means they are faster than stores under concurrent accesses, in particular when there are concurrent atomic loads (for all major HW we care about). If there is no natively supported atomic load, GCC will not declare the type to be lock-free. Nobody made statement about performance of locks vs. RMWs.
Re: GCC interpretation of C11 atomics (DR 459)
> Consider a producer-consumer relationship between two processes where > the producer doesn't want to wait for the consumer. For example, the > producer could be an application that's being traced, and the consumer > is a trace aggregation tool. The producer can provide a read-only > mapping to the consumer, and put a nonblocking ring buffer or something > similar in there. That allows the consumer to read, but it still needs > atomic access because the consumer is modifying the ring buffer > concurrently. Sorry for getting into someone's else conversation... And what good solution gcc offers right now? It forces producer and consumer to use lock-based (BTW: global lock!) approach for *both* producer and consumer if we are talking about 128-bit types. Therefore, sometimes producers *will* wait (by, effectively, blocking). Basically, it becomes useless. In this case, I would rather use a lock-based approach which at least does not use a global lock. On the contrary, the alternative implementation would have been at least useful when both producers and consumers have full (RW) access. Anyway, I already said that I personally will go with assembly inlines for right now. I just wanted to raise this concern since other people may find it useful in their projects.
Re: GCC interpretation of C11 atomics (DR 459)
> But we're not talking about that special case of 128b types here. The > majority of synchronization doesn't need more than machine word size. Then why do you worry about read-only access for 128b types? (it is a special case anyway). > No, such a program would have a bug anyway. It wouldn't even > synchronize properly. Therefore, it was not a valid example / use case (for 128-bit) in the first place. It was a *valid* example for smaller atomics, though. But that is exactly my point -- your current solution for 128 bit does not add any practical value except when you want to use lock-based solution (but see my explanation below). > The lock would need to be shared between processes in the example I > gave. You have to build your own lock for that currently, because C/C++ > don't give you any process-shared locks. At least in Linux, you can simply use eventfd(2) to reliably do it (without relying on "array of locks"). Given that it is not a very common use case, does not seem to need to have special C standard for this. And whatever C11 provides can not be relied upon anyway since you do not have strict guarantees that read-only memory is supported for larger types. For example, clang (and possibly other compilers) will break this assumption. At least, I would prefer to use eventfd in my application (if ever needed at all) since it has reliable and well-defined behavior in Linux.
Re: GCC interpretation of C11 atomics (DR 459)
Torvald, I think this discussion, indeed, gets pointless. Some of your responses clearly take my comments out of larger picture and context of the discussion. One thing is clear that either implementation is fine with the standard (formally speaking) simply because the standard allows too much leeway on how you implement atomics. In fact, as I mentioned clang/llvm implements it differently. I find it as a weakness of the standard, actually, because for portable (across different compilers), the only thing you can more or less safely assume are single-width types. Thank you for your output and discussion.