Re: MSP430 in gcc4.9 ... enable interrupts?
On 18/02/14 00:12, DJ Delorie wrote: >> I presume these will be part of the headers for the library >> distributed for msp430 gcc by TI/Redhat? > > I can't speak for TI's or Red Hat's plans. GNU's typical non-custom > embedded runtime is newlib/libgloss, which usually doesn't have that > much in the way of chip-specific headers or library functions. Fair enough. I don't know if Red Hat will be distributing anything themselves, but I'm confident that TI will distribute chip-specific headers with pre-built msp430 gcc packages (their aim, after all, is to get more people to buy msp430 chips - and making the tools as easy and powerful as possible is part of that). I was just wondering where these headers would fit in. > >> is that for the "critical" attribute that exists in the old msp430 >> port (which disables interrupts for the duration of the function)? > > Yes, for things like that. They're documented under "Function > Attributes" in the "Extensions to the C Language Family" chapter of > the current GCC manual. > Ah yes, I missed it when I first looked - the documentation makes "critical" look like an option to the "interrupt" attribute, rather than a stand-alone attribute. It seems to me a rather strange idea to have both "interrupt" and "critical" on the same function - an "interrupt" function is inherently "critical" (and "reentrant") in that interrupts are disabled before it enters, and restored or re-enabled on exit. It would make a difference on processors like ARMs with several interrupt levels, but not on an msp430 with its single level. David
Re: TYPE_BINFO and canonical types at LTO
On Mon, 17 Feb 2014, Jan Hubicka wrote: > > > > Yeah, ok. But we treat those types (B and C) TBAA equivalent because > > structurally they are the same ;) Luckily C has a "proper" field > > for its base (proper means that offset and size are correct as well > > as the type). It indeed has DECL_ARTIFICIAL set and yes, we treat > > those as "real" fields when doing the structural comparison. > > Yep, the difference is that depending if C or D win, we will end up walking > the > BINFO or not. So we should not depend on the BINFo walk for correctness. > > > > More interesting is of course when we can re-use tail-padding in > > one but not the other (works as expected - not merged). > > Yep. > > > > struct A { A (); short x; bool a;}; > > struct C:A { bool b; }; > > struct B {struct A a; bool b;}; > > struct C *p2; > > struct B *p1; > > int > > t() > > { > > p1->a.a = 2; > > return p2->a; > > } > > > > > Yes, zero sized classes are those having no fields (but other stuff, > > > type decls, bases etc.) > > > > Yeah, but TBAA obviously doesn't care about type decls and bases. > > So I guess the conclussion is that the BINFO walk in alias.c is pointless? Yes. But as I said - I remember being there and proposing to remove it. Some N > 5 years ago or so and it was either rejected or it didn't work out ;) > Concerning the merging details and LTO aliasing, I think for 4.10 we > should make C++ to compute mangled names of types (i.e. call > DECL_ASSEMBLER_NAME on the associated type_decl + explicitly mark that > type is driven by ODR) and then we can do merging driven by ODR rule. > > Non-ODR types born from other frontends will then need to be made to > alias all the ODR variants that can be done by storing them into the > current canonical type hash. > (I wonder if we want to support cross language aliasing for non-POD?) Surely for accessing components of non-POD types, no? Like class Foo { Foo(); int *get_data (); int *data; } glob_foo; extern "C" int *get_foo_data() { return glob_foo.get_data(); } ? But you are talking about the "tree" merging part using ODR info to also merge types which differ in completeness of contained pointer types, right? (exactly equal cases should be already merged) The canonical type computation happens separately (only for prevailing types, of course), and there we already "merge" types which differ in completeness. Canonical type merging is conservative the other way aroud - if we merge _all_ types to a single canonical type then TBAA is still correct (we get a single alias set). > I also think we want explicit representation of types known to be local > to compilation unit - anonymous namespaces in C/C++, types defined > within function bodies in C and god knows what in Ada/Fortran/Java. But here you get into the idea of improving TBAA, thus having _more_ distinct canonical types? Just to make sure to not mix those two ;) And whatever "frontend knowledge" we want to excercise - please make sure we get a reliable way for the middle-end to see that "frontend knowledge" (no langhooks!). Thus, make it "middle-end knowledge". Oh - and the easiest way to improve things is to get less types into the merging process in the first place! Richard.
Re: TYPE_BINFO and canonical types at LTO
On Tue, 18 Feb 2014, Richard Biener wrote: > On Mon, 17 Feb 2014, Jan Hubicka wrote: > > > > > > > Yeah, ok. But we treat those types (B and C) TBAA equivalent because > > > structurally they are the same ;) Luckily C has a "proper" field > > > for its base (proper means that offset and size are correct as well > > > as the type). It indeed has DECL_ARTIFICIAL set and yes, we treat > > > those as "real" fields when doing the structural comparison. > > > > Yep, the difference is that depending if C or D win, we will end up walking > > the > > BINFO or not. So we should not depend on the BINFo walk for correctness. > > > > > > More interesting is of course when we can re-use tail-padding in > > > one but not the other (works as expected - not merged). > > > > Yep. > > > > > > struct A { A (); short x; bool a;}; > > > struct C:A { bool b; }; > > > struct B {struct A a; bool b;}; > > > struct C *p2; > > > struct B *p1; > > > int > > > t() > > > { > > > p1->a.a = 2; > > > return p2->a; > > > } > > > > > > > Yes, zero sized classes are those having no fields (but other stuff, > > > > type decls, bases etc.) > > > > > > Yeah, but TBAA obviously doesn't care about type decls and bases. > > > > So I guess the conclussion is that the BINFO walk in alias.c is pointless? > > Yes. But as I said - I remember being there and proposing to remove > it. Some N > 5 years ago or so and it was either rejected or it didn't > work out ;) Btw, a bootstrap & regtest worked fine with removing that loop (not that this proves anything). Richard.
Re: [RFC][PATCH 0/5] arch: atomic rework
Several of you have said that the standard and compiler should not permit speculative writes of atomics, or (effectively) that the compiler should preserve dependencies. In simple examples it's easy to see what that means, but in general it's not so clear what the language should guarantee, because dependencies may go via non-atomic code in other compilation units, and we have to consider the extent to which it's desirable to limit optimisation there. For example, suppose we have, in one compilation unit: void f(int ra, int*rb) { if (ra==42) *rb=42; else *rb=42; } and in another compilation unit the bodies of two threads: // Thread 0 r1 = x; f(r1,&r2); y = r2; // Thread 1 r3 = y; f(r3,&r4); x = r4; where accesses to x and y are annotated C11 atomic memory_order_relaxed or Linux ACCESS_ONCE(), accesses to r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. (Of course, this is an artificial example, to make the point below as simply as possible - in real code the branches of the conditional might not be syntactically identical, just equivalent after macro expansion and other optimisation.) In the source program there's a dependency from the read of x to the write of y in Thread 0, and from the read of y to the write of x on Thread 1. Dependency-respecting compilation would preserve those and the ARM and POWER architectures both respect them, so the reads of x and y could not give 42. But a compiler might well optimise the (non-atomic) body of f() to just *rb=42, making the threads effectively // Thread 0 r1 = x; y = 42; // Thread 1 r3 = y; x = 42; (GCC does this at O1, O2, and O3) and the ARM and POWER architectures permit those two reads to see 42. That is moreover actually observable on current ARM hardware. So as far as we can see, either: 1) if you can accept the latter behaviour (if the Linux codebase does not rely on its absence), the language definition should permit it, and current compiler optimisations can be used, or 2) otherwise, the language definition should prohibit it but the compiler would have to preserve dependencies even in compilation units that have no mention of atomics. It's unclear what the (runtime and compiler development) cost of that would be in practice - perhaps Torvald could comment? For more context, this example is taken from a summary of the thin-air problem by Mark Batty and myself, , and the problem with dependencies via other compilation units was AFAIK first pointed out by Hans Boehm. Peter
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote: > Several of you have said that the standard and compiler should not > permit speculative writes of atomics, or (effectively) that the > compiler should preserve dependencies. The example below only deals with control dependencies; so I'll limit myself to that. > In simple examples it's easy > to see what that means, but in general it's not so clear what the > language should guarantee, because dependencies may go via non-atomic > code in other compilation units, and we have to consider the extent to > which it's desirable to limit optimisation there. > > For example, suppose we have, in one compilation unit: > > void f(int ra, int*rb) { > if (ra==42) > *rb=42; > else > *rb=42; > } > > and in another compilation unit the bodies of two threads: > > // Thread 0 > r1 = x; > f(r1,&r2); > y = r2; > > // Thread 1 > r3 = y; > f(r3,&r4); > x = r4; > > where accesses to x and y are annotated C11 atomic > memory_order_relaxed or Linux ACCESS_ONCE(), accesses to > r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. So I'm intuitively ok with this, however I would expect something like: void f(_Atomic int ra, _Atomic int *rb); To preserve dependencies and not make the conditional go away, simply because in that case the: if (ra == 42) the 'ra' usage can be seen as an atomic load. > So as far as we can see, either: > > 1) if you can accept the latter behaviour (if the Linux codebase does >not rely on its absence), the language definition should permit it, >and current compiler optimisations can be used, Currently there's exactly 1 site in the Linux kernel that relies on control dependencies as far as I know -- the one I put in. And its limited to a single function, so no cross translation unit funnies there. Of course, nobody is going to tell me when or where they'll put in the next one; since its now documented as accepted practise. However, PaulMck and our RCU usage very much do cross all sorts of TU boundaries; but those are data dependencies. ~ Peter
[LM-32] Code generation for address loading
Hi everyone, I am developing on a custom design using the LatticeMico32 architecture and I use gcc 4.5.1 to compile C code for this arch. In this architecture, the loading of an address 0x always takes two assembly instructions to fetch the address because immediates are on 16 bits : mvhi r1, 0x ori r1, r1, 0x ... lw r2, r1 In my situation, nearly all the symbols are located in the same 64kB region and their address share the same hi-part, so I am trying to minimize the overload of always using two instructions when only one is needed. In the C code, regrouping variables in structure will generate code where memory is accessed through an offset of the base address of the structure which is only loaded once per function, hence a gain in code size. But the extent of this technique is still limited by C code readability constraints and there is still at least one penalty in each function to load the first address. I was imagining something with a fixed register (say r23, declared as -ffixed in compilation flags) that would be loaded with the addresses' hi-part at reset (in the same ways that r0 is set to 0) and then each address loading phase would only consist in one instruction : ori r1, r23, 0x I have tried to play around with machine description file of the lm32 architecture to achieve my goal but I need some kind of constraints or condition to check where is my symbol before using this shortened version of address loading (Even if nearly all of the symbols are located within the same 64kB, some are still located in other memory regions and I don't want to touch them !) Because the symbol mapping phase is done during linking, I have little chance to know the future symbol address at code generation but is there some way I could make this work ? If I affect the symbol to a dedicated section (with the __attribute__ ((section())) directive ), is there a way to know its section during code generation ? I understand that I am asking for a very 'dangerous' advice but again, this will only be a custom optimization for a custom design. Thank you. F-X Morel
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote: > Several of you have said that the standard and compiler should not > permit speculative writes of atomics, or (effectively) that the > compiler should preserve dependencies. In simple examples it's easy > to see what that means, but in general it's not so clear what the > language should guarantee, because dependencies may go via non-atomic > code in other compilation units, and we have to consider the extent to > which it's desirable to limit optimisation there. > > For example, suppose we have, in one compilation unit: > > void f(int ra, int*rb) { > if (ra==42) > *rb=42; > else > *rb=42; > } Hello, Peter! Nice example! The relevant portion of Documentation/memory-barriers.txt in my -rcu tree says the following about the control dependency in the above construct: q = ACCESS_ONCE(a); if (q) { barrier(); ACCESS_ONCE(b) = p; do_something(); } else { barrier(); ACCESS_ONCE(b) = p; do_something_else(); } The initial ACCESS_ONCE() is required to prevent the compiler from proving the value of 'a', and the pair of barrier() invocations are required to prevent the compiler from pulling the two identical stores to 'b' out from the legs of the "if" statement. So yes, current compilers need significant help if it is necessary to maintain dependencies in that sort of code. Similar examples came up in the data-dependency discussions in the standards committee, which led to the [[carries_dependency]] attribute for C11 and C++11. Of course, current compilers don't have this attribute, and the current Linux kernel code doesn't have any other marking for data dependencies passing across function boundaries. (Maybe some time as an assist for detecting pointer leaks out of RCU read-side critical sections, but efforts along those lines are a bit stalled at the moment.) More on data dependencies below... > and in another compilation unit the bodies of two threads: > > // Thread 0 > r1 = x; > f(r1,&r2); > y = r2; > > // Thread 1 > r3 = y; > f(r3,&r4); > x = r4; > > where accesses to x and y are annotated C11 atomic > memory_order_relaxed or Linux ACCESS_ONCE(), accesses to > r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. > > (Of course, this is an artificial example, to make the point below as > simply as possible - in real code the branches of the conditional > might not be syntactically identical, just equivalent after macro > expansion and other optimisation.) > > In the source program there's a dependency from the read of x to the > write of y in Thread 0, and from the read of y to the write of x on > Thread 1. Dependency-respecting compilation would preserve those and > the ARM and POWER architectures both respect them, so the reads of x > and y could not give 42. > > But a compiler might well optimise the (non-atomic) body of f() to > just *rb=42, making the threads effectively > > // Thread 0 > r1 = x; > y = 42; > > // Thread 1 > r3 = y; > x = 42; > > (GCC does this at O1, O2, and O3) and the ARM and POWER architectures > permit those two reads to see 42. That is moreover actually observable > on current ARM hardware. I do agree that this could happen on current compilers and hardware. Agreed, but as Peter Zijlstra noted in this thread, this optimization is to a control dependency, not a data dependency. > So as far as we can see, either: > > 1) if you can accept the latter behaviour (if the Linux codebase does >not rely on its absence), the language definition should permit it, >and current compiler optimisations can be used, > > or > > 2) otherwise, the language definition should prohibit it but the >compiler would have to preserve dependencies even in compilation >units that have no mention of atomics. It's unclear what the >(runtime and compiler development) cost of that would be in >practice - perhaps Torvald could comment? For current compilers, we have to rely on coding conventions within the Linux kernel in combination with non-standard extentions to gcc and specified compiler flags to disable undesirable behavior. I have a start on specifying this in a document I am preparing for the standards committee, a very early draft of which may be found here: http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf Section 3 shows the results of a manual scan through the Linux kernel's dependency chains, and Section 4.1 lists a probably incomplete (and no doubt erroneous) list of coding standards required to make dependency chains work on current compilers. Any comments and suggestions are more tha
Re: [RFC][PATCH 0/5] arch: atomic rework
Hi Paul, Thanks for the document. I'm looking forward to reading the bits about dependency chains in Linux. > One point of confusion for me... Example 4 says "language must allow". > Shouldn't that be "language is permitted to allow"? When we say "allow", we mean that the optimised execution should be allowed by the specification, and Implicitly, the unoptimised execution should remain allowed too. We want to be concrete about what the language specification allows, and that's why we say "must". It is not to disallow the unoptimised execution. > Seems like an > implementation is always within its rights to avoid an optimization if > its implementation prevents it from safely detecting the oppportunity > for that optimization. That's right. - Mark > Or am I missing something here? > > Thanx, Paul >
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > And exactly because I know enough, I would *really* like atomics to be > well-defined, and have very clear - and *local* - rules about how they > can be combined and optimized. "Local"? > None of this "if you can prove that the read has value X" stuff. And > things like value speculation should simply not be allowed, because > that actually breaks the dependency chain that the CPU architects give > guarantees for. Instead, make the rules be very clear, and very > simple, like my suggestion. You can never remove a load because you > can "prove" it has some value, but you can combine two consecutive > atomic accesses/ Sorry, but the rules *are* very clear. I *really* suggest to look at the formalization by Batty et al. And in these rules, proving that a read will always return value X has a well-defined meaning, and you can use it. That simply follows from how the model is built. What you seem to want just isn't covered by the model as it is today -- you can't infer from that that the model itself would be wrong. The dependency chains aren't modeled in the way you envision it (except in what consume_mo tries, but that seems to be hard to implement); they are there on the level of the program logic as modeled by the abstract machine and the respective execution/output rules, but they are not built to represent those specific ordering guarantees the HW gives you. I would also be cautious claiming that the rules you suggested would be very clear and very simple. I haven't seen a memory model spec from you that would be viable as the standard model for C/C++, nor have I seen proof that this would actually be easier to understand for programmers in general. > For example, CPU people actually do tend to give guarantees for > certain things, like stores that are causally related being visible in > a particular order. Yes, but that's not part of the model so far. If you want to exploit this, please make a suggestion for how to extend the model to cover this. You certainly expect compilers to actually optimize code, which usually removes or reorders stuff. Now, we could say that we just don't want that for atomic accesses, but even this isn't as clear-cut: How do we actually detect where non-atomic code is intended by the programmer to express (1) a dependency that needs to be transformed into a dependency in the generated code from (2) a dependency that is just program logic? IOW, consider the consume_mo example "*(p + flag - flag)", where flag is coming from a consume_mo load: Can you give a complete set of rules (for a full program) for when the compiler is allowed to optimize out flag-flag and when not? (And it should be practically implementable in a compiler, and not prevent optimizations where people expect them.) > If the compiler starts doing value speculation on > atomic accesses, you are quite possibly breaking things like that. > It's just not a good idea. Don't do it. Write the standard so that it > clearly is disallowed. You never want value speculation for anything possibly originating from an atomic load? Or what are the concrete rules you have in mind? > Because you may think that a C standard is machine-independent, but > that isn't really the case. The people who write code still write code > for a particular machine. Our code works (in the general case) on > different byte orderings, different register sizes, different memory > ordering models. But in each *instance* we still end up actually > coding for each machine. That's how *you* do it. I'm sure you are aware that for lots of programmers, having a machine-dependent standard is not helpful at all. As the standard is written, code is portable. It can certainly be beneficial for programmers to optimize for different machines, but the core of the memory model will always remain portable, because that's what arguably most programmers need. That doesn't prevent machine-dependent extensions, but those then should be well integrated with the rest of the model. > So the rules for atomics should be simple and *specific* enough that > when you write code for a particular architecture, you can take the > architecture memory ordering *and* the C atomics orderings into > account, and do the right thing for that architecture. That would be an extension, but not viable as a *general* requirement as far as the standard is concerned. > And that very much means that doing things like value speculation MUST > NOT HAPPEN. See? Even if you can prove that your code is "equivalent", > it isn't. I'm kind of puzzled why you keep stating generalizations of assertions that clearly aren't well-founded (ie, which might be true for the kernel or your wishes for how the world should be, but aren't true in general or for the objectives of others). It's not helpful to just pretend other people's opinions or worlds didn't exist in discussions such as this one. It's clear that the standard
Re: [RFC][PATCH 0/5] arch: atomic rework
Hi Paul, On 18 February 2014 14:56, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote: >> Several of you have said that the standard and compiler should not >> permit speculative writes of atomics, or (effectively) that the >> compiler should preserve dependencies. In simple examples it's easy >> to see what that means, but in general it's not so clear what the >> language should guarantee, because dependencies may go via non-atomic >> code in other compilation units, and we have to consider the extent to >> which it's desirable to limit optimisation there. >> >> For example, suppose we have, in one compilation unit: >> >> void f(int ra, int*rb) { >> if (ra==42) >> *rb=42; >> else >> *rb=42; >> } > > Hello, Peter! > > Nice example! > > The relevant portion of Documentation/memory-barriers.txt in my -rcu tree > says the following about the control dependency in the above construct: > > > > q = ACCESS_ONCE(a); > if (q) { > barrier(); > ACCESS_ONCE(b) = p; > do_something(); > } else { > barrier(); > ACCESS_ONCE(b) = p; > do_something_else(); > } > > The initial ACCESS_ONCE() is required to prevent the compiler from > proving the value of 'a', and the pair of barrier() invocations are > required to prevent the compiler from pulling the two identical stores > to 'b' out from the legs of the "if" statement. thanks > > > So yes, current compilers need significant help if it is necessary to > maintain dependencies in that sort of code. > > Similar examples came up in the data-dependency discussions in the > standards committee, which led to the [[carries_dependency]] attribute for > C11 and C++11. Of course, current compilers don't have this attribute, > and the current Linux kernel code doesn't have any other marking for > data dependencies passing across function boundaries. (Maybe some time > as an assist for detecting pointer leaks out of RCU read-side critical > sections, but efforts along those lines are a bit stalled at the moment.) > > More on data dependencies below... > >> and in another compilation unit the bodies of two threads: >> >> // Thread 0 >> r1 = x; >> f(r1,&r2); >> y = r2; >> >> // Thread 1 >> r3 = y; >> f(r3,&r4); >> x = r4; >> >> where accesses to x and y are annotated C11 atomic >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. >> >> (Of course, this is an artificial example, to make the point below as >> simply as possible - in real code the branches of the conditional >> might not be syntactically identical, just equivalent after macro >> expansion and other optimisation.) >> >> In the source program there's a dependency from the read of x to the >> write of y in Thread 0, and from the read of y to the write of x on >> Thread 1. Dependency-respecting compilation would preserve those and >> the ARM and POWER architectures both respect them, so the reads of x >> and y could not give 42. >> >> But a compiler might well optimise the (non-atomic) body of f() to >> just *rb=42, making the threads effectively >> >> // Thread 0 >> r1 = x; >> y = 42; >> >> // Thread 1 >> r3 = y; >> x = 42; >> >> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures >> permit those two reads to see 42. That is moreover actually observable >> on current ARM hardware. > > I do agree that this could happen on current compilers and hardware. > > Agreed, but as Peter Zijlstra noted in this thread, this optimization > is to a control dependency, not a data dependency. Indeed. In principle (again as Hans has observed) a compiler might well convert between the two, e.g. if operating on single-bit values, or where value-range analysis has shown that a variable can only contain one of a small set of values. I don't know whether that happens in practice? Then there are also cases where a compiler is very likely to remove data/address dependencies, eg if some constant C is #define'd to be 0 then an array access indexed by x * C will have the dependency on x removed. The runtime and compiler development costs of preventing that are also unclear to me. Given that, whether it's reasonable to treat control and data dependencies differently seems to be an open question. >> So as far as we can see, either: >> >> 1) if you can accept the latter behaviour (if the Linux codebase does >>not rely on its absence), the language definition should permit it, >>and current compiler optimisations can be used, >> >> or >> >> 2) otherwise, the language definition should prohibit it but the >>compiler would have to preserve depen
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel wrote: > > > > There's an underlying problem here that's independent from the actual > > instance that you're worried about here: "no sense" is a ultimately a > > matter of taste/objectives/priorities as long as the respective > > specification is logically consistent. > > Yes. But I don't think it's "independent". > > Exactly *because* some people will read standards without applying > "does the resulting code generation actually make sense for the > programmer that wrote the code", the standard has to be pretty clear. > > The standard often *isn't* pretty clear. It wasn't clear enough when > it came to "volatile", and yet that was a *much* simpler concept than > atomic accesses and memory ordering. > > And most of the time it's not a big deal. But because the C standard > generally tries to be very portable, and cover different machines, > there tends to be a mindset that anything inherently unportable is > "undefined" or "implementation defined", and then the compiler writer > is basically given free reign to do anything they want (with > "implementation defined" at least requiring that it is reliably the > same thing). Yes, that's how it works in general. And this makes sense, because all optimizations rely on that. Second, you can't keep something consistent (eg, between compilers) if it isn't specified. So if we want stricter rules, those need to be specified somewhere. > And when it comes to memory ordering, *everything* is basically > non-portable, because different CPU's very much have different rules. Well, the current set of memory orders (and the memory model as a whole) is portable, even though it might not allow to exploit all hardware properties, and thus might perform sub-optimally in some cases. > I worry that that means that the standard then takes the stance that > "well, compiler re-ordering is no worse than CPU re-ordering, so we > let the compiler do anything". And then we have to either add > "volatile" to make sure the compiler doesn't do that, or use an overly > strict memory model at the compiler level that makes it all pointless. Using "volatile" is not a good option, I think, because synchronization between threads should be orthogonal to observable output of the abstract machine. The current memory model might not allow to exploit all hardware properties, I agree. But then why don't we work on how to extend it to do so? We need to specify the behavior we want anyway, and this can't be independent of the language semantics, so it has to be conceptually integrated with the standard anyway. > So I really really hope that the standard doesn't give compiler > writers free hands to do anything that they can prove is "equivalent" > in the virtual C machine model. It does, but it also doesn't mean this can't be extended. So let's focus on whether we can find an extension. > That's not how you get reliable > results. In this general form, that's obviously a false claim.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote: > On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote: > > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel wrote: > > > > > > Which example do you have in mind here? Haven't we resolved all the > > > debated examples, or did I miss any? > > > > Well, Paul seems to still think that the standard possibly allows > > speculative writes or possibly value speculation in ways that break > > the hardware-guaranteed orderings. > > It is not that I know of any specific problems, but rather that I > know I haven't looked under all the rocks. Plus my impression from > my few years on the committee is that the standard will be pushed to > the limit when it comes time to add optimizations. > > One example that I learned about last week uses the branch-prediction > hardware to validate value speculation. And no, I am not at all a fan > of value speculation, in case you were curious. However, it is still > an educational example. > > This is where you start: > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > do_something(p->a, p->b, p->c); > p->d = 1; I assume that's the source code. > Then you leverage branch-prediction hardware as follows: > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > if (p == GUESS) { > do_something(GUESS->a, GUESS->b, GUESS->c); > GUESS->d = 1; > } else { > do_something(p->a, p->b, p->c); > p->d = 1; > } I assume that this is a potential transformation by a compiler. > The CPU's branch-prediction hardware squashes speculation in the case where > the guess was wrong, and this prevents the speculative store to ->d from > ever being visible. However, the then-clause breaks dependencies, which > means that the loads -could- be speculated, so that do_something() gets > passed pre-initialization values. > > Now, I hope and expect that the wording in the standard about dependency > ordering prohibits this sort of thing. But I do not yet know for certain. The transformation would be incorrect. p->a in the source code carries a dependency, and as you say, the transformed code wouldn't have that dependency any more. So the transformed code would loose ordering constraints that it has in the virtual machine, so in the absence of other proofs of correctness based on properties not shown in the example, the transformed code would not result in the same behavior as allowed by the abstract machine. If the transformation would actually be by a programmer, then this wouldn't do the same as the first example because mo_consume doesn't work through the if statement. Are there other specified concerns that you have regarding this example?
Deadline for DWARF Version 5 comments -- March 31, 2014
The DWARF Debugging Information Format Committee (http://dwarfstd.org) is nearing completion on the DWARF Version 5 standard. We anticipate a public comment draft to be released mid-2014. As in the past, we have attempted where possible to make DWARF Version 5 backward compatible with previous versions. We will be accepting public comments, suggestions, or proposals for changes to the DWARF Debugging Standard until March 31, 2014. Please submit your comments using our Public Comment web page: http://dwarfstd.org/Comment.php. The current Version 4 standard, as well as previous versions of the DWARF Standard can be found here: http://dwarfstd.org/Download.php Please feel free to forward this email to anyone or any list where it seems appropriate. -- Michael EagerChair, DWARF Standards Committee ea...@eagercon.com
Re: [RFC][PATCH 0/5] arch: atomic rework
On 18 February 2014 12:53, Peter Zijlstra wrote: > On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote: >> Several of you have said that the standard and compiler should not >> permit speculative writes of atomics, or (effectively) that the >> compiler should preserve dependencies. > > The example below only deals with control dependencies; so I'll limit > myself to that. Data/address dependencies are, if anything, even less clear - see a paragraph on that in my reply to Paul. >> In simple examples it's easy >> to see what that means, but in general it's not so clear what the >> language should guarantee, because dependencies may go via non-atomic >> code in other compilation units, and we have to consider the extent to >> which it's desirable to limit optimisation there. >> >> For example, suppose we have, in one compilation unit: >> >> void f(int ra, int*rb) { >> if (ra==42) >> *rb=42; >> else >> *rb=42; >> } >> >> and in another compilation unit the bodies of two threads: >> >> // Thread 0 >> r1 = x; >> f(r1,&r2); >> y = r2; >> >> // Thread 1 >> r3 = y; >> f(r3,&r4); >> x = r4; >> >> where accesses to x and y are annotated C11 atomic >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. > > So I'm intuitively ok with this, however I would expect something like: > > void f(_Atomic int ra, _Atomic int *rb); > > To preserve dependencies and not make the conditional go away, simply > because in that case the: > > if (ra == 42) > > the 'ra' usage can be seen as an atomic load. > >> So as far as we can see, either: >> >> 1) if you can accept the latter behaviour (if the Linux codebase does >>not rely on its absence), the language definition should permit it, >>and current compiler optimisations can be used, > > Currently there's exactly 1 site in the Linux kernel that relies on > control dependencies as far as I know -- the one I put in. ok, thanks > And its > limited to a single function, so no cross translation unit funnies > there. One can imagine a language definition that treats code that lies entirely within a single compilation unit specially (e.g. if it's somehow annotated as relying on dependencies). But I imagine it would be pretty unappealing to work with. > Of course, nobody is going to tell me when or where they'll put in the > next one; since its now documented as accepted practise. Is that not fragile? > However, PaulMck and our RCU usage very much do cross all sorts of TU > boundaries; but those are data dependencies. yes - though again see my reply to Paul's note thanks, Peter > ~ Peter
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 19:42 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds > wrote: > > > > As far as I can tell, the intent is that you can't do value > > speculation (except perhaps for the "relaxed", which quite frankly > > sounds largely useless). > > Hmm. The language I see for "consume" is not obvious: > > "Consume operation: no reads in the current thread dependent on the > value currently loaded can be reordered before this load" I can't remember seeing that language in the standard (ie, C or C++). Where is this from? > and it could make a compiler writer say that value speculation is > still valid, if you do it like this (with "ptr" being the atomic > variable): > > value = ptr->val; I assume the load from ptr has mo_consume ordering? > into > > tmp = ptr; > value = speculated.value; > if (unlikely(tmp != &speculated)) > value = tmp->value; > > which is still bogus. The load of "ptr" does happen before the load of > "value = speculated->value" in the instruction stream, but it would > still result in the CPU possibly moving the value read before the > pointer read at least on ARM and power. And surprise, in the C/C++ model the load from ptr is sequenced-before the load from speculated, but there's no ordering constraint on the reads-from relation for the value load if you use mo_consume on the ptr load. Thus, the transformed code has less ordering constraints than the original code, and we arrive at the same outcome. > So if you're a compiler person, you think you followed the letter of > the spec - as far as *you* were concerned, no load dependent on the > value of the atomic load moved to before the atomic load. No. Because the wobbly sentence you cited(?) above is not what the standard says. Would you please stop making claims about what compiler writers would do or not if you seemingly aren't even familiar with the model that compiler writers would use to reason about transformations? Seriously? > You go home, > happy, knowing you've done your job. Never mind that you generated > code that doesn't actually work. > > I dread having to explain to the compiler person that he may be right > in some theoretical virtual machine, but the code is subtly broken and > nobody will ever understand why (and likely not be able to create a > test-case showing the breakage). > > But maybe the full standard makes it clear that "reordered before this > load" actually means on the real hardware, not just in the generated > instruction stream. Do you think everyone else is stupid? If there's an ordering constraint in the virtual machine, it better be present when executing in the real machine unless it provably cannot result in different output as specified by the language's semantics. > Reading it with understanding of the *intent* and > understanding all the different memory models that requirement should > be obvious (on alpha, you need an "rmb" instruction after the load), > but ... The standard is clear on what's required. I strongly suggest reading the formalization of the memory model by Batty et al.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-17 at 16:09 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 14:32 -0800, > > > >> Stop claiming it "can return 1".. It *never* returns 1 unless you do > >> the load and *verify* it, or unless the load itself can be made to go > >> away. And with the code sequence given, that just doesn't happen. END > >> OF STORY. > > > > void foo(); > > { > > atomic x = 1; > > if (atomic_load(&x, mo_relaxed) == 1) > > atomic_store(&y, 3, mo_relaxed)); > > } > > This is the very example I gave, where the real issue is not that "you > prove that load returns 1", you instead say "store followed by a load > can be combined". > > I (in another email I just wrote) tried to show why the "prove > something is true" is a very dangerous model. Seriously, it's pure > crap. It's broken. I don't see anything dangerous in the example above with the language semantics as specified: It's a well-defined situation, given the rules of the language. I replied to the other email you wrote with my viewpoint on why the above is useful, how it compares to what you seem to what, and where I think we need to start to bridge the gap. > If the C standard defines atomics in terms of "provable equivalence", > it's broken. Exactly because on a *virtual* machine you can prove > things that are not actually true in a *real* machine. For the control dependencies you have in mind, it's actually the other way around. You expect the real machine's properties in a program whose semantics only give you the virtual machine's properties. Anything you prove on the virtual machine will be true on the real machine (in a correct implementation) -- but you can't expect to have real-machine properties on language that's based on the virtual machine. > I have the > example of value speculation changing the memory ordering model of the > actual machine. This example is not true for the language as specified. It is true for a modified language that you have in mind, but for this one I've just seen pretty rough rules so far. Please see my other reply.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 03:33:35PM +, Peter Sewell wrote: > Hi Paul, > > On 18 February 2014 14:56, Paul E. McKenney > wrote: > > On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote: > >> Several of you have said that the standard and compiler should not > >> permit speculative writes of atomics, or (effectively) that the > >> compiler should preserve dependencies. In simple examples it's easy > >> to see what that means, but in general it's not so clear what the > >> language should guarantee, because dependencies may go via non-atomic > >> code in other compilation units, and we have to consider the extent to > >> which it's desirable to limit optimisation there. > >> > >> For example, suppose we have, in one compilation unit: > >> > >> void f(int ra, int*rb) { > >> if (ra==42) > >> *rb=42; > >> else > >> *rb=42; > >> } > > > > Hello, Peter! > > > > Nice example! > > > > The relevant portion of Documentation/memory-barriers.txt in my -rcu tree > > says the following about the control dependency in the above construct: > > > > > > > > q = ACCESS_ONCE(a); > > if (q) { > > barrier(); > > ACCESS_ONCE(b) = p; > > do_something(); > > } else { > > barrier(); > > ACCESS_ONCE(b) = p; > > do_something_else(); > > } > > > > The initial ACCESS_ONCE() is required to prevent the compiler from > > proving the value of 'a', and the pair of barrier() invocations are > > required to prevent the compiler from pulling the two identical stores > > to 'b' out from the legs of the "if" statement. > > thanks > > > > > > > So yes, current compilers need significant help if it is necessary to > > maintain dependencies in that sort of code. > > > > Similar examples came up in the data-dependency discussions in the > > standards committee, which led to the [[carries_dependency]] attribute for > > C11 and C++11. Of course, current compilers don't have this attribute, > > and the current Linux kernel code doesn't have any other marking for > > data dependencies passing across function boundaries. (Maybe some time > > as an assist for detecting pointer leaks out of RCU read-side critical > > sections, but efforts along those lines are a bit stalled at the moment.) > > > > More on data dependencies below... > > > >> and in another compilation unit the bodies of two threads: > >> > >> // Thread 0 > >> r1 = x; > >> f(r1,&r2); > >> y = r2; > >> > >> // Thread 1 > >> r3 = y; > >> f(r3,&r4); > >> x = r4; > >> > >> where accesses to x and y are annotated C11 atomic > >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to > >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. > >> > >> (Of course, this is an artificial example, to make the point below as > >> simply as possible - in real code the branches of the conditional > >> might not be syntactically identical, just equivalent after macro > >> expansion and other optimisation.) > >> > >> In the source program there's a dependency from the read of x to the > >> write of y in Thread 0, and from the read of y to the write of x on > >> Thread 1. Dependency-respecting compilation would preserve those and > >> the ARM and POWER architectures both respect them, so the reads of x > >> and y could not give 42. > >> > >> But a compiler might well optimise the (non-atomic) body of f() to > >> just *rb=42, making the threads effectively > >> > >> // Thread 0 > >> r1 = x; > >> y = 42; > >> > >> // Thread 1 > >> r3 = y; > >> x = 42; > >> > >> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures > >> permit those two reads to see 42. That is moreover actually observable > >> on current ARM hardware. > > > > I do agree that this could happen on current compilers and hardware. > > > > Agreed, but as Peter Zijlstra noted in this thread, this optimization > > is to a control dependency, not a data dependency. > > Indeed. In principle (again as Hans has observed) a compiler might > well convert between the two, e.g. if operating on single-bit values, > or where value-range analysis has shown that a variable can only > contain one of a small set of values. I don't know whether that > happens in practice? Then there are also cases where a compiler is > very likely to remove data/address dependencies, eg if some constant C > is #define'd to be 0 then an array access indexed by x * C will have > the dependency on x removed. The runtime and compiler development > costs of preventing that are also unclear to me. > > Given that, whether it's reasonable to treat control and data > dependencies differently seems to be an open question. Here is another (admittedly fanciful and probably bugg
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel wrote: > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: >> And exactly because I know enough, I would *really* like atomics to be >> well-defined, and have very clear - and *local* - rules about how they >> can be combined and optimized. > > "Local"? Yes. So I think that one of the big advantages of atomics over volatile is that they *can* be optimized, and as such I'm not at all against trying to generate much better code than for volatile accesses. But at the same time, that can go too far. For example, one of the things we'd want to use atomics for is page table accesses, where it is very important that we don't generate multiple accesses to the values, because parts of the values can be change *by*hardware* (ie accessed and dirty bits). So imagine that you have some clever global optimizer that sees that the program never ever actually sets the dirty bit at all in any thread, and then uses that kind of non-local knowledge to make optimization decisions. THAT WOULD BE BAD. Do you see what I'm aiming for? Any optimization that tries to prove anything from more than local state is by definition broken, because it assumes that everything is described by the program. But *local* optimizations are fine, as long as they follow the obvious rule of not actually making changes that are semantically visible. (In practice, I'd be impressed as hell for that particular example, and we actually do end up setting the dirty bit by hand in some situations, so the example is slightly made up, but there are other cases that might be more realistic in that sometimes we do things that are hidden from the compiler - in assembly etc - and the compiler might *think* it knows what is going on, but it doesn't actually see all accesses). > Sorry, but the rules *are* very clear. I *really* suggest to look at > the formalization by Batty et al. And in these rules, proving that a > read will always return value X has a well-defined meaning, and you can > use it. That simply follows from how the model is built. What "model"? That's the thing. I have tried to figure out whether the model is some abstract C model, or a model based on the actual hardware that the compiler is compiling for, and whether the model is one that assumes the compiler has complete knowledge of the system (see the example above). And it seems to be a mixture of it all. The definitions of the various orderings obviously very much imply that the compiler has to insert the proper barriers or sequence markers for that architecture, but then there is no apparent way to depend on any *other* architecture ordering guarantees. Like our knowledge that all architectures (with the exception of alpha, which really doesn't end up being something we worry about any more) end up having the load dependency ordering guarantee. > What you seem to want just isn't covered by the model as it is today -- > you can't infer from that that the model itself would be wrong. The > dependency chains aren't modeled in the way you envision it (except in > what consume_mo tries, but that seems to be hard to implement); they are > there on the level of the program logic as modeled by the abstract > machine and the respective execution/output rules, but they are not > built to represent those specific ordering guarantees the HW gives you. So this is a problem. It basically means that we cannot do the kinds of things we do now, which very much involve knowing what the memory ordering of a particular machine is, and combining that knowledge with our knowledge of code generation. Now, *most* of what we do is protected by locking and is all fine. But we do have a few rather subtle places in RCU and in the scheduler where we depend on the actual dependency chain. In *practice*, I seriously doubt any reasonable compiler can actually make a mess of it. The kinds of optimizations that would actually defeat the dependency chain are simply not realistic. And I suspect that will end up being what we rely on - there being no actual sane sequence that a compiler would ever do, even if we wouldn't have guarantees for some of it. And I suspect I can live with that. We _have_ lived with that for the longest time, after all. We very much do things that aren't covered by any existing C standard, and just basically use tricks to coax the compiler into never generating code that doesn't work (with our inline asm barriers etc being a prime example). > I would also be cautious claiming that the rules you suggested would be > very clear and very simple. I haven't seen a memory model spec from you > that would be viable as the standard model for C/C++, nor have I seen > proof that this would actually be easier to understand for programmers > in general. So personally, if I were to write the spec, I would have taken a completely different approach from what the standard obviously does. I'd have taken the approach of specifying the requir
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 04:56:40PM +0100, Torvald Riegel wrote: > On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote: > > On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote: > > > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel > > > wrote: > > > > > > > > Which example do you have in mind here? Haven't we resolved all the > > > > debated examples, or did I miss any? > > > > > > Well, Paul seems to still think that the standard possibly allows > > > speculative writes or possibly value speculation in ways that break > > > the hardware-guaranteed orderings. > > > > It is not that I know of any specific problems, but rather that I > > know I haven't looked under all the rocks. Plus my impression from > > my few years on the committee is that the standard will be pushed to > > the limit when it comes time to add optimizations. > > > > One example that I learned about last week uses the branch-prediction > > hardware to validate value speculation. And no, I am not at all a fan > > of value speculation, in case you were curious. However, it is still > > an educational example. > > > > This is where you start: > > > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > > do_something(p->a, p->b, p->c); > > p->d = 1; > > I assume that's the source code. Yep! > > Then you leverage branch-prediction hardware as follows: > > > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > > if (p == GUESS) { > > do_something(GUESS->a, GUESS->b, GUESS->c); > > GUESS->d = 1; > > } else { > > do_something(p->a, p->b, p->c); > > p->d = 1; > > } > > I assume that this is a potential transformation by a compiler. Again, yep! > > The CPU's branch-prediction hardware squashes speculation in the case where > > the guess was wrong, and this prevents the speculative store to ->d from > > ever being visible. However, the then-clause breaks dependencies, which > > means that the loads -could- be speculated, so that do_something() gets > > passed pre-initialization values. > > > > Now, I hope and expect that the wording in the standard about dependency > > ordering prohibits this sort of thing. But I do not yet know for certain. > > The transformation would be incorrect. p->a in the source code carries > a dependency, and as you say, the transformed code wouldn't have that > dependency any more. So the transformed code would loose ordering > constraints that it has in the virtual machine, so in the absence of > other proofs of correctness based on properties not shown in the > example, the transformed code would not result in the same behavior as > allowed by the abstract machine. Glad that you agree! ;-) > If the transformation would actually be by a programmer, then this > wouldn't do the same as the first example because mo_consume doesn't > work through the if statement. Agreed. > Are there other specified concerns that you have regarding this example? Nope. Just generalized paranoia. (But just because I am paranoid doesn't mean that there isn't a bug lurking somewhere in the standard, the compiler, the kernel, or my own head!) I will likely have more once I start mapping Linux kernel atomics to the C11 standard. One more paper past N3934 comes first, though. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote: > On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: > > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel wrote: > > > > > > There's an underlying problem here that's independent from the actual > > > instance that you're worried about here: "no sense" is a ultimately a > > > matter of taste/objectives/priorities as long as the respective > > > specification is logically consistent. > > > > Yes. But I don't think it's "independent". > > > > Exactly *because* some people will read standards without applying > > "does the resulting code generation actually make sense for the > > programmer that wrote the code", the standard has to be pretty clear. > > > > The standard often *isn't* pretty clear. It wasn't clear enough when > > it came to "volatile", and yet that was a *much* simpler concept than > > atomic accesses and memory ordering. > > > > And most of the time it's not a big deal. But because the C standard > > generally tries to be very portable, and cover different machines, > > there tends to be a mindset that anything inherently unportable is > > "undefined" or "implementation defined", and then the compiler writer > > is basically given free reign to do anything they want (with > > "implementation defined" at least requiring that it is reliably the > > same thing). > > Yes, that's how it works in general. And this makes sense, because all > optimizations rely on that. Second, you can't keep something consistent > (eg, between compilers) if it isn't specified. So if we want stricter > rules, those need to be specified somewhere. > > > And when it comes to memory ordering, *everything* is basically > > non-portable, because different CPU's very much have different rules. > > Well, the current set of memory orders (and the memory model as a whole) > is portable, even though it might not allow to exploit all hardware > properties, and thus might perform sub-optimally in some cases. > > > I worry that that means that the standard then takes the stance that > > "well, compiler re-ordering is no worse than CPU re-ordering, so we > > let the compiler do anything". And then we have to either add > > "volatile" to make sure the compiler doesn't do that, or use an overly > > strict memory model at the compiler level that makes it all pointless. > > Using "volatile" is not a good option, I think, because synchronization > between threads should be orthogonal to observable output of the > abstract machine. Are you thinking of "volatile" -instead- of atomics? My belief is that given the current standard there will be times that we need to use "volatile" -in- -addition- to atomics. > The current memory model might not allow to exploit all hardware > properties, I agree. > > But then why don't we work on how to extend it to do so? We need to > specify the behavior we want anyway, and this can't be independent of > the language semantics, so it has to be conceptually integrated with the > standard anyway. > > > So I really really hope that the standard doesn't give compiler > > writers free hands to do anything that they can prove is "equivalent" > > in the virtual C machine model. > > It does, but it also doesn't mean this can't be extended. So let's > focus on whether we can find an extension. > > > That's not how you get reliable > > results. > > In this general form, that's obviously a false claim. These two sentences starkly illustrate the difference in perspective between you two. You are talking past each other. Not sure how to fix this at the moment, but what else is new? ;-) Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > >> And exactly because I know enough, I would *really* like atomics to be > >> well-defined, and have very clear - and *local* - rules about how they > >> can be combined and optimized. > > > > "Local"? > > Yes. > > So I think that one of the big advantages of atomics over volatile is > that they *can* be optimized, and as such I'm not at all against > trying to generate much better code than for volatile accesses. > > But at the same time, that can go too far. For example, one of the > things we'd want to use atomics for is page table accesses, where it > is very important that we don't generate multiple accesses to the > values, because parts of the values can be change *by*hardware* (ie > accessed and dirty bits). > > So imagine that you have some clever global optimizer that sees that > the program never ever actually sets the dirty bit at all in any > thread, and then uses that kind of non-local knowledge to make > optimization decisions. THAT WOULD BE BAD. Might as well list other reasons why value proofs via whole-program analysis are unreliable for the Linux kernel: 1. As Linus said, changes from hardware. 2. Assembly code that is not visible to the compiler. Inline asms will -normally- let the compiler know what memory they change, but some just use the "memory" tag. Worse yet, I suspect that most compilers don't look all that carefully at .S files. Any number of other programs contain assembly files. 3. Kernel modules that have not yet been written. Now, the compiler could refrain from trying to prove anything about an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there is currently no way to communicate this information to the compiler other than marking the variable "volatile". Other programs have similar issues, e.g., via dlopen(). 4. Some drivers allow user-mode code to mmap() some of their state. Any changes undertaken by the user-mode code would be invisible to the compiler. 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ And probably other stuff as well. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 03:16:33PM +, Mark Batty wrote: > Hi Paul, > > Thanks for the document. I'm looking forward to reading the bits about > dependency chains in Linux. And I am looking forward to your thoughts on those bits! > > One point of confusion for me... Example 4 says "language must allow". > > Shouldn't that be "language is permitted to allow"? > > When we say "allow", we mean that the optimised execution should be > allowed by the specification, and Implicitly, the unoptimised > execution should remain allowed too. We want to be concrete about what > the language specification allows, and that's why we say "must". It is > not to disallow the unoptimised execution. OK, got it! Thanx, Paul > > Seems like an > > implementation is always within its rights to avoid an optimization if > > its implementation prevents it from safely detecting the oppportunity > > for that optimization. > > That's right. > > - Mark > > > > Or am I missing something here? > > > > Thanx, Paul > > >
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell wrote: > > For example, suppose we have, in one compilation unit: > > void f(int ra, int*rb) { > if (ra==42) > *rb=42; > else > *rb=42; > } So this is a great example, and in general I really like your page at: > For more context, this example is taken from a summary of the thin-air > problem by Mark Batty and myself, > , and the problem with > dependencies via other compilation units was AFAIK first pointed out > by Hans Boehm. and the reason I like your page is that it really talks about the problem by pointing to the "unoptimized" code, and what hardware would do. As mentioned, I think that's actually the *correct* way to think about the problem space, because it allows the programmer to take hardware characteristics into account, without having to try to "describe" them at a source level. As to your example of if (ra) atomic_write(rb, A); else atomic_write(rb, B); I really think that it is ok to combine that into atomic_write(rb, ra ? A:B); (by virtue of "exact same behavior on actual hardware"), and then the only remaining question is whether the "ra?A:B" can be optimized to remove the conditional if A==B as in your example where both are "42". Agreed? Now, I would argue that the "naive" translation of that is unambiguous, and since "ra" is not volatile or magic in any way, then "ra?42:42" can obviously be optimized into just 42 - by the exact same rule that says "the compiler can do any transformation that is equivalent in the hardware". The compiler can *locally* decide that that is the right thing to do, and any programmer that complains about that decision is just crazy. So my "local machine behavior equivalency" rule means that that function can be optimized into a single "store 42 atomically into rb". Now, if it's *not* compiled locally, and is instead implemented as a macro (or inline function), there are obviously situations where "ra ? A : B" ends up having to do other things. In particular, X may be volatile or an atomic read that has ordering semantics, and then that expression doesn't become just "42", but that's a separate issue. It's not all that dissimilar to "function calls are sequence points", though, and obviously if the source of "ra" has semantic meaning, you have to honor that semantic meaning. Agreed? Linus
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel wrote: >> >> "Consume operation: no reads in the current thread dependent on the >> value currently loaded can be reordered before this load" > > I can't remember seeing that language in the standard (ie, C or C++). > Where is this from? That's just for googling for explanations. I do have some old standard draft, but that doesn't have any concise definitions anywhere that I could find. >> and it could make a compiler writer say that value speculation is >> still valid, if you do it like this (with "ptr" being the atomic >> variable): >> >> value = ptr->val; > > I assume the load from ptr has mo_consume ordering? Yes. >> into >> >> tmp = ptr; >> value = speculated.value; >> if (unlikely(tmp != &speculated)) >> value = tmp->value; >> >> which is still bogus. The load of "ptr" does happen before the load of >> "value = speculated->value" in the instruction stream, but it would >> still result in the CPU possibly moving the value read before the >> pointer read at least on ARM and power. > > And surprise, in the C/C++ model the load from ptr is sequenced-before > the load from speculated, but there's no ordering constraint on the > reads-from relation for the value load if you use mo_consume on the ptr > load. Thus, the transformed code has less ordering constraints than the > original code, and we arrive at the same outcome. Ok, good. > The standard is clear on what's required. I strongly suggest reading > the formalization of the memory model by Batty et al. Can you point to it? Because I can find a draft standard, and it sure as hell does *not* contain any clarity of the model. It has a *lot* of verbiage, but it's pretty much impossible to actually understand, even for somebody who really understands memory ordering. Linus
Re: [RFC][PATCH 0/5] arch: atomic rework
On 18 February 2014 17:38, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell > wrote: >> >> For example, suppose we have, in one compilation unit: >> >> void f(int ra, int*rb) { >> if (ra==42) >> *rb=42; >> else >> *rb=42; >> } > > So this is a great example, and in general I really like your page at: > >> For more context, this example is taken from a summary of the thin-air >> problem by Mark Batty and myself, >> , and the problem with >> dependencies via other compilation units was AFAIK first pointed out >> by Hans Boehm. > > and the reason I like your page is that it really talks about the > problem by pointing to the "unoptimized" code, and what hardware would > do. Thanks. It's certainly necessary to separately understand what compiler optimisation and the hardware might do, to get anywhere here. But... > As mentioned, I think that's actually the *correct* way to think about > the problem space, because it allows the programmer to take hardware > characteristics into account, without having to try to "describe" them > at a source level. ...to be clear, I am ultimately after a decent source-level description of what programmers can depend on, and we (Mark and I) view that page as identifying constraints on what that description can say. There are too many compiler optimisations for people to reason directly in terms of the set of all transformations that they do, so we need some more concise and comprehensible envelope identifying what is allowed, as an interface between compiler writers and users. AIUI that's basically what Torvald is arguing. The C11 spec in its current form is not yet fully up to that task, for one thing because it doesn't attempt to cover all the h/w interactions that you and Paul list, but that is where we're trying to go with our formalisation work. > As to your example of > >if (ra) >atomic_write(rb, A); >else >atomic_write(rb, B); > > I really think that it is ok to combine that into > > atomic_write(rb, ra ? A:B); > > (by virtue of "exact same behavior on actual hardware"), and then the > only remaining question is whether the "ra?A:B" can be optimized to > remove the conditional if A==B as in your example where both are "42". > Agreed? y > Now, I would argue that the "naive" translation of that is > unambiguous, and since "ra" is not volatile or magic in any way, then > "ra?42:42" can obviously be optimized into just 42 - by the exact same > rule that says "the compiler can do any transformation that is > equivalent in the hardware". The compiler can *locally* decide that > that is the right thing to do, and any programmer that complains about > that decision is just crazy. > > So my "local machine behavior equivalency" rule means that that > function can be optimized into a single "store 42 atomically into rb". This is a bit more subtle, because (on ARM and POWER) removing the dependency and conditional branch is actually in general *not* equivalent in the hardware, in a concurrent context. That notwithstanding, I tend to agree that preventing that optimisation for non-atomics would be prohibitively costly (though I'd like real data). It's tempting then to permit more-or-less any optimisation for thread-local accesses but rule out value-range analysis and suchlike for shared-memory accesses. Whether that would be viable from a compiler point of view, I don't know. In C, one will at best only be able to get an approximate analysis of which is which, just for a start. > Now, if it's *not* compiled locally, and is instead implemented as a > macro (or inline function), there are obviously situations where "ra ? > A : B" ends up having to do other things. In particular, X may be > volatile or an atomic read that has ordering semantics, and then that > expression doesn't become just "42", but that's a separate issue. It's > not all that dissimilar to "function calls are sequence points", > though, and obviously if the source of "ra" has semantic meaning, you > have to honor that semantic meaning. separate point, indeed Peter
Re: [RFC][PATCH 0/5] arch: atomic rework
On 18 February 2014 17:16, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: >> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel wrote: >> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: >> >> And exactly because I know enough, I would *really* like atomics to be >> >> well-defined, and have very clear - and *local* - rules about how they >> >> can be combined and optimized. >> > >> > "Local"? >> >> Yes. >> >> So I think that one of the big advantages of atomics over volatile is >> that they *can* be optimized, and as such I'm not at all against >> trying to generate much better code than for volatile accesses. >> >> But at the same time, that can go too far. For example, one of the >> things we'd want to use atomics for is page table accesses, where it >> is very important that we don't generate multiple accesses to the >> values, because parts of the values can be change *by*hardware* (ie >> accessed and dirty bits). >> >> So imagine that you have some clever global optimizer that sees that >> the program never ever actually sets the dirty bit at all in any >> thread, and then uses that kind of non-local knowledge to make >> optimization decisions. THAT WOULD BE BAD. > > Might as well list other reasons why value proofs via whole-program > analysis are unreliable for the Linux kernel: > > 1. As Linus said, changes from hardware. > > 2. Assembly code that is not visible to the compiler. > Inline asms will -normally- let the compiler know what > memory they change, but some just use the "memory" tag. > Worse yet, I suspect that most compilers don't look all > that carefully at .S files. > > Any number of other programs contain assembly files. > > 3. Kernel modules that have not yet been written. Now, the > compiler could refrain from trying to prove anything about > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > is currently no way to communicate this information to the > compiler other than marking the variable "volatile". > > Other programs have similar issues, e.g., via dlopen(). > > 4. Some drivers allow user-mode code to mmap() some of their > state. Any changes undertaken by the user-mode code would > be invisible to the compiler. > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > And probably other stuff as well. interesting list. So are you saying that value-range-analysis and such-like (I say glibly, without really knowing what "such-like" refers to here) are fundamentally incompatible with the kernel code, or can you think of some way to tell the compiler a bound on the footprint of the "unseen" changes in each of those cases? Peter > Thanx, Paul > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell wrote: > > This is a bit more subtle, because (on ARM and POWER) removing the > dependency and conditional branch is actually in general *not* equivalent > in the hardware, in a concurrent context. So I agree, but I think that's a generic issue with non-local memory ordering, and is not at all specific to the optimization wrt that "x?42:42" expression. If you have a value that you loaded with a non-relaxed load, and you pass that value off to a non-local function that you don't know what it does, in my opinion that implies that the compiler had better add the necessary serialization to say "whatever that other function does, we guarantee the semantics of the load". So on ppc, if you do a load with "consume" or "acquire" and then call another function without having had something in the caller that serializes the load, you'd better add the lwsync or whatever before the call. Exactly because the function call itself otherwise basically breaks the visibility into ordering. You've basically turned a load-with-ordering-guarantees into just an integer that you passed off to something that doesn't know about the ordering guarantees - and you need that "lwsync" in order to still guarantee the ordering. Tough titties. That's what a CPU with weak memory ordering semantics gets in order to have sufficient memory ordering. And I don't think it's actually a problem in practice. If you are doing loads with ordered semantics, you're not going to pass the result off willy-nilly to random functions (or you really *do* require the ordering, because the load that did the "acquire" was actually for a lock! So I really think that the "local optimization" is correct regardless. Linus
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:23 AM, Peter Sewell wrote: > > interesting list. So are you saying that value-range-analysis and > such-like (I say glibly, without really knowing what "such-like" > refers to here) are fundamentally incompatible with > the kernel code No, it's fine to do things like value-range-analysis, it's just that you have to do it on a local scope, and not think that you can see every single write to some variable. And as Paul points out, this is actually generally true even outside of kernels. We may be pretty unique in having some things that are literally done by hardware (eg the page table updates), but we're certainly not unique in having loadable modules or code that the compiler doesn't know about by virtue of being compiled separately (assembly files, JITted, whatever). And we are actually perfectly fine with compiler barriers. One of our most common barriers is simply this: #define barrier() __asm__ __volatile__("": : :"memory") which basically tells the compiler "something changes memory in ways you don't understand, so you cannot assume anything about memory contents". Obviously, a compiler is still perfectly fine optimizing things like local variables etc that haven't had their address taken across such a barrier (regardless of whether those local variables might be spilled to the stack frame etc), but equally obviously we'd require that this kind of thing makes sure that any atomic writes have been finalized by the time the barrier happens (it does *not* imply a memory ordering, just a compiler memory access barrier - we have separate things for ordering requirements, and those obviously often generate actual barrier instructions) And we're likely always going to have things like this, but if C11 atomics end up working well for us, we might have *fewer* of them. Linus
Re: TYPE_BINFO and canonical types at LTO
> > Non-ODR types born from other frontends will then need to be made to > > alias all the ODR variants that can be done by storing them into the > > current canonical type hash. > > (I wonder if we want to support cross language aliasing for non-POD?) > > Surely for accessing components of non-POD types, no? Like > > class Foo { > Foo(); > int *get_data (); > int *data; > } glob_foo; > > extern "C" int *get_foo_data() { return glob_foo.get_data(); } OK, if we want to support this, then we want to merge. What about types with vtbl pointer? :) > > ? But you are talking about the "tree" merging part using ODR info > to also merge types which differ in completeness of contained > pointer types, right? (exactly equal cases should be already merged) Actually I was speaking of canonical types here. I want to preserve more of TBAA via honoring ODR and local types. I want to change lto to not merge canonical types for pairs of types of same layout (i.e. equivalent in the current canonical type definition) but with different mangled names. I also want it to never merge when types are local. For inter-language TBAA we will need to ensure aliasing in between non-ODR type of same layout and all unmerged variants of ODR type. Can it be done by attaching chains of ODR types into the canonical type hash and when non-ODR type appears, just make it alias with all of them? It would make sense to ODR merge in tree merging, too, but I am not sure if this fits the current design, since you would need to merge SCC components of different shape then that seems hard, right? It may be easier to ODR merge after streaming (during DECL fixup) just to make WPA streaming cheaper and to reduce debug info size. If you use -fdump-ipa-devirt, it will dump you ODR types that did not get merged (only ones with vtable pointers in them ATM) and there are quite long chains for firefox. Surely then hundreds of duplicated ODR types will end up in the ltrans partition streams and they eventually hit debug output machinery. Eric sent me presentation about doing this in LLVM. http://llvm.org/devmtg/2013-11/slides/Christopher-DebugInfo.pdf > > The canonical type computation happens separately (only for prevailing > types, of course), and there we already "merge" types which differ > in completeness. Canonical type merging is conservative the other > way aroud - if we merge _all_ types to a single canonical type then > TBAA is still correct (we get a single alias set). Yes, I think I understand that. One equivalence is kind of minimal so we merge only if we are sure there is no informationloss, other is maximal so we are sure that types that needs to be equivalent by whatever underlying langauge TBAA rules are actually equivalent. > > > I also think we want explicit representation of types known to be local > > to compilation unit - anonymous namespaces in C/C++, types defined > > within function bodies in C and god knows what in Ada/Fortran/Java. > > But here you get into the idea of improving TBAA, thus having > _more_ distinct canonical types? Yes. > > Just to make sure to not mix those two ;) > > And whatever "frontend knowledge" we want to excercise - please > make sure we get a reliable way for the middle-end to see > that "frontend knowledge" (no langhooks!). Thus, make it > "middle-end knowledge". Sure that is what I am proposing - just have DECL_ASSEMBLER_NAME on TYPE_DECL and ODR flag. Middle-end when comparing types will test ODR flag and if flag is set, then it will compare via DECL_ASEBMLER_NAME (TYPE_DECL (type)). No langhooks needed here + if other language has similar inter-unit equivalency it can use the same mechanizm. Just turn the equivalency description into string identifiers. > > Oh - and the easiest way to improve things is to get less types into > the merging process in the first place! Yep, my experiments with not streaming BINFO are directed in it. I will collect some numbers and send. Honza > > Richard.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 09:44:48AM -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel wrote: [ . . . ] > > The standard is clear on what's required. I strongly suggest reading > > the formalization of the memory model by Batty et al. > > Can you point to it? Because I can find a draft standard, and it sure > as hell does *not* contain any clarity of the model. It has a *lot* of > verbiage, but it's pretty much impossible to actually understand, even > for somebody who really understands memory ordering. I suspect he is thinking of the following: "Mathematizing C++ Concurrency." Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. https://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf Even if you don't like the math, it contains some very good examples. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 06:23:47PM +, Peter Sewell wrote: > On 18 February 2014 17:16, Paul E. McKenney > wrote: > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > >> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel wrote: > >> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > >> >> And exactly because I know enough, I would *really* like atomics to be > >> >> well-defined, and have very clear - and *local* - rules about how they > >> >> can be combined and optimized. > >> > > >> > "Local"? > >> > >> Yes. > >> > >> So I think that one of the big advantages of atomics over volatile is > >> that they *can* be optimized, and as such I'm not at all against > >> trying to generate much better code than for volatile accesses. > >> > >> But at the same time, that can go too far. For example, one of the > >> things we'd want to use atomics for is page table accesses, where it > >> is very important that we don't generate multiple accesses to the > >> values, because parts of the values can be change *by*hardware* (ie > >> accessed and dirty bits). > >> > >> So imagine that you have some clever global optimizer that sees that > >> the program never ever actually sets the dirty bit at all in any > >> thread, and then uses that kind of non-local knowledge to make > >> optimization decisions. THAT WOULD BE BAD. > > > > Might as well list other reasons why value proofs via whole-program > > analysis are unreliable for the Linux kernel: > > > > 1. As Linus said, changes from hardware. > > > > 2. Assembly code that is not visible to the compiler. > > Inline asms will -normally- let the compiler know what > > memory they change, but some just use the "memory" tag. > > Worse yet, I suspect that most compilers don't look all > > that carefully at .S files. > > > > Any number of other programs contain assembly files. > > > > 3. Kernel modules that have not yet been written. Now, the > > compiler could refrain from trying to prove anything about > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > > is currently no way to communicate this information to the > > compiler other than marking the variable "volatile". > > > > Other programs have similar issues, e.g., via dlopen(). > > > > 4. Some drivers allow user-mode code to mmap() some of their > > state. Any changes undertaken by the user-mode code would > > be invisible to the compiler. > > > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > > > And probably other stuff as well. > > interesting list. So are you saying that value-range-analysis and > such-like (I say glibly, without really knowing what "such-like" > refers to here) are fundamentally incompatible with > the kernel code, or can you think of some way to tell the compiler a > bound on the footprint of the "unseen" changes in each of those cases? Other than the "volatile" keyword, no. Well, I suppose you could also create a function that changed the variables in question, then arrange to never call it, but in such a way that the compiler could not prove that it was never called. But ouch! Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:49:27AM -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell > wrote: > > > > This is a bit more subtle, because (on ARM and POWER) removing the > > dependency and conditional branch is actually in general *not* equivalent > > in the hardware, in a concurrent context. > > So I agree, but I think that's a generic issue with non-local memory > ordering, and is not at all specific to the optimization wrt that > "x?42:42" expression. > > If you have a value that you loaded with a non-relaxed load, and you > pass that value off to a non-local function that you don't know what > it does, in my opinion that implies that the compiler had better add > the necessary serialization to say "whatever that other function does, > we guarantee the semantics of the load". > > So on ppc, if you do a load with "consume" or "acquire" and then call > another function without having had something in the caller that > serializes the load, you'd better add the lwsync or whatever before > the call. Exactly because the function call itself otherwise basically > breaks the visibility into ordering. You've basically turned a > load-with-ordering-guarantees into just an integer that you passed off > to something that doesn't know about the ordering guarantees - and you > need that "lwsync" in order to still guarantee the ordering. > > Tough titties. That's what a CPU with weak memory ordering semantics > gets in order to have sufficient memory ordering. And that is in fact what C11 compilers are supposed to do if the function doesn't have the [[carries_dependency]] attribute on the corresponding argument or return of the non-local function. If the function is marked with [[carries_dependency]], then the compiler has the information needed in both compilations to make things work correctly. Thanx, Paul > And I don't think it's actually a problem in practice. If you are > doing loads with ordered semantics, you're not going to pass the > result off willy-nilly to random functions (or you really *do* require > the ordering, because the load that did the "acquire" was actually for > a lock! > > So I really think that the "local optimization" is correct regardless. > >Linus >
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel wrote: > > The standard is clear on what's required. I strongly suggest reading > > the formalization of the memory model by Batty et al. > > Can you point to it? Because I can find a draft standard, and it sure > as hell does *not* contain any clarity of the model. It has a *lot* of > verbiage, but it's pretty much impossible to actually understand, even > for somebody who really understands memory ordering. http://www.cl.cam.ac.uk/~mjb220/n3132.pdf This has an explanation of the model up front, and then the detailed formulae in Section 6. This is from 2010, and there might have been smaller changes since then, but I'm not aware of any bigger ones. The cppmem tool is based on this, so if you want to play around with a few code examples, it's pretty nice because it shows you all allowed executions, including graphs like in the paper (see elsewhere in this thread for CAS syntax): http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 08:55 -0800, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: > > > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel > > > wrote: > > > > > > > > There's an underlying problem here that's independent from the actual > > > > instance that you're worried about here: "no sense" is a ultimately a > > > > matter of taste/objectives/priorities as long as the respective > > > > specification is logically consistent. > > > > > > Yes. But I don't think it's "independent". > > > > > > Exactly *because* some people will read standards without applying > > > "does the resulting code generation actually make sense for the > > > programmer that wrote the code", the standard has to be pretty clear. > > > > > > The standard often *isn't* pretty clear. It wasn't clear enough when > > > it came to "volatile", and yet that was a *much* simpler concept than > > > atomic accesses and memory ordering. > > > > > > And most of the time it's not a big deal. But because the C standard > > > generally tries to be very portable, and cover different machines, > > > there tends to be a mindset that anything inherently unportable is > > > "undefined" or "implementation defined", and then the compiler writer > > > is basically given free reign to do anything they want (with > > > "implementation defined" at least requiring that it is reliably the > > > same thing). > > > > Yes, that's how it works in general. And this makes sense, because all > > optimizations rely on that. Second, you can't keep something consistent > > (eg, between compilers) if it isn't specified. So if we want stricter > > rules, those need to be specified somewhere. > > > > > And when it comes to memory ordering, *everything* is basically > > > non-portable, because different CPU's very much have different rules. > > > > Well, the current set of memory orders (and the memory model as a whole) > > is portable, even though it might not allow to exploit all hardware > > properties, and thus might perform sub-optimally in some cases. > > > > > I worry that that means that the standard then takes the stance that > > > "well, compiler re-ordering is no worse than CPU re-ordering, so we > > > let the compiler do anything". And then we have to either add > > > "volatile" to make sure the compiler doesn't do that, or use an overly > > > strict memory model at the compiler level that makes it all pointless. > > > > Using "volatile" is not a good option, I think, because synchronization > > between threads should be orthogonal to observable output of the > > abstract machine. > > Are you thinking of "volatile" -instead- of atomics? My belief is that > given the current standard there will be times that we need to use > "volatile" -in- -addition- to atomics. No, I was talking about having to use "volatile" in addition to atomics to get synchronization properties for the atomics. ISTM that this would be unfortunate because the objective for "volatile" (ie, declaring what is output of the abstract machine) is orthogonal to synchronization between threads. So if we want to preserve control dependencies and get ordering guarantees through that, I wouldn't prefer it through hacks based on volatile. For example, if we have a macro or helper function that contains synchronization code, but the compiler can prove that the macro/function is used just on data only accessible to a single thread, then the "volatile" notation on it would prevent optimizations. > > The current memory model might not allow to exploit all hardware > > properties, I agree. > > > > But then why don't we work on how to extend it to do so? We need to > > specify the behavior we want anyway, and this can't be independent of > > the language semantics, so it has to be conceptually integrated with the > > standard anyway. > > > > > So I really really hope that the standard doesn't give compiler > > > writers free hands to do anything that they can prove is "equivalent" > > > in the virtual C machine model. > > > > It does, but it also doesn't mean this can't be extended. So let's > > focus on whether we can find an extension. > > > > > That's not how you get reliable > > > results. > > > > In this general form, that's obviously a false claim. > > These two sentences starkly illustrate the difference in perspective > between you two. You are talking past each other. Not sure how to fix > this at the moment, but what else is new? ;-) Well, we're still talking, and we are making little steps in clarifying things, so there is progress :)
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote: > Several of you have said that the standard and compiler should not > permit speculative writes of atomics, or (effectively) that the > compiler should preserve dependencies. In simple examples it's easy > to see what that means, but in general it's not so clear what the > language should guarantee, because dependencies may go via non-atomic > code in other compilation units, and we have to consider the extent to > which it's desirable to limit optimisation there. [...] > 2) otherwise, the language definition should prohibit it but the >compiler would have to preserve dependencies even in compilation >units that have no mention of atomics. It's unclear what the >(runtime and compiler development) cost of that would be in >practice - perhaps Torvald could comment? If I'm reading the standard correctly, it requires that data dependencies are preserved through loads and stores, including nonatomic ones. That sounds convenient because it allows programmers to use temporary storage. However, what happens if a dependency "arrives" at a store for which the alias set isn't completely known? Then we either have to add a barrier to enforce the ordering at this point, or we have to assume that all other potentially aliasing memory locations would also have to start carrying dependencies (which might be in other functions in other compilation units). Neither option is good. The first might introduce barriers in places in which they might not be required (or the programmer has to use kill_dependency() quite often to avoid all these). The second is bad because points-to analysis is hard, so in practice the points-to set will not be precisely known for a lot of pointers. So this might not just creep into other functions via calls of [[carries_dependency]] functions, but also through normal loads and stores, likely prohibiting many optimizations. Furthermore, the dependency tracking can currently only be "disabled/enabled" on a function granularity (via [[carries_dependency]]). Thus, if we have big functions, then dependency tracking may slow down a lot of code in the big function. If we have small functions, there's a lot of attributes to be added. If a function may only carry a dependency but doesn't necessarily (eg, depending on input parameters), then the programmer has to make a trade-off whether he/she want's to benefit from mo_consume but slow down other calls due to additional barriers (ie, when this function is called from non-[[carries_dependency]] functions), or vice versa. (IOW, because of the function granularity, other code's performance is affected.) If a compiler wants to implement dependency tracking just for a few constructs (e.g., operators -> + ...) and use barriers otherwise, then this decision must be compatible with how all this is handled in other compilation units. Thus, compiler optimizations effectively become part of the ABI, which doesn't seem right. I hope these examples illustrate my concerns about the implementability in practice of this. It's also why I've suggested to move from an opt-out approach as in the current standard (ie, with kill_dependency()) to an opt-in approach for conservative dependency tracking (e.g., with a preserve_dependencies(exp) call, where exp will not be optimized in a way that removes any dependencies). This wouldn't help with many optimizations being prevented, but it should at least help programmers contain the problem to smaller regions of code. I'm not aware of any implementation that tries to track dependencies, so I can't give any real performance numbers. This could perhaps be simulated, but I'm not sure whether a realistic case would be made without at least supporting [[carries_dependency]] properly in the compiler, which would be some work.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 18:21 +, Peter Sewell wrote: > On 18 February 2014 17:38, Linus Torvalds > wrote: > > On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell > > wrote: > >> > >> For example, suppose we have, in one compilation unit: > >> > >> void f(int ra, int*rb) { > >> if (ra==42) > >> *rb=42; > >> else > >> *rb=42; > >> } > > > > So this is a great example, and in general I really like your page at: > > > >> For more context, this example is taken from a summary of the thin-air > >> problem by Mark Batty and myself, > >> , and the problem with > >> dependencies via other compilation units was AFAIK first pointed out > >> by Hans Boehm. > > > > and the reason I like your page is that it really talks about the > > problem by pointing to the "unoptimized" code, and what hardware would > > do. > > Thanks. It's certainly necessary to separately understand what compiler > optimisation and the hardware might do, to get anywhere here. But... > > > As mentioned, I think that's actually the *correct* way to think about > > the problem space, because it allows the programmer to take hardware > > characteristics into account, without having to try to "describe" them > > at a source level. > > ...to be clear, I am ultimately after a decent source-level description of > what > programmers can depend on, and we (Mark and I) view that page as > identifying constraints on what that description can say. There are too > many compiler optimisations for people to reason directly in terms of > the set of all transformations that they do, so we need some more > concise and comprehensible envelope identifying what is allowed, > as an interface between compiler writers and users. AIUI that's basically > what Torvald is arguing. Yes, that's one reason. Another one is that if a programmer would actually want to use atomics in a machine-independent / portable way, he/she does also not want to reason about how all those transformations might interact with the machine's memory model.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 08:49 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > >> And exactly because I know enough, I would *really* like atomics to be > >> well-defined, and have very clear - and *local* - rules about how they > >> can be combined and optimized. > > > > "Local"? > > Yes. > > So I think that one of the big advantages of atomics over volatile is > that they *can* be optimized, and as such I'm not at all against > trying to generate much better code than for volatile accesses. > > But at the same time, that can go too far. For example, one of the > things we'd want to use atomics for is page table accesses, where it > is very important that we don't generate multiple accesses to the > values, because parts of the values can be change *by*hardware* (ie > accessed and dirty bits). > > So imagine that you have some clever global optimizer that sees that > the program never ever actually sets the dirty bit at all in any > thread, and then uses that kind of non-local knowledge to make > optimization decisions. THAT WOULD BE BAD. > > Do you see what I'm aiming for? Yes, I do. But that seems to be "volatile" territory. It crosses the boundaries of the abstract machine, and thus is input/output. Which fraction of your atomic accesses can read values produced by hardware? I would still suppose that lots of synchronization is not affected by this. Do you perhaps want a weaker form of volatile? That is, one that, for example, allows combining of two adjacent loads of the dirty bits, but will make sure that this is treated as if there is some imaginary external thread that it cannot analyze and that may write? I'm trying to be careful here in distinguishing between volatile and synchronization because I believe those are orthogonal, and this is a good thing because it allows for more-than-conservatively optimized code. > Any optimization that tries to prove > anything from more than local state is by definition broken, because > it assumes that everything is described by the program. Well, that's how atomics that aren't volatile are defined in the standard. I can see that you want something else too, but that doesn't mean that the other thing is broken. > But *local* optimizations are fine, as long as they follow the obvious > rule of not actually making changes that are semantically visible. If we assume that there is this imaginary thread called hardware that can write/read to/from such weak-volatile atomics, I believe this should restrict optimizations sufficiently even in the model as specified in the standard. For example, it would prevent a compiler from proving that there is no access by another thread to a variable, so it would prevent the cases in our discussion that you didn't want to get optimized. Yet, I believe the model itself could stay unmodified. Thus, with this "weak volatile", we could let programmers request precisely the semantics that they want, without using volatile too much and without preventing optimizations more than necessary. Thoughts? > (In practice, I'd be impressed as hell for that particular example, > and we actually do end up setting the dirty bit by hand in some > situations, so the example is slightly made up, but there are other > cases that might be more realistic in that sometimes we do things that > are hidden from the compiler - in assembly etc - and the compiler > might *think* it knows what is going on, but it doesn't actually see > all accesses). If something is visible to assembly, then the compiler should see this (or at least be aware that it can't prove anything about such data). Or am I missing anything? > > Sorry, but the rules *are* very clear. I *really* suggest to look at > > the formalization by Batty et al. And in these rules, proving that a > > read will always return value X has a well-defined meaning, and you can > > use it. That simply follows from how the model is built. > > What "model"? The C/C++ memory model was what I was referring to. > That's the thing. I have tried to figure out whether the model is some > abstract C model, or a model based on the actual hardware that the > compiler is compiling for, and whether the model is one that assumes > the compiler has complete knowledge of the system (see the example > above). It's a model as specified in the standard. It's not parametrized by the hardware the program will eventually run on (ignoring implementation-defined behavior, timing, etc.). The compiler has complete knowledge of the system unless for "volatiles" and things coming in from the external world or escaping to it (e.g., stuff escaping into asm statements). > And it seems to be a mixture of it all. The definitions of the various > orderings obviously very much imply that the compiler has to insert > the proper barriers or sequence markers for that architecture, but > then there is no apparent w
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 09:43:31PM +0100, Torvald Riegel wrote: > xagsmtp5.20140218204423.3...@bldgate.vnet.ibm.com > X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP5 at BLDGATE) > > On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote: > > Several of you have said that the standard and compiler should not > > permit speculative writes of atomics, or (effectively) that the > > compiler should preserve dependencies. In simple examples it's easy > > to see what that means, but in general it's not so clear what the > > language should guarantee, because dependencies may go via non-atomic > > code in other compilation units, and we have to consider the extent to > > which it's desirable to limit optimisation there. > > [...] > > > 2) otherwise, the language definition should prohibit it but the > >compiler would have to preserve dependencies even in compilation > >units that have no mention of atomics. It's unclear what the > >(runtime and compiler development) cost of that would be in > >practice - perhaps Torvald could comment? > > If I'm reading the standard correctly, it requires that data > dependencies are preserved through loads and stores, including nonatomic > ones. That sounds convenient because it allows programmers to use > temporary storage. > > However, what happens if a dependency "arrives" at a store for which the > alias set isn't completely known? Then we either have to add a barrier > to enforce the ordering at this point, or we have to assume that all > other potentially aliasing memory locations would also have to start > carrying dependencies (which might be in other functions in other > compilation units). Neither option is good. The first might introduce > barriers in places in which they might not be required (or the > programmer has to use kill_dependency() quite often to avoid all these). > The second is bad because points-to analysis is hard, so in practice the > points-to set will not be precisely known for a lot of pointers. So > this might not just creep into other functions via calls of > [[carries_dependency]] functions, but also through normal loads and > stores, likely prohibiting many optimizations. I cannot immediately think of a situation where a store carrying a dependency into a non-trivially aliased object wouldn't be a usage error, so perhaps emitting a barrier and a diagnostic at that point is best. > Furthermore, the dependency tracking can currently only be > "disabled/enabled" on a function granularity (via > [[carries_dependency]]). Thus, if we have big functions, then > dependency tracking may slow down a lot of code in the big function. If > we have small functions, there's a lot of attributes to be added. > > If a function may only carry a dependency but doesn't necessarily (eg, > depending on input parameters), then the programmer has to make a > trade-off whether he/she want's to benefit from mo_consume but slow down > other calls due to additional barriers (ie, when this function is called > from non-[[carries_dependency]] functions), or vice versa. (IOW, > because of the function granularity, other code's performance is > affected.) > > If a compiler wants to implement dependency tracking just for a few > constructs (e.g., operators -> + ...) and use barriers otherwise, then > this decision must be compatible with how all this is handled in other > compilation units. Thus, compiler optimizations effectively become part > of the ABI, which doesn't seem right. > > I hope these examples illustrate my concerns about the implementability > in practice of this. It's also why I've suggested to move from an > opt-out approach as in the current standard (ie, with kill_dependency()) > to an opt-in approach for conservative dependency tracking (e.g., with a > preserve_dependencies(exp) call, where exp will not be optimized in a > way that removes any dependencies). This wouldn't help with many > optimizations being prevented, but it should at least help programmers > contain the problem to smaller regions of code. > > I'm not aware of any implementation that tries to track dependencies, so > I can't give any real performance numbers. This could perhaps be > simulated, but I'm not sure whether a realistic case would be made > without at least supporting [[carries_dependency]] properly in the > compiler, which would be some work. Another approach would be to use start-tracking/stop-tracking directives that could be buried into rcu_read_lock() and rcu_read_unlock(). There are issues with nesting and conditional use of rcu_read_lock() and rcu_read_unlock(), but it does give you nicer granularity properties. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel wrote: > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > >> And exactly because I know enough, I would *really* like atomics to be > > >> well-defined, and have very clear - and *local* - rules about how they > > >> can be combined and optimized. > > > > > > "Local"? > > > > Yes. > > > > So I think that one of the big advantages of atomics over volatile is > > that they *can* be optimized, and as such I'm not at all against > > trying to generate much better code than for volatile accesses. > > > > But at the same time, that can go too far. For example, one of the > > things we'd want to use atomics for is page table accesses, where it > > is very important that we don't generate multiple accesses to the > > values, because parts of the values can be change *by*hardware* (ie > > accessed and dirty bits). > > > > So imagine that you have some clever global optimizer that sees that > > the program never ever actually sets the dirty bit at all in any > > thread, and then uses that kind of non-local knowledge to make > > optimization decisions. THAT WOULD BE BAD. > > Might as well list other reasons why value proofs via whole-program > analysis are unreliable for the Linux kernel: > > 1.As Linus said, changes from hardware. This is what's volatile is for, right? (Or the weak-volatile idea I mentioned). Compilers won't be able to prove something about the values of such variables, if marked (weak-)volatile. > 2.Assembly code that is not visible to the compiler. > Inline asms will -normally- let the compiler know what > memory they change, but some just use the "memory" tag. > Worse yet, I suspect that most compilers don't look all > that carefully at .S files. > > Any number of other programs contain assembly files. Are the annotations of changed memory really a problem? If the "memory" tag exists, isn't that supposed to mean all memory? To make a proof about a program for location X, the compiler has to analyze all uses of X. Thus, as soon as X escapes into an .S file, then the compiler will simply not be able to prove a thing (except maybe due to the data-race-free requirement for non-atomics). The attempt to prove something isn't unreliable, simply because a correct compiler won't claim to be able to "prove" something. One reason that could corrupt this is that if program addresses objects other than through the mechanisms defined in the language. For example, if one thread lays out a data structure at a constant fixed memory address, and another one then uses the fixed memory address to get access to the object with a cast (e.g., (void*)0x123). > 3.Kernel modules that have not yet been written. Now, the > compiler could refrain from trying to prove anything about > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > is currently no way to communicate this information to the > compiler other than marking the variable "volatile". Even if the variable is just externally accessible, then the compiler knows that it can't do whole-program analysis about it. It is true that whole-program analysis will not be applicable in this case, but it will not be unreliable. I think that's an important difference. > Other programs have similar issues, e.g., via dlopen(). > > 4.Some drivers allow user-mode code to mmap() some of their > state. Any changes undertaken by the user-mode code would > be invisible to the compiler. A good point, but a compiler that doesn't try to (incorrectly) assume something about the semantics of mmap will simply see that the mmap'ed data will escape to stuff if can't analyze, so it will not be able to make a proof. This is different from, for example, malloc(), which is guaranteed to return "fresh" nonaliasing memory. > 5.JITed code produced based on BPF: https://lwn.net/Articles/437981/ This might be special, or not, depending on how the JITed code gets access to data. If this is via fixed addresses (e.g., (void*)0x123), then see above. If this is through function calls that the compiler can't analyze, then this is like 4.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > Well, that's how atomics that aren't volatile are defined in the > standard. I can see that you want something else too, but that doesn't > mean that the other thing is broken. Well that other thing depends on being able to see the entire program at compile time. PaulMck already listed various ways in which this is not feasible even for normal userspace code. In particular; DSOs and JITs were mentioned.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > Yes, I do. But that seems to be "volatile" territory. It crosses the > boundaries of the abstract machine, and thus is input/output. Which > fraction of your atomic accesses can read values produced by hardware? > I would still suppose that lots of synchronization is not affected by > this. Its not only hardware; also the kernel/user boundary has this same problem. We cannot a-priory say what userspace will do; in fact, because we're a general purpose OS, we must assume it will willfully try its bestest to wreck whatever assumptions we make about its behaviour. We also have loadable modules -- much like regular userspace DSOs -- so there too we cannot say what will or will not happen. We also have JITs that generate code on demand. And I'm absolutely sure (with the exception of the JITs, its not an area I've worked on) that we have atomic usage across all those boundaries. I must agree with Linus, global state driven optimizations are crack brained; esp. for atomics. We simply cannot know all state at compile time. The best we can hope for are local optimizations.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote: > On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > > Well, that's how atomics that aren't volatile are defined in the > > standard. I can see that you want something else too, but that doesn't > > mean that the other thing is broken. > > Well that other thing depends on being able to see the entire program at > compile time. PaulMck already listed various ways in which this is > not feasible even for normal userspace code. > > In particular; DSOs and JITs were mentioned. No it doesn't depend on whole-program analysis being possible. Because if it isn't, then a correct compiler will just not do certain optimizations simply because it can't prove properties required for the optimization to hold. With the exception of access to objects via magic numbers (e.g., fixed and known addresses (see my reply to Paul), which are outside of the semantics specified in the standard), I don't see a correctness problem here.
Re: [RFC][PATCH 0/5] arch: atomic rework
> > 4. Some drivers allow user-mode code to mmap() some of their > > state. Any changes undertaken by the user-mode code would > > be invisible to the compiler. > > A good point, but a compiler that doesn't try to (incorrectly) assume > something about the semantics of mmap will simply see that the mmap'ed > data will escape to stuff if can't analyze, so it will not be able to > make a proof. > > This is different from, for example, malloc(), which is guaranteed to > return "fresh" nonaliasing memory. The kernel side of this is different.. it looks like 'normal' memory, we just happen to allow it to end up in userspace too. But on that point; how do you tell the compiler the difference between malloc() and mmap()? Is that some function attribute?
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel wrote: >> >> So imagine that you have some clever global optimizer that sees that >> the program never ever actually sets the dirty bit at all in any >> thread, and then uses that kind of non-local knowledge to make >> optimization decisions. THAT WOULD BE BAD. >> >> Do you see what I'm aiming for? > > Yes, I do. But that seems to be "volatile" territory. It crosses the > boundaries of the abstract machine, and thus is input/output. Which > fraction of your atomic accesses can read values produced by hardware? > I would still suppose that lots of synchronization is not affected by > this. The "hardware can change things" case is indeed pretty rare. But quite frankly, even when it isn't hardware, as far as the compiler is concerned you have the exact same issue - you have TLB faults happening on other CPU's that do the same thing asynchronously using software TLB fault handlers. So *semantically*, it really doesn't make any difference what-so-ever if it's a software TLB handler on another CPU, a microcoded TLB fault, or an actual hardware path. So if the answer for all of the above is "use volatile", then I think that means that the C11 atomics are badly designed. The whole *point* of atomic accesses is that stuff like above should "JustWork(tm)" > Do you perhaps want a weaker form of volatile? That is, one that, for > example, allows combining of two adjacent loads of the dirty bits, but > will make sure that this is treated as if there is some imaginary > external thread that it cannot analyze and that may write? Yes, that's basically what I would want. And it is what I would expect an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a bit of a misnomer, because technically we really generally want "ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use volatile, and is a hell of a lot simpler to write ;^). So we obviously use "volatile" for this currently, and generally the semantics we really want are: - the load or store is done as a single access ("atomic") - the compiler must not try to re-materialize the value by reloading it from memory (this is the "at most once" part) and quite frankly, "volatile" is a big hammer for this. In practice it tends to work pretty well, though, because in _most_ cases, there really is just the single access, so there isn't anything that it could be combined with, and the biggest issue is often just the correctness of not re-materializing the value. And I agree - memory ordering is a totally separate issue, and in fact we largely tend to consider it entirely separate. For cases where we have ordering constraints, we either handle those with special accessors (ie "atomic-modify-and-test" helpers tend to have some serialization guarantees built in), or we add explicit fencing. But semantically, C11 atomic accessors *should* generally have the correct behavior for our uses. If we have to add "volatile", that makes atomics basically useless. We already *have* the volatile semantics, if atomics need it, that just means that atomics have zero upside for us. >> But *local* optimizations are fine, as long as they follow the obvious >> rule of not actually making changes that are semantically visible. > > If we assume that there is this imaginary thread called hardware that > can write/read to/from such weak-volatile atomics, I believe this should > restrict optimizations sufficiently even in the model as specified in > the standard. Well, what about *real* threads that do this, but that aren't analyzable by the C compiler because they are written in another language entirely (inline asm, asm, perl, INTERCA:. microcode, PAL-code, whatever?) I really don't think that "hardware" is necessary for this to happen. What is done by hardware on x86, for example, is done by PAL-code (loaded at boot-time) on alpha, and done by hand-tuned assembler fault handlers on Sparc. The *effect* is the same: it's not visible to the compiler. There is no way in hell that the compiler can understand the hand-tuned Sparc TLB fault handler, even if it parsed it. >> IOW, I would *revel* in the fact that different machines are >> different, and basically just describe the "stupid" code generation. >> You'd get the guaranteed semantic baseline, but you'd *also* be able >> to know that whatever architecture guarantees you have would remain. >> Without having to describe those architecture issues. > > Would you be okay with this preventing lots of optimizations a compiler > otherwise could do? Because AFAICT, this spreads into non-synchronizing > code via the dependency-tracking, for example. Actually, it probably wouldn't really hurt code generation. The thing is, you really have three cases: - architectures that have weak memory ordering and fairly stupid hardware (power, some day maybe ARM) - architectures that expose a fairly strong memory ordering, but reorders aggressively in hardware by tracking c
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: > xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel > > > wrote: > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > > >> And exactly because I know enough, I would *really* like atomics to be > > > >> well-defined, and have very clear - and *local* - rules about how they > > > >> can be combined and optimized. > > > > > > > > "Local"? > > > > > > Yes. > > > > > > So I think that one of the big advantages of atomics over volatile is > > > that they *can* be optimized, and as such I'm not at all against > > > trying to generate much better code than for volatile accesses. > > > > > > But at the same time, that can go too far. For example, one of the > > > things we'd want to use atomics for is page table accesses, where it > > > is very important that we don't generate multiple accesses to the > > > values, because parts of the values can be change *by*hardware* (ie > > > accessed and dirty bits). > > > > > > So imagine that you have some clever global optimizer that sees that > > > the program never ever actually sets the dirty bit at all in any > > > thread, and then uses that kind of non-local knowledge to make > > > optimization decisions. THAT WOULD BE BAD. > > > > Might as well list other reasons why value proofs via whole-program > > analysis are unreliable for the Linux kernel: > > > > 1. As Linus said, changes from hardware. > > This is what's volatile is for, right? (Or the weak-volatile idea I > mentioned). > > Compilers won't be able to prove something about the values of such > variables, if marked (weak-)volatile. Yep. > > 2. Assembly code that is not visible to the compiler. > > Inline asms will -normally- let the compiler know what > > memory they change, but some just use the "memory" tag. > > Worse yet, I suspect that most compilers don't look all > > that carefully at .S files. > > > > Any number of other programs contain assembly files. > > Are the annotations of changed memory really a problem? If the "memory" > tag exists, isn't that supposed to mean all memory? > > To make a proof about a program for location X, the compiler has to > analyze all uses of X. Thus, as soon as X escapes into an .S file, then > the compiler will simply not be able to prove a thing (except maybe due > to the data-race-free requirement for non-atomics). The attempt to > prove something isn't unreliable, simply because a correct compiler > won't claim to be able to "prove" something. I am indeed less worried about inline assembler than I am about files full of assembly. Or files full of other languages. > One reason that could corrupt this is that if program addresses objects > other than through the mechanisms defined in the language. For example, > if one thread lays out a data structure at a constant fixed memory > address, and another one then uses the fixed memory address to get > access to the object with a cast (e.g., (void*)0x123). Or if the program uses gcc linker scripts to get the same effect. > > 3. Kernel modules that have not yet been written. Now, the > > compiler could refrain from trying to prove anything about > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > > is currently no way to communicate this information to the > > compiler other than marking the variable "volatile". > > Even if the variable is just externally accessible, then the compiler > knows that it can't do whole-program analysis about it. > > It is true that whole-program analysis will not be applicable in this > case, but it will not be unreliable. I think that's an important > difference. Let me make sure that I understand what you are saying. If my program has "extern int foo;", the compiler will refrain from doing whole-program analysis involving "foo"? Or to ask it another way, when you say "whole-program analysis", are you restricting that analysis to the current translation unit? If so, I was probably not the only person thinking that you instead meant analysis across all translation units linked into the program. ;-) > > Other programs have similar issues, e.g., via dlopen(). > > > > 4. Some drivers allow user-mode code to mmap() some of their > > state. Any changes undertaken by the user-mode code would > > be invisible to the compiler. > > A good point, but a compiler that doesn't try to (incorrectly) assume > something about the semantics of mmap will simply see that the mmap'ed > data will escape to stuff if can't analyze, so it will not be able to > make a proof. > > This is different from, for example, malloc(), which is guaranteed to > return "fresh" nonaliasing memory. As Peter
Re: [RFC][PATCH 0/5] arch: atomic rework
On 18 February 2014 20:43, Torvald Riegel wrote: > On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote: >> Several of you have said that the standard and compiler should not >> permit speculative writes of atomics, or (effectively) that the >> compiler should preserve dependencies. In simple examples it's easy >> to see what that means, but in general it's not so clear what the >> language should guarantee, because dependencies may go via non-atomic >> code in other compilation units, and we have to consider the extent to >> which it's desirable to limit optimisation there. > > [...] > >> 2) otherwise, the language definition should prohibit it but the >>compiler would have to preserve dependencies even in compilation >>units that have no mention of atomics. It's unclear what the >>(runtime and compiler development) cost of that would be in >>practice - perhaps Torvald could comment? > > If I'm reading the standard correctly, it requires that data > dependencies are preserved through loads and stores, including nonatomic > ones. That sounds convenient because it allows programmers to use > temporary storage. The standard only needs this for consume chains, but if one wanted to get rid of thin-air values by requiring implementations to respect all (reads-from union dependency) cycles, AFAICS we'd need it pretty much everywhere. I don't myself think that's likely to be a realistic proposal, but it does keep coming up, and it'd be very interesting to know the actual cost on some credible workload. > However, what happens if a dependency "arrives" at a store for which the > alias set isn't completely known? Then we either have to add a barrier > to enforce the ordering at this point, or we have to assume that all > other potentially aliasing memory locations would also have to start > carrying dependencies (which might be in other functions in other > compilation units). Neither option is good. The first might introduce > barriers in places in which they might not be required (or the > programmer has to use kill_dependency() quite often to avoid all these). > The second is bad because points-to analysis is hard, so in practice the > points-to set will not be precisely known for a lot of pointers. So > this might not just creep into other functions via calls of > [[carries_dependency]] functions, but also through normal loads and > stores, likely prohibiting many optimizations. > > Furthermore, the dependency tracking can currently only be > "disabled/enabled" on a function granularity (via > [[carries_dependency]]). Thus, if we have big functions, then > dependency tracking may slow down a lot of code in the big function. If > we have small functions, there's a lot of attributes to be added. > > If a function may only carry a dependency but doesn't necessarily (eg, > depending on input parameters), then the programmer has to make a > trade-off whether he/she want's to benefit from mo_consume but slow down > other calls due to additional barriers (ie, when this function is called > from non-[[carries_dependency]] functions), or vice versa. (IOW, > because of the function granularity, other code's performance is > affected.) > > If a compiler wants to implement dependency tracking just for a few > constructs (e.g., operators -> + ...) and use barriers otherwise, then > this decision must be compatible with how all this is handled in other > compilation units. Thus, compiler optimizations effectively become part > of the ABI, which doesn't seem right. > > I hope these examples illustrate my concerns about the implementability > in practice of this. It's also why I've suggested to move from an > opt-out approach as in the current standard (ie, with kill_dependency()) > to an opt-in approach for conservative dependency tracking (e.g., with a > preserve_dependencies(exp) call, where exp will not be optimized in a > way that removes any dependencies). This wouldn't help with many > optimizations being prevented, but it should at least help programmers > contain the problem to smaller regions of code. > > I'm not aware of any implementation that tries to track dependencies, so > I can't give any real performance numbers. This could perhaps be > simulated, but I'm not sure whether a realistic case would be made > without at least supporting [[carries_dependency]] properly in the > compiler, which would be some work. >
ARM inline assembly usage in Linux kernel
Hello. I am sending this at the behest of Renato. I have been working on the ARM integrated assembler in LLVM and came across an interesting item in the Linux kernel. I am wondering if this is an unstated covenant between the kernel and GCC or simply a clever use of an unintended/undefined behaviour. The Linux kernel uses the *compiler* as a fancy preprocessor to generate a specially crafted assembly file. This file is then post-processed via sed to generate a header containing constants which is shared across assembly and C sources. In order to clarify the question, I am selecting a particular example and pulling out the relevant bits of the source code below. #define DEFINE(sym, val) asm volatile("\n->" #sym " %0 " #val : : "i" (val)) #define __NR_PAGEFLAGS 22 void definitions(void) { DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS); } This is then assembled to generate the following: ->NR_PAGEFLAGS #22 __NR_PAGEFLAGS This will later be post-processed to generate: #define NR_PAGELAGS 22 /* __NR_PAGEFLAGS */ By using the inline assembler to evaluate (constant) expressions into constant values and then emit that using a special identifier (->) is a fairly clever trick. This leads to my question: is this just use of an unintentional "feature" or something that was worked out between the two projects. Please explicitly CC me on any response as I am not subscribed to this mailing list. Thanks. -- Saleem Abdulrasool compnerd (at) compnerd (dot) org
Re: ARM inline assembly usage in Linux kernel
On Tue, Feb 18, 2014 at 6:56 PM, Saleem Abdulrasool wrote: > Hello. > > I am sending this at the behest of Renato. I have been working on the ARM > integrated assembler in LLVM and came across an interesting item in the Linux > kernel. > > I am wondering if this is an unstated covenant between the kernel and GCC or > simply a clever use of an unintended/undefined behaviour. > > The Linux kernel uses the *compiler* as a fancy preprocessor to generate a > specially crafted assembly file. This file is then post-processed via sed to > generate a header containing constants which is shared across assembly and C > sources. > > In order to clarify the question, I am selecting a particular example and > pulling out the relevant bits of the source code below. > > #define DEFINE(sym, val) asm volatile("\n->" #sym " %0 " #val : : "i" (val)) > > #define __NR_PAGEFLAGS 22 > > void definitions(void) { > DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS); > } > > This is then assembled to generate the following: > > ->NR_PAGEFLAGS #22 __NR_PAGEFLAGS > > This will later be post-processed to generate: > > #define NR_PAGELAGS 22 /* __NR_PAGEFLAGS */ > > By using the inline assembler to evaluate (constant) expressions into constant > values and then emit that using a special identifier (->) is a fairly clever > trick. This leads to my question: is this just use of an unintentional > "feature" or something that was worked out between the two projects. > > Please explicitly CC me on any response as I am not subscribed to this mailing > list. I don't see why this is a bad use of the inline-asm. GCC does not know and is not supposed to know what the string inside the inline-asm is going to be. In fact if you have a newer assembler than the compiler, you could use instructions that GCC does not even know about. This is the purpose of inline-asm. I think it was a bad design decision on LLVM/clang's part that it would check the assembly code up front. Thanks, Andrew Pinski > > Thanks. > > -- > Saleem Abdulrasool > compnerd (at) compnerd (dot) org >