date:20140218

Re: MSP430 in gcc4.9 ... enable interrupts?

2014-02-18 Thread David Brown

On 18/02/14 00:12, DJ Delorie wrote:
>> I presume these will be part of the headers for the library
>> distributed for msp430 gcc by TI/Redhat?
> 
> I can't speak for TI's or Red Hat's plans.  GNU's typical non-custom
> embedded runtime is newlib/libgloss, which usually doesn't have that
> much in the way of chip-specific headers or library functions.

Fair enough.  I don't know if Red Hat will be distributing anything
themselves, but I'm confident that TI will distribute chip-specific
headers with pre-built msp430 gcc packages (their aim, after all, is to
get more people to buy msp430 chips - and making the tools as easy and
powerful as possible is part of that).  I was just wondering where these
headers would fit in.

> 
>> is that for the "critical" attribute that exists in the old msp430
>> port (which disables interrupts for the duration of the function)?
> 
> Yes, for things like that.  They're documented under "Function
> Attributes" in the "Extensions to the C Language Family" chapter of
> the current GCC manual.
> 

Ah yes, I missed it when I first looked - the documentation makes
"critical" look like an option to the "interrupt" attribute, rather than
a stand-alone attribute.  It seems to me a rather strange idea to have
both "interrupt" and "critical" on the same function - an "interrupt"
function is inherently "critical" (and "reentrant") in that interrupts
are disabled before it enters, and restored or re-enabled on exit.  It
would make a difference on processors like ARMs with several interrupt
levels, but not on an msp430 with its single level.

David

Re: TYPE_BINFO and canonical types at LTO

2014-02-18 Thread Richard Biener

On Mon, 17 Feb 2014, Jan Hubicka wrote:

> > 
> > Yeah, ok.  But we treat those types (B and C) TBAA equivalent because
> > structurally they are the same ;)  Luckily C has a "proper" field
> > for its base (proper means that offset and size are correct as well
> > as the type).  It indeed has DECL_ARTIFICIAL set and yes, we treat
> > those as "real" fields when doing the structural comparison.
> 
> Yep, the difference is that depending if C or D win, we will end up walking 
> the
> BINFO or not.  So we should not depend on the BINFo walk for correctness.
> > 
> > More interesting is of course when we can re-use tail-padding in
> > one but not the other (works as expected - not merged).
> 
> Yep.
> > 
> > struct A { A (); short x; bool a;};
> > struct C:A { bool b; };
> > struct B {struct A a; bool b;};
> > struct C *p2;
> > struct B *p1;
> > int
> > t()
> > {
> >   p1->a.a = 2;
> >   return p2->a;
> > }
> > 
> > > Yes, zero sized classes are those having no fields (but other stuff, 
> > > type decls, bases etc.)
> > 
> > Yeah, but TBAA obviously doesn't care about type decls and bases.
> 
> So I guess the conclussion is that the BINFO walk in alias.c is pointless?

Yes.  But as I said - I remember being there and proposing to remove
it.  Some N > 5 years ago or so and it was either rejected or it didn't
work out ;)

> Concerning the merging details and LTO aliasing, I think for 4.10 we 
> should make C++ to compute mangled names of types (i.e. call 
> DECL_ASSEMBLER_NAME on the associated type_decl + explicitly mark that 
> type is driven by ODR) and then we can do merging driven by ODR rule.
> 
> Non-ODR types born from other frontends will then need to be made to 
> alias all the ODR variants that can be done by storing them into the 
> current canonical type hash.
> (I wonder if we want to support cross language aliasing for non-POD?)

Surely for accessing components of non-POD types, no?  Like

class Foo {
Foo();
int *get_data ();
int *data;
} glob_foo;

extern "C" int *get_foo_data() { return glob_foo.get_data(); }

?  But you are talking about the "tree" merging part using ODR info
to also merge types which differ in completeness of contained
pointer types, right?  (exactly equal cases should be already merged)

The canonical type computation happens separately (only for prevailing
types, of course), and there we already "merge" types which differ
in completeness.  Canonical type merging is conservative the other
way aroud - if we merge _all_ types to a single canonical type then
TBAA is still correct (we get a single alias set).

> I also think we want explicit representation of types known to be local 
> to compilation unit - anonymous namespaces in C/C++, types defined 
> within function bodies in C and god knows what in Ada/Fortran/Java.

But here you get into the idea of improving TBAA, thus having
_more_ distinct canonical types?

Just to make sure to not mix those two ;)

And whatever "frontend knowledge" we want to excercise - please
make sure we get a reliable way for the middle-end to see
that "frontend knowledge" (no langhooks!).  Thus, make it
"middle-end knowledge".

Oh - and the easiest way to improve things is to get less types into
the merging process in the first place!

Richard.

Re: TYPE_BINFO and canonical types at LTO

2014-02-18 Thread Richard Biener

On Tue, 18 Feb 2014, Richard Biener wrote:

> On Mon, 17 Feb 2014, Jan Hubicka wrote:
> 
> > > 
> > > Yeah, ok.  But we treat those types (B and C) TBAA equivalent because
> > > structurally they are the same ;)  Luckily C has a "proper" field
> > > for its base (proper means that offset and size are correct as well
> > > as the type).  It indeed has DECL_ARTIFICIAL set and yes, we treat
> > > those as "real" fields when doing the structural comparison.
> > 
> > Yep, the difference is that depending if C or D win, we will end up walking 
> > the
> > BINFO or not.  So we should not depend on the BINFo walk for correctness.
> > > 
> > > More interesting is of course when we can re-use tail-padding in
> > > one but not the other (works as expected - not merged).
> > 
> > Yep.
> > > 
> > > struct A { A (); short x; bool a;};
> > > struct C:A { bool b; };
> > > struct B {struct A a; bool b;};
> > > struct C *p2;
> > > struct B *p1;
> > > int
> > > t()
> > > {
> > >   p1->a.a = 2;
> > >   return p2->a;
> > > }
> > > 
> > > > Yes, zero sized classes are those having no fields (but other stuff, 
> > > > type decls, bases etc.)
> > > 
> > > Yeah, but TBAA obviously doesn't care about type decls and bases.
> > 
> > So I guess the conclussion is that the BINFO walk in alias.c is pointless?
> 
> Yes.  But as I said - I remember being there and proposing to remove
> it.  Some N > 5 years ago or so and it was either rejected or it didn't
> work out ;)

Btw, a bootstrap & regtest worked fine with removing that loop
(not that this proves anything).

Richard.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Sewell

Several of you have said that the standard and compiler should not
permit speculative writes of atomics, or (effectively) that the
compiler should preserve dependencies.  In simple examples it's easy
to see what that means, but in general it's not so clear what the
language should guarantee, because dependencies may go via non-atomic
code in other compilation units, and we have to consider the extent to
which it's desirable to limit optimisation there.

For example, suppose we have, in one compilation unit:

void f(int ra, int*rb) {
  if (ra==42)
*rb=42;
  else
*rb=42;
}

and in another compilation unit the bodies of two threads:

// Thread 0
r1 = x;
f(r1,&r2);
y = r2;

// Thread 1
r3 = y;
f(r3,&r4);
x = r4;

where accesses to x and y are annotated C11 atomic
memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.

(Of course, this is an artificial example, to make the point below as
simply as possible - in real code the branches of the conditional
might not be syntactically identical, just equivalent after macro
expansion and other optimisation.)

In the source program there's a dependency from the read of x to the
write of y in Thread 0, and from the read of y to the write of x on
Thread 1.  Dependency-respecting compilation would preserve those and
the ARM and POWER architectures both respect them, so the reads of x
and y could not give 42.

But a compiler might well optimise the (non-atomic) body of f() to
just *rb=42, making the threads effectively

// Thread 0
r1 = x;
y = 42;

// Thread 1
r3 = y;
x = 42;

(GCC does this at O1, O2, and O3) and the ARM and POWER architectures
permit those two reads to see 42. That is moreover actually observable
on current ARM hardware.

So as far as we can see, either:

1) if you can accept the latter behaviour (if the Linux codebase does
   not rely on its absence), the language definition should permit it,
   and current compiler optimisations can be used,

or

2) otherwise, the language definition should prohibit it but the
   compiler would have to preserve dependencies even in compilation
   units that have no mention of atomics.  It's unclear what the
   (runtime and compiler development) cost of that would be in
   practice - perhaps Torvald could comment?


For more context, this example is taken from a summary of the thin-air
problem by Mark Batty and myself,
, and the problem with
dependencies via other compilation units was AFAIK first pointed out
by Hans Boehm.

Peter

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Zijlstra

On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote:
> Several of you have said that the standard and compiler should not
> permit speculative writes of atomics, or (effectively) that the
> compiler should preserve dependencies.

The example below only deals with control dependencies; so I'll limit
myself to that.

> In simple examples it's easy
> to see what that means, but in general it's not so clear what the
> language should guarantee, because dependencies may go via non-atomic
> code in other compilation units, and we have to consider the extent to
> which it's desirable to limit optimisation there.
> 
> For example, suppose we have, in one compilation unit:
> 
> void f(int ra, int*rb) {
>   if (ra==42)
> *rb=42;
>   else
> *rb=42;
> }
> 
> and in another compilation unit the bodies of two threads:
> 
> // Thread 0
> r1 = x;
> f(r1,&r2);
> y = r2;
> 
> // Thread 1
> r3 = y;
> f(r3,&r4);
> x = r4;
> 
> where accesses to x and y are annotated C11 atomic
> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.

So I'm intuitively ok with this, however I would expect something like:

  void f(_Atomic int ra, _Atomic int *rb);

To preserve dependencies and not make the conditional go away, simply
because in that case the:

  if (ra == 42)

the 'ra' usage can be seen as an atomic load.

> So as far as we can see, either:
> 
> 1) if you can accept the latter behaviour (if the Linux codebase does
>not rely on its absence), the language definition should permit it,
>and current compiler optimisations can be used,

Currently there's exactly 1 site in the Linux kernel that relies on
control dependencies as far as I know -- the one I put in. And its
limited to a single function, so no cross translation unit funnies
there.

Of course, nobody is going to tell me when or where they'll put in the
next one; since its now documented as accepted practise.

However, PaulMck and our RCU usage very much do cross all sorts of TU
boundaries; but those are data dependencies.

~ Peter

[LM-32] Code generation for address loading

2014-02-18 Thread FX MOREL

Hi everyone,

I am developing on a custom design using the LatticeMico32
architecture and I use gcc 4.5.1 to compile C code for this arch.

In this architecture, the loading of an address 0x always
takes two assembly instructions to fetch the address because
immediates are on 16 bits :
mvhi r1, 0x
ori r1, r1, 0x
...
lw r2, r1

In my situation, nearly all the symbols are located in the same 64kB
region and their address share the same hi-part, so I am trying to
minimize the overload of always using two instructions when only one
is needed.

In the C code, regrouping variables in structure will generate code
where memory is accessed through an offset of the base address of the
structure which is only loaded once per function, hence a gain in code
size.
But the extent of this technique is still limited by C code
readability constraints and there is still at least one penalty in
each function to load the first address.

I was imagining something with a fixed register (say r23, declared as
-ffixed in compilation flags) that would be loaded with the addresses'
hi-part at reset (in the same ways that r0 is set to 0) and then each
address loading phase would only consist in one instruction :

  ori r1, r23, 0x

I have tried to play around with machine description file of the lm32
architecture to achieve my goal but I need some kind of constraints or
condition to check where is my symbol before using this shortened
version of address loading (Even if nearly all of the symbols are
located within the same 64kB, some are still located in other memory
regions and I don't want to touch them !)

Because the symbol mapping phase is done during linking, I have little
chance to know the future symbol address at code generation but is
there some way I could make this work ?
If I affect the symbol to a dedicated section (with the __attribute__
((section())) directive ), is there a way to know its section during
code generation ?

I understand that I am asking for a very 'dangerous' advice but again,
this will only be a custom optimization for a custom design.

Thank you.

F-X Morel

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote:
> Several of you have said that the standard and compiler should not
> permit speculative writes of atomics, or (effectively) that the
> compiler should preserve dependencies.  In simple examples it's easy
> to see what that means, but in general it's not so clear what the
> language should guarantee, because dependencies may go via non-atomic
> code in other compilation units, and we have to consider the extent to
> which it's desirable to limit optimisation there.
> 
> For example, suppose we have, in one compilation unit:
> 
> void f(int ra, int*rb) {
>   if (ra==42)
> *rb=42;
>   else
> *rb=42;
> }

Hello, Peter!

Nice example!

The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
says the following about the control dependency in the above construct:

q = ACCESS_ONCE(a);
if (q) {
barrier();
ACCESS_ONCE(b) = p;
do_something();
} else {
barrier();
ACCESS_ONCE(b) = p;
do_something_else();
}

The initial ACCESS_ONCE() is required to prevent the compiler from
proving the value of 'a', and the pair of barrier() invocations are
required to prevent the compiler from pulling the two identical stores
to 'b' out from the legs of the "if" statement.

So yes, current compilers need significant help if it is necessary to
maintain dependencies in that sort of code.

Similar examples came up in the data-dependency discussions in the
standards committee, which led to the [[carries_dependency]] attribute for
C11 and C++11.  Of course, current compilers don't have this attribute,
and the current Linux kernel code doesn't have any other marking for
data dependencies passing across function boundaries.  (Maybe some time
as an assist for detecting pointer leaks out of RCU read-side critical
sections, but efforts along those lines are a bit stalled at the moment.)

More on data dependencies below...

> and in another compilation unit the bodies of two threads:
> 
> // Thread 0
> r1 = x;
> f(r1,&r2);
> y = r2;
> 
> // Thread 1
> r3 = y;
> f(r3,&r4);
> x = r4;
> 
> where accesses to x and y are annotated C11 atomic
> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
> 
> (Of course, this is an artificial example, to make the point below as
> simply as possible - in real code the branches of the conditional
> might not be syntactically identical, just equivalent after macro
> expansion and other optimisation.)
> 
> In the source program there's a dependency from the read of x to the
> write of y in Thread 0, and from the read of y to the write of x on
> Thread 1.  Dependency-respecting compilation would preserve those and
> the ARM and POWER architectures both respect them, so the reads of x
> and y could not give 42.
> 
> But a compiler might well optimise the (non-atomic) body of f() to
> just *rb=42, making the threads effectively
> 
> // Thread 0
> r1 = x;
> y = 42;
> 
> // Thread 1
> r3 = y;
> x = 42;
> 
> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
> permit those two reads to see 42. That is moreover actually observable
> on current ARM hardware.

I do agree that this could happen on current compilers and hardware.

Agreed, but as Peter Zijlstra noted in this thread, this optimization
is to a control dependency, not a data dependency.

> So as far as we can see, either:
> 
> 1) if you can accept the latter behaviour (if the Linux codebase does
>not rely on its absence), the language definition should permit it,
>and current compiler optimisations can be used,
> 
> or
> 
> 2) otherwise, the language definition should prohibit it but the
>compiler would have to preserve dependencies even in compilation
>units that have no mention of atomics.  It's unclear what the
>(runtime and compiler development) cost of that would be in
>practice - perhaps Torvald could comment?

For current compilers, we have to rely on coding conventions within
the Linux kernel in combination with non-standard extentions to gcc
and specified compiler flags to disable undesirable behavior.  I have a
start on specifying this in a document I am preparing for the standards
committee, a very early draft of which may be found here:

http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf

Section 3 shows the results of a manual scan through the Linux kernel's
dependency chains, and Section 4.1 lists a probably incomplete (and no
doubt erroneous) list of coding standards required to make dependency
chains work on current compilers.  Any comments and suggestions are more
tha

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Mark Batty

Hi Paul,

Thanks for the document. I'm looking forward to reading the bits about
dependency chains in Linux.

> One point of confusion for me...  Example 4 says "language must allow".
> Shouldn't that be "language is permitted to allow"?

When we say "allow", we mean that the optimised execution should be
allowed by the specification, and Implicitly, the unoptimised
execution should remain allowed too. We want to be concrete about what
the language specification allows, and that's why we say "must". It is
not to disallow the unoptimised execution.

> Seems like an
> implementation is always within its rights to avoid an optimization if
> its implementation prevents it from safely detecting the oppportunity
> for that optimization.

That's right.

- Mark


> Or am I missing something here?
>
> Thanx, Paul
>

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> And exactly because I know enough, I would *really* like atomics to be
> well-defined, and have very clear - and *local* - rules about how they
> can be combined and optimized.

"Local"?

> None of this "if you can prove that the read has value X" stuff. And
> things like value speculation should simply not be allowed, because
> that actually breaks the dependency chain that the CPU architects give
> guarantees for. Instead, make the rules be very clear, and very
> simple, like my suggestion. You can never remove a load because you
> can "prove" it has some value, but you can combine two consecutive
> atomic accesses/

Sorry, but the rules *are* very clear.  I *really* suggest to look at
the formalization by Batty et al.  And in these rules, proving that a
read will always return value X has a well-defined meaning, and you can
use it.  That simply follows from how the model is built.

What you seem to want just isn't covered by the model as it is today --
you can't infer from that that the model itself would be wrong.  The
dependency chains aren't modeled in the way you envision it (except in
what consume_mo tries, but that seems to be hard to implement); they are
there on the level of the program logic as modeled by the abstract
machine and the respective execution/output rules, but they are not
built to represent those specific ordering guarantees the HW gives you.

I would also be cautious claiming that the rules you suggested would be
very clear and very simple.  I haven't seen a memory model spec from you
that would be viable as the standard model for C/C++, nor have I seen
proof that this would actually be easier to understand for programmers
in general.

> For example, CPU people actually do tend to give guarantees for
> certain things, like stores that are causally related being visible in
> a particular order.

Yes, but that's not part of the model so far.  If you want to exploit
this, please make a suggestion for how to extend the model to cover
this.

You certainly expect compilers to actually optimize code, which usually
removes or reorders stuff.  Now, we could say that we just don't want
that for atomic accesses, but even this isn't as clear-cut: How do we
actually detect where non-atomic code is intended by the programmer to
express (1) a dependency that needs to be transformed into a dependency
in the generated code from (2) a dependency that is just program logic?

IOW, consider the consume_mo example "*(p + flag - flag)", where flag is
coming from a consume_mo load: Can you give a complete set of rules (for
a full program) for when the compiler is allowed to optimize out
flag-flag and when not?  (And it should be practically implementable in
a compiler, and not prevent optimizations where people expect them.)

> If the compiler starts doing value speculation on
> atomic accesses, you are quite possibly breaking things like that.
> It's just not a good idea. Don't do it. Write the standard so that it
> clearly is disallowed.

You never want value speculation for anything possibly originating from
an atomic load?  Or what are the concrete rules you have in mind?

> Because you may think that a C standard is machine-independent, but
> that isn't really the case. The people who write code still write code
> for a particular machine. Our code works (in the general case) on
> different byte orderings, different register sizes, different memory
> ordering models. But in each *instance* we still end up actually
> coding for each machine.

That's how *you* do it.  I'm sure you are aware that for lots of
programmers, having a machine-dependent standard is not helpful at all.
As the standard is written, code is portable.  It can certainly be
beneficial for programmers to optimize for different machines, but the
core of the memory model will always remain portable, because that's
what arguably most programmers need.  That doesn't prevent
machine-dependent extensions, but those then should be well integrated
with the rest of the model.

> So the rules for atomics should be simple and *specific* enough that
> when you write code for a particular architecture, you can take the
> architecture memory ordering *and* the C atomics orderings into
> account, and do the right thing for that architecture.

That would be an extension, but not viable as a *general* requirement as
far as the standard is concerned.

> And that very much means that doing things like value speculation MUST
> NOT HAPPEN. See? Even if you can prove that your code is "equivalent",
> it isn't.

I'm kind of puzzled why you keep stating generalizations of assertions
that clearly aren't well-founded (ie, which might be true for the kernel
or your wishes for how the world should be, but aren't true in general
or for the objectives of others).

It's not helpful to just pretend other people's opinions or worlds
didn't exist in discussions such as this one.  It's clear that the
standard

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Sewell

Hi Paul,

On 18 February 2014 14:56, Paul E. McKenney  wrote:
> On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote:
>> Several of you have said that the standard and compiler should not
>> permit speculative writes of atomics, or (effectively) that the
>> compiler should preserve dependencies.  In simple examples it's easy
>> to see what that means, but in general it's not so clear what the
>> language should guarantee, because dependencies may go via non-atomic
>> code in other compilation units, and we have to consider the extent to
>> which it's desirable to limit optimisation there.
>>
>> For example, suppose we have, in one compilation unit:
>>
>> void f(int ra, int*rb) {
>>   if (ra==42)
>> *rb=42;
>>   else
>> *rb=42;
>> }
>
> Hello, Peter!
>
> Nice example!
>
> The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
> says the following about the control dependency in the above construct:
>
> 
>
> q = ACCESS_ONCE(a);
> if (q) {
> barrier();
> ACCESS_ONCE(b) = p;
> do_something();
> } else {
> barrier();
> ACCESS_ONCE(b) = p;
> do_something_else();
> }
>
> The initial ACCESS_ONCE() is required to prevent the compiler from
> proving the value of 'a', and the pair of barrier() invocations are
> required to prevent the compiler from pulling the two identical stores
> to 'b' out from the legs of the "if" statement.

thanks

> 
>
> So yes, current compilers need significant help if it is necessary to
> maintain dependencies in that sort of code.
>
> Similar examples came up in the data-dependency discussions in the
> standards committee, which led to the [[carries_dependency]] attribute for
> C11 and C++11.  Of course, current compilers don't have this attribute,
> and the current Linux kernel code doesn't have any other marking for
> data dependencies passing across function boundaries.  (Maybe some time
> as an assist for detecting pointer leaks out of RCU read-side critical
> sections, but efforts along those lines are a bit stalled at the moment.)
>
> More on data dependencies below...
>
>> and in another compilation unit the bodies of two threads:
>>
>> // Thread 0
>> r1 = x;
>> f(r1,&r2);
>> y = r2;
>>
>> // Thread 1
>> r3 = y;
>> f(r3,&r4);
>> x = r4;
>>
>> where accesses to x and y are annotated C11 atomic
>> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
>> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
>>
>> (Of course, this is an artificial example, to make the point below as
>> simply as possible - in real code the branches of the conditional
>> might not be syntactically identical, just equivalent after macro
>> expansion and other optimisation.)
>>
>> In the source program there's a dependency from the read of x to the
>> write of y in Thread 0, and from the read of y to the write of x on
>> Thread 1.  Dependency-respecting compilation would preserve those and
>> the ARM and POWER architectures both respect them, so the reads of x
>> and y could not give 42.
>>
>> But a compiler might well optimise the (non-atomic) body of f() to
>> just *rb=42, making the threads effectively
>>
>> // Thread 0
>> r1 = x;
>> y = 42;
>>
>> // Thread 1
>> r3 = y;
>> x = 42;
>>
>> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
>> permit those two reads to see 42. That is moreover actually observable
>> on current ARM hardware.
>
> I do agree that this could happen on current compilers and hardware.
>
> Agreed, but as Peter Zijlstra noted in this thread, this optimization
> is to a control dependency, not a data dependency.

Indeed.  In principle (again as Hans has observed) a compiler might
well convert between the two, e.g. if operating on single-bit values,
or where value-range analysis has shown that a variable can only
contain one of a small set of values.   I don't know whether that
happens in practice?  Then there are also cases where a compiler is
very likely to remove data/address dependencies, eg if some constant C
is #define'd to be 0 then an array access indexed by x * C will have
the dependency on x removed.  The runtime and compiler development
costs of preventing that are also unclear to me.

Given that, whether it's reasonable to treat control and data
dependencies differently seems to be an open question.

>> So as far as we can see, either:
>>
>> 1) if you can accept the latter behaviour (if the Linux codebase does
>>not rely on its absence), the language definition should permit it,
>>and current compiler optimisations can be used,
>>
>> or
>>
>> 2) otherwise, the language definition should prohibit it but the
>>compiler would have to preserve depen

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel  wrote:
> >
> > There's an underlying problem here that's independent from the actual
> > instance that you're worried about here: "no sense" is a ultimately a
> > matter of taste/objectives/priorities as long as the respective
> > specification is logically consistent.
> 
> Yes. But I don't think it's "independent".
> 
> Exactly *because* some people will read standards without applying
> "does the resulting code generation actually make sense for the
> programmer that wrote the code", the standard has to be pretty clear.
> 
> The standard often *isn't* pretty clear. It wasn't clear enough when
> it came to "volatile", and yet that was a *much* simpler concept than
> atomic accesses and memory ordering.
> 
> And most of the time it's not a big deal. But because the C standard
> generally tries to be very portable, and cover different machines,
> there tends to be a mindset that anything inherently unportable is
> "undefined" or "implementation defined", and then the compiler writer
> is basically given free reign to do anything they want (with
> "implementation defined" at least requiring that it is reliably the
> same thing).

Yes, that's how it works in general.  And this makes sense, because all
optimizations rely on that.  Second, you can't keep something consistent
(eg, between compilers) if it isn't specified.  So if we want stricter
rules, those need to be specified somewhere.

> And when it comes to memory ordering, *everything* is basically
> non-portable, because different CPU's very much have different rules.

Well, the current set of memory orders (and the memory model as a whole)
is portable, even though it might not allow to exploit all hardware
properties, and thus might perform sub-optimally in some cases.

> I worry that that means that the standard then takes the stance that
> "well, compiler re-ordering is no worse than CPU re-ordering, so we
> let the compiler do anything". And then we have to either add
> "volatile" to make sure the compiler doesn't do that, or use an overly
> strict memory model at the compiler level that makes it all pointless.

Using "volatile" is not a good option, I think, because synchronization
between threads should be orthogonal to observable output of the
abstract machine.

The current memory model might not allow to exploit all hardware
properties, I agree.

But then why don't we work on how to extend it to do so?  We need to
specify the behavior we want anyway, and this can't be independent of
the language semantics, so it has to be conceptually integrated with the
standard anyway.

> So I really really hope that the standard doesn't give compiler
> writers free hands to do anything that they can prove is "equivalent"
> in the virtual C machine model.

It does, but it also doesn't mean this can't be extended.  So let's
focus on whether we can find an extension.

> That's not how you get reliable
> results.

In this general form, that's obviously a false claim.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote:
> On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel  wrote:
> > >
> > > Which example do you have in mind here?  Haven't we resolved all the
> > > debated examples, or did I miss any?
> > 
> > Well, Paul seems to still think that the standard possibly allows
> > speculative writes or possibly value speculation in ways that break
> > the hardware-guaranteed orderings.
> 
> It is not that I know of any specific problems, but rather that I
> know I haven't looked under all the rocks.  Plus my impression from
> my few years on the committee is that the standard will be pushed to
> the limit when it comes time to add optimizations.
> 
> One example that I learned about last week uses the branch-prediction
> hardware to validate value speculation.  And no, I am not at all a fan
> of value speculation, in case you were curious.  However, it is still
> an educational example.
> 
> This is where you start:
> 
>   p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
>   do_something(p->a, p->b, p->c);
>   p->d = 1;

I assume that's the source code.

> Then you leverage branch-prediction hardware as follows:
> 
>   p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
>   if (p == GUESS) {
>   do_something(GUESS->a, GUESS->b, GUESS->c);
>   GUESS->d = 1;
>   } else {
>   do_something(p->a, p->b, p->c);
>   p->d = 1;
>   }

I assume that this is a potential transformation by a compiler.

> The CPU's branch-prediction hardware squashes speculation in the case where
> the guess was wrong, and this prevents the speculative store to ->d from
> ever being visible.  However, the then-clause breaks dependencies, which
> means that the loads -could- be speculated, so that do_something() gets
> passed pre-initialization values.
> 
> Now, I hope and expect that the wording in the standard about dependency
> ordering prohibits this sort of thing.  But I do not yet know for certain.

The transformation would be incorrect.  p->a in the source code carries
a dependency, and as you say, the transformed code wouldn't have that
dependency any more.  So the transformed code would loose ordering
constraints that it has in the virtual machine, so in the absence of
other proofs of correctness based on properties not shown in the
example, the transformed code would not result in the same behavior as
allowed by the abstract machine.

If the transformation would actually be by a programmer, then this
wouldn't do the same as the first example because mo_consume doesn't
work through the if statement.

Are there other specified concerns that you have regarding this example?

Deadline for DWARF Version 5 comments -- March 31, 2014

2014-02-18 Thread Michael Eager


The DWARF Debugging Information Format Committee (http://dwarfstd.org)
is nearing completion on the DWARF Version 5 standard.  We anticipate
a public comment draft to be released mid-2014.  As in the past, we
have attempted where possible to make DWARF Version 5 backward compatible
with previous versions.

We will be accepting public comments, suggestions, or proposals
for changes to the DWARF Debugging Standard until March 31, 2014.
Please submit your comments using our Public Comment web page:
http://dwarfstd.org/Comment.php.

The current Version 4 standard, as well as previous versions of
the DWARF Standard can be found here: http://dwarfstd.org/Download.php

Please feel free to forward this email to anyone or any list
where it seems appropriate.

--
Michael EagerChair, DWARF Standards Committee
ea...@eagercon.com

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Sewell

On 18 February 2014 12:53, Peter Zijlstra  wrote:
> On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote:
>> Several of you have said that the standard and compiler should not
>> permit speculative writes of atomics, or (effectively) that the
>> compiler should preserve dependencies.
>
> The example below only deals with control dependencies; so I'll limit
> myself to that.

Data/address dependencies are, if anything, even less clear - see a
paragraph on that in my reply to Paul.

>> In simple examples it's easy
>> to see what that means, but in general it's not so clear what the
>> language should guarantee, because dependencies may go via non-atomic
>> code in other compilation units, and we have to consider the extent to
>> which it's desirable to limit optimisation there.
>>
>> For example, suppose we have, in one compilation unit:
>>
>> void f(int ra, int*rb) {
>>   if (ra==42)
>> *rb=42;
>>   else
>> *rb=42;
>> }
>>
>> and in another compilation unit the bodies of two threads:
>>
>> // Thread 0
>> r1 = x;
>> f(r1,&r2);
>> y = r2;
>>
>> // Thread 1
>> r3 = y;
>> f(r3,&r4);
>> x = r4;
>>
>> where accesses to x and y are annotated C11 atomic
>> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
>> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
>
> So I'm intuitively ok with this, however I would expect something like:
>
>   void f(_Atomic int ra, _Atomic int *rb);
>
> To preserve dependencies and not make the conditional go away, simply
> because in that case the:
>
>   if (ra == 42)
>
> the 'ra' usage can be seen as an atomic load.
>
>> So as far as we can see, either:
>>
>> 1) if you can accept the latter behaviour (if the Linux codebase does
>>not rely on its absence), the language definition should permit it,
>>and current compiler optimisations can be used,
>
> Currently there's exactly 1 site in the Linux kernel that relies on
> control dependencies as far as I know -- the one I put in.

ok, thanks

> And its
> limited to a single function, so no cross translation unit funnies
> there.

One can imagine a language definition that treats code that lies
entirely within a single compilation unit specially (e.g. if it's
somehow annotated as relying on dependencies).  But I imagine it would
be pretty unappealing to work with.

> Of course, nobody is going to tell me when or where they'll put in the
> next one; since its now documented as accepted practise.

Is that not fragile?

> However, PaulMck and our RCU usage very much do cross all sorts of TU
> boundaries; but those are data dependencies.

yes - though again see my reply to Paul's note

thanks,
Peter

> ~ Peter

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Mon, 2014-02-17 at 19:42 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds
>  wrote:
> >
> > As far as I can tell, the intent is that you can't do value
> > speculation (except perhaps for the "relaxed", which quite frankly
> > sounds largely useless).
> 
> Hmm. The language I see for "consume" is not obvious:
> 
>   "Consume operation: no reads in the current thread dependent on the
> value currently loaded can be reordered before this load"

I can't remember seeing that language in the standard (ie, C or C++).
Where is this from?

> and it could make a compiler writer say that value speculation is
> still valid, if you do it like this (with "ptr" being the atomic
> variable):
> 
>   value = ptr->val;

I assume the load from ptr has mo_consume ordering?

> into
> 
>   tmp = ptr;
>   value = speculated.value;
>   if (unlikely(tmp != &speculated))
> value = tmp->value;
> 
> which is still bogus. The load of "ptr" does happen before the load of
> "value = speculated->value" in the instruction stream, but it would
> still result in the CPU possibly moving the value read before the
> pointer read at least on ARM and power.

And surprise, in the C/C++ model the load from ptr is sequenced-before
the load from speculated, but there's no ordering constraint on the
reads-from relation for the value load if you use mo_consume on the ptr
load.  Thus, the transformed code has less ordering constraints than the
original code, and we arrive at the same outcome.

> So if you're a compiler person, you think you followed the letter of
> the spec - as far as *you* were concerned, no load dependent on the
> value of the atomic load moved to before the atomic load.

No.  Because the wobbly sentence you cited(?) above is not what the
standard says.

Would you please stop making claims about what compiler writers would do
or not if you seemingly aren't even familiar with the model that
compiler writers would use to reason about transformations?  Seriously?

> You go home,
> happy, knowing you've done your job. Never mind that you generated
> code that doesn't actually work.
> 
> I dread having to explain to the compiler person that he may be right
> in some theoretical virtual machine, but the code is subtly broken and
> nobody will ever understand why (and likely not be able to create a
> test-case showing the breakage).
> 
> But maybe the full standard makes it clear that "reordered before this
> load" actually means on the real hardware, not just in the generated
> instruction stream.

Do you think everyone else is stupid?  If there's an ordering constraint
in the virtual machine, it better be present when executing in the real
machine unless it provably cannot result in different output as
specified by the language's semantics.

> Reading it with understanding of the *intent* and
> understanding all the different memory models that requirement should
> be obvious (on alpha, you need an "rmb" instruction after the load),
> but ...

The standard is clear on what's required.  I strongly suggest reading
the formalization of the memory model by Batty et al.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Mon, 2014-02-17 at 16:09 -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel  wrote:
> > On Mon, 2014-02-17 at 14:32 -0800,
> >
> >> Stop claiming it "can return 1".. It *never* returns 1 unless you do
> >> the load and *verify* it, or unless the load itself can be made to go
> >> away. And with the code sequence given, that just doesn't happen. END
> >> OF STORY.
> >
> > void foo();
> > {
> >   atomic x = 1;
> >   if (atomic_load(&x, mo_relaxed) == 1)
> > atomic_store(&y, 3, mo_relaxed));
> > }
> 
> This is the very example I gave, where the real issue is not that "you
> prove that load returns 1", you instead say "store followed by a load
> can be combined".
> 
> I (in another email I just wrote) tried to show why the "prove
> something is true" is a very dangerous model.  Seriously, it's pure
> crap. It's broken.

I don't see anything dangerous in the example above with the language
semantics as specified: It's a well-defined situation, given the rules
of the language.  I replied to the other email you wrote with my
viewpoint on why the above is useful, how it compares to what you seem
to what, and where I think we need to start to bridge the gap.

> If the C standard defines atomics in terms of "provable equivalence",
> it's broken. Exactly because on a *virtual* machine you can prove
> things that are not actually true in a *real* machine.

For the control dependencies you have in mind, it's actually the other
way around.  You expect the real machine's properties in a program whose
semantics only give you the virtual machine's properties.  Anything you
prove on the virtual machine will be true on the real machine (in a
correct implementation) -- but you can't expect to have real-machine
properties on language that's based on the virtual machine.

> I have the
> example of value speculation changing the memory ordering model of the
> actual machine.

This example is not true for the language as specified.  It is true for
a modified language that you have in mind, but for this one I've just
seen pretty rough rules so far.  Please see my other reply.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 03:33:35PM +, Peter Sewell wrote:
> Hi Paul,
> 
> On 18 February 2014 14:56, Paul E. McKenney  
> wrote:
> > On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote:
> >> Several of you have said that the standard and compiler should not
> >> permit speculative writes of atomics, or (effectively) that the
> >> compiler should preserve dependencies.  In simple examples it's easy
> >> to see what that means, but in general it's not so clear what the
> >> language should guarantee, because dependencies may go via non-atomic
> >> code in other compilation units, and we have to consider the extent to
> >> which it's desirable to limit optimisation there.
> >>
> >> For example, suppose we have, in one compilation unit:
> >>
> >> void f(int ra, int*rb) {
> >>   if (ra==42)
> >> *rb=42;
> >>   else
> >> *rb=42;
> >> }
> >
> > Hello, Peter!
> >
> > Nice example!
> >
> > The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
> > says the following about the control dependency in the above construct:
> >
> > 
> >
> > q = ACCESS_ONCE(a);
> > if (q) {
> > barrier();
> > ACCESS_ONCE(b) = p;
> > do_something();
> > } else {
> > barrier();
> > ACCESS_ONCE(b) = p;
> > do_something_else();
> > }
> >
> > The initial ACCESS_ONCE() is required to prevent the compiler from
> > proving the value of 'a', and the pair of barrier() invocations are
> > required to prevent the compiler from pulling the two identical stores
> > to 'b' out from the legs of the "if" statement.
> 
> thanks
> 
> > 
> >
> > So yes, current compilers need significant help if it is necessary to
> > maintain dependencies in that sort of code.
> >
> > Similar examples came up in the data-dependency discussions in the
> > standards committee, which led to the [[carries_dependency]] attribute for
> > C11 and C++11.  Of course, current compilers don't have this attribute,
> > and the current Linux kernel code doesn't have any other marking for
> > data dependencies passing across function boundaries.  (Maybe some time
> > as an assist for detecting pointer leaks out of RCU read-side critical
> > sections, but efforts along those lines are a bit stalled at the moment.)
> >
> > More on data dependencies below...
> >
> >> and in another compilation unit the bodies of two threads:
> >>
> >> // Thread 0
> >> r1 = x;
> >> f(r1,&r2);
> >> y = r2;
> >>
> >> // Thread 1
> >> r3 = y;
> >> f(r3,&r4);
> >> x = r4;
> >>
> >> where accesses to x and y are annotated C11 atomic
> >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
> >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
> >>
> >> (Of course, this is an artificial example, to make the point below as
> >> simply as possible - in real code the branches of the conditional
> >> might not be syntactically identical, just equivalent after macro
> >> expansion and other optimisation.)
> >>
> >> In the source program there's a dependency from the read of x to the
> >> write of y in Thread 0, and from the read of y to the write of x on
> >> Thread 1.  Dependency-respecting compilation would preserve those and
> >> the ARM and POWER architectures both respect them, so the reads of x
> >> and y could not give 42.
> >>
> >> But a compiler might well optimise the (non-atomic) body of f() to
> >> just *rb=42, making the threads effectively
> >>
> >> // Thread 0
> >> r1 = x;
> >> y = 42;
> >>
> >> // Thread 1
> >> r3 = y;
> >> x = 42;
> >>
> >> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
> >> permit those two reads to see 42. That is moreover actually observable
> >> on current ARM hardware.
> >
> > I do agree that this could happen on current compilers and hardware.
> >
> > Agreed, but as Peter Zijlstra noted in this thread, this optimization
> > is to a control dependency, not a data dependency.
> 
> Indeed.  In principle (again as Hans has observed) a compiler might
> well convert between the two, e.g. if operating on single-bit values,
> or where value-range analysis has shown that a variable can only
> contain one of a small set of values.   I don't know whether that
> happens in practice?  Then there are also cases where a compiler is
> very likely to remove data/address dependencies, eg if some constant C
> is #define'd to be 0 then an array access indexed by x * C will have
> the dependency on x removed.  The runtime and compiler development
> costs of preventing that are also unclear to me.
> 
> Given that, whether it's reasonable to treat control and data
> dependencies differently seems to be an open question.

Here is another (admittedly fanciful and probably bugg

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds

On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
> On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
>> And exactly because I know enough, I would *really* like atomics to be
>> well-defined, and have very clear - and *local* - rules about how they
>> can be combined and optimized.
>
> "Local"?

Yes.

So I think that one of the big advantages of atomics over volatile is
that they *can* be optimized, and as such I'm not at all against
trying to generate much better code than for volatile accesses.

But at the same time, that can go too far. For example, one of the
things we'd want to use atomics for is page table accesses, where it
is very important that we don't generate multiple accesses to the
values, because parts of the values can be change *by*hardware* (ie
accessed and dirty bits).

So imagine that you have some clever global optimizer that sees that
the program never ever actually sets the dirty bit at all in any
thread, and then uses that kind of non-local knowledge to make
optimization decisions. THAT WOULD BE BAD.

Do you see what I'm aiming for? Any optimization that tries to prove
anything from more than local state is by definition broken, because
it assumes that everything is described by the program.

But *local* optimizations are fine, as long as they follow the obvious
rule of not actually making changes that are semantically visible.

(In practice, I'd be impressed as hell for that particular example,
and we actually do end up setting the dirty bit by hand in some
situations, so the example is slightly made up, but there are other
cases that might be more realistic in that sometimes we do things that
are hidden from the compiler - in assembly etc - and the compiler
might *think* it knows what is going on, but it doesn't actually see
all accesses).

> Sorry, but the rules *are* very clear.  I *really* suggest to look at
> the formalization by Batty et al.  And in these rules, proving that a
> read will always return value X has a well-defined meaning, and you can
> use it.  That simply follows from how the model is built.

What "model"?

That's the thing. I have tried to figure out whether the model is some
abstract C model, or a model based on the actual hardware that the
compiler is compiling for, and whether the model is one that assumes
the compiler has complete knowledge of the system (see the example
above).

And it seems to be a mixture of it all. The definitions of the various
orderings obviously very much imply that the compiler has to insert
the proper barriers or sequence markers for that architecture, but
then there is no apparent way to depend on any *other* architecture
ordering guarantees. Like our knowledge that all architectures (with
the exception of alpha, which really doesn't end up being something we
worry about any more) end up having the load dependency ordering
guarantee.

> What you seem to want just isn't covered by the model as it is today --
> you can't infer from that that the model itself would be wrong.  The
> dependency chains aren't modeled in the way you envision it (except in
> what consume_mo tries, but that seems to be hard to implement); they are
> there on the level of the program logic as modeled by the abstract
> machine and the respective execution/output rules, but they are not
> built to represent those specific ordering guarantees the HW gives you.

So this is a problem. It basically means that we cannot do the kinds
of things we do now, which very much involve knowing what the memory
ordering of a particular machine is, and combining that knowledge with
our knowledge of code generation.

Now, *most* of what we do is protected by locking and is all fine. But
we do have a few rather subtle places in RCU and in the scheduler
where we depend on the actual dependency chain.

In *practice*, I seriously doubt any reasonable compiler can actually
make a mess of it. The kinds of optimizations that would actually
defeat the dependency chain are simply not realistic. And I suspect
that will end up being what we rely on - there being no actual sane
sequence that a compiler would ever do, even if we wouldn't have
guarantees for some of it.

And I suspect  I can live with that. We _have_ lived with that for the
longest time, after all. We very much do things that aren't covered by
any existing C standard, and just basically use tricks to coax the
compiler into never generating code that doesn't work (with our inline
asm barriers etc being a prime example).

> I would also be cautious claiming that the rules you suggested would be
> very clear and very simple.  I haven't seen a memory model spec from you
> that would be viable as the standard model for C/C++, nor have I seen
> proof that this would actually be easier to understand for programmers
> in general.

So personally, if I were to write the spec, I would have taken a
completely different approach from what the standard obviously does.

I'd have taken the approach of specifying the requir

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 04:56:40PM +0100, Torvald Riegel wrote:
> On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote:
> > On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote:
> > > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel  
> > > wrote:
> > > >
> > > > Which example do you have in mind here?  Haven't we resolved all the
> > > > debated examples, or did I miss any?
> > > 
> > > Well, Paul seems to still think that the standard possibly allows
> > > speculative writes or possibly value speculation in ways that break
> > > the hardware-guaranteed orderings.
> > 
> > It is not that I know of any specific problems, but rather that I
> > know I haven't looked under all the rocks.  Plus my impression from
> > my few years on the committee is that the standard will be pushed to
> > the limit when it comes time to add optimizations.
> > 
> > One example that I learned about last week uses the branch-prediction
> > hardware to validate value speculation.  And no, I am not at all a fan
> > of value speculation, in case you were curious.  However, it is still
> > an educational example.
> > 
> > This is where you start:
> > 
> > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
> > do_something(p->a, p->b, p->c);
> > p->d = 1;
> 
> I assume that's the source code.

Yep!

> > Then you leverage branch-prediction hardware as follows:
> > 
> > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
> > if (p == GUESS) {
> > do_something(GUESS->a, GUESS->b, GUESS->c);
> > GUESS->d = 1;
> > } else {
> > do_something(p->a, p->b, p->c);
> > p->d = 1;
> > }
> 
> I assume that this is a potential transformation by a compiler.

Again, yep!

> > The CPU's branch-prediction hardware squashes speculation in the case where
> > the guess was wrong, and this prevents the speculative store to ->d from
> > ever being visible.  However, the then-clause breaks dependencies, which
> > means that the loads -could- be speculated, so that do_something() gets
> > passed pre-initialization values.
> > 
> > Now, I hope and expect that the wording in the standard about dependency
> > ordering prohibits this sort of thing.  But I do not yet know for certain.
> 
> The transformation would be incorrect.  p->a in the source code carries
> a dependency, and as you say, the transformed code wouldn't have that
> dependency any more.  So the transformed code would loose ordering
> constraints that it has in the virtual machine, so in the absence of
> other proofs of correctness based on properties not shown in the
> example, the transformed code would not result in the same behavior as
> allowed by the abstract machine.

Glad that you agree!  ;-)

> If the transformation would actually be by a programmer, then this
> wouldn't do the same as the first example because mo_consume doesn't
> work through the if statement.

Agreed.

> Are there other specified concerns that you have regarding this example?

Nope.  Just generalized paranoia.  (But just because I am paranoid
doesn't mean that there isn't a bug lurking somewhere in the standard,
the compiler, the kernel, or my own head!)

I will likely have more once I start mapping Linux kernel atomics to the
C11 standard.  One more paper past N3934 comes first, though.

Thanx, Paul

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote:
> On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel  wrote:
> > >
> > > There's an underlying problem here that's independent from the actual
> > > instance that you're worried about here: "no sense" is a ultimately a
> > > matter of taste/objectives/priorities as long as the respective
> > > specification is logically consistent.
> > 
> > Yes. But I don't think it's "independent".
> > 
> > Exactly *because* some people will read standards without applying
> > "does the resulting code generation actually make sense for the
> > programmer that wrote the code", the standard has to be pretty clear.
> > 
> > The standard often *isn't* pretty clear. It wasn't clear enough when
> > it came to "volatile", and yet that was a *much* simpler concept than
> > atomic accesses and memory ordering.
> > 
> > And most of the time it's not a big deal. But because the C standard
> > generally tries to be very portable, and cover different machines,
> > there tends to be a mindset that anything inherently unportable is
> > "undefined" or "implementation defined", and then the compiler writer
> > is basically given free reign to do anything they want (with
> > "implementation defined" at least requiring that it is reliably the
> > same thing).
> 
> Yes, that's how it works in general.  And this makes sense, because all
> optimizations rely on that.  Second, you can't keep something consistent
> (eg, between compilers) if it isn't specified.  So if we want stricter
> rules, those need to be specified somewhere.
> 
> > And when it comes to memory ordering, *everything* is basically
> > non-portable, because different CPU's very much have different rules.
> 
> Well, the current set of memory orders (and the memory model as a whole)
> is portable, even though it might not allow to exploit all hardware
> properties, and thus might perform sub-optimally in some cases.
> 
> > I worry that that means that the standard then takes the stance that
> > "well, compiler re-ordering is no worse than CPU re-ordering, so we
> > let the compiler do anything". And then we have to either add
> > "volatile" to make sure the compiler doesn't do that, or use an overly
> > strict memory model at the compiler level that makes it all pointless.
> 
> Using "volatile" is not a good option, I think, because synchronization
> between threads should be orthogonal to observable output of the
> abstract machine.

Are you thinking of "volatile" -instead- of atomics?  My belief is that
given the current standard there will be times that we need to use
"volatile" -in- -addition- to atomics.

> The current memory model might not allow to exploit all hardware
> properties, I agree.
> 
> But then why don't we work on how to extend it to do so?  We need to
> specify the behavior we want anyway, and this can't be independent of
> the language semantics, so it has to be conceptually integrated with the
> standard anyway.
> 
> > So I really really hope that the standard doesn't give compiler
> > writers free hands to do anything that they can prove is "equivalent"
> > in the virtual C machine model.
> 
> It does, but it also doesn't mean this can't be extended.  So let's
> focus on whether we can find an extension.
> 
> > That's not how you get reliable
> > results.
> 
> In this general form, that's obviously a false claim.

These two sentences starkly illustrate the difference in perspective
between you two.  You are talking past each other.  Not sure how to fix
this at the moment, but what else is new?  ;-)

Thanx, Paul

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> >> And exactly because I know enough, I would *really* like atomics to be
> >> well-defined, and have very clear - and *local* - rules about how they
> >> can be combined and optimized.
> >
> > "Local"?
> 
> Yes.
> 
> So I think that one of the big advantages of atomics over volatile is
> that they *can* be optimized, and as such I'm not at all against
> trying to generate much better code than for volatile accesses.
> 
> But at the same time, that can go too far. For example, one of the
> things we'd want to use atomics for is page table accesses, where it
> is very important that we don't generate multiple accesses to the
> values, because parts of the values can be change *by*hardware* (ie
> accessed and dirty bits).
> 
> So imagine that you have some clever global optimizer that sees that
> the program never ever actually sets the dirty bit at all in any
> thread, and then uses that kind of non-local knowledge to make
> optimization decisions. THAT WOULD BE BAD.

Might as well list other reasons why value proofs via whole-program
analysis are unreliable for the Linux kernel:

1.  As Linus said, changes from hardware.

2.  Assembly code that is not visible to the compiler.
Inline asms will -normally- let the compiler know what
memory they change, but some just use the "memory" tag.
Worse yet, I suspect that most compilers don't look all
that carefully at .S files.

Any number of other programs contain assembly files.

3.  Kernel modules that have not yet been written.  Now, the
compiler could refrain from trying to prove anything about
an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
is currently no way to communicate this information to the
compiler other than marking the variable "volatile".

Other programs have similar issues, e.g., via dlopen().

4.  Some drivers allow user-mode code to mmap() some of their
state.  Any changes undertaken by the user-mode code would
be invisible to the compiler.

5.  JITed code produced based on BPF: https://lwn.net/Articles/437981/

And probably other stuff as well.

Thanx, Paul

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 03:16:33PM +, Mark Batty wrote:
> Hi Paul,
> 
> Thanks for the document. I'm looking forward to reading the bits about
> dependency chains in Linux.

And I am looking forward to your thoughts on those bits!

> > One point of confusion for me...  Example 4 says "language must allow".
> > Shouldn't that be "language is permitted to allow"?
> 
> When we say "allow", we mean that the optimised execution should be
> allowed by the specification, and Implicitly, the unoptimised
> execution should remain allowed too. We want to be concrete about what
> the language specification allows, and that's why we say "must". It is
> not to disallow the unoptimised execution.

OK, got it!

Thanx, Paul

> > Seems like an
> > implementation is always within its rights to avoid an optimization if
> > its implementation prevents it from safely detecting the oppportunity
> > for that optimization.
> 
> That's right.
> 
> - Mark
> 
> 
> > Or am I missing something here?
> >
> > Thanx, Paul
> >
>

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds

On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell  wrote:
>
> For example, suppose we have, in one compilation unit:
>
> void f(int ra, int*rb) {
>   if (ra==42)
> *rb=42;
>   else
> *rb=42;
> }

So this is a great example, and in general I really like your page at:

> For more context, this example is taken from a summary of the thin-air
> problem by Mark Batty and myself,
> , and the problem with
> dependencies via other compilation units was AFAIK first pointed out
> by Hans Boehm.

and the reason I like your page is that it really talks about the
problem by pointing to the "unoptimized" code, and what hardware would
do.

As mentioned, I think that's actually the *correct* way to think about
the problem space, because it allows the programmer to take hardware
characteristics into account, without having to try to "describe" them
at a source level.

As to your example of

   if (ra)
   atomic_write(rb, A);
   else
   atomic_write(rb, B);

I really think that it is ok to combine that into

atomic_write(rb, ra ? A:B);

(by virtue of "exact same behavior on actual hardware"), and then the
only remaining question is whether the "ra?A:B" can be optimized to
remove the conditional if A==B as in your example where both are "42".
Agreed?

Now, I would argue that the "naive" translation of that is
unambiguous, and since "ra" is not volatile or magic in any way, then
"ra?42:42" can obviously be optimized into just 42 - by the exact same
rule that says "the compiler can do any transformation that is
equivalent in the hardware". The compiler can *locally* decide that
that is the right thing to do, and any programmer that complains about
that decision is just crazy.

So my "local machine behavior equivalency" rule means that that
function can be optimized into a single "store 42 atomically into rb".

Now, if it's *not* compiled locally, and is instead implemented as a
macro (or inline function), there are obviously situations where "ra ?
A : B" ends up having to do other things. In particular, X may be
volatile or an atomic read that has ordering semantics, and then that
expression doesn't become just "42", but that's a separate issue. It's
not all that dissimilar to "function calls are sequence points",
though, and obviously if the source of "ra" has semantic meaning, you
have to honor that semantic meaning.

Agreed?

  Linus

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds

On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel  wrote:
>>
>>   "Consume operation: no reads in the current thread dependent on the
>> value currently loaded can be reordered before this load"
>
> I can't remember seeing that language in the standard (ie, C or C++).
> Where is this from?

That's just for googling for explanations. I do have some old standard
draft, but that doesn't have any concise definitions anywhere that I
could find.

>> and it could make a compiler writer say that value speculation is
>> still valid, if you do it like this (with "ptr" being the atomic
>> variable):
>>
>>   value = ptr->val;
>
> I assume the load from ptr has mo_consume ordering?

Yes.

>> into
>>
>>   tmp = ptr;
>>   value = speculated.value;
>>   if (unlikely(tmp != &speculated))
>> value = tmp->value;
>>
>> which is still bogus. The load of "ptr" does happen before the load of
>> "value = speculated->value" in the instruction stream, but it would
>> still result in the CPU possibly moving the value read before the
>> pointer read at least on ARM and power.
>
> And surprise, in the C/C++ model the load from ptr is sequenced-before
> the load from speculated, but there's no ordering constraint on the
> reads-from relation for the value load if you use mo_consume on the ptr
> load.  Thus, the transformed code has less ordering constraints than the
> original code, and we arrive at the same outcome.

Ok, good.

> The standard is clear on what's required.  I strongly suggest reading
> the formalization of the memory model by Batty et al.

Can you point to it? Because I can find a draft standard, and it sure
as hell does *not* contain any clarity of the model. It has a *lot* of
verbiage, but it's pretty much impossible to actually understand, even
for somebody who really understands memory ordering.

 Linus

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Sewell

On 18 February 2014 17:38, Linus Torvalds  wrote:
> On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell  
> wrote:
>>
>> For example, suppose we have, in one compilation unit:
>>
>> void f(int ra, int*rb) {
>>   if (ra==42)
>> *rb=42;
>>   else
>> *rb=42;
>> }
>
> So this is a great example, and in general I really like your page at:
>
>> For more context, this example is taken from a summary of the thin-air
>> problem by Mark Batty and myself,
>> , and the problem with
>> dependencies via other compilation units was AFAIK first pointed out
>> by Hans Boehm.
>
> and the reason I like your page is that it really talks about the
> problem by pointing to the "unoptimized" code, and what hardware would
> do.

Thanks.  It's certainly necessary to separately understand what compiler
optimisation and the hardware might do, to get anywhere here.   But...

> As mentioned, I think that's actually the *correct* way to think about
> the problem space, because it allows the programmer to take hardware
> characteristics into account, without having to try to "describe" them
> at a source level.

...to be clear, I am ultimately after a decent source-level description of what
programmers can depend on, and we (Mark and I) view that page as
identifying constraints on what that description can say.  There are too
many compiler optimisations for people to reason directly in terms of
the set of all transformations that they do, so we need some more
concise and comprehensible envelope identifying what is allowed,
as an interface between compiler writers and users.  AIUI that's basically
what Torvald is arguing.

The C11 spec in its current form is not yet fully up to that task, for one thing
because it doesn't attempt to cover all the h/w interactions that you and
Paul list, but that is where we're trying to go with our formalisation work.

> As to your example of
>
>if (ra)
>atomic_write(rb, A);
>else
>atomic_write(rb, B);
>
> I really think that it is ok to combine that into
>
> atomic_write(rb, ra ? A:B);
>
> (by virtue of "exact same behavior on actual hardware"), and then the
> only remaining question is whether the "ra?A:B" can be optimized to
> remove the conditional if A==B as in your example where both are "42".
> Agreed?

y

> Now, I would argue that the "naive" translation of that is
> unambiguous, and since "ra" is not volatile or magic in any way, then
> "ra?42:42" can obviously be optimized into just 42 - by the exact same
> rule that says "the compiler can do any transformation that is
> equivalent in the hardware". The compiler can *locally* decide that
> that is the right thing to do, and any programmer that complains about
> that decision is just crazy.
>
> So my "local machine behavior equivalency" rule means that that
> function can be optimized into a single "store 42 atomically into rb".

This is a bit more subtle, because (on ARM and POWER) removing the
dependency and conditional branch is actually in general *not* equivalent
in the hardware, in a concurrent context.

That notwithstanding, I tend to agree that preventing that optimisation for
non-atomics would be prohibitively costly (though I'd like real data).

It's tempting then to permit more-or-less any optimisation for thread-local
accesses but rule out value-range analysis and suchlike for shared-memory
accesses.  Whether that would be viable from a compiler point of view, I don't
know.   In C, one will at best only be able to get an approximate analysis
of which is which, just for a start.

> Now, if it's *not* compiled locally, and is instead implemented as a
> macro (or inline function), there are obviously situations where "ra ?
> A : B" ends up having to do other things. In particular, X may be
> volatile or an atomic read that has ordering semantics, and then that
> expression doesn't become just "42", but that's a separate issue. It's
> not all that dissimilar to "function calls are sequence points",
> though, and obviously if the source of "ra" has semantic meaning, you
> have to honor that semantic meaning.

separate point, indeed

Peter

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Sewell

On 18 February 2014 17:16, Paul E. McKenney  wrote:
> On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
>> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
>> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
>> >> And exactly because I know enough, I would *really* like atomics to be
>> >> well-defined, and have very clear - and *local* - rules about how they
>> >> can be combined and optimized.
>> >
>> > "Local"?
>>
>> Yes.
>>
>> So I think that one of the big advantages of atomics over volatile is
>> that they *can* be optimized, and as such I'm not at all against
>> trying to generate much better code than for volatile accesses.
>>
>> But at the same time, that can go too far. For example, one of the
>> things we'd want to use atomics for is page table accesses, where it
>> is very important that we don't generate multiple accesses to the
>> values, because parts of the values can be change *by*hardware* (ie
>> accessed and dirty bits).
>>
>> So imagine that you have some clever global optimizer that sees that
>> the program never ever actually sets the dirty bit at all in any
>> thread, and then uses that kind of non-local knowledge to make
>> optimization decisions. THAT WOULD BE BAD.
>
> Might as well list other reasons why value proofs via whole-program
> analysis are unreliable for the Linux kernel:
>
> 1.  As Linus said, changes from hardware.
>
> 2.  Assembly code that is not visible to the compiler.
> Inline asms will -normally- let the compiler know what
> memory they change, but some just use the "memory" tag.
> Worse yet, I suspect that most compilers don't look all
> that carefully at .S files.
>
> Any number of other programs contain assembly files.
>
> 3.  Kernel modules that have not yet been written.  Now, the
> compiler could refrain from trying to prove anything about
> an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> is currently no way to communicate this information to the
> compiler other than marking the variable "volatile".
>
> Other programs have similar issues, e.g., via dlopen().
>
> 4.  Some drivers allow user-mode code to mmap() some of their
> state.  Any changes undertaken by the user-mode code would
> be invisible to the compiler.
>
> 5.  JITed code produced based on BPF: https://lwn.net/Articles/437981/
>
> And probably other stuff as well.

interesting list.  So are you saying that value-range-analysis and
such-like (I say glibly, without really knowing what "such-like"
refers to here) are fundamentally incompatible with
the kernel code, or can you think of some way to tell the compiler a
bound on the footprint of the "unseen" changes in each of those cases?

Peter


> Thanx, Paul
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds

On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell
 wrote:
>
> This is a bit more subtle, because (on ARM and POWER) removing the
> dependency and conditional branch is actually in general *not* equivalent
> in the hardware, in a concurrent context.

So I agree, but I think that's a generic issue with non-local memory
ordering, and is not at all specific to the optimization wrt that
"x?42:42" expression.

If you have a value that you loaded with a non-relaxed load, and you
pass that value off to a non-local function that you don't know what
it does, in my opinion that implies that the compiler had better add
the necessary serialization to say "whatever that other function does,
we guarantee the semantics of the load".

So on ppc, if you do a load with "consume" or "acquire" and then call
another function without having had something in the caller that
serializes the load, you'd better add the lwsync or whatever before
the call. Exactly because the function call itself otherwise basically
breaks the visibility into ordering. You've basically turned a
load-with-ordering-guarantees into just an integer that you passed off
to something that doesn't know about the ordering guarantees - and you
need that "lwsync" in order to still guarantee the ordering.

Tough titties. That's what a CPU with weak memory ordering semantics
gets in order to have sufficient memory ordering.

And I don't think it's actually a problem in practice. If you are
doing loads with ordered semantics, you're not going to pass the
result off willy-nilly to random functions (or you really *do* require
the ordering, because the load that did the "acquire" was actually for
a lock!

So I really think that the "local optimization" is correct regardless.

   Linus

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds

On Tue, Feb 18, 2014 at 10:23 AM, Peter Sewell
 wrote:
>
> interesting list.  So are you saying that value-range-analysis and
> such-like (I say glibly, without really knowing what "such-like"
> refers to here) are fundamentally incompatible with
> the kernel code

No, it's fine to do things like value-range-analysis, it's just that
you have to do it on a local scope, and not think that you can see
every single write to some variable.

And as Paul points out, this is actually generally true even outside
of kernels. We may be pretty unique in having some things that are
literally done by hardware (eg the page table updates), but we're
certainly not unique in having loadable modules or code that the
compiler doesn't know about by virtue of being compiled separately
(assembly files, JITted, whatever).

And we are actually perfectly fine with compiler barriers. One of our
most common barriers is simply this:

   #define barrier() __asm__ __volatile__("": : :"memory")

which basically tells the compiler "something changes memory in ways
you don't understand, so you cannot assume anything about memory
contents".

Obviously, a compiler is still perfectly fine optimizing things like
local variables etc that haven't had their address taken across such a
barrier (regardless of whether those local variables might be spilled
to the stack frame etc), but equally obviously we'd require that this
kind of thing makes sure that any atomic writes have been finalized by
the time the barrier happens (it does *not* imply a memory ordering,
just a compiler memory access barrier - we have separate things for
ordering requirements, and those obviously often generate actual
barrier instructions)

And we're likely always going to have things like this, but if C11
atomics end up working well for us, we might have *fewer* of them.

  Linus

Re: TYPE_BINFO and canonical types at LTO

2014-02-18 Thread Jan Hubicka

> > Non-ODR types born from other frontends will then need to be made to 
> > alias all the ODR variants that can be done by storing them into the 
> > current canonical type hash.
> > (I wonder if we want to support cross language aliasing for non-POD?)
> 
> Surely for accessing components of non-POD types, no?  Like
> 
> class Foo {
> Foo();
> int *get_data ();
> int *data;
> } glob_foo;
> 
> extern "C" int *get_foo_data() { return glob_foo.get_data(); }

OK, if we want to support this, then we want to merge.
What about types with vtbl pointer? :)
> 
> ?  But you are talking about the "tree" merging part using ODR info
> to also merge types which differ in completeness of contained
> pointer types, right?  (exactly equal cases should be already merged)

Actually I was speaking of canonical types here. I want to preserve more of TBAA
via honoring ODR and local types.
I want to change lto to not merge canonical types for pairs of types of same 
layout
(i.e. equivalent in the current canonical type definition) but with different
mangled names. I also want it to never merge when types are local.
For inter-language TBAA we will need to ensure aliasing in between non-ODR type
of same layout and all unmerged variants of ODR type.  Can it be done by 
attaching
chains of ODR types into the canonical type hash and when non-ODR type appears, 
just
make it alias with all of them?

It would make sense to ODR merge in tree merging, too, but I am not sure if
this fits the current design, since you would need to merge SCC components of
different shape then that seems hard, right?

It may be easier to ODR merge after streaming (during DECL fixup) just to make
WPA streaming cheaper and to reduce debug info size.  If you use
-fdump-ipa-devirt, it will dump you ODR types that did not get merged (only
ones with vtable pointers in them ATM) and there are quite long chains for
firefox. Surely then hundreds of duplicated ODR types will end up in the ltrans
partition streams and they eventually hit debug output machinery.
Eric sent me presentation about doing this in LLVM.
http://llvm.org/devmtg/2013-11/slides/Christopher-DebugInfo.pdf
> 
> The canonical type computation happens separately (only for prevailing
> types, of course), and there we already "merge" types which differ
> in completeness.  Canonical type merging is conservative the other
> way aroud - if we merge _all_ types to a single canonical type then
> TBAA is still correct (we get a single alias set).

Yes, I think I understand that. One equivalence is kind of minimal so we merge
only if we are sure there is no informationloss, other is maximal so we are
sure that types that needs to be equivalent by whatever underlying langauge
TBAA rules are actually equivalent.
> 
> > I also think we want explicit representation of types known to be local 
> > to compilation unit - anonymous namespaces in C/C++, types defined 
> > within function bodies in C and god knows what in Ada/Fortran/Java.
> 
> But here you get into the idea of improving TBAA, thus having
> _more_ distinct canonical types?

Yes.
> 
> Just to make sure to not mix those two ;)
> 
> And whatever "frontend knowledge" we want to excercise - please
> make sure we get a reliable way for the middle-end to see
> that "frontend knowledge" (no langhooks!).  Thus, make it
> "middle-end knowledge".

Sure that is what I am proposing - just have DECL_ASSEMBLER_NAME on TYPE_DECL
and ODR flag. Middle-end when comparing types will test ODR flag and if flag
is set, then it will compare via DECL_ASEBMLER_NAME (TYPE_DECL (type)).
No langhooks needed here + if other language has similar inter-unit equivalency
it can use the same mechanizm. Just turn the equivalency description into
string identifiers.
> 
> Oh - and the easiest way to improve things is to get less types into
> the merging process in the first place!

Yep, my experiments with not streaming BINFO are directed in it.  I will collect
some numbers and send.

Honza
> 
> Richard.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 09:44:48AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel  wrote:

[ . . . ]

> > The standard is clear on what's required.  I strongly suggest reading
> > the formalization of the memory model by Batty et al.
> 
> Can you point to it? Because I can find a draft standard, and it sure
> as hell does *not* contain any clarity of the model. It has a *lot* of
> verbiage, but it's pretty much impossible to actually understand, even
> for somebody who really understands memory ordering.

I suspect he is thinking of the following:

"Mathematizing C++ Concurrency." Mark Batty, Scott Owens, Susmit Sarkar,
Peter Sewell, and Tjark Weber.
https://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf

Even if you don't like the math, it contains some very good examples.

Thanx, Paul

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 06:23:47PM +, Peter Sewell wrote:
> On 18 February 2014 17:16, Paul E. McKenney  
> wrote:
> > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> >> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
> >> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> >> >> And exactly because I know enough, I would *really* like atomics to be
> >> >> well-defined, and have very clear - and *local* - rules about how they
> >> >> can be combined and optimized.
> >> >
> >> > "Local"?
> >>
> >> Yes.
> >>
> >> So I think that one of the big advantages of atomics over volatile is
> >> that they *can* be optimized, and as such I'm not at all against
> >> trying to generate much better code than for volatile accesses.
> >>
> >> But at the same time, that can go too far. For example, one of the
> >> things we'd want to use atomics for is page table accesses, where it
> >> is very important that we don't generate multiple accesses to the
> >> values, because parts of the values can be change *by*hardware* (ie
> >> accessed and dirty bits).
> >>
> >> So imagine that you have some clever global optimizer that sees that
> >> the program never ever actually sets the dirty bit at all in any
> >> thread, and then uses that kind of non-local knowledge to make
> >> optimization decisions. THAT WOULD BE BAD.
> >
> > Might as well list other reasons why value proofs via whole-program
> > analysis are unreliable for the Linux kernel:
> >
> > 1.  As Linus said, changes from hardware.
> >
> > 2.  Assembly code that is not visible to the compiler.
> > Inline asms will -normally- let the compiler know what
> > memory they change, but some just use the "memory" tag.
> > Worse yet, I suspect that most compilers don't look all
> > that carefully at .S files.
> >
> > Any number of other programs contain assembly files.
> >
> > 3.  Kernel modules that have not yet been written.  Now, the
> > compiler could refrain from trying to prove anything about
> > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > is currently no way to communicate this information to the
> > compiler other than marking the variable "volatile".
> >
> > Other programs have similar issues, e.g., via dlopen().
> >
> > 4.  Some drivers allow user-mode code to mmap() some of their
> > state.  Any changes undertaken by the user-mode code would
> > be invisible to the compiler.
> >
> > 5.  JITed code produced based on BPF: https://lwn.net/Articles/437981/
> >
> > And probably other stuff as well.
> 
> interesting list.  So are you saying that value-range-analysis and
> such-like (I say glibly, without really knowing what "such-like"
> refers to here) are fundamentally incompatible with
> the kernel code, or can you think of some way to tell the compiler a
> bound on the footprint of the "unseen" changes in each of those cases?

Other than the "volatile" keyword, no.

Well, I suppose you could also create a function that changed the
variables in question, then arrange to never call it, but in such a way
that the compiler could not prove that it was never called.  But ouch!

Thanx, Paul

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 10:49:27AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell
>  wrote:
> >
> > This is a bit more subtle, because (on ARM and POWER) removing the
> > dependency and conditional branch is actually in general *not* equivalent
> > in the hardware, in a concurrent context.
> 
> So I agree, but I think that's a generic issue with non-local memory
> ordering, and is not at all specific to the optimization wrt that
> "x?42:42" expression.
> 
> If you have a value that you loaded with a non-relaxed load, and you
> pass that value off to a non-local function that you don't know what
> it does, in my opinion that implies that the compiler had better add
> the necessary serialization to say "whatever that other function does,
> we guarantee the semantics of the load".
> 
> So on ppc, if you do a load with "consume" or "acquire" and then call
> another function without having had something in the caller that
> serializes the load, you'd better add the lwsync or whatever before
> the call. Exactly because the function call itself otherwise basically
> breaks the visibility into ordering. You've basically turned a
> load-with-ordering-guarantees into just an integer that you passed off
> to something that doesn't know about the ordering guarantees - and you
> need that "lwsync" in order to still guarantee the ordering.
> 
> Tough titties. That's what a CPU with weak memory ordering semantics
> gets in order to have sufficient memory ordering.

And that is in fact what C11 compilers are supposed to do if the function
doesn't have the [[carries_dependency]] attribute on the corresponding
argument or return of the non-local function.  If the function is marked
with [[carries_dependency]], then the compiler has the information needed
in both compilations to make things work correctly.

Thanx, Paul

> And I don't think it's actually a problem in practice. If you are
> doing loads with ordered semantics, you're not going to pass the
> result off willy-nilly to random functions (or you really *do* require
> the ordering, because the load that did the "acquire" was actually for
> a lock!
> 
> So I really think that the "local optimization" is correct regardless.
> 
>Linus
>

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel  wrote:
> > The standard is clear on what's required.  I strongly suggest reading
> > the formalization of the memory model by Batty et al.
> 
> Can you point to it? Because I can find a draft standard, and it sure
> as hell does *not* contain any clarity of the model. It has a *lot* of
> verbiage, but it's pretty much impossible to actually understand, even
> for somebody who really understands memory ordering.

http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
This has an explanation of the model up front, and then the detailed
formulae in Section 6.  This is from 2010, and there might have been
smaller changes since then, but I'm not aware of any bigger ones.

The cppmem tool is based on this, so if you want to play around with a
few code examples, it's pretty nice because it shows you all allowed
executions, including graphs like in the paper (see elsewhere in this
thread for CAS syntax):
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Tue, 2014-02-18 at 08:55 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote:
> > On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel  
> > > wrote:
> > > >
> > > > There's an underlying problem here that's independent from the actual
> > > > instance that you're worried about here: "no sense" is a ultimately a
> > > > matter of taste/objectives/priorities as long as the respective
> > > > specification is logically consistent.
> > > 
> > > Yes. But I don't think it's "independent".
> > > 
> > > Exactly *because* some people will read standards without applying
> > > "does the resulting code generation actually make sense for the
> > > programmer that wrote the code", the standard has to be pretty clear.
> > > 
> > > The standard often *isn't* pretty clear. It wasn't clear enough when
> > > it came to "volatile", and yet that was a *much* simpler concept than
> > > atomic accesses and memory ordering.
> > > 
> > > And most of the time it's not a big deal. But because the C standard
> > > generally tries to be very portable, and cover different machines,
> > > there tends to be a mindset that anything inherently unportable is
> > > "undefined" or "implementation defined", and then the compiler writer
> > > is basically given free reign to do anything they want (with
> > > "implementation defined" at least requiring that it is reliably the
> > > same thing).
> > 
> > Yes, that's how it works in general.  And this makes sense, because all
> > optimizations rely on that.  Second, you can't keep something consistent
> > (eg, between compilers) if it isn't specified.  So if we want stricter
> > rules, those need to be specified somewhere.
> > 
> > > And when it comes to memory ordering, *everything* is basically
> > > non-portable, because different CPU's very much have different rules.
> > 
> > Well, the current set of memory orders (and the memory model as a whole)
> > is portable, even though it might not allow to exploit all hardware
> > properties, and thus might perform sub-optimally in some cases.
> > 
> > > I worry that that means that the standard then takes the stance that
> > > "well, compiler re-ordering is no worse than CPU re-ordering, so we
> > > let the compiler do anything". And then we have to either add
> > > "volatile" to make sure the compiler doesn't do that, or use an overly
> > > strict memory model at the compiler level that makes it all pointless.
> > 
> > Using "volatile" is not a good option, I think, because synchronization
> > between threads should be orthogonal to observable output of the
> > abstract machine.
> 
> Are you thinking of "volatile" -instead- of atomics?  My belief is that
> given the current standard there will be times that we need to use
> "volatile" -in- -addition- to atomics.

No, I was talking about having to use "volatile" in addition to atomics
to get synchronization properties for the atomics.  ISTM that this would
be unfortunate because the objective for "volatile" (ie, declaring what
is output of the abstract machine) is orthogonal to synchronization
between threads.  So if we want to preserve control dependencies and get
ordering guarantees through that, I wouldn't prefer it through hacks
based on volatile.

For example, if we have a macro or helper function that contains
synchronization code, but the compiler can prove that the macro/function
is used just on data only accessible to a single thread, then the
"volatile" notation on it would prevent optimizations.

> > The current memory model might not allow to exploit all hardware
> > properties, I agree.
> > 
> > But then why don't we work on how to extend it to do so?  We need to
> > specify the behavior we want anyway, and this can't be independent of
> > the language semantics, so it has to be conceptually integrated with the
> > standard anyway.
> > 
> > > So I really really hope that the standard doesn't give compiler
> > > writers free hands to do anything that they can prove is "equivalent"
> > > in the virtual C machine model.
> > 
> > It does, but it also doesn't mean this can't be extended.  So let's
> > focus on whether we can find an extension.
> > 
> > > That's not how you get reliable
> > > results.
> > 
> > In this general form, that's obviously a false claim.
> 
> These two sentences starkly illustrate the difference in perspective
> between you two.  You are talking past each other.  Not sure how to fix
> this at the moment, but what else is new?  ;-)

Well, we're still talking, and we are making little steps in clarifying
things, so there is progress :)

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote:
> Several of you have said that the standard and compiler should not
> permit speculative writes of atomics, or (effectively) that the
> compiler should preserve dependencies.  In simple examples it's easy
> to see what that means, but in general it's not so clear what the
> language should guarantee, because dependencies may go via non-atomic
> code in other compilation units, and we have to consider the extent to
> which it's desirable to limit optimisation there.

[...]

> 2) otherwise, the language definition should prohibit it but the
>compiler would have to preserve dependencies even in compilation
>units that have no mention of atomics.  It's unclear what the
>(runtime and compiler development) cost of that would be in
>practice - perhaps Torvald could comment?

If I'm reading the standard correctly, it requires that data
dependencies are preserved through loads and stores, including nonatomic
ones.  That sounds convenient because it allows programmers to use
temporary storage.

However, what happens if a dependency "arrives" at a store for which the
alias set isn't completely known?  Then we either have to add a barrier
to enforce the ordering at this point, or we have to assume that all
other potentially aliasing memory locations would also have to start
carrying dependencies (which might be in other functions in other
compilation units).  Neither option is good.  The first might introduce
barriers in places in which they might not be required (or the
programmer has to use kill_dependency() quite often to avoid all these).
The second is bad because points-to analysis is hard, so in practice the
points-to set will not be precisely known for a lot of pointers.  So
this might not just creep into other functions via calls of
[[carries_dependency]] functions, but also through normal loads and
stores, likely prohibiting many optimizations.

Furthermore, the dependency tracking can currently only be
"disabled/enabled" on a function granularity (via
[[carries_dependency]]).  Thus, if we have big functions, then
dependency tracking may slow down a lot of code in the big function.  If
we have small functions, there's a lot of attributes to be added.

If a function may only carry a dependency but doesn't necessarily (eg,
depending on input parameters), then the programmer has to make a
trade-off whether he/she want's to benefit from mo_consume but slow down
other calls due to additional barriers (ie, when this function is called
from non-[[carries_dependency]] functions), or vice versa.  (IOW,
because of the function granularity, other code's performance is
affected.)

If a compiler wants to implement dependency tracking just for a few
constructs (e.g., operators -> + ...) and use barriers otherwise, then
this decision must be compatible with how all this is handled in other
compilation units.  Thus, compiler optimizations effectively become part
of the ABI, which doesn't seem right.

I hope these examples illustrate my concerns about the implementability
in practice of this.  It's also why I've suggested to move from an
opt-out approach as in the current standard (ie, with kill_dependency())
to an opt-in approach for conservative dependency tracking (e.g., with a
preserve_dependencies(exp) call, where exp will not be optimized in a
way that removes any dependencies).  This wouldn't help with many
optimizations being prevented, but it should at least help programmers
contain the problem to smaller regions of code.

I'm not aware of any implementation that tries to track dependencies, so
I can't give any real performance numbers.  This could perhaps be
simulated, but I'm not sure whether a realistic case would be made
without at least supporting [[carries_dependency]] properly in the
compiler, which would be some work.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Tue, 2014-02-18 at 18:21 +, Peter Sewell wrote:
> On 18 February 2014 17:38, Linus Torvalds  
> wrote:
> > On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell  
> > wrote:
> >>
> >> For example, suppose we have, in one compilation unit:
> >>
> >> void f(int ra, int*rb) {
> >>   if (ra==42)
> >> *rb=42;
> >>   else
> >> *rb=42;
> >> }
> >
> > So this is a great example, and in general I really like your page at:
> >
> >> For more context, this example is taken from a summary of the thin-air
> >> problem by Mark Batty and myself,
> >> , and the problem with
> >> dependencies via other compilation units was AFAIK first pointed out
> >> by Hans Boehm.
> >
> > and the reason I like your page is that it really talks about the
> > problem by pointing to the "unoptimized" code, and what hardware would
> > do.
> 
> Thanks.  It's certainly necessary to separately understand what compiler
> optimisation and the hardware might do, to get anywhere here.   But...
> 
> > As mentioned, I think that's actually the *correct* way to think about
> > the problem space, because it allows the programmer to take hardware
> > characteristics into account, without having to try to "describe" them
> > at a source level.
> 
> ...to be clear, I am ultimately after a decent source-level description of 
> what
> programmers can depend on, and we (Mark and I) view that page as
> identifying constraints on what that description can say.  There are too
> many compiler optimisations for people to reason directly in terms of
> the set of all transformations that they do, so we need some more
> concise and comprehensible envelope identifying what is allowed,
> as an interface between compiler writers and users.  AIUI that's basically
> what Torvald is arguing.

Yes, that's one reason.  Another one is that if a programmer would
actually want to use atomics in a machine-independent / portable way,
he/she does also not want to reason about how all those transformations
might interact with the machine's memory model.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Tue, 2014-02-18 at 08:49 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> >> And exactly because I know enough, I would *really* like atomics to be
> >> well-defined, and have very clear - and *local* - rules about how they
> >> can be combined and optimized.
> >
> > "Local"?
> 
> Yes.
> 
> So I think that one of the big advantages of atomics over volatile is
> that they *can* be optimized, and as such I'm not at all against
> trying to generate much better code than for volatile accesses.
> 
> But at the same time, that can go too far. For example, one of the
> things we'd want to use atomics for is page table accesses, where it
> is very important that we don't generate multiple accesses to the
> values, because parts of the values can be change *by*hardware* (ie
> accessed and dirty bits).
> 
> So imagine that you have some clever global optimizer that sees that
> the program never ever actually sets the dirty bit at all in any
> thread, and then uses that kind of non-local knowledge to make
> optimization decisions. THAT WOULD BE BAD.
> 
> Do you see what I'm aiming for?

Yes, I do.  But that seems to be "volatile" territory.  It crosses the
boundaries of the abstract machine, and thus is input/output.  Which
fraction of your atomic accesses can read values produced by hardware?
I would still suppose that lots of synchronization is not affected by
this.

Do you perhaps want a weaker form of volatile?  That is, one that, for
example, allows combining of two adjacent loads of the dirty bits, but
will make sure that this is treated as if there is some imaginary
external thread that it cannot analyze and that may write?

I'm trying to be careful here in distinguishing between volatile and
synchronization because I believe those are orthogonal, and this is a
good thing because it allows for more-than-conservatively optimized
code.

> Any optimization that tries to prove
> anything from more than local state is by definition broken, because
> it assumes that everything is described by the program.

Well, that's how atomics that aren't volatile are defined in the
standard.  I can see that you want something else too, but that doesn't
mean that the other thing is broken.

> But *local* optimizations are fine, as long as they follow the obvious
> rule of not actually making changes that are semantically visible.

If we assume that there is this imaginary thread called hardware that
can write/read to/from such weak-volatile atomics, I believe this should
restrict optimizations sufficiently even in the model as specified in
the standard.  For example, it would prevent a compiler from proving
that there is no access by another thread to a variable, so it would
prevent the cases in our discussion that you didn't want to get
optimized.  Yet, I believe the model itself could stay unmodified.

Thus, with this "weak volatile", we could let programmers request
precisely the semantics that they want, without using volatile too much
and without preventing optimizations more than necessary.  Thoughts?

> (In practice, I'd be impressed as hell for that particular example,
> and we actually do end up setting the dirty bit by hand in some
> situations, so the example is slightly made up, but there are other
> cases that might be more realistic in that sometimes we do things that
> are hidden from the compiler - in assembly etc - and the compiler
> might *think* it knows what is going on, but it doesn't actually see
> all accesses).

If something is visible to assembly, then the compiler should see this
(or at least be aware that it can't prove anything about such data).  Or
am I missing anything?

> > Sorry, but the rules *are* very clear.  I *really* suggest to look at
> > the formalization by Batty et al.  And in these rules, proving that a
> > read will always return value X has a well-defined meaning, and you can
> > use it.  That simply follows from how the model is built.
> 
> What "model"?

The C/C++ memory model was what I was referring to.

> That's the thing. I have tried to figure out whether the model is some
> abstract C model, or a model based on the actual hardware that the
> compiler is compiling for, and whether the model is one that assumes
> the compiler has complete knowledge of the system (see the example
> above).

It's a model as specified in the standard.  It's not parametrized by the
hardware the program will eventually run on (ignoring
implementation-defined behavior, timing, etc.).  The compiler has
complete knowledge of the system unless for "volatiles" and things
coming in from the external world or escaping to it (e.g., stuff
escaping into asm statements).

> And it seems to be a mixture of it all. The definitions of the various
> orderings obviously very much imply that the compiler has to insert
> the proper barriers or sequence markers for that architecture, but
> then there is no apparent w

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 09:43:31PM +0100, Torvald Riegel wrote:
> xagsmtp5.20140218204423.3...@bldgate.vnet.ibm.com
> X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP5 at BLDGATE)
> 
> On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote:
> > Several of you have said that the standard and compiler should not
> > permit speculative writes of atomics, or (effectively) that the
> > compiler should preserve dependencies.  In simple examples it's easy
> > to see what that means, but in general it's not so clear what the
> > language should guarantee, because dependencies may go via non-atomic
> > code in other compilation units, and we have to consider the extent to
> > which it's desirable to limit optimisation there.
> 
> [...]
> 
> > 2) otherwise, the language definition should prohibit it but the
> >compiler would have to preserve dependencies even in compilation
> >units that have no mention of atomics.  It's unclear what the
> >(runtime and compiler development) cost of that would be in
> >practice - perhaps Torvald could comment?
> 
> If I'm reading the standard correctly, it requires that data
> dependencies are preserved through loads and stores, including nonatomic
> ones.  That sounds convenient because it allows programmers to use
> temporary storage.
> 
> However, what happens if a dependency "arrives" at a store for which the
> alias set isn't completely known?  Then we either have to add a barrier
> to enforce the ordering at this point, or we have to assume that all
> other potentially aliasing memory locations would also have to start
> carrying dependencies (which might be in other functions in other
> compilation units).  Neither option is good.  The first might introduce
> barriers in places in which they might not be required (or the
> programmer has to use kill_dependency() quite often to avoid all these).
> The second is bad because points-to analysis is hard, so in practice the
> points-to set will not be precisely known for a lot of pointers.  So
> this might not just creep into other functions via calls of
> [[carries_dependency]] functions, but also through normal loads and
> stores, likely prohibiting many optimizations.

I cannot immediately think of a situation where a store carrying a
dependency into a non-trivially aliased object wouldn't be a usage
error, so perhaps emitting a barrier and a diagnostic at that point
is best.

> Furthermore, the dependency tracking can currently only be
> "disabled/enabled" on a function granularity (via
> [[carries_dependency]]).  Thus, if we have big functions, then
> dependency tracking may slow down a lot of code in the big function.  If
> we have small functions, there's a lot of attributes to be added.
> 
> If a function may only carry a dependency but doesn't necessarily (eg,
> depending on input parameters), then the programmer has to make a
> trade-off whether he/she want's to benefit from mo_consume but slow down
> other calls due to additional barriers (ie, when this function is called
> from non-[[carries_dependency]] functions), or vice versa.  (IOW,
> because of the function granularity, other code's performance is
> affected.)
> 
> If a compiler wants to implement dependency tracking just for a few
> constructs (e.g., operators -> + ...) and use barriers otherwise, then
> this decision must be compatible with how all this is handled in other
> compilation units.  Thus, compiler optimizations effectively become part
> of the ABI, which doesn't seem right.
> 
> I hope these examples illustrate my concerns about the implementability
> in practice of this.  It's also why I've suggested to move from an
> opt-out approach as in the current standard (ie, with kill_dependency())
> to an opt-in approach for conservative dependency tracking (e.g., with a
> preserve_dependencies(exp) call, where exp will not be optimized in a
> way that removes any dependencies).  This wouldn't help with many
> optimizations being prevented, but it should at least help programmers
> contain the problem to smaller regions of code.
> 
> I'm not aware of any implementation that tries to track dependencies, so
> I can't give any real performance numbers.  This could perhaps be
> simulated, but I'm not sure whether a realistic case would be made
> without at least supporting [[carries_dependency]] properly in the
> compiler, which would be some work.

Another approach would be to use start-tracking/stop-tracking directives
that could be buried into rcu_read_lock() and rcu_read_unlock().  There
are issues with nesting and conditional use of rcu_read_lock() and
rcu_read_unlock(), but it does give you nicer granularity properties.

Thanx, Paul

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
> > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > >> And exactly because I know enough, I would *really* like atomics to be
> > >> well-defined, and have very clear - and *local* - rules about how they
> > >> can be combined and optimized.
> > >
> > > "Local"?
> > 
> > Yes.
> > 
> > So I think that one of the big advantages of atomics over volatile is
> > that they *can* be optimized, and as such I'm not at all against
> > trying to generate much better code than for volatile accesses.
> > 
> > But at the same time, that can go too far. For example, one of the
> > things we'd want to use atomics for is page table accesses, where it
> > is very important that we don't generate multiple accesses to the
> > values, because parts of the values can be change *by*hardware* (ie
> > accessed and dirty bits).
> > 
> > So imagine that you have some clever global optimizer that sees that
> > the program never ever actually sets the dirty bit at all in any
> > thread, and then uses that kind of non-local knowledge to make
> > optimization decisions. THAT WOULD BE BAD.
> 
> Might as well list other reasons why value proofs via whole-program
> analysis are unreliable for the Linux kernel:
> 
> 1.As Linus said, changes from hardware.

This is what's volatile is for, right?  (Or the weak-volatile idea I
mentioned).

Compilers won't be able to prove something about the values of such
variables, if marked (weak-)volatile.

> 2.Assembly code that is not visible to the compiler.
>   Inline asms will -normally- let the compiler know what
>   memory they change, but some just use the "memory" tag.
>   Worse yet, I suspect that most compilers don't look all
>   that carefully at .S files.
> 
>   Any number of other programs contain assembly files.

Are the annotations of changed memory really a problem?  If the "memory"
tag exists, isn't that supposed to mean all memory?

To make a proof about a program for location X, the compiler has to
analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
the compiler will simply not be able to prove a thing (except maybe due
to the data-race-free requirement for non-atomics).  The attempt to
prove something isn't unreliable, simply because a correct compiler
won't claim to be able to "prove" something.

One reason that could corrupt this is that if program addresses objects
other than through the mechanisms defined in the language.  For example,
if one thread lays out a data structure at a constant fixed memory
address, and another one then uses the fixed memory address to get
access to the object with a cast (e.g., (void*)0x123).

> 3.Kernel modules that have not yet been written.  Now, the
>   compiler could refrain from trying to prove anything about
>   an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
>   is currently no way to communicate this information to the
>   compiler other than marking the variable "volatile".

Even if the variable is just externally accessible, then the compiler
knows that it can't do whole-program analysis about it.

It is true that whole-program analysis will not be applicable in this
case, but it will not be unreliable.  I think that's an important
difference.

>   Other programs have similar issues, e.g., via dlopen().
> 
> 4.Some drivers allow user-mode code to mmap() some of their
>   state.  Any changes undertaken by the user-mode code would
>   be invisible to the compiler.

A good point, but a compiler that doesn't try to (incorrectly) assume
something about the semantics of mmap will simply see that the mmap'ed
data will escape to stuff if can't analyze, so it will not be able to
make a proof.

This is different from, for example, malloc(), which is guaranteed to
return "fresh" nonaliasing memory.

> 5.JITed code produced based on BPF: https://lwn.net/Articles/437981/

This might be special, or not, depending on how the JITed code gets
access to data.  If this is via fixed addresses (e.g., (void*)0x123),
then see above.  If this is through function calls that the compiler
can't analyze, then this is like 4.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Zijlstra

On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> Well, that's how atomics that aren't volatile are defined in the
> standard.  I can see that you want something else too, but that doesn't
> mean that the other thing is broken.

Well that other thing depends on being able to see the entire program at
compile time. PaulMck already listed various ways in which this is
not feasible even for normal userspace code.

In particular; DSOs and JITs were mentioned.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Zijlstra

On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> boundaries of the abstract machine, and thus is input/output.  Which
> fraction of your atomic accesses can read values produced by hardware?
> I would still suppose that lots of synchronization is not affected by
> this.

Its not only hardware; also the kernel/user boundary has this same
problem. We cannot a-priory say what userspace will do; in fact, because
we're a general purpose OS, we must assume it will willfully try its
bestest to wreck whatever assumptions we make about its behaviour.

We also have loadable modules -- much like regular userspace DSOs -- so
there too we cannot say what will or will not happen.

We also have JITs that generate code on demand.

And I'm absolutely sure (with the exception of the JITs, its not an area
I've worked on) that we have atomic usage across all those boundaries.

I must agree with Linus, global state driven optimizations are crack
brained; esp. for atomics. We simply cannot know all state at compile
time. The best we can hope for are local optimizations.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Torvald Riegel

On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> > Well, that's how atomics that aren't volatile are defined in the
> > standard.  I can see that you want something else too, but that doesn't
> > mean that the other thing is broken.
> 
> Well that other thing depends on being able to see the entire program at
> compile time. PaulMck already listed various ways in which this is
> not feasible even for normal userspace code.
> 
> In particular; DSOs and JITs were mentioned.

No it doesn't depend on whole-program analysis being possible.  Because
if it isn't, then a correct compiler will just not do certain
optimizations simply because it can't prove properties required for the
optimization to hold.  With the exception of access to objects via magic
numbers (e.g., fixed and known addresses (see my reply to Paul), which
are outside of the semantics specified in the standard), I don't see a
correctness problem here.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Zijlstra

> > 4.  Some drivers allow user-mode code to mmap() some of their
> > state.  Any changes undertaken by the user-mode code would
> > be invisible to the compiler.
> 
> A good point, but a compiler that doesn't try to (incorrectly) assume
> something about the semantics of mmap will simply see that the mmap'ed
> data will escape to stuff if can't analyze, so it will not be able to
> make a proof.
> 
> This is different from, for example, malloc(), which is guaranteed to
> return "fresh" nonaliasing memory.

The kernel side of this is different.. it looks like 'normal' memory, we
just happen to allow it to end up in userspace too.

But on that point; how do you tell the compiler the difference between
malloc() and mmap()? Is that some function attribute?

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds

On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel  wrote:
>>
>> So imagine that you have some clever global optimizer that sees that
>> the program never ever actually sets the dirty bit at all in any
>> thread, and then uses that kind of non-local knowledge to make
>> optimization decisions. THAT WOULD BE BAD.
>>
>> Do you see what I'm aiming for?
>
> Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> boundaries of the abstract machine, and thus is input/output.  Which
> fraction of your atomic accesses can read values produced by hardware?
> I would still suppose that lots of synchronization is not affected by
> this.

The "hardware can change things" case is indeed pretty rare.

But quite frankly, even when it isn't hardware, as far as the compiler
is concerned you have the exact same issue - you have TLB faults
happening on other CPU's that do the same thing asynchronously using
software TLB fault handlers. So *semantically*, it really doesn't make
any difference what-so-ever if it's a software TLB handler on another
CPU, a microcoded TLB fault, or an actual hardware path.

So if the answer for all of the above is "use volatile", then I think
that means that the C11 atomics are badly designed.

The whole *point* of atomic accesses is that stuff like above should
"JustWork(tm)"

> Do you perhaps want a weaker form of volatile?  That is, one that, for
> example, allows combining of two adjacent loads of the dirty bits, but
> will make sure that this is treated as if there is some imaginary
> external thread that it cannot analyze and that may write?

Yes, that's basically what I would want. And it is what I would expect
an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a
bit of a misnomer, because technically we really generally want
"ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use
volatile, and is a hell of a lot simpler to write ;^).

So we obviously use "volatile" for this currently, and generally the
semantics we really want are:

 - the load or store is done as a single access ("atomic")

 - the compiler must not try to re-materialize the value by reloading
it from memory (this is the "at most once" part)

and quite frankly, "volatile" is a big hammer for this. In practice it
tends to work pretty well, though, because in _most_ cases, there
really is just the single access, so there isn't anything that it
could be combined with, and the biggest issue is often just the
correctness of not re-materializing the value.

And I agree - memory ordering is a totally separate issue, and in fact
we largely tend to consider it entirely separate. For cases where we
have ordering constraints, we either handle those with special
accessors (ie "atomic-modify-and-test" helpers tend to have some
serialization guarantees built in), or we add explicit fencing.

But semantically, C11 atomic accessors *should* generally have the
correct behavior for our uses.

If we have to add "volatile", that makes atomics basically useless. We
already *have* the volatile semantics, if atomics need it, that just
means that atomics have zero upside for us.

>> But *local* optimizations are fine, as long as they follow the obvious
>> rule of not actually making changes that are semantically visible.
>
> If we assume that there is this imaginary thread called hardware that
> can write/read to/from such weak-volatile atomics, I believe this should
> restrict optimizations sufficiently even in the model as specified in
> the standard.

Well, what about *real* threads that do this, but that aren't
analyzable by the C compiler because they are written in another
language entirely (inline asm, asm, perl, INTERCA:. microcode,
PAL-code, whatever?)

I really don't think that "hardware" is necessary for this to happen.
What is done by hardware on x86, for example, is done by PAL-code
(loaded at boot-time) on alpha, and done by hand-tuned assembler fault
handlers on Sparc. The *effect* is the same: it's not visible to the
compiler. There is no way in hell that the compiler can understand the
hand-tuned Sparc TLB fault handler, even if it parsed it.

>> IOW, I would *revel* in the fact that different machines are
>> different, and basically just describe the "stupid" code generation.
>> You'd get the guaranteed semantic baseline, but you'd *also* be able
>> to know that whatever architecture guarantees you have would remain.
>> Without having to describe those architecture issues.
>
> Would you be okay with this preventing lots of optimizations a compiler
> otherwise could do?  Because AFAICT, this spreads into non-synchronizing
> code via the dependency-tracking, for example.

Actually, it probably wouldn't really hurt code generation.

The thing is, you really have three cases:

 - architectures that have weak memory ordering and fairly stupid
hardware (power, some day maybe ARM)

 - architectures that expose a fairly strong memory ordering, but
reorders aggressively in hardware by tracking c

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney

On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com
> X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> 
> On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  
> > > wrote:
> > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > >> And exactly because I know enough, I would *really* like atomics to be
> > > >> well-defined, and have very clear - and *local* - rules about how they
> > > >> can be combined and optimized.
> > > >
> > > > "Local"?
> > > 
> > > Yes.
> > > 
> > > So I think that one of the big advantages of atomics over volatile is
> > > that they *can* be optimized, and as such I'm not at all against
> > > trying to generate much better code than for volatile accesses.
> > > 
> > > But at the same time, that can go too far. For example, one of the
> > > things we'd want to use atomics for is page table accesses, where it
> > > is very important that we don't generate multiple accesses to the
> > > values, because parts of the values can be change *by*hardware* (ie
> > > accessed and dirty bits).
> > > 
> > > So imagine that you have some clever global optimizer that sees that
> > > the program never ever actually sets the dirty bit at all in any
> > > thread, and then uses that kind of non-local knowledge to make
> > > optimization decisions. THAT WOULD BE BAD.
> > 
> > Might as well list other reasons why value proofs via whole-program
> > analysis are unreliable for the Linux kernel:
> > 
> > 1.  As Linus said, changes from hardware.
> 
> This is what's volatile is for, right?  (Or the weak-volatile idea I
> mentioned).
> 
> Compilers won't be able to prove something about the values of such
> variables, if marked (weak-)volatile.

Yep.

> > 2.  Assembly code that is not visible to the compiler.
> > Inline asms will -normally- let the compiler know what
> > memory they change, but some just use the "memory" tag.
> > Worse yet, I suspect that most compilers don't look all
> > that carefully at .S files.
> > 
> > Any number of other programs contain assembly files.
> 
> Are the annotations of changed memory really a problem?  If the "memory"
> tag exists, isn't that supposed to mean all memory?
> 
> To make a proof about a program for location X, the compiler has to
> analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> the compiler will simply not be able to prove a thing (except maybe due
> to the data-race-free requirement for non-atomics).  The attempt to
> prove something isn't unreliable, simply because a correct compiler
> won't claim to be able to "prove" something.

I am indeed less worried about inline assembler than I am about files
full of assembly.  Or files full of other languages.

> One reason that could corrupt this is that if program addresses objects
> other than through the mechanisms defined in the language.  For example,
> if one thread lays out a data structure at a constant fixed memory
> address, and another one then uses the fixed memory address to get
> access to the object with a cast (e.g., (void*)0x123).

Or if the program uses gcc linker scripts to get the same effect.

> > 3.  Kernel modules that have not yet been written.  Now, the
> > compiler could refrain from trying to prove anything about
> > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > is currently no way to communicate this information to the
> > compiler other than marking the variable "volatile".
> 
> Even if the variable is just externally accessible, then the compiler
> knows that it can't do whole-program analysis about it.
> 
> It is true that whole-program analysis will not be applicable in this
> case, but it will not be unreliable.  I think that's an important
> difference.

Let me make sure that I understand what you are saying.  If my program has
"extern int foo;", the compiler will refrain from doing whole-program
analysis involving "foo"?  Or to ask it another way, when you say
"whole-program analysis", are you restricting that analysis to the
current translation unit?

If so, I was probably not the only person thinking that you instead meant
analysis across all translation units linked into the program.  ;-)

> > Other programs have similar issues, e.g., via dlopen().
> > 
> > 4.  Some drivers allow user-mode code to mmap() some of their
> > state.  Any changes undertaken by the user-mode code would
> > be invisible to the compiler.
> 
> A good point, but a compiler that doesn't try to (incorrectly) assume
> something about the semantics of mmap will simply see that the mmap'ed
> data will escape to stuff if can't analyze, so it will not be able to
> make a proof.
> 
> This is different from, for example, malloc(), which is guaranteed to
> return "fresh" nonaliasing memory.

As Peter

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Peter Sewell

On 18 February 2014 20:43, Torvald Riegel  wrote:
> On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote:
>> Several of you have said that the standard and compiler should not
>> permit speculative writes of atomics, or (effectively) that the
>> compiler should preserve dependencies.  In simple examples it's easy
>> to see what that means, but in general it's not so clear what the
>> language should guarantee, because dependencies may go via non-atomic
>> code in other compilation units, and we have to consider the extent to
>> which it's desirable to limit optimisation there.
>
> [...]
>
>> 2) otherwise, the language definition should prohibit it but the
>>compiler would have to preserve dependencies even in compilation
>>units that have no mention of atomics.  It's unclear what the
>>(runtime and compiler development) cost of that would be in
>>practice - perhaps Torvald could comment?
>
> If I'm reading the standard correctly, it requires that data
> dependencies are preserved through loads and stores, including nonatomic
> ones.  That sounds convenient because it allows programmers to use
> temporary storage.

The standard only needs this for consume chains, but if one wanted to
get rid of thin-air values by requiring implementations to respect all
(reads-from union dependency) cycles, AFAICS we'd need it pretty much
everywhere.   I don't myself think that's likely to be a realistic
proposal, but it does keep coming up, and it'd be very interesting to
know the actual cost on some credible workload.

> However, what happens if a dependency "arrives" at a store for which the
> alias set isn't completely known?  Then we either have to add a barrier
> to enforce the ordering at this point, or we have to assume that all
> other potentially aliasing memory locations would also have to start
> carrying dependencies (which might be in other functions in other
> compilation units).  Neither option is good.  The first might introduce
> barriers in places in which they might not be required (or the
> programmer has to use kill_dependency() quite often to avoid all these).
> The second is bad because points-to analysis is hard, so in practice the
> points-to set will not be precisely known for a lot of pointers.  So
> this might not just creep into other functions via calls of
> [[carries_dependency]] functions, but also through normal loads and
> stores, likely prohibiting many optimizations.
>
> Furthermore, the dependency tracking can currently only be
> "disabled/enabled" on a function granularity (via
> [[carries_dependency]]).  Thus, if we have big functions, then
> dependency tracking may slow down a lot of code in the big function.  If
> we have small functions, there's a lot of attributes to be added.
>
> If a function may only carry a dependency but doesn't necessarily (eg,
> depending on input parameters), then the programmer has to make a
> trade-off whether he/she want's to benefit from mo_consume but slow down
> other calls due to additional barriers (ie, when this function is called
> from non-[[carries_dependency]] functions), or vice versa.  (IOW,
> because of the function granularity, other code's performance is
> affected.)
>
> If a compiler wants to implement dependency tracking just for a few
> constructs (e.g., operators -> + ...) and use barriers otherwise, then
> this decision must be compatible with how all this is handled in other
> compilation units.  Thus, compiler optimizations effectively become part
> of the ABI, which doesn't seem right.
>
> I hope these examples illustrate my concerns about the implementability
> in practice of this.  It's also why I've suggested to move from an
> opt-out approach as in the current standard (ie, with kill_dependency())
> to an opt-in approach for conservative dependency tracking (e.g., with a
> preserve_dependencies(exp) call, where exp will not be optimized in a
> way that removes any dependencies).  This wouldn't help with many
> optimizations being prevented, but it should at least help programmers
> contain the problem to smaller regions of code.
>
> I'm not aware of any implementation that tries to track dependencies, so
> I can't give any real performance numbers.  This could perhaps be
> simulated, but I'm not sure whether a realistic case would be made
> without at least supporting [[carries_dependency]] properly in the
> compiler, which would be some work.
>

ARM inline assembly usage in Linux kernel

2014-02-18 Thread Saleem Abdulrasool

Hello.

I am sending this at the behest of Renato.  I have been working on the ARM
integrated assembler in LLVM and came across an interesting item in the Linux
kernel.

I am wondering if this is an unstated covenant between the kernel and GCC or
simply a clever use of an unintended/undefined behaviour.

The Linux kernel uses the *compiler* as a fancy preprocessor to generate a
specially crafted assembly file.  This file is then post-processed via sed to
generate a header containing constants which is shared across assembly and C
sources.

In order to clarify the question, I am selecting a particular example and
pulling out the relevant bits of the source code below.

#define DEFINE(sym, val) asm volatile("\n->" #sym " %0 " #val : : "i" (val))

#define __NR_PAGEFLAGS 22

void definitions(void) {
  DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
}

This is then assembled to generate the following:

->NR_PAGEFLAGS #22 __NR_PAGEFLAGS

This will later be post-processed to generate:

#define NR_PAGELAGS 22 /* __NR_PAGEFLAGS */

By using the inline assembler to evaluate (constant) expressions into constant
values and then emit that using a special identifier (->) is a fairly clever
trick.  This leads to my question: is this just use of an unintentional
"feature" or something that was worked out between the two projects.

Please explicitly CC me on any response as I am not subscribed to this mailing
list.

Thanks.

-- 
Saleem Abdulrasool
compnerd (at) compnerd (dot) org

Re: ARM inline assembly usage in Linux kernel

2014-02-18 Thread Andrew Pinski

On Tue, Feb 18, 2014 at 6:56 PM, Saleem Abdulrasool
 wrote:
> Hello.
>
> I am sending this at the behest of Renato.  I have been working on the ARM
> integrated assembler in LLVM and came across an interesting item in the Linux
> kernel.
>
> I am wondering if this is an unstated covenant between the kernel and GCC or
> simply a clever use of an unintended/undefined behaviour.
>
> The Linux kernel uses the *compiler* as a fancy preprocessor to generate a
> specially crafted assembly file.  This file is then post-processed via sed to
> generate a header containing constants which is shared across assembly and C
> sources.
>
> In order to clarify the question, I am selecting a particular example and
> pulling out the relevant bits of the source code below.
>
> #define DEFINE(sym, val) asm volatile("\n->" #sym " %0 " #val : : "i" (val))
>
> #define __NR_PAGEFLAGS 22
>
> void definitions(void) {
>   DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
> }
>
> This is then assembled to generate the following:
>
> ->NR_PAGEFLAGS #22 __NR_PAGEFLAGS
>
> This will later be post-processed to generate:
>
> #define NR_PAGELAGS 22 /* __NR_PAGEFLAGS */
>
> By using the inline assembler to evaluate (constant) expressions into constant
> values and then emit that using a special identifier (->) is a fairly clever
> trick.  This leads to my question: is this just use of an unintentional
> "feature" or something that was worked out between the two projects.
>
> Please explicitly CC me on any response as I am not subscribed to this mailing
> list.

I don't see why this is a bad use of the inline-asm.  GCC does not
know and is not supposed to know what the string inside the inline-asm
is going to be.  In fact if you have a newer assembler than the
compiler, you could use instructions that GCC does not even know
about.  This is the purpose of inline-asm.  I think it was a bad
design decision on LLVM/clang's part that it would check the assembly
code up front.

Thanks,
Andrew Pinski


>
> Thanks.
>
> --
> Saleem Abdulrasool
> compnerd (at) compnerd (dot) org
>

48 matches

Mail list logo