[RFC] Introducing MIPS O32 ABI Extension for FR0 and FR1 Interlinking

2014-02-21 Thread Matthew Fortune
All,

Imagination Technologies would like to introduce the design of an O32 ABI 
extension for MIPS to allow it to be used in conjunction with MIPS FPUs having 
64-bit floating-point registers. This is a wide-reaching design that involves 
changes to all components of the MIPS toolchain it is being posted to GCC first 
and will progress on to other tools. This ABI extension is compatible with the 
existing O32 ABI definition and will not require the introduction of new build 
variants (multilibs).

The design document is relatively large and has been placed on the MIPS 
Compiler Team wiki to facilitate review:

http://dmz-portal.mips.com/wiki/MIPS_O32_ABI_-_FR0_and_FR1_Interlinking

The introductory paragraph is copied below:

---
MIPS ABIs have been adjusted many times as the architecture has evolved, 
resulting in the introduction of incompatible ABI variants. Current 
architectural changes lead us to review the state of the O32 ABI and evaluate 
whether the existing ABI can be made more compatible with current and future 
extensions.
The three primary reasons for extending the current O32 ABI are the 
introduction of the MSA ASE, the desire to exploit the FR=1 mode of current 
FPUs and the potential for future architectures to demand that floating point 
units run in the 'FR=1' mode. 
For the avoidance of doubt: 

* The FR=0 mode describes an FPU where we consider registers to be constructed 
of 32-bit parts and (depending on architecture revision) there are either 16 or 
32 single-precision registers and 16 double-precision registers. The 
double-precision registers exist at even indices and their upper half exists in 
the odd indices. 
* The FR=1 mode describes an FPU with 32 64-bit registers. All registers can be 
used for either single or double-precision data. 
* The MSA ASE requires the FR=1 mode 
---

All aspects of this design are open for discussion but, in particular, feedback 
and suggestions on the following areas are welcome:

* The mechanism in which we mark the mode requirements of binaries (ELF flags 
vs 'other')
* The mechanism in which mode requirements are conveyed from a program loader 
to a running program/dynamic linker.

Regards,
Matthew



GNU Tools Cauldron 2014 - We have reached capacity

2014-02-21 Thread Diego Novillo

An update to this year's Cauldron. We have almost reached
capacity. There are only a few slots left for registration.

If you still have not registered, please do it quickly. As soon
as we fill up, I will start a waiting list. Priority will be
given to those proposing a presentation or BoF.

If you have already registered and decide to change plans, please
let us know so we can give your slot to someone else.


Thanks. Diego.


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Michael Matz
Hi,

On Thu, 20 Feb 2014, Linus Torvalds wrote:

> But I'm pretty sure that any compiler guy must *hate* that current odd
> dependency-generation part, and if I was a gcc person, seeing that
> bugzilla entry Torvald pointed at, I would personally want to
> dismember somebody with a rusty spoon..

Yes.  Defining dependency chains in the way the standard currently seems 
to do must come from people not writing compilers.  There's simply no 
sensible way to implement it without being really conservative, because 
the depchains can contain arbitrary constructs including stores, 
loads and function calls but must still be observed.  

And with conservative I mean "everything is a source of a dependency, and 
hence can't be removed, reordered or otherwise fiddled with", and that 
includes code sequences where no atomic objects are anywhere in sight [1].
In the light of that the only realistic way (meaning to not have to 
disable optimization everywhere) to implement consume as currently 
specified is to map it to acquire.  At which point it becomes pointless.

> So I suspect there are a number of people who would be *more* than
> happy with a change to those odd dependency rules.

I can't say much about your actual discussion related to semantics of 
atomics, not my turf.  But the "carries a dependency" relation is not 
usefully implementable.


Ciao,
Michael.
[1] Simple example of what type of transformations would be disallowed:

int getzero (int i) { return i - i; }

Should be optimizable to "return 0;", right?  Not with carries a 
dependency in place:

int jeez (int idx) {
  int i = atomic_load(idx, memory_order_consume); // A
  int j = getzero (i);// B
  return array[j];// C
}

As I read "carries a dependency" there's a dependency from A to C. 
Now suppose we would optimize getzero in the obvious way, then inline, and 
boom, dependency gone.  So we wouldn't be able to optimize any function 
when we don't control all its users, for fear that it _might_ be used in 
some dependency chain where it then matters that we possibly removed some 
chain elements due to the transformation.  We would have to retain 'i-i' 
before inlining, and if the function then is inlined into a context where 
depchains don't matter, could _then_ optmize it to zero.  But that's 
insane, especially considering that it's hard to detect if a given context 
doesn't care for depchains, after all the depchain relation is constructed 
exactly so that it bleeds into nearly everywhere.  So we would most of 
the time have to assume that the ultimate context will be depchain-aware 
and therefore disable many transformations.

There'd be one solution to the above, we would have to invent some special 
operands and markers that explicitely model "carries-a-dep", ala this:

int getzero (int i) {
  #RETURN.dep = i.dep
  return 0;
}

int jeez (int idx) {
  # i.dep = idx.dep
  int i = atomic_load(idx, memory_order_consume); // A
  # j.dep = i.dep
  int j = getzero (i);// B
  # RETURN.dep = j.dep + array.dep
  return array[j];// C
}

Then inlining getzero would merely add another "# j.dep = i.dep" relation, 
so depchains are still there but the value optimization can happen before 
inlining.  Having to do something like that I'd find disgusting, and 
rather rewrite consume into acquire :)  Or make the depchain relation 
somehow realistically implementable.


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Paul E. McKenney
On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
> Hi,
> 
> On Thu, 20 Feb 2014, Linus Torvalds wrote:
> 
> > But I'm pretty sure that any compiler guy must *hate* that current odd
> > dependency-generation part, and if I was a gcc person, seeing that
> > bugzilla entry Torvald pointed at, I would personally want to
> > dismember somebody with a rusty spoon..
> 
> Yes.  Defining dependency chains in the way the standard currently seems 
> to do must come from people not writing compilers.  There's simply no 
> sensible way to implement it without being really conservative, because 
> the depchains can contain arbitrary constructs including stores, 
> loads and function calls but must still be observed.  
> 
> And with conservative I mean "everything is a source of a dependency, and 
> hence can't be removed, reordered or otherwise fiddled with", and that 
> includes code sequences where no atomic objects are anywhere in sight [1].
> In the light of that the only realistic way (meaning to not have to 
> disable optimization everywhere) to implement consume as currently 
> specified is to map it to acquire.  At which point it becomes pointless.

No, only memory_order_consume loads and [[carries_dependency]]
function arguments are sources of dependency chains.

> > So I suspect there are a number of people who would be *more* than
> > happy with a change to those odd dependency rules.
> 
> I can't say much about your actual discussion related to semantics of 
> atomics, not my turf.  But the "carries a dependency" relation is not 
> usefully implementable.
> 
> 
> Ciao,
> Michael.
> [1] Simple example of what type of transformations would be disallowed:
> 
> int getzero (int i) { return i - i; }

This needs to be as follows:

[[carries_dependency]] int getzero(int i [[carries_dependency]])
{
return i - i;
}

Otherwise dependencies won't get carried through it.

> Should be optimizable to "return 0;", right?  Not with carries a 
> dependency in place:
> 
> int jeez (int idx) {
>   int i = atomic_load(idx, memory_order_consume); // A
>   int j = getzero (i);// B
>   return array[j];// C
> }
> 
> As I read "carries a dependency" there's a dependency from A to C. 
> Now suppose we would optimize getzero in the obvious way, then inline, and 
> boom, dependency gone.  So we wouldn't be able to optimize any function 
> when we don't control all its users, for fear that it _might_ be used in 
> some dependency chain where it then matters that we possibly removed some 
> chain elements due to the transformation.  We would have to retain 'i-i' 
> before inlining, and if the function then is inlined into a context where 
> depchains don't matter, could _then_ optmize it to zero.  But that's 
> insane, especially considering that it's hard to detect if a given context 
> doesn't care for depchains, after all the depchain relation is constructed 
> exactly so that it bleeds into nearly everywhere.  So we would most of 
> the time have to assume that the ultimate context will be depchain-aware 
> and therefore disable many transformations.

Any function that does not contain a memory_order_consume load and that
doesn't have any arguments marked [[carries_dependency]] can be optimized
just as before.

> There'd be one solution to the above, we would have to invent some special 
> operands and markers that explicitely model "carries-a-dep", ala this:
> 
> int getzero (int i) {
>   #RETURN.dep = i.dep
>   return 0;
> }

The above is already handled by the [[carries_dependency]] attribute,
see above.

> int jeez (int idx) {
>   # i.dep = idx.dep
>   int i = atomic_load(idx, memory_order_consume); // A
>   # j.dep = i.dep
>   int j = getzero (i);// B
>   # RETURN.dep = j.dep + array.dep
>   return array[j];// C
> }
> 
> Then inlining getzero would merely add another "# j.dep = i.dep" relation, 
> so depchains are still there but the value optimization can happen before 
> inlining.  Having to do something like that I'd find disgusting, and 
> rather rewrite consume into acquire :)  Or make the depchain relation 
> somehow realistically implementable.

I was actually OK with arithmetic cancellation breaking the dependency
chains.  Others on the committee felt otherwise, and I figured that
(1) I wouldn't be writing that kind of function anyway and (2) they
knew more about writing compilers than I.  I would still be OK saying
that things like "i-i", "i*0", "i%1", "i&0", "i|~0" and so on just
break the dependency chain.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Linus Torvalds
On Fri, Feb 21, 2014 at 10:25 AM, Peter Sewell
 wrote:
>
> If one thinks this is too fragile, then simply using memory_order_acquire
> and paying the resulting barrier cost (and perhaps hoping that compilers
> will eventually be able to optimise some cases of those barriers to
> hardware-level dependencies) is the obvious alternative.

No, the obvious alternative is to do what we do already, and just do
it by hand. Using acquire is *worse* than what we have now.

Maybe for some other users, the thing falls out differently.

> Many concurrent things will "accidentally" work on x86 - consume is not
> special in that respect.

No. But if you have something that is mis-designed, easy to get wrong,
and untestable, any sane programmer will go "that's bad".

> There are two difficulties with this, if I understand correctly what you're
> proposing.
>
> The first is knowing where to stop.

No.

Read my suggestion. Knowing where to stop is *trivial*.

Either the dependency is immediate and obvious, or you treat it like an acquire.

Seriously. Any compiler that doesn't turn the dependency chain into
SSA or something pretty much equivalent is pretty much a joke. Agreed?

So we can pretty much assume that the compiler will have some
intermediate representation as part of optimization that is basically
SSA.

So what you do is,

 - build the SSA by doing all the normal parsing and possible
tree-level optimizations you already do even before getting to the SSA
stage

 - do all the normal optimizations/simplifications/cse/etc that you do
normally on SSA

 - add *one* new rule to your SSA simplification that goes something like this:

   * when you see a load op that is marked with a "consume" barrier,
just follow the usage chain that comes from that.
   * if you hit a normal arithmetic op, just follow the result chain of that
   * if you hit a memory operation address use, stop and say "looks good"
   * it you hit anything else (including a copy/phi/whatever), abort
   * if nothing aborted as part of the walk, you can now just remove
the "consume" barrier.

You can fancy it up and try to follow more cases, but realistically
the only case that really matters is the "consume" being fed directly
into one or more loads, with possibly an offset calculation in
between. There are certainly more cases you could *try* to remove the
barrier, but the thing is, it's never incorrect to not remove it, so
any time you get bored or hit any complication at all, just do the
"abort" part.

I *guarantee* that if you describe this to a compiler writer, he will
tell you that my scheme is about a billion times simpler than the
current standard wording. Especially after you've pointed him to that
gcc bugzilla entry and explained to him about how the current standard
cares about those kinds of made-up syntactic chains that he likely
removed quite early, possibly even as he was generating the semantic
tree.

Try it. I dare you. So if you want to talk about "difficulties", the
current C standard loses.

> The second is the proposal in later mails to use some notion of  "semantic"
> dependency instead of this syntactic one.

Bah.

The C standard does that all over. It's called "as-is". The C standard
talks about how the compiler can do pretty much whatever it likes, as
long as the end result acts the same in the virtual C machine.

So claiming that "semantics" being meaningful is somehow complex is
bogus. People do that all the time. If you make it clear that the
dependency chain is through the *value*, not syntax, and that the
value can be optimized all the usual ways, it's quite clear what the
end result is. Any operation that actually meaningfully uses the value
is serialized with the load, and if there is no meaningful use that
would affect the end result in the virtual machine, then there is no
barrier.

Why would this be any different, especially since it's easy to
understand both for a human and a compiler?

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Paul E. McKenney
On Fri, Feb 21, 2014 at 06:28:05PM +, Peter Sewell wrote:
> On 20 February 2014 17:01, Linus Torvalds  >wrote:

[ . . . ]

> > > So, if you make one of two changes to your example, then I will agree
> > > with you.
> >
> > No. We're not playing games here. I'm fed up with complex examples
> > that make no sense.
> >
> > Nobody sane writes code that does that pointer comparison, and it is
> > entirely immaterial what the compiler can do behind our backs. The C
> > standard semantics need to make sense to the *user* (ie programmer),
> > not to a CPU and not to a compiler. The CPU and compiler are "tools".
> > They don't matter. Their only job is to make the code *work*, dammit.
> >
> > So no idiotic made-up examples that involve code that nobody will ever
> > write and that have subtle issues.
> >
> > So the starting point is that (same example as before, but with even
> > clearer naming):
> >
> > Initialization state:
> >   initialized = 0;
> >   value = 0;
> >
> > Consumer:
> >
> > return atomic_read(&initialized, consume) ? value : -1;
> >
> > Writer:
> > value = 42;
> > atomic_write(&initialized, 1, release);
> >
> > and because the C memory ordering standard is written in such a way
> > that this is subtly buggy (and can return 0, which is *not* logically
> > a valid value), then I think the C memory ordering standard is broken.
> >
> > That "consumer" memory ordering is dangerous as hell, and it is
> > dangerous FOR NO GOOD REASON.
> >
> > The trivial "fix" to the standard would be to get rid of all the
> > "carries a dependency" crap, and just say that *anything* that depends
> > on it is ordered wrt it.
> >
> 
> There are two difficulties with this, if I understand correctly what
> you're proposing.
> 
> The first is knowing where to stop.  If one includes all data and
> control dependencies, pretty much all the continuation of execution
> would depend on the consume read, so the compiler would eventually
> have to give up and insert a gratuitous barrier.  One might imagine
> somehow annotating the return_expensive_system_value() you have below
> to say "stop dependency tracking at the return" (thereby perhaps
> enabling the compiler to optimise the barrier that you'd need in h/w
> to order the Linus-consume-read of initialised and the non-atomic read
> of calculated, replacing it by a compiler-introduced artificial
> dependency), and indeed that's roughly what the standard's
> kill_dependency does for consumes.

One way to tell the compiler where to stop would be to place markers
in the source code saying where dependencies stop.  These markers
could be provided by the definitions of the current rcu_read_unlock()
tags in the Linux kernel (and elsewhere, for that matter).  These would
be overridden by [[carries_dependency]] tags on function arguments and
return values, which is needed to handle the possibility of nested RCU
read-side critical sections.

> The second is the proposal in later mails to use some notion of
> "semantic" dependency instead of this syntactic one.  That's maybe
> attractive at first sight, but rather hard to define in a decent way
> in general.  To define whether the consume load can "really" affect
> some subsequent value, you need to know about all the set of possible
> executions of the program - which is exactly what we have to define.
> 
> For syntactic dependencies, in contrast, you can at least tell whether
> they exist by examining the source code you have in front of you.  The
> fact that artificial dependencies like (&x + y-y) are significant is
> (I guess) basically incidental at the C level - sometimes things like
> this are the best idiom to enforce ordering at the assembly level, but
> in C I imagine they won't normally arise.  If they do, it might be
> nicer to have a more informative syntax, eg (&x + dependency(y)).

This was in fact one of the arguments put forward in favor of carrying
dependencies through things like "y-y" back in the 2007-8 timeframe.

Can't say that I am much of a fan of manually tagging all dependencies:
Machines are much better at that sort of thing than are humans.
But just out of curiosity, did you instead mean (&x + dependency(y-y))
or some such?

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Linus Torvalds
On Fri, Feb 21, 2014 at 11:16 AM, Linus Torvalds
 wrote:
>
> Why would this be any different, especially since it's easy to
> understand both for a human and a compiler?

Btw, the actual data path may actually be semantically meaningful even
at a processor level.

For example, let's look at that gcc bugzilla that got mentioned
earlier, and let's assume that gcc is fixed to follow the "arithmetic
is always meaningful, even if it is only syntactic" the letter.
So we have that gcc bugzilla use-case:

   flag ? *(q + flag - flag) : 0;

and let's say that the fixed compiler now generates the code with the
data dependency that is actually suggested in that bugzilla entry:

and w2, w2, #0
ldr w0, [x1, w2]

ie the CPU actually sees that address data dependency. Now everything
is fine, right?

Wrong.

It is actually quite possible that the CPU sees the "and with zero"
and *breaks the dependencies on the incoming value*.

Modern CPU's literally do things like that. Seriously. Maybe not that
particular one, but you'll sometimes find that the CPU - int he
instruction decoding phase (ie very early in the pipeline) notices
certain patterns that generate constants, and actually drop the data
dependency on the "incoming" registers.

On x86, generating zero using "xor" on the register with itself is one
such known sequence.

Can you guarantee that powerpc doesn't do the same for "and r,r,#0"?
Or what if the compiler generated the much more obvious

sub w2,w2,w2

for that "+flag-flag"? Are you really 100% sure that the CPU won't
notice that that is just a way to generate a zero, and doesn't depend
on the incoming values?

Because I'm not. I know CPU designers that do exactly this.

So I would actually and seriously argue that the whole C standard
attempt to use a syntactic data dependency as a determination of
whether two things are serialized is wrong, and that you actually
*want* to have the compiler optimize away false data dependencies.

Because people playing tricks with "+flag-flag" and thinking that that
somehow generates a data dependency - that's *wrong*. It's not just
the compiler that decides "that's obviously nonsense, I'll optimize it
away". The CPU itself can do it.

So my "actual semantic dependency" model is seriously more likely to
be *correct*. Not just t a compiler level.

Btw, any tricks like that, I would also take a second look at the
assembler and the linker. Many assemblers do some trivial
optimizations too. Are you sure that "and w2, w2, #0" really ends
up being encoded as an "and"? Maybe the assembler says "I can do that
as a "mov w2,#0" instead? Who knows? Even power and ARM have their
variable-sized encodings (there are some "compressed executable"
embedded power processors, and there is obviously Thumb2, and many
assemblers end up trying to use equivalent "small" instructions..

So the whole "fake data dependency" thing is just dangerous on so many levels.

MUCH more dangerous than my "actual real dependency" model.

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Peter Sewell
On 21 February 2014 19:41, Linus Torvalds  wrote:
> On Fri, Feb 21, 2014 at 11:16 AM, Linus Torvalds
>  wrote:
>>
>> Why would this be any different, especially since it's easy to
>> understand both for a human and a compiler?
>
> Btw, the actual data path may actually be semantically meaningful even
> at a processor level.
>
> For example, let's look at that gcc bugzilla that got mentioned
> earlier, and let's assume that gcc is fixed to follow the "arithmetic
> is always meaningful, even if it is only syntactic" the letter.
> So we have that gcc bugzilla use-case:
>
>flag ? *(q + flag - flag) : 0;
>
> and let's say that the fixed compiler now generates the code with the
> data dependency that is actually suggested in that bugzilla entry:
>
> and w2, w2, #0
> ldr w0, [x1, w2]
>
> ie the CPU actually sees that address data dependency. Now everything
> is fine, right?
>
> Wrong.
>
> It is actually quite possible that the CPU sees the "and with zero"
> and *breaks the dependencies on the incoming value*.

For reference: the Power and ARM architectures explicitly guarantee
not to do this, the architects are quite clear about it, and we've
tested (some cases) rather thoroughly.
I can't speak about other architectures.

> Modern CPU's literally do things like that. Seriously. Maybe not that
> particular one, but you'll sometimes find that the CPU - int he
> instruction decoding phase (ie very early in the pipeline) notices
> certain patterns that generate constants, and actually drop the data
> dependency on the "incoming" registers.
>
> On x86, generating zero using "xor" on the register with itself is one
> such known sequence.
>
> Can you guarantee that powerpc doesn't do the same for "and r,r,#0"?
> Or what if the compiler generated the much more obvious
>
> sub w2,w2,w2
>
> for that "+flag-flag"? Are you really 100% sure that the CPU won't
> notice that that is just a way to generate a zero, and doesn't depend
> on the incoming values?
>
> Because I'm not. I know CPU designers that do exactly this.
>
> So I would actually and seriously argue that the whole C standard
> attempt to use a syntactic data dependency as a determination of
> whether two things are serialized is wrong, and that you actually
> *want* to have the compiler optimize away false data dependencies.
>
> Because people playing tricks with "+flag-flag" and thinking that that
> somehow generates a data dependency - that's *wrong*. It's not just
> the compiler that decides "that's obviously nonsense, I'll optimize it
> away". The CPU itself can do it.
>
> So my "actual semantic dependency" model is seriously more likely to
> be *correct*. Not just t a compiler level.
>
> Btw, any tricks like that, I would also take a second look at the
> assembler and the linker. Many assemblers do some trivial
> optimizations too.

That's certainly something worth checking.

> Are you sure that "and w2, w2, #0" really ends
> up being encoded as an "and"? Maybe the assembler says "I can do that
> as a "mov w2,#0" instead? Who knows? Even power and ARM have their
> variable-sized encodings (there are some "compressed executable"
> embedded power processors, and there is obviously Thumb2, and many
> assemblers end up trying to use equivalent "small" instructions..
>
> So the whole "fake data dependency" thing is just dangerous on so many levels.
>
> MUCH more dangerous than my "actual real dependency" model.
>
>Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Linus Torvalds
On Fri, Feb 21, 2014 at 11:43 AM, Peter Sewell
 wrote:
>
> You have to track dependencies through other assignments, e.g. simple x=y

That is all visible in the SSA form. Variable assignment has been
converted to some use of the SSA node that generated the value. The
use might be a phi node or a cast op, or maybe it's just a memory
store op, but the whole point of SSA is that there is one single node
that creates the data (in this case that would be the "load" op with
the associated consume barrier - that barrier might be part of the
load op itself, or it might be implemented as a separate SSA node that
consumes the result of the load that generates a new pseudo), and the
uses of the result are all visible from that.

And yes, there might be a lot of users. But any complex case you just
punt on - and the difference here is that since "punt" means "leave
the barrier in place", it's never a correctness issue.

So yeah, it could be somewhat expensive, although you can always bound
that expense by just punting. But the dependencies in SSA form are no
more complex than the dependencies the C standard talks about now, and
in SSA form they are at least really easy to follow.  So if they are
complex and expensive in SSA form, I'd expect them to be *worse* in
the current "depends-on" syntax form.

 Linus


Accelerator BoF at GNU Tools Cauldron 2014 (was: [gomp4] gomp-4_0-branch)

2014-02-21 Thread Thomas Schwinge
Hi!

On Fri, 24 Jan 2014 18:24:56 +0100, I wrote:
> First, pardon the long CC list.  You are, in my understanding, the people
> who are interested in collaborating on the topics that are being prepared
> on gomp-4_0-branch: "LTO" streaming, acceleration device offloading,
> OpenMP target, OpenACC, nvptx backend -- and more?

For the very same reason writing to you again: yesterday, when I
registered for the GNU Tools Cauldron 2014,
, I proposed to Diego (and he
confirmed) that we should have another Accelerator BoF, following up to
the one we had last year, suggested and run by Torvald Riegel,
,
.

Unless someone has convincing arguments to do otherwise, I propose we
again do this in the format of a BoF: interaction, discussion instead of
presentation.  For a preliminary agenda, Torvald's posting from last year
still seems usable.  ;-)


Grüße,
 Thomas


pgp5XLj6W2RJN.pgp
Description: PGP signature


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Joseph S. Myers
On Fri, 21 Feb 2014, Paul E. McKenney wrote:

> This needs to be as follows:
> 
> [[carries_dependency]] int getzero(int i [[carries_dependency]])
> {
>   return i - i;
> }
> 
> Otherwise dependencies won't get carried through it.

C11 doesn't have attributes at all (and no specification regarding calls 
and dependencies that I can see).  And the way I read the C++11 
specification of carries_dependency is that specifying carries_dependency 
is purely about increasing optimization of the caller: that if it isn't 
specified, then the caller doesn't know what dependencies might be 
carried.  "Note: The carries_dependency attribute does not change the 
meaning of the program, but may result in generation of more efficient 
code. - end note".

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Paul E. McKenney
On Fri, Feb 21, 2014 at 10:10:54PM +, Joseph S. Myers wrote:
> On Fri, 21 Feb 2014, Paul E. McKenney wrote:
> 
> > This needs to be as follows:
> > 
> > [[carries_dependency]] int getzero(int i [[carries_dependency]])
> > {
> > return i - i;
> > }
> > 
> > Otherwise dependencies won't get carried through it.
> 
> C11 doesn't have attributes at all (and no specification regarding calls 
> and dependencies that I can see).  And the way I read the C++11 
> specification of carries_dependency is that specifying carries_dependency 
> is purely about increasing optimization of the caller: that if it isn't 
> specified, then the caller doesn't know what dependencies might be 
> carried.  "Note: The carries_dependency attribute does not change the 
> meaning of the program, but may result in generation of more efficient 
> code. - end note".

Good point -- I am so used to them being in gcc that I missed that.

In which case, it seems to me that straight C11 is within its rights
to emit a memory barrier just before control passes into a function
that either it can't see or that it chose to apply dependency-breaking
optimizations to.

Thanx, Paul



gcc generated long read out of bounds segfault

2014-02-21 Thread David Fries
I was using valgrind and found an out of bounds error reading 8
bytes off an array of 3 byte data structures where the extra 5 bytes
being read were out of the array bounds.  I attached a program that
ends the array at the end of a page so reading beyond the end of the
array would cause a crash, and this is.  x86-64 crashes, x86-32
doesn't.

Before I file a bug report I wanted to check to see if my expectations
are wrong or if this is a compiler bug.  Is there anything that allows
the compiler to generate instructions that would read beyond the end
of an array potentially causing a crash if the page isn't accessible?
Or is this program somehow invalid?

tested gcc and g++ 4.7.2
and from svn, gcc (GCC) 4.9.0 20140221 (experimental)

While both lines read an array entry, only the second crashes.
dup = c[i];
fun(c[i]);

The attached program sets up and reads through the array with extra
padding at the of the array from 8 bytes to 0 bytes.  Padding from 4
to 0 crashes.  There are some #defines to make it easy to use malloc
vs mmap, use munmap or mprotect to make the next page not accessible.

-- 
David Fries PGP pub CB1EE8F0
http://fries.net/~david/
//#define __USE_MISC
//#define __USE_GNU
#include 
#include 
#include 
#include 
#include 
#include 
#include 

typedef struct
{
	char r, g, b;
	/*
	char a;
	char w;
	*/
} rgb;

static char rtn;

char fun(rgb r)
{
	rtn = r.r/2 + r.g/3 + r.b/3;
	return rtn;
}

//#define USE_MALLOC
#define SPECIFY_ADDRESS
//#define MPROTECT

/* g++ 4.7.2 generates an 8 byte read for rgb in fun(c[i]); which for the
 * last 2 pixels (3 bytes each), is out of the array bounds.
 * valgrind will flag the problem in all cases.  
 * just mmap runs without crashing, /proc/pid/maps shows there happens to
 * be an adjacent mapping and additional mmaps expands down
 * with SPECIFY_ADDRESS, it maps, unmaps, and maps one page down to leave
 * an whole, which results in a crash
 * with MPROTECT out of bounds is protected and it crashes
 * Linux 3.2.0 x86-64
 * libc6 2.13
 * g++ 4.7.2 and 4.9.0 20140221 (experimental)
 *
 * movq(%rax), %rdi
 * call_Z3fun3rgb
 */
char test(int buffer)
{
	const int page = sysconf(_SC_PAGE_SIZE);
	const int count = 3216360;
	printf("buffer %d\n", buffer);
	#ifdef USE_MALLOC
	rgb *c = (rgb*)malloc(sizeof(rgb) * count + buffer);
	#else
	// 3 extra pages worth, 1 rounding, 1 mprotect, 1 align
	size_t size = (sizeof(rgb)*count + buffer + 3*page)/page * page;
	char *p = (char*)mmap(NULL, size, PROT_READ | PROT_WRITE,
		MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

	#ifdef SPECIFY_ADDRESS
	char *p1 = p;
	munmap(p1, size);
	p = (char*)mmap(p1 - page, size, PROT_READ | PROT_WRITE,
		MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	#endif

	char *none = NULL;
	#ifdef MPROTECT
	none = p + size - page;
	buffer += page;
	if(mprotect(none, page, PROT_NONE) == -1)
		perror("mprotect failed");
	#endif
	rgb *c = (rgb*)(p + size - sizeof(rgb)*count - buffer);

	printf("mmap size %llu, start %p end %p rgb * %p protect at %p\n",
		(long long)size, p, p+size, c, none);
	#endif
	memset(c, 0, sizeof(rgb)*count);
	rgb dup;
	int i;
	for(i=0; i=0; --i)
		test(i);

	return rtn;
}


Re: gcc generated long read out of bounds segfault

2014-02-21 Thread Andreas Schwab
David Fries  writes:

> The attached program sets up and reads through the array with extra
> padding at the of the array from 8 bytes to 0 bytes.  Padding from 4
> to 0 crashes.

This program has undefined behaviour because you are using unaligned
pointers.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."