Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 23:48 +, Peter Sewell wrote:
> On 18 February 2014 20:43, Torvald Riegel  wrote:
> > On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote:
> >> Several of you have said that the standard and compiler should not
> >> permit speculative writes of atomics, or (effectively) that the
> >> compiler should preserve dependencies.  In simple examples it's easy
> >> to see what that means, but in general it's not so clear what the
> >> language should guarantee, because dependencies may go via non-atomic
> >> code in other compilation units, and we have to consider the extent to
> >> which it's desirable to limit optimisation there.
> >
> > [...]
> >
> >> 2) otherwise, the language definition should prohibit it but the
> >>compiler would have to preserve dependencies even in compilation
> >>units that have no mention of atomics.  It's unclear what the
> >>(runtime and compiler development) cost of that would be in
> >>practice - perhaps Torvald could comment?
> >
> > If I'm reading the standard correctly, it requires that data
> > dependencies are preserved through loads and stores, including nonatomic
> > ones.  That sounds convenient because it allows programmers to use
> > temporary storage.
> 
> The standard only needs this for consume chains,

That's right, and the runtime cost / implementation problems of
mo_consume was what I was making statements about.  Sorry if that wasn't
clear.



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 22:52 +0100, Peter Zijlstra wrote:
> > > 4.Some drivers allow user-mode code to mmap() some of their
> > >   state.  Any changes undertaken by the user-mode code would
> > >   be invisible to the compiler.
> > 
> > A good point, but a compiler that doesn't try to (incorrectly) assume
> > something about the semantics of mmap will simply see that the mmap'ed
> > data will escape to stuff if can't analyze, so it will not be able to
> > make a proof.
> > 
> > This is different from, for example, malloc(), which is guaranteed to
> > return "fresh" nonaliasing memory.
> 
> The kernel side of this is different.. it looks like 'normal' memory, we
> just happen to allow it to end up in userspace too.
> 
> But on that point; how do you tell the compiler the difference between
> malloc() and mmap()? Is that some function attribute?

Yes:

malloc
The malloc attribute is used to tell the compiler that a
function may be treated as if any non-NULL pointer it returns
cannot alias any other pointer valid when the function returns
and that the memory has undefined content. This often improves
optimization. Standard functions with this property include
malloc and calloc. realloc-like functions do not have this
property as the memory pointed to does not have undefined
content.

I'm not quite sure whether GCC assumes malloc() to be indeed C's malloc
even if the function attribute isn't used, and/or whether that is
different for freestanding environments.



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
> On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> > xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com
> > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> > 
> > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  
> > > > wrote:
> > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > > >> And exactly because I know enough, I would *really* like atomics to 
> > > > >> be
> > > > >> well-defined, and have very clear - and *local* - rules about how 
> > > > >> they
> > > > >> can be combined and optimized.
> > > > >
> > > > > "Local"?
> > > > 
> > > > Yes.
> > > > 
> > > > So I think that one of the big advantages of atomics over volatile is
> > > > that they *can* be optimized, and as such I'm not at all against
> > > > trying to generate much better code than for volatile accesses.
> > > > 
> > > > But at the same time, that can go too far. For example, one of the
> > > > things we'd want to use atomics for is page table accesses, where it
> > > > is very important that we don't generate multiple accesses to the
> > > > values, because parts of the values can be change *by*hardware* (ie
> > > > accessed and dirty bits).
> > > > 
> > > > So imagine that you have some clever global optimizer that sees that
> > > > the program never ever actually sets the dirty bit at all in any
> > > > thread, and then uses that kind of non-local knowledge to make
> > > > optimization decisions. THAT WOULD BE BAD.
> > > 
> > > Might as well list other reasons why value proofs via whole-program
> > > analysis are unreliable for the Linux kernel:
> > > 
> > > 1.As Linus said, changes from hardware.
> > 
> > This is what's volatile is for, right?  (Or the weak-volatile idea I
> > mentioned).
> > 
> > Compilers won't be able to prove something about the values of such
> > variables, if marked (weak-)volatile.
> 
> Yep.
> 
> > > 2.Assembly code that is not visible to the compiler.
> > >   Inline asms will -normally- let the compiler know what
> > >   memory they change, but some just use the "memory" tag.
> > >   Worse yet, I suspect that most compilers don't look all
> > >   that carefully at .S files.
> > > 
> > >   Any number of other programs contain assembly files.
> > 
> > Are the annotations of changed memory really a problem?  If the "memory"
> > tag exists, isn't that supposed to mean all memory?
> > 
> > To make a proof about a program for location X, the compiler has to
> > analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> > the compiler will simply not be able to prove a thing (except maybe due
> > to the data-race-free requirement for non-atomics).  The attempt to
> > prove something isn't unreliable, simply because a correct compiler
> > won't claim to be able to "prove" something.
> 
> I am indeed less worried about inline assembler than I am about files
> full of assembly.  Or files full of other languages.
> 
> > One reason that could corrupt this is that if program addresses objects
> > other than through the mechanisms defined in the language.  For example,
> > if one thread lays out a data structure at a constant fixed memory
> > address, and another one then uses the fixed memory address to get
> > access to the object with a cast (e.g., (void*)0x123).
> 
> Or if the program uses gcc linker scripts to get the same effect.
> 
> > > 3.Kernel modules that have not yet been written.  Now, the
> > >   compiler could refrain from trying to prove anything about
> > >   an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > >   is currently no way to communicate this information to the
> > >   compiler other than marking the variable "volatile".
> > 
> > Even if the variable is just externally accessible, then the compiler
> > knows that it can't do whole-program analysis about it.
> > 
> > It is true that whole-program analysis will not be applicable in this
> > case, but it will not be unreliable.  I think that's an important
> > difference.
> 
> Let me make sure that I understand what you are saying.  If my program has
> "extern int foo;", the compiler will refrain from doing whole-program
> analysis involving "foo"?

Yes.  If it can't be sure to actually have the whole program available,
it can't do whole-program analysis, right?  Things like the linker
scripts you mention or other stuff outside of the language semantics
complicates this somewhat, and maybe some compilers assume too much.
There's also the point that data-race-freedom is required for
non-atomics even if those are shared with non-C-code.

But except those corner cases, a compiler sees whether something escapes
and becomes visible/accessible to other entities.

> Or to ask it another way, when you say
> "whole-program analysis", are you restricting that

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 22:47 +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> > Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> > boundaries of the abstract machine, and thus is input/output.  Which
> > fraction of your atomic accesses can read values produced by hardware?
> > I would still suppose that lots of synchronization is not affected by
> > this.
> 
> Its not only hardware; also the kernel/user boundary has this same
> problem. We cannot a-priory say what userspace will do; in fact, because
> we're a general purpose OS, we must assume it will willfully try its
> bestest to wreck whatever assumptions we make about its behaviour.

That's a good note, and I think a distinct case from those below,
because here you're saying that you can't assume that the userspace code
follows the C11 semantics ...

> We also have loadable modules -- much like regular userspace DSOs -- so
> there too we cannot say what will or will not happen.
> 
> We also have JITs that generate code on demand.

.. whereas for those, you might assume that the other code follows C11
semantics and the same ABI, which makes this just a normal case already
handled as (see my other replies nearby in this thread).

> And I'm absolutely sure (with the exception of the JITs, its not an area
> I've worked on) that we have atomic usage across all those boundaries.

That would be fine as long as all involved parties use the same memory
model and ABI to implement it.

(Of course, I'm assuming here that the compiler is aware of sharing with
other entities, which is always the case except in those corner case
like accesses to (void*)0x123 magically aliasing with something else).



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Peter Zijlstra
On Wed, Feb 19, 2014 at 12:07:02PM +0100, Torvald Riegel wrote:
> > Its not only hardware; also the kernel/user boundary has this same
> > problem. We cannot a-priory say what userspace will do; in fact, because
> > we're a general purpose OS, we must assume it will willfully try its
> > bestest to wreck whatever assumptions we make about its behaviour.
> 
> That's a good note, and I think a distinct case from those below,
> because here you're saying that you can't assume that the userspace code
> follows the C11 semantics ...

Right; we can malfunction in those cases though; as long as the
malfunctioning happens on the userspace side. That is, whatever
userspace does should not cause the kernel to crash, but userspace
crashing itself, or getting crap data or whatever is its own damn fault
for not following expected behaviour.

To stay on topic; if the kernel/user interface requires memory ordering
and userspace explicitly omits the barriers all malfunctioning should be
on the user. For instance it might loose a fwd progress guarantee or
data integrity guarantees.

In specific, given a kernel/user lockless producer/consumer buffer, if
the user-side allows the tail write to happen before its data reads are
complete, the kernel might overwrite the data its still reading.

Or in case of futexes, if the user side doesn't use the appropriate
operations its lock state gets corrupt but only userspace should suffer.

But yes, this does require some care and consideration from our side.


Stack layout change during lra

2014-02-19 Thread Joey Ye
Vlad,

When fixing PR60169, I found that reload fail to assert
verify_initial_elim_offsets ()
  if (insns_need_reload != 0 || something_needs_elimination
  || something_needs_operands_changed)
{
  HOST_WIDE_INT old_frame_size = get_frame_size ();

  reload_as_needed (global);

  gcc_assert (old_frame_size == get_frame_size ());

  gcc_assert (verify_initial_elim_offsets ());
}

The reason is that stack layout changes during reload_as_needed as a result
of a thumb1 backend heuristic.

I have a patch to make sure the heuristic doesn't change stack layout during
and after reload, and the assertion disappeared. However, I'm not sure if it
will also be a problem in lra. Here is the question more specific:

Is that any chance during lra_in_progress that: stack layout can no longer
be altered, but insns can still be added or changed?

Thanks,
Joey







Re: ARM inline assembly usage in Linux kernel

2014-02-19 Thread Richard Sandiford
Andrew Pinski  writes:
> On Tue, Feb 18, 2014 at 6:56 PM, Saleem Abdulrasool
>  wrote:
>> Hello.
>>
>> I am sending this at the behest of Renato.  I have been working on the ARM
>> integrated assembler in LLVM and came across an interesting item in the Linux
>> kernel.
>>
>> I am wondering if this is an unstated covenant between the kernel and GCC or
>> simply a clever use of an unintended/undefined behaviour.
>>
>> The Linux kernel uses the *compiler* as a fancy preprocessor to generate a
>> specially crafted assembly file.  This file is then post-processed via sed to
>> generate a header containing constants which is shared across assembly and C
>> sources.
>>
>> In order to clarify the question, I am selecting a particular example and
>> pulling out the relevant bits of the source code below.
>>
>> #define DEFINE(sym, val) asm volatile("\n->" #sym " %0 " #val : : "i" (val))
>>
>> #define __NR_PAGEFLAGS 22
>>
>> void definitions(void) {
>>   DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
>> }
>>
>> This is then assembled to generate the following:
>>
>> ->NR_PAGEFLAGS #22 __NR_PAGEFLAGS
>>
>> This will later be post-processed to generate:
>>
>> #define NR_PAGELAGS 22 /* __NR_PAGEFLAGS */
>>
>> By using the inline assembler to evaluate (constant) expressions into 
>> constant
>> values and then emit that using a special identifier (->) is a fairly clever
>> trick.  This leads to my question: is this just use of an unintentional
>> "feature" or something that was worked out between the two projects.

If the output is being post-processed by sed then maybe you could put
a comment character at the beginning of the line and sed it out?
But I tend to agree with Andrew that for -S output the compiler should
be prepared to accept asm strings that it can't parse, even if the integrated
assembler thinks it understands every instruction.

> I don't see why this is a bad use of the inline-asm.  GCC does not
> know and is not supposed to know what the string inside the inline-asm
> is going to be.  In fact if you have a newer assembler than the
> compiler, you could use instructions that GCC does not even know
> about.

Yeah, FWIW, I agree this is a valid use of inline asm.  The use of volatile
in a reachable part of definitions() means that the asm (and thus the
asm string) must be kept if definitions() is kept.

I doubt the idea was agreed with GCC developers because no GCC changes were
needed to use inline asm this way.

> This is the purpose of inline-asm.  I think it was a bad
> design decision on LLVM/clang's part that it would check the assembly
> code up front.

Being able to parse it is a useful feature.  E.g. it means you can get an
accurate byte length for the asm, which is something that we otherwise
have to guess by multiplying the number of lines by a constant factor.
(And that's wrong for MIPS assembly macros, unless you use a very
conservative constant factor.)

I agree that having an unrecognised asm shouldn't be a hard error until
assembly time though.  Saleem, is the problem that this is being rejected
earlier?

Thanks,
Richard



Re: TYPE_BINFO and canonical types at LTO

2014-02-19 Thread Richard Biener
On Tue, 18 Feb 2014, Jan Hubicka wrote:

> > > Non-ODR types born from other frontends will then need to be made to 
> > > alias all the ODR variants that can be done by storing them into the 
> > > current canonical type hash.
> > > (I wonder if we want to support cross language aliasing for non-POD?)
> > 
> > Surely for accessing components of non-POD types, no?  Like
> > 
> > class Foo {
> > Foo();
> > int *get_data ();
> > int *data;
> > } glob_foo;
> > 
> > extern "C" int *get_foo_data() { return glob_foo.get_data(); }
> 
> OK, if we want to support this, then we want to merge.
> What about types with vtbl pointer? :)

I can easily create a C struct variant covering that.  Basically
in _practice_ I can inter-operate with any language from C if I
know its ABI.  Do we really want to make this undefined?  See
the (even standard) Fortran - C interoperability spec.  I'm sure
something exists for Ada interoperating with C (or even C++).

> > ?  But you are talking about the "tree" merging part using ODR info
> > to also merge types which differ in completeness of contained
> > pointer types, right?  (exactly equal cases should be already merged)
> 
> Actually I was speaking of canonical types here. I want to preserve more 
> of TBAA via honoring ODR and local types.

So, are you positive there will be a net gain in optimization when
doing that?  Please factor in the surprises you'll get when code
gets "miscompiled" because of "slight" ODR violations or interoperability
that no longer works.

> I want to change lto to not 
> merge canonical types for pairs of types of same layout (i.e. equivalent 
> in the current canonical type definition) but with different mangled 
> names.

Names are nothing ;)  In C I very often see different _names_ used
in headers vs. implementation (when the implementation uses a different
internal header).  You have struct Foo; in public headers vs.
struct Foo_impl; in the implementation.

> I also want it to never merge when types are local. For 
> inter-language TBAA we will need to ensure aliasing in between non-ODR 
> type of same layout and all unmerged variants of ODR type.
>  Can it be 
> done by attaching chains of ODR types into the canonical type hash and 
> when non-ODR type appears, just make it alias with all of them?

No, how would that work?

> It would make sense to ODR merge in tree merging, too, but I am not sure if
> this fits the current design, since you would need to merge SCC components of
> different shape then that seems hard, right?

Right.  You'd lose the nice incremental SCC merging (where we haven't even
yet implemented the nicest way - avoid re-materializing the SCC until
we know it prevails).

> It may be easier to ODR merge after streaming (during DECL fixup) just to make
> WPA streaming cheaper and to reduce debug info size.  If you use
> -fdump-ipa-devirt, it will dump you ODR types that did not get merged (only
> ones with vtable pointers in them ATM) and there are quite long chains for
> firefox. Surely then hundreds of duplicated ODR types will end up in the 
> ltrans
> partition streams and they eventually hit debug output machinery.
> Eric sent me presentation about doing this in LLVM.
> http://llvm.org/devmtg/2013-11/slides/Christopher-DebugInfo.pdf

Debuginfo is sth completely separate and should be done separately
(early debug), avoiding to stream the types in the first place.

> > 
> > The canonical type computation happens separately (only for prevailing
> > types, of course), and there we already "merge" types which differ
> > in completeness.  Canonical type merging is conservative the other
> > way aroud - if we merge _all_ types to a single canonical type then
> > TBAA is still correct (we get a single alias set).
> 
> Yes, I think I understand that. One equivalence is kind of minimal so we merge
> only if we are sure there is no informationloss, other is maximal so we are
> sure that types that needs to be equivalent by whatever underlying langauge
> TBAA rules are actually equivalent.

The former is just not correct - it would mean that not merging at all
would be valid, which it is not (you'd create wrong-code all over the 
place).

We still don't merge enough (because of latent bugs that I didn't manage
to fix in time) - thus we do not merge all structurally equivalent types
right now.

> > > I also think we want explicit representation of types known to be local 
> > > to compilation unit - anonymous namespaces in C/C++, types defined 
> > > within function bodies in C and god knows what in Ada/Fortran/Java.
> > 
> > But here you get into the idea of improving TBAA, thus having
> > _more_ distinct canonical types?
> 
> Yes.
> > 
> > Just to make sure to not mix those two ;)
> > 
> > And whatever "frontend knowledge" we want to excercise - please
> > make sure we get a reliable way for the middle-end to see
> > that "frontend knowledge" (no langhooks!).  Thus, make it
> > "middle-end knowledge".
> 
> Sure that is what I am proposin

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Tue, 2014-02-18 at 14:14 -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel  wrote:
> >>
> >> So imagine that you have some clever global optimizer that sees that
> >> the program never ever actually sets the dirty bit at all in any
> >> thread, and then uses that kind of non-local knowledge to make
> >> optimization decisions. THAT WOULD BE BAD.
> >>
> >> Do you see what I'm aiming for?
> >
> > Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> > boundaries of the abstract machine, and thus is input/output.  Which
> > fraction of your atomic accesses can read values produced by hardware?
> > I would still suppose that lots of synchronization is not affected by
> > this.
> 
> The "hardware can change things" case is indeed pretty rare.
> 
> But quite frankly, even when it isn't hardware, as far as the compiler
> is concerned you have the exact same issue - you have TLB faults
> happening on other CPU's that do the same thing asynchronously using
> software TLB fault handlers. So *semantically*, it really doesn't make
> any difference what-so-ever if it's a software TLB handler on another
> CPU, a microcoded TLB fault, or an actual hardware path.

I think there are a few semantic differences:

* If a SW handler uses the C11 memory model, it will synchronize like
any other thread.  HW might do something else entirely, including
synchronizing differently, not using atomic accesses, etc.  (At least
that's the constraints I had in mind).

* If we can treat any interrupt handler like Just Another Thread, then
the next question is whether the compiler will be aware that there is
another thread.  I think that in practice it will be: You'll set up the
handler in some way by calling a function the compiler can't analyze, so
the compiler will know that stuff accessible to the handler (e.g.,
global variables) will potentially be accessed by other threads. 

* Similarly, if the C code is called from some external thing, it also
has to assume the presence of other threads.  (Perhaps this is what the
compiler has to assume in a freestanding implementation anyway...)

However, accessibility will be different for, say, stack variables that
haven't been shared with other functions yet; those are arguably not
reachable by other things, at least not through mechanisms defined by
the C standard.  So optimizing these should be possible with the
assumption that there is no other thread (at least as default -- I'm not
saying that this is the only reasonable semantics).

> So if the answer for all of the above is "use volatile", then I think
> that means that the C11 atomics are badly designed.
> 
> The whole *point* of atomic accesses is that stuff like above should
> "JustWork(tm)"

I think that it should in the majority of cases.  If the other thing
potentially accessing can do as much as a valid C11 thread can do, the
synchronization itself will work just fine.  In most cases except the
(void*)0x123 example (or linker scripts etc.) the compiler is aware when
data is made visible to other threads or other non-analyzable functions
that may spawn other threads (or just by being a plain global variable
accessible to other (potentially .S) translation units.

> > Do you perhaps want a weaker form of volatile?  That is, one that, for
> > example, allows combining of two adjacent loads of the dirty bits, but
> > will make sure that this is treated as if there is some imaginary
> > external thread that it cannot analyze and that may write?
> 
> Yes, that's basically what I would want. And it is what I would expect
> an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a
> bit of a misnomer, because technically we really generally want
> "ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use
> volatile, and is a hell of a lot simpler to write ;^).
> 
> So we obviously use "volatile" for this currently, and generally the
> semantics we really want are:
> 
>  - the load or store is done as a single access ("atomic")
> 
>  - the compiler must not try to re-materialize the value by reloading
> it from memory (this is the "at most once" part)

In the presence of other threads performing operations unknown to the
compiler, that's what you should get even if the compiler is trying to
optimize C11 atomics.  The first requirement is clear, and the "at most
once" follows from another thread potentially writing to the variable.

The only difference I can see right now is that a compiler may be able
to *prove* that it doesn't matter whether it reloaded the value or not.
But this seems very hard to prove for me, and likely to require
whole-program analysis (which won't be possible because we don't know
what other threads are doing).  I would guess that this isn't a problem
in practice.  I just wanted to note it because it theoretically does
have a different semantics than plain volatiles.

> and quite frankly, "volatile" is a big hammer for this. In practice it
> tends 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread David Lang

On Tue, 18 Feb 2014, Torvald Riegel wrote:


On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote:

On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:

Well, that's how atomics that aren't volatile are defined in the
standard.  I can see that you want something else too, but that doesn't
mean that the other thing is broken.


Well that other thing depends on being able to see the entire program at
compile time. PaulMck already listed various ways in which this is
not feasible even for normal userspace code.

In particular; DSOs and JITs were mentioned.


No it doesn't depend on whole-program analysis being possible.  Because
if it isn't, then a correct compiler will just not do certain
optimizations simply because it can't prove properties required for the
optimization to hold.  With the exception of access to objects via magic
numbers (e.g., fixed and known addresses (see my reply to Paul), which
are outside of the semantics specified in the standard), I don't see a
correctness problem here.


Are you really sure that the compiler can figure out every possible thing that a 
loadable module or JITed code can access? That seems like a pretty strong claim.


David Lang


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Paul E. McKenney
On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:
> On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
> > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> > > xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com
> > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> > > 
> > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  
> > > > > wrote:
> > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > > > >> And exactly because I know enough, I would *really* like atomics 
> > > > > >> to be
> > > > > >> well-defined, and have very clear - and *local* - rules about how 
> > > > > >> they
> > > > > >> can be combined and optimized.
> > > > > >
> > > > > > "Local"?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > So I think that one of the big advantages of atomics over volatile is
> > > > > that they *can* be optimized, and as such I'm not at all against
> > > > > trying to generate much better code than for volatile accesses.
> > > > > 
> > > > > But at the same time, that can go too far. For example, one of the
> > > > > things we'd want to use atomics for is page table accesses, where it
> > > > > is very important that we don't generate multiple accesses to the
> > > > > values, because parts of the values can be change *by*hardware* (ie
> > > > > accessed and dirty bits).
> > > > > 
> > > > > So imagine that you have some clever global optimizer that sees that
> > > > > the program never ever actually sets the dirty bit at all in any
> > > > > thread, and then uses that kind of non-local knowledge to make
> > > > > optimization decisions. THAT WOULD BE BAD.
> > > > 
> > > > Might as well list other reasons why value proofs via whole-program
> > > > analysis are unreliable for the Linux kernel:
> > > > 
> > > > 1.  As Linus said, changes from hardware.
> > > 
> > > This is what's volatile is for, right?  (Or the weak-volatile idea I
> > > mentioned).
> > > 
> > > Compilers won't be able to prove something about the values of such
> > > variables, if marked (weak-)volatile.
> > 
> > Yep.
> > 
> > > > 2.  Assembly code that is not visible to the compiler.
> > > > Inline asms will -normally- let the compiler know what
> > > > memory they change, but some just use the "memory" tag.
> > > > Worse yet, I suspect that most compilers don't look all
> > > > that carefully at .S files.
> > > > 
> > > > Any number of other programs contain assembly files.
> > > 
> > > Are the annotations of changed memory really a problem?  If the "memory"
> > > tag exists, isn't that supposed to mean all memory?
> > > 
> > > To make a proof about a program for location X, the compiler has to
> > > analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> > > the compiler will simply not be able to prove a thing (except maybe due
> > > to the data-race-free requirement for non-atomics).  The attempt to
> > > prove something isn't unreliable, simply because a correct compiler
> > > won't claim to be able to "prove" something.
> > 
> > I am indeed less worried about inline assembler than I am about files
> > full of assembly.  Or files full of other languages.
> > 
> > > One reason that could corrupt this is that if program addresses objects
> > > other than through the mechanisms defined in the language.  For example,
> > > if one thread lays out a data structure at a constant fixed memory
> > > address, and another one then uses the fixed memory address to get
> > > access to the object with a cast (e.g., (void*)0x123).
> > 
> > Or if the program uses gcc linker scripts to get the same effect.
> > 
> > > > 3.  Kernel modules that have not yet been written.  Now, the
> > > > compiler could refrain from trying to prove anything about
> > > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > > > is currently no way to communicate this information to the
> > > > compiler other than marking the variable "volatile".
> > > 
> > > Even if the variable is just externally accessible, then the compiler
> > > knows that it can't do whole-program analysis about it.
> > > 
> > > It is true that whole-program analysis will not be applicable in this
> > > case, but it will not be unreliable.  I think that's an important
> > > difference.
> > 
> > Let me make sure that I understand what you are saying.  If my program has
> > "extern int foo;", the compiler will refrain from doing whole-program
> > analysis involving "foo"?
> 
> Yes.  If it can't be sure to actually have the whole program available,
> it can't do whole-program analysis, right?  Things like the linker
> scripts you mention or other stuff outside of the language semantics
> complicates this somewhat, and maybe some compilers assume too much.
> There's also the point t

Re: Stack layout change during lra

2014-02-19 Thread Vladimir Makarov

On 2/19/2014, 6:54 AM, Joey Ye wrote:

Vlad,

When fixing PR60169, I found that reload fail to assert
verify_initial_elim_offsets ()
   if (insns_need_reload != 0 || something_needs_elimination
   || something_needs_operands_changed)
 {
   HOST_WIDE_INT old_frame_size = get_frame_size ();

   reload_as_needed (global);

   gcc_assert (old_frame_size == get_frame_size ());

   gcc_assert (verify_initial_elim_offsets ());
 }

The reason is that stack layout changes during reload_as_needed as a result
of a thumb1 backend heuristic.

I have a patch to make sure the heuristic doesn't change stack layout during
and after reload, and the assertion disappeared. However, I'm not sure if it
will also be a problem in lra. Here is the question more specific:

Is that any chance during lra_in_progress that: stack layout can no longer
be altered, but insns can still be added or changed?



I believe LRA is less prone to the above reload problem because of its 
design.  You can change stack in the backend and all will work if you 
provide the right sfp to hfp/sp offsets.  In fact, LRA itself can 
allocate stack slots several times and add and change insns between 
these allocations.






Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote:
> On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:
> > On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
> > > > xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com
> > > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
> > > > 
> > > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
> > > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
> > > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel 
> > > > > >  wrote:
> > > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
> > > > > > >> And exactly because I know enough, I would *really* like atomics 
> > > > > > >> to be
> > > > > > >> well-defined, and have very clear - and *local* - rules about 
> > > > > > >> how they
> > > > > > >> can be combined and optimized.
> > > > > > >
> > > > > > > "Local"?
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > So I think that one of the big advantages of atomics over volatile 
> > > > > > is
> > > > > > that they *can* be optimized, and as such I'm not at all against
> > > > > > trying to generate much better code than for volatile accesses.
> > > > > > 
> > > > > > But at the same time, that can go too far. For example, one of the
> > > > > > things we'd want to use atomics for is page table accesses, where it
> > > > > > is very important that we don't generate multiple accesses to the
> > > > > > values, because parts of the values can be change *by*hardware* (ie
> > > > > > accessed and dirty bits).
> > > > > > 
> > > > > > So imagine that you have some clever global optimizer that sees that
> > > > > > the program never ever actually sets the dirty bit at all in any
> > > > > > thread, and then uses that kind of non-local knowledge to make
> > > > > > optimization decisions. THAT WOULD BE BAD.
> > > > > 
> > > > > Might as well list other reasons why value proofs via whole-program
> > > > > analysis are unreliable for the Linux kernel:
> > > > > 
> > > > > 1.As Linus said, changes from hardware.
> > > > 
> > > > This is what's volatile is for, right?  (Or the weak-volatile idea I
> > > > mentioned).
> > > > 
> > > > Compilers won't be able to prove something about the values of such
> > > > variables, if marked (weak-)volatile.
> > > 
> > > Yep.
> > > 
> > > > > 2.Assembly code that is not visible to the compiler.
> > > > >   Inline asms will -normally- let the compiler know what
> > > > >   memory they change, but some just use the "memory" tag.
> > > > >   Worse yet, I suspect that most compilers don't look all
> > > > >   that carefully at .S files.
> > > > > 
> > > > >   Any number of other programs contain assembly files.
> > > > 
> > > > Are the annotations of changed memory really a problem?  If the "memory"
> > > > tag exists, isn't that supposed to mean all memory?
> > > > 
> > > > To make a proof about a program for location X, the compiler has to
> > > > analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
> > > > the compiler will simply not be able to prove a thing (except maybe due
> > > > to the data-race-free requirement for non-atomics).  The attempt to
> > > > prove something isn't unreliable, simply because a correct compiler
> > > > won't claim to be able to "prove" something.
> > > 
> > > I am indeed less worried about inline assembler than I am about files
> > > full of assembly.  Or files full of other languages.
> > > 
> > > > One reason that could corrupt this is that if program addresses objects
> > > > other than through the mechanisms defined in the language.  For example,
> > > > if one thread lays out a data structure at a constant fixed memory
> > > > address, and another one then uses the fixed memory address to get
> > > > access to the object with a cast (e.g., (void*)0x123).
> > > 
> > > Or if the program uses gcc linker scripts to get the same effect.
> > > 
> > > > > 3.Kernel modules that have not yet been written.  Now, the
> > > > >   compiler could refrain from trying to prove anything about
> > > > >   an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
> > > > >   is currently no way to communicate this information to the
> > > > >   compiler other than marking the variable "volatile".
> > > > 
> > > > Even if the variable is just externally accessible, then the compiler
> > > > knows that it can't do whole-program analysis about it.
> > > > 
> > > > It is true that whole-program analysis will not be applicable in this
> > > > case, but it will not be unreliable.  I think that's an important
> > > > difference.
> > > 
> > > Let me make sure that I understand what you are saying.  If my program has
> > > "extern int foo;", the compiler will refrain from doing whole-program
> > > analysis involving "foo"?
> > 
> > Yes.  If it can't be sure to actually have the whole program availabl

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Torvald Riegel
On Wed, 2014-02-19 at 07:23 -0800, David Lang wrote:
> On Tue, 18 Feb 2014, Torvald Riegel wrote:
> 
> > On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote:
> >> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote:
> >>> Well, that's how atomics that aren't volatile are defined in the
> >>> standard.  I can see that you want something else too, but that doesn't
> >>> mean that the other thing is broken.
> >>
> >> Well that other thing depends on being able to see the entire program at
> >> compile time. PaulMck already listed various ways in which this is
> >> not feasible even for normal userspace code.
> >>
> >> In particular; DSOs and JITs were mentioned.
> >
> > No it doesn't depend on whole-program analysis being possible.  Because
> > if it isn't, then a correct compiler will just not do certain
> > optimizations simply because it can't prove properties required for the
> > optimization to hold.  With the exception of access to objects via magic
> > numbers (e.g., fixed and known addresses (see my reply to Paul), which
> > are outside of the semantics specified in the standard), I don't see a
> > correctness problem here.
> 
> Are you really sure that the compiler can figure out every possible thing 
> that a 
> loadable module or JITed code can access? That seems like a pretty strong 
> claim.

If the other code can be produced by a C translation unit that is valid
to be linked with the rest of the program, then I'm pretty sure the
compiler has a well-defined notion of whether it does or does not see
all other potential accesses.  IOW, if the C compiler is dealing with C
semantics and mechanisms only (including the C mechanisms for sharing
with non-C code!), then it will know what to do.

If you're playing tricks behind the C compiler's back using
implementation-defined stuff outside of the C specification, then
there's nothing the compiler really can do.  For example, if you're
trying to access a variable on a function's stack from some other
function, you better know how the register allocator of the compiler
operates.  In contrast, if you let this function simply export the
address of the variable to some external place, all will be fine.

The documentation of GCC's -fwhole-program and -flto might also be
interesting for you.  GCC wouldn't need to have -fwhole-program if it
weren't conservative by default (correctly so).



Re: TYPE_BINFO and canonical types at LTO

2014-02-19 Thread Jan Hubicka
> On Tue, 18 Feb 2014, Jan Hubicka wrote:
> 
> > > > Non-ODR types born from other frontends will then need to be made to 
> > > > alias all the ODR variants that can be done by storing them into the 
> > > > current canonical type hash.
> > > > (I wonder if we want to support cross language aliasing for non-POD?)
> > > 
> > > Surely for accessing components of non-POD types, no?  Like
> > > 
> > > class Foo {
> > > Foo();
> > > int *get_data ();
> > > int *data;
> > > } glob_foo;
> > > 
> > > extern "C" int *get_foo_data() { return glob_foo.get_data(); }
> > 
> > OK, if we want to support this, then we want to merge.
> > What about types with vtbl pointer? :)
> 
> I can easily create a C struct variant covering that.  Basically
> in _practice_ I can inter-operate with any language from C if I
> know its ABI.  Do we really want to make this undefined?  See
> the (even standard) Fortran - C interoperability spec.  I'm sure
> something exists for Ada interoperating with C (or even C++).

Well, if you know the ABI, you can interoperate C
struct {int a,b;};
with
struct {struct {int a;} a; struct {int b;}};

So we don't interoperate everything.  We may eventually want to define what
kind of inter-operation we support.  non-POD class layout is not always a
natural extension of C struct layout as one would expect
http://mentorembedded.github.io/cxx-abi/abi.html#layout
so it may make sense to declare it uninteroperable if it helps something in real
world (which I am not quite sure about)

I am not really shooting for this. I just want to get something that would
improve TBAA in pre-dominantly C++ programs (with some C code in them) such as
firefox, chromium, openoffice or GCC. Because I see us giving up on TBAA very
often when playing with cases that should be devirtualized by GVN/aggregate
propagation in ipa-prop but aren't. Also given that C++ standard actually
have notion of inter-module type equivalence it may be good idea to honnor it.
> 
> > > ?  But you are talking about the "tree" merging part using ODR info
> > > to also merge types which differ in completeness of contained
> > > pointer types, right?  (exactly equal cases should be already merged)
> > 
> > Actually I was speaking of canonical types here. I want to preserve more 
> > of TBAA via honoring ODR and local types.
> 
> So, are you positive there will be a net gain in optimization when
> doing that?  Please factor in the surprises you'll get when code
> gets "miscompiled" because of "slight" ODR violations or interoperability
> that no longer works.
> 
> > I want to change lto to not 
> > merge canonical types for pairs of types of same layout (i.e. equivalent 
> > in the current canonical type definition) but with different mangled 
> > names.
> 
> Names are nothing ;)  In C I very often see different _names_ used
> in headers vs. implementation (when the implementation uses a different
> internal header).  You have struct Foo; in public headers vs.
> struct Foo_impl; in the implementation.

Yes, I am aware of that - I had to go through all those issues with --combine
code in mid 2000s. C++ is different.
What I want is to make TBAA in between C++ types stronger and TBAA between
C++ and other languages to do pretty much what we donot, if possible.
> 
> > I also want it to never merge when types are local. For 
> > inter-language TBAA we will need to ensure aliasing in between non-ODR 
> > type of same layout and all unmerged variants of ODR type.
> >  Can it be 
> > done by attaching chains of ODR types into the canonical type hash and 
> > when non-ODR type appears, just make it alias with all of them?
> 
> No, how would that work?

This is what I am asking you about.
At the end of streaming process, we will have canonical type hash with one 
leader
and list of ODR variants that was not merged based on ODR rules.
If leader happens to be non-ODR, then it must be made to alias all of them.
I think either we can rewrite all ODR cases to have the non-ODR as canonical 
type
or make something more fine grained so we can force loads/stores through non-ODR
types to alias with all of them, but loads/stores through ODR types to alias
as within C++ compilation unit.
> 
> > It would make sense to ODR merge in tree merging, too, but I am not sure if
> > this fits the current design, since you would need to merge SCC components 
> > of
> > different shape then that seems hard, right?
> 
> Right.  You'd lose the nice incremental SCC merging (where we haven't even
> yet implemented the nicest way - avoid re-materializing the SCC until
> we know it prevails).

Yes, SCC merging w/o re-materialization is a priority, indeed.
> 
> > It may be easier to ODR merge after streaming (during DECL fixup) just to 
> > make
> > WPA streaming cheaper and to reduce debug info size.  If you use
> > -fdump-ipa-devirt, it will dump you ODR types that did not get merged (only
> > ones with vtable pointers in them ATM) and there are quite long chains for
> > firefox. Surely then h

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Linus Torvalds
On Wed, Feb 19, 2014 at 6:40 AM, Torvald Riegel  wrote:
>
> If all those other threads written in whichever way use the same memory
> model and ABI for synchronization (e.g., choice of HW barriers for a
> certain memory_order), it doesn't matter whether it's a hardware thread,
> microcode, whatever.  In this case, C11 atomics should be fine.
> (We have this in userspace already, because correct compilers will have
> to assume that the code generated by them has to properly synchronize
> with other code generated by different compilers.)
>
> If the other threads use a different model, access memory entirely
> differently, etc, then we might be back to "volatile" because we don't
> know anything, and the very strict rules about execution steps of the
> abstract machine (ie, no as-if rule) are probably the safest thing to
> do.

Oh, I don't even care about architectures that don't have real hardware atomics.

So if there's a software protocol for atomics, all bets are off. The
compiler almost certainly has to do atomics with function calls
anyway, and we'll just plug in out own. And frankly, nobody will ever
care, because those architectures aren't relevant, and never will be.

Sure, there are some ancient Sparc platforms that only support a
single-byte "ldstub" and there are some embedded chips that don't
really do SMP, but have some pseudo-smp with special separate locking.
Really, nobody cares. The C standard has that crazy lock-free atomic
tests, and talks about address-free, but generally we require both
lock-free and address-free in the kernel, because otherwise it's just
too painful to do interrupt-safe locking, or do atomics in user-space
(for futexes).

So if your worry is just about software protocols for CPU's that
aren't actually designed for modern SMP, that's pretty much a complete
non-issue.

Linus


Re: Update x86-64 PLT for MPX

2014-02-19 Thread H.J. Lu
On Mon, Jan 27, 2014 at 1:50 PM, H.J. Lu  wrote:
> On Mon, Jan 27, 2014 at 1:42 PM, H.J. Lu  wrote:
>> On Sat, Jan 18, 2014 at 8:11 AM, H.J. Lu  wrote:
>>> Hi,
>>>
>>> Here is the proposal to update x86-64 PLT for MPX.  The linker change
>>> is implemented on hjl/mpx/pltext8 branch.  Any comments/feedbacks?
>>>
>>> Thanks.
>>>
>>> --
>>> H.J.
>>> ---
>>> Intel MPX:
>>>
>>> http://software.intel.com/en-us/file/319433-017pdf
>>>
>>> introduces 4 bound registers, which will be used for parameter passing
>>> in x86-64.  Bound registers are cleared by branch instructions.  Branch
>>> instructions with BND prefix will keep bound register contents. This leads
>>> to 2 requirements to 64-bit MPX run-time:
>>>
>>> 1. Dynamic linker (ld.so) should save and restore bound registers during
>>> symbol lookup.
>>> 2. Change the current 16-byte PLT0:
>>>
>>>   ff 35 08 00 00 00pushq  GOT+8(%rip)
>>>   ff 25 00 10 00jmpq  *GOT+16(%rip)
>>>   0f 1f 40 00nopl   0x0(%rax)
>>>
>>> and 16-byte PLT1:
>>>
>>>   ff 25 00 00 00 00jmpq   *name@GOTPCREL(%rip)
>>>   68 00 00 00 00   pushq  $index
>>>   e9 00 00 00 00   jmpq   PLT0
>>>
>>> which clear bound registers, to preserve bound registers.
>>>
>>> We use 2 new relocations:
>>>
>>> #define R_X86_64_PC32_BND  39 /* PC relative 32 bit signed with BND prefix 
>>> */
>>> #define R_X86_64_PLT32_BND 40 /* 32 bit PLT address with BND prefix */
>>>
>>> to mark branch instructions with BND prefix.
>>>
>>> When linker sees any R_X86_64_PC32_BND or R_X86_64_PLT32_BND relocations,
>>> it switches to a different PLT0:
>>>
>>>   ff 35 08 00 00 00pushq  GOT+8(%rip)
>>>   f2 ff 25 00 10 00bnd jmpq *GOT+16(%rip)
>>>   0f 1f 00nopl   (%rax)
>>>
>>> to preserve bound registers for symbol lookup and it also creates an
>>> external PLT section, .pl.bnd.  Linker will create a BND PLT1 entry
>>> in .plt:
>>>
>>>   68 00 00 00 00   pushq  $index
>>>   f2 e9 00 00 00 00 bnd jmpq PLT0
>>>   0f 1f 44 00 00nopl 0(%rax,%rax,1)
>>>
>>> and a 8-byte BND PLT entry in .plt.bnd:
>>>
>>>   f2 ff 25 00 00 00 00  bnd jmpq *name@GOTPCREL(%rip)
>>>   90nop
>>>
>>> Otherwise, linker will create a legacy PLT entry in .plt:
>>>
>>>   68 00 00 00 00   pushq  $index
>>>   e9 00 00 00 00jmpq PLT0
>>>   66 0f 1f 44 00 00 nopw 0(%rax,%rax,1)
>>>
>>> and a 8-byte legacy PLT in .plt.bnd:
>>>
>>>   ff 25 00 00 00 00 jmpq  *name@GOTPCREL(%rip)
>>>   66 90 xchg  %ax,%ax
>>>
>>> The initial value of the GOT entry for "name" will be set to the the
>>> "pushq" instruction in the corresponding entry in .plt.  Linker will
>>> resolve reference of symbol "name" to the entry in the second PLT,
>>> .plt.bnd.
>>>
>>> Prelink stores the offset of pushq of PLT1 (plt_base + 0x10) in GOT[1]
>>> and GOT[1] is stored in GOT[3].  We can undo prelink in GOT by computing
>>> the corresponding the pushq offset with
>>>
>>> GOT[1] + (GOT offset - &GOT[3]) * 2
>>>
>>> Since for each entry in .plt except for PLT0 we create a 8-byte entry in
>>> .plt.bnd, there is extra 8-byte per PLT symbol.
>>>
>>> We also investigated the 16-byte entry for .plt.bnd.  We compared the
>>> 8-byte entry vs the the 16-byte entry for .plt.bnd on Sandy Bridge.
>>> There are no performance differences in SPEC CPU 2000/2006 as well as
>>> micro benchmarks.
>>>
>>> Pros:
>>> No change to undo prelink in dynamic linker.
>>> Only 8-byte memory overhead for each PLT symbol.
>>> Cons:
>>> Extra .plt.bnd section is needed.
>>> Extra 8 byte for legacy branches to PLT.
>>> GDB is unware of the new layout of .plt and .plt.bnd.
>>
>> Hi,
>>
>> I am enclosing the updated x86-64 psABI with PLT change.
>> I checkeMy email is rejected due to PDF attachment.   I am resubmitting it 
>> with
> out PDF file.
> d it onto hjl/mpx/master branch at
>>
>> https://github.com/hjl-tools/x86-64-psABI
>>
>> I will check in the binutils changes if there are no disagreements
>> in 2 weeks.
>>
>> Thanks.
>
>
> My email is rejected due to PDF attachment.   I am resubmitting it with
> out PDF file.

I pushed the MPX binutils change into master:

https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=0ff2b86e7c14177ec7f9e1257f8e697814794017


-- 
H.J.


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Paul E. McKenney
On Wed, Feb 19, 2014 at 06:55:51PM +0100, Torvald Riegel wrote:
> On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote:
> > On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:

[ . . . ]

> > > On both sides, the compiler will see that mmap() (or similar) is called,
> > > so that means the data escapes to something unknown, which could create
> > > threads and so on.  So first, it can't do whole-program analysis for
> > > this state anymore, and has to assume that other C11 threads are
> > > accessing this memory.  Next, lock-free atomics are specified to be
> > > "address-free", meaning that they must work independent of where in
> > > memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a
> > > "should" and non-normative, but essential IMO).  Thus, this then boils
> > > down to just a simple case of synchronization.  (Of course, the rest of
> > > the ABI has to match too for the data exchange to work.)
> > 
> > The compiler will see mmap() on the user side, but not on the kernel
> > side.  On the kernel side, something special is required.
> 
> Maybe -- you'll certainly know better :)
> 
> But maybe it's not that hard: For example, if the memory is in current
> code made available to userspace via calling some function with an asm
> implementation that the compiler can't analyze, then this should be
> sufficient.

The kernel code would need to explicitly tell the compiler what portions
of the kernel address space were covered by this.  I would not want the
compiler to have to work it out based on observing interactions with the
page tables.  ;-)

> > Agree that "address-free" would be nice as "shall" rather than "should".
> > 
> > > > I echo Peter's question about how one tags functions like mmap().
> > > > 
> > > > I will also remember this for the next time someone on the committee
> > > > discounts "volatile".  ;-)
> > > > 
> > > > > > 5.  JITed code produced based on BPF: 
> > > > > > https://lwn.net/Articles/437981/
> > > > > 
> > > > > This might be special, or not, depending on how the JITed code gets
> > > > > access to data.  If this is via fixed addresses (e.g., (void*)0x123),
> > > > > then see above.  If this is through function calls that the compiler
> > > > > can't analyze, then this is like 4.
> > > > 
> > > > It could well be via the kernel reading its own symbol table, sort of
> > > > a poor-person's reflection facility.  I guess that would be for all
> > > > intents and purposes equivalent to your (void*)0x123.
> > > 
> > > If it is replacing code generated by the compiler, then yes.  If the JIT
> > > is just filling in functions that had been undefined yet declared
> > > before, then the compiler will have seen the data escape through the
> > > function interfaces, and should be aware that there is other stuff.
> > 
> > So one other concern would then be things things like ftrace, kprobes,
> > ksplice, and so on.  These rewrite the kernel binary at runtime, though
> > in very limited ways.
> 
> Yes.  Nonetheless, I wouldn't see a problem if they, say, rewrite with
> C11-compatible code (and same ABI) on a function granularity (and when
> the function itself isn't executing concurrently) -- this seems to be
> similar to just having another compiler compile this particular
> function.

Well, they aren't using C11-compatible code yet.  They do patch within
functions.  And in some cases, they make staged sequences of changes to
allow the patching to happen concurrently with other CPUs executing the
code being patched.  Not sure that any of the latter is actually in the
kernel at the moment, but it has at least been prototyped and discussed.

Thanx, Paul



Re: ARM inline assembly usage in Linux kernel

2014-02-19 Thread Renato Golin
On 19 February 2014 11:58, Richard Sandiford
 wrote:
> I agree that having an unrecognised asm shouldn't be a hard error until
> assembly time though.  Saleem, is the problem that this is being rejected
> earlier?

Hi Andrew, Richard,

Thanks for your reviews! We agree that we should actually just ignore
the contents until object emission.

Just for context, one of the reasons why we enabled inline assembly
checks is for some obscure cases when the snippet changes the
instructions set (arm -> thumb) and the rest of the function becomes
garbage. Our initial implementation was to always emit .arm/.thumb
after *any* inline assembly, which would become a nop in the worst
case. But since we had easy access to the assembler, we thought: "why
not?".

The idea is now to try to parse the snippet for cases like .arm/.thumb
but only emit a warning IFF -Wbad-inline-asm (or whatever) is set (and
not to make it on by default), otherwise, ignore. We're hoping our
assembler will be able to cope with the multiple levels of indirection
automagically. ;)

Thanks again!
--renato


Re: ARM inline assembly usage in Linux kernel

2014-02-19 Thread Andrew Pinski
On Wed, Feb 19, 2014 at 3:17 PM, Renato Golin  wrote:
> On 19 February 2014 11:58, Richard Sandiford
>  wrote:
>> I agree that having an unrecognised asm shouldn't be a hard error until
>> assembly time though.  Saleem, is the problem that this is being rejected
>> earlier?
>
> Hi Andrew, Richard,
>
> Thanks for your reviews! We agree that we should actually just ignore
> the contents until object emission.
>
> Just for context, one of the reasons why we enabled inline assembly
> checks is for some obscure cases when the snippet changes the
> instructions set (arm -> thumb) and the rest of the function becomes
> garbage. Our initial implementation was to always emit .arm/.thumb
> after *any* inline assembly, which would become a nop in the worst
> case. But since we had easy access to the assembler, we thought: "why
> not?".

With the unified assembly format, you should not need those
.arm/.thumb and in fact emitting them can make things even worse.

Thanks,
Andrew Pinski


>
> The idea is now to try to parse the snippet for cases like .arm/.thumb
> but only emit a warning IFF -Wbad-inline-asm (or whatever) is set (and
> not to make it on by default), otherwise, ignore. We're hoping our
> assembler will be able to cope with the multiple levels of indirection
> automagically. ;)
>
> Thanks again!
> --renato


Re: ARM inline assembly usage in Linux kernel

2014-02-19 Thread Renato Golin
On 19 February 2014 23:19, Andrew Pinski  wrote:
> With the unified assembly format, you should not need those
> .arm/.thumb and in fact emitting them can make things even worse.

If only we could get rid or all pre-UAL inline assembly on the planet... :)

The has been the only reason why we added support for those in our
assembler, because GAS supports them and people still use (or have
legacy code they won't change).

If the binutils folks (and you guys) are happy to start seriously
de-phasing pre-UAL support, I'd be more than happy to do so on our
end. Do you think I should start that conversation on the binutils
list?

Maybe a new serious compulsory warning, to start?

cheers,
--renato


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Linus Torvalds
On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel  wrote:
> On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
>>
>> Can you point to it? Because I can find a draft standard, and it sure
>> as hell does *not* contain any clarity of the model. It has a *lot* of
>> verbiage, but it's pretty much impossible to actually understand, even
>> for somebody who really understands memory ordering.
>
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> This has an explanation of the model up front, and then the detailed
> formulae in Section 6.  This is from 2010, and there might have been
> smaller changes since then, but I'm not aware of any bigger ones.

Ahh, this is different from what others pointed at. Same people,
similar name, but not the same paper.

I will read this version too, but from reading the other one and the
standard in parallel and trying to make sense of it, it seems that I
may have originally misunderstood part of the whole control dependency
chain.

The fact that the left side of "? :", "&&" and "||" breaks data
dependencies made me originally think that the standard tried very
hard to break any control dependencies. Which I felt was insane, when
then some of the examples literally were about the testing of the
value of an atomic read. The data dependency matters quite a bit. The
fact that the other "Mathematical" paper then very much talked about
consume only in the sense of following a pointer made me think so even
more.

But reading it some more, I now think that the whole "data dependency"
logic (which is where the special left-hand side rule of the ternary
and logical operators come in) are basically an exception to the rule
that sequence points end up being also meaningful for ordering (ok, so
C11 seems to have renamed "sequence points" to "sequenced before").

So while an expression like

atomic_read(p, consume) ? a : b;

doesn't have a data dependency from the atomic read that forces
serialization, writing

   if (atomic_read(p, consume))
  a;
   else
  b;

the standard *does* imply that the atomic read is "happens-before" wrt
"a", and I'm hoping that there is no question that the control
dependency still acts as an ordering point.

THAT was one of my big confusions, the discussion about control
dependencies and the fact that the logical ops broke the data
dependency made me believe that the standard tried to actively avoid
the whole issue with "control dependencies can break ordering
dependencies on some CPU's due to branch prediction and memory
re-ordering by the CPU".

But after all the reading, I'm starting to think that that was never
actually the implication at all, and the "logical ops breaks the data
dependency rule" is simply an exception to the sequence point rule.
All other sequence points still do exist, and do imply an ordering
that matters for "consume"

Am I now reading it right?

So the clarification is basically to the statement that the "if
(consume(p)) a" version *would* have an ordering guarantee between the
read of "p" and "a", but the "consume(p) ? a : b" would *not* have
such an ordering guarantee. Yes?

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Paul E. McKenney
On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel  wrote:
> > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
> >>
> >> Can you point to it? Because I can find a draft standard, and it sure
> >> as hell does *not* contain any clarity of the model. It has a *lot* of
> >> verbiage, but it's pretty much impossible to actually understand, even
> >> for somebody who really understands memory ordering.
> >
> > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> > This has an explanation of the model up front, and then the detailed
> > formulae in Section 6.  This is from 2010, and there might have been
> > smaller changes since then, but I'm not aware of any bigger ones.
> 
> Ahh, this is different from what others pointed at. Same people,
> similar name, but not the same paper.
> 
> I will read this version too, but from reading the other one and the
> standard in parallel and trying to make sense of it, it seems that I
> may have originally misunderstood part of the whole control dependency
> chain.
> 
> The fact that the left side of "? :", "&&" and "||" breaks data
> dependencies made me originally think that the standard tried very
> hard to break any control dependencies. Which I felt was insane, when
> then some of the examples literally were about the testing of the
> value of an atomic read. The data dependency matters quite a bit. The
> fact that the other "Mathematical" paper then very much talked about
> consume only in the sense of following a pointer made me think so even
> more.
> 
> But reading it some more, I now think that the whole "data dependency"
> logic (which is where the special left-hand side rule of the ternary
> and logical operators come in) are basically an exception to the rule
> that sequence points end up being also meaningful for ordering (ok, so
> C11 seems to have renamed "sequence points" to "sequenced before").
> 
> So while an expression like
> 
> atomic_read(p, consume) ? a : b;
> 
> doesn't have a data dependency from the atomic read that forces
> serialization, writing
> 
>if (atomic_read(p, consume))
>   a;
>else
>   b;
> 
> the standard *does* imply that the atomic read is "happens-before" wrt
> "a", and I'm hoping that there is no question that the control
> dependency still acts as an ordering point.

The control dependency should order subsequent stores, at least assuming
that "a" and "b" don't start off with identical stores that the compiler
could pull out of the "if" and merge.  The same might also be true for ?:
for all I know.  (But see below)

That said, in this case, you could substitute relaxed for consume and get
the same effect.  The return value from atomic_read() gets absorbed into
the "if" condition, so there is no dependency-ordered-before relationship,
so nothing for consume to do.

One caution...  The happens-before relationship requires you to trace a
full path between the two operations of interest.  This is illustrated
by the following example, with both x and y initially zero:

T1: atomic_store_explicit(&x, 1, memory_order_relaxed);
r1 = atomic_load_explicit(&y, memory_order_relaxed);

T2: atomic_store_explicit(&y, 1, memory_order_relaxed);
r2 = atomic_load_explicit(&x, memory_order_relaxed);

There is a happens-before relationship between T1's load and store,
and another happens-before relationship between T2's load and store,
but there is no happens-before relationship from T1 to T2, and none
in the other direction, either.  And you don't get to assume any
ordering based on reasoning about these two disjoint happens-before
relationships.

So it is quite possible for r1==1&&r2==1 after both threads complete.

Which should be no surprise: This misordering can happen even on x86,
which would need a full smp_mb() to prevent it.

> THAT was one of my big confusions, the discussion about control
> dependencies and the fact that the logical ops broke the data
> dependency made me believe that the standard tried to actively avoid
> the whole issue with "control dependencies can break ordering
> dependencies on some CPU's due to branch prediction and memory
> re-ordering by the CPU".
> 
> But after all the reading, I'm starting to think that that was never
> actually the implication at all, and the "logical ops breaks the data
> dependency rule" is simply an exception to the sequence point rule.
> All other sequence points still do exist, and do imply an ordering
> that matters for "consume"
> 
> Am I now reading it right?

As long as there is an unbroken chain of -data- dependencies from the
consume to the later access in question, and as long as that chain
doesn't go through the excluded operations, yes.

> So the clarification is basically to the statement that the "if
> (consume(p)) a" version *would* have an ordering guarantee between the
> read of "p" and "a", but the "consume(p) ? a : b" would *not* have
> such an ordering guarantee. Yes?

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Linus Torvalds
On Wed, Feb 19, 2014 at 8:01 PM, Paul E. McKenney
 wrote:
>
> The control dependency should order subsequent stores, at least assuming
> that "a" and "b" don't start off with identical stores that the compiler
> could pull out of the "if" and merge.  The same might also be true for ?:
> for all I know.  (But see below)

Stores I don't worry about so much because

 (a) you can't sanely move stores up in a compiler anyway
 (b) no sane CPU or moves stores up, since they aren't on the critical path

so a read->cmp->store is actually really hard to make anything sane
re-order. I'm sure it can be done, and I'm sure it's stupid as hell.

But that "it's hard to screw up" is *not* true for a load->cmp->load.

So lets make this really simple: if you have a consume->cmp->read, is
the ordering of the two reads guaranteed?

> As long as there is an unbroken chain of -data- dependencies from the
> consume to the later access in question, and as long as that chain
> doesn't go through the excluded operations, yes.

So let's make it *really* specific, and make it real code doing a real
operation, that is actually realistic and reasonable in a threaded
environment, and may even be in some critical code.

The issue is the read-side ordering guarantee for 'a' and 'b', for this case:

 - Initial state:

   a = b = 0;

 - Thread 1 ("consumer"):

if (atomic_read(&a, consume))
 return b;
/* not yet initialized */
return -1;

 - Thread 2 ("initializer"):

 b = some_value_lets_say_42;
 /* We are now ready to party */
 atomic_write(&a, 1, release);

and quite frankly, if there is no ordering guarantee between the read
of "a" and the read of "b" in the consumer thread, then the C atomics
standard is broken.

Put another way: I claim that if "thread 1" ever sees a return value
other than -1 or 42, then the whole definition of atomics is broken.

Question 2: and what changes if the atomic_read() is turned into an
acquire, and why? Does it start working?

> Neither has a data-dependency guarantee, because there is no data
> dependency from the load to either "a" or "b".  After all, the value
> loaded got absorbed into the "if" condition.  However, according to
> discussions earlier in this thread, the "if" variant would have a
> control-dependency ordering guarantee for any stores in "a" and "b"
> (but not loads!).

So exactly what part of the standard allows the loads to be
re-ordered, and why? Quite frankly, I'd think that any sane person
will agree that the above code snippet is realistic, and that my
requirement that thread 1 sees either -1 or 42 is valid.

And if the C standards body has said that control dependencies break
the read ordering, then I really think that the C standards committee
has screwed up.

If the consumer of an atomic load isn't a pointer chasing operation,
then the consume should be defined to be the same as acquire. None of
this "conditionals break consumers". No, conditionals on the
dependency path should turn consumers into acquire, because otherwise
the "consume" load is dangerous as hell.

And if the definition of acquire doesn't include the control
dependency either, then the C atomic memory model is just completely
and utterly broken, since the above *trivial* and clearly useful
example is broken.

I really think the above example is pretty damn black-and-white.
Either it works, or the standard isn't worth wiping your ass with.

  Linus


Re: Building GCC with -Wmissing-declarations and addressing its warnings

2014-02-19 Thread Jonathan Wakely
On 13 February 2014 20:47, Patrick Palka wrote:
> On a related note, would a patch to officially enable
> -Wmissing-declarations in the build process be well regarded?

What would be the advantage?

>  Since
> -Wmissing-prototypes is currently enabled, I assume it is the
> intention of the GCC devs to address these warnings, and that during
> the transition from a C to C++ bootstrap compiler a small oversight
> was made (that -Wmissing-prototypes is a no-op against C++ source
> files).

The additional safety provided by -Wmissing-prototypes is already
guaranteed for C++.

In C a missing prototype causes the compiler to guess, probably
incorrectly, how to call the function.

In C++ a function cannot be called without a previous declaration and
the linker will notice if you declare a function with one signature
and define it differently.