Re: proposal to turn on some warnings by default

2014-02-27 Thread David Brown
On 27/02/14 07:50, Mingjie Xing wrote:
> Hello,
> 
> I'm wondering if it's a good idea to turn on some warnings by default
> (or even promote them to error), such as -Wreturn-type on C.  This
> would help programmers to avoid some mistakes.
> 
> Regards,
> Mingjie
> 

Personally, I think gcc should issue a warning if it is run without at
least "-Wall" (or "-Wno-all"), telling the user that they have forgotten
to enable warnings.  /That/ would help people avoid mistakes.  It should
also warn if there is no optimisation level, to stop people accidentally
generating big and slow code.

It's not going to happen, of course - compatibility with existing
Makefiles, the users' right to make mistakes if they want, etc.


But it might be reasonable to have warnings when the compiler can see
without doubt that there is undefined behaviour going on, such as a
missing return.  The trouble is, it is only reasonable to have it by
default if there is /no/ doubt that it will be run - the compiler can't
really insist on issuing warnings on code that is compiled but not used.
 And since it often can't tell what will be used, it won't often be able
to issue such warnings - and gcc would then have to figure out what code
is definitely reachable from main() and apply "-Wreturn-type", etc., to
that code, while leaving the warnings off for other code.


David


Re: proposal to turn on some warnings by default

2014-02-27 Thread Sylvestre Ledru
On 27/02/2014 07:50, Mingjie Xing wrote:
> Hello,
>
> I'm wondering if it's a good idea to turn on some warnings by default
> (or even promote them to error), such as -Wreturn-type on C.  This
> would help programmers to avoid some mistakes.
>
I am writing a patch for this specific change but it is a huge work.
Many unitary tests
have to be updated... I guess that would be the same with other warnings...
See http://gcc.gnu.org/ml/gcc-patches/2014-01/msg00820.html
or
http://gcc.gnu.org/ml/gcc/2014-01/msg00194.html

Sylvestre



Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread Andrew Haley
Over the years there has been a great deal of traffic on these lists
caused by misunderstandings of GCC's inline assembler.  That's partly
because it's inherently tricky, but the existing documentation needs
to be improved.

dw  has done a fairly thorough reworking of
the documentation.  I've helped a bit.

Section 6.41 of the GCC manual has been rewritten.  It has become:

6.41 How to Use Inline Assembly Language in C Code
6.41.1 Basic Asm - Assembler Instructions with No Operands
6.41.2 Extended Asm - Assembler Instructions with C Expression Operands

We could simply post the patch to GCC-patches and have at it, but I
think it's better to discuss the document here first.  You can read it
at

http://www.LimeGreenSocks.com/gcc/Basic-Asm.html
http://www.LimeGreenSocks.com/gcc/Extended-Asm.html
http://www.LimeGreenSocks.com/gcc/extend04.zip (contains .texi, .patch,
and affected html pages)

All comments are very welcome.

Andrew.


Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread Richard Sandiford
Andrew Haley  writes:
> Over the years there has been a great deal of traffic on these lists
> caused by misunderstandings of GCC's inline assembler.  That's partly
> because it's inherently tricky, but the existing documentation needs
> to be improved.
>
> dw  has done a fairly thorough reworking of
> the documentation.  I've helped a bit.
>
> Section 6.41 of the GCC manual has been rewritten.  It has become:
>
> 6.41 How to Use Inline Assembly Language in C Code
> 6.41.1 Basic Asm - Assembler Instructions with No Operands
> 6.41.2 Extended Asm - Assembler Instructions with C Expression Operands
>
> We could simply post the patch to GCC-patches and have at it, but I
> think it's better to discuss the document here first.  You can read it
> at
>
> http://www.LimeGreenSocks.com/gcc/Basic-Asm.html
> http://www.LimeGreenSocks.com/gcc/Extended-Asm.html
> http://www.LimeGreenSocks.com/gcc/extend04.zip (contains .texi, .patch,
> and affected html pages)
>
> All comments are very welcome.

Thanks for doing this, looks like a big improvement.  A couple of comments:

The section on basic asms says:

  Do not expect a sequence of asm statements to remain perfectly
  consecutive after compilation. To ensure that assembler instructions
  maintain their order, use a single asm statement containing multiple
  instructions. Note that GCC's optimizer can move asm statements
  relative to other code, including across jumps.

The "maintain their order" might be a bit misleading, since volatile asms
(including basic asms) must always be executed in the original order.
Maybe this was meaning placement/address order instead?  It might also be
worth mentioning that the number of instances of an asm in the output
may be different from the input.  (Can it increase as well as decrease?
I'm not sure off-hand, but probably yes.)

In the extended section:

  Unless an output operand has the '&' constraint modifier (see
  Modifiers), GCC may allocate it in the same register as an unrelated
  input operand, [...]

It could also use it for addresses in other (memory) outputs.

For:

  When using asmSymbolicNames for the output operands, you may use these
  names instead of digits.

it might be worth mentioning that you need the enclosing [...].

Thanks,
Richard



Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread Kyrill Tkachov

On 27/02/14 11:07, Andrew Haley wrote:

Over the years there has been a great deal of traffic on these lists
caused by misunderstandings of GCC's inline assembler.  That's partly
because it's inherently tricky, but the existing documentation needs
to be improved.

dw  has done a fairly thorough reworking of
the documentation.  I've helped a bit.

Section 6.41 of the GCC manual has been rewritten.  It has become:

6.41 How to Use Inline Assembly Language in C Code
6.41.1 Basic Asm - Assembler Instructions with No Operands
6.41.2 Extended Asm - Assembler Instructions with C Expression Operands

We could simply post the patch to GCC-patches and have at it, but I
think it's better to discuss the document here first.  You can read it
at

http://www.LimeGreenSocks.com/gcc/Basic-Asm.html
http://www.LimeGreenSocks.com/gcc/Extended-Asm.html
http://www.LimeGreenSocks.com/gcc/extend04.zip (contains .texi, .patch,
and affected html pages)

All comments are very welcome.

Hi Andrew, dw,

Thanks for doing this!

In the Extended Asm documentation: Other format strings section:
"'%=' outputs a number that is unique to each instruction in the entire 
compilation."


I find the term 'instruction' to be confusing here. From what I understand the 
number is unique to each asm statement, which may contain multiple assembly 
instructions. IMHO it would be clearer to say "unique to each asm statement"


Kyrill




Andrew.






Re: proposal to turn on some warnings by default

2014-02-27 Thread Basile Starynkevitch
On Thu, 2014-02-27 at 10:14 +0100, David Brown wrote:
> On 27/02/14 07:50, Mingjie Xing wrote:
> > Hello,
> > 
> > I'm wondering if it's a good idea to turn on some warnings by default
> > (or even promote them to error), such as -Wreturn-type on C.  This
> > would help programmers to avoid some mistakes.
> > 
> > Regards,
> > Mingjie
> > 
> 
> Personally, I think gcc should issue a warning if it is run without at
> least "-Wall" (or "-Wno-all"), telling the user that they have forgotten
> to enable warnings.  /That/ would help people avoid mistakes.  It should
> also warn if there is no optimisation level, to stop people accidentally
> generating big and slow code.


I totally agree with you. Perhaps in next release (after 4.9) we might
add something more in the spec file to enable e.g. Linux distribution
makers (or users compiling and configuring GCC from its source tarball)
to give that feature.

Maybe a way to tell in the spec file that -Wall (or some other global
and configurable option) is passed by default, unless some other options
is given to override it.

So, for people wanting -Wall without asking it explicitly it would be
possible to configure their spec file to have that. For other people
wanting maximal compatibility with the behavior of previous GCC versions
they would avoid hacking the spec file.

Look e.g. on stackoverflow.com the large amount of questions asked by
newbies users of GCC (or beginners in C programming) which could have
been avoided (or at least noticed) by -Wall

BTWX, when I teach some courses and have students use GCC I require them
to pass -Wall and to do the necessary to avoid any warnings.

Regards.
-- 
Basile STARYNKEVITCH http://starynkevitch.net/Basile/
email: basilestarynkevitchnet mobile: +33 6 8501 2359
8, rue de la Faiencerie, 92340 Bourg La Reine, France
*** opinions {are only mine, sont seulement les miennes} ***




Asm volatile causing performance regressions on ARM

2014-02-27 Thread Yury Gribov

Hi all,

We have recently ran into a performance/code size regression on ARM 
targets after transition from GCC 4.7 to GCC 4.8 (this regression is 
also present in 4.9).


The following code snippet uses Linux-style compiler barriers to protect 
memory writes:


  #define barrier() __asm__ __volatile__ ("": : :"memory")
  #define write(v,a) { barrier(); *(volatile unsigned *)(a) = (v); }

  #define v1 0x0010
  #define v2 0xaabbccdd

  void test(unsigned base) {
write(v1, base + 0x100);
write(v2, base + 0x200);
write(v1, base + 0x300);
write(v2, base + 0x400);
  }

Code generated by GCC 4.7 under -Os (all good):

   mov r2, #7340032
   str r2, [r0, #3604]
   ldr r3, .L2
   str r3, [r0, #3612]
   str r2, [r0, #3632]
   str r3, [r0, #3640]

(note that compiler decided to load v2 from constant pool).

Now code generated by GCC 4.8/4.9 under -Os is much larger because v1 
and v2 are reloaded before every store:


   mov r3, #7340032
   str r3, [r0, #3604]
   ldr r3, .L2
   str r3, [r0, #3612]
   mov r3, #7340032
   str r3, [r0, #3632]
   ldr r3, .L2
   str r3, [r0, #3640]

v1 and v2 are constant literals and can't really be changed by user so I 
would expect compiler to combine loads.


After some investigation, we discovered that this behavior is caused by 
big hammer in gcc/cse.c:

   /* A volatile ASM or an UNSPEC_VOLATILE invalidates everything.  */
   if (NONJUMP_INSN_P (insn)
   && volatile_insn_p (PATTERN (insn)))
 flush_hash_table ();
This code (introduced in 
http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=193802) aborts CSE 
after seeing a volatile inline asm.


Is this compiler behavior reasonable? AFAIK GCC documentation only says 
that __volatile__ prevents compiler from removing the asm but it does 
not mention that it supresses optimization of all surrounding expressions.


If this behavior is not intended, what would be the best way to fix 
performance? I could teach GCC to not remove constant RTXs in 
flush_hash_table() but this is probably very naive and won't cover some 
corner-cases.


-Y


[RFC] Meta-description for tree and gimple folding

2014-02-27 Thread Richard Biener

I've been hacking on a prototype that generates matching and 
simplification code from a meta-description.  The goal is
to provide a single source of transforms currently spread
over the compiler, mostly fold-const.c, gimple-fold.c and
tree-ssa-forwprop.c.  Another goal is to make these transforms
(which are most of the form of generating a simpler form of
a pattern-matched IL piece) more readily available to passes
like value-numbering so they can be used on-the-fly, using
information provided by the pass lattice.  The ultimate
goal is to generate (most of) fold-const.c and gimple-fold.c
and tree-ssa-forwprop.c from a single meta description.

Currently the prototype can generate code to match and simplify
on the GIMPLE IL and it uses a very simple description right now
(following the lispy style we have for machine descriptions).
For example

(define_match_and_simplify foo
  (PLUS_EXPR (MINUS_EXPR integral_op_p@0 @1) @1)
  @0)

Matches (A - B) + B and transforms it to A.  More complex
replacements involving modifying of matches operands can be
done with inlined C code:

(define_match_and_simplify bar
  (PLUS_EXPR INTEGER_CST_P@0 (PLUS_EXPR @1 INTEGER_CST_P@2))
  (PLUS_EXPR { int_const_binop (PLUS_EXPR, captures[0], captures[2]); } @1))

which matches CST1 + (X + CST2) and transforms it to (CST1 + CST2) + X
(thus it reassociates but it also simplifies the constant part).

Writing patterns will require a few new predicates like
INTEGER_CST_P or integral_op_p.

At this point I'll try integrating the result into a few
GIMPLE passes (forwprop and SCCVN) to see if the interface
works well enough.  Currently the GIMPLE interface is

tree
gimple_match_and_simplify (tree name, gimple_seq *seq,
   tree (*valueize)(tree));

where the simplification happens on the defining statement
of the SSA name name and a is_gimple_val result is returned.
Any intermediate stmts are appended to 'seq' (or NULL_TREE
is returned if that would be necessary and 'seq' is NULL)
and all SSA names matched and generated are valueized using
the valueize callback (if not NULL).  Thus for the first
example above we'd return A and do not touch seq while
for the second example we'd return a new temporary SSA
name and append name = CST' + X to seq (we might want
to allow in-place modification of the def stmt of name
as well, I'm not sure yet - that's the forwprop way of operation)

Patch below for reference.

Comments or suggestions?

Thanks,
Richard.

Index: gcc/Makefile.in
===
*** gcc/Makefile.in.orig2014-02-15 10:52:03.466934196 +0100
--- gcc/Makefile.in 2014-02-27 14:30:37.426648887 +0100
*** OBJS = \
*** 1236,1241 
--- 1236,1242 
gimple-iterator.o \
gimple-fold.o \
gimple-low.o \
+   gimple-match.o \
gimple-pretty-print.o \
gimple-ssa-isolate-paths.o \
gimple-ssa-strength-reduction.o \
*** MOSTLYCLEANFILES = insn-flags.h insn-con
*** 1504,1510 
   insn-output.c insn-recog.c insn-emit.c insn-extract.c insn-peep.c \
   insn-attr.h insn-attr-common.h insn-attrtab.c insn-dfatab.c \
   insn-latencytab.c insn-opinit.c insn-opinit.h insn-preds.c insn-constants.h \
!  tm-preds.h tm-constrs.h checksum-options \
   tree-check.h min-insn-modes.c insn-modes.c insn-modes.h \
   genrtl.h gt-*.h gtype-*.h gtype-desc.c gtyp-input.list \
   xgcc$(exeext) cpp$(exeext) \
--- 1505,1511 
   insn-output.c insn-recog.c insn-emit.c insn-extract.c insn-peep.c \
   insn-attr.h insn-attr-common.h insn-attrtab.c insn-dfatab.c \
   insn-latencytab.c insn-opinit.c insn-opinit.h insn-preds.c insn-constants.h \
!  tm-preds.h tm-constrs.h checksum-options gimple-match.c \
   tree-check.h min-insn-modes.c insn-modes.c insn-modes.h \
   genrtl.h gt-*.h gtype-*.h gtype-desc.c gtyp-input.list \
   xgcc$(exeext) cpp$(exeext) \
*** $(common_out_object_file): $(common_out_
*** 2018,2024 
  .PRECIOUS: insn-config.h insn-flags.h insn-codes.h insn-constants.h \
insn-emit.c insn-recog.c insn-extract.c insn-output.c insn-peep.c \
insn-attr.h insn-attr-common.h insn-attrtab.c insn-dfatab.c \
!   insn-latencytab.c insn-preds.c
  
  # Dependencies for the md file.  The first time through, we just assume
  # the md file itself and the generated dependency file (in order to get
--- 2019,2025 
  .PRECIOUS: insn-config.h insn-flags.h insn-codes.h insn-constants.h \
insn-emit.c insn-recog.c insn-extract.c insn-output.c insn-peep.c \
insn-attr.h insn-attr-common.h insn-attrtab.c insn-dfatab.c \
!   insn-latencytab.c insn-preds.c gimple-match.c
  
  # Dependencies for the md file.  The first time through, we just assume
  # the md file itself and the generated dependency file (in order to get
*** s-tm-texi: build/genhooks$(build_exeext)
*** 2227,2232 
--- 2228,2242 
  false; \
fi
  
+ gimple-match.c: s-match; @true
+ 
+ s-match: build

Re: Asm volatile causing performance regressions on ARM

2014-02-27 Thread Eric Botcazou
> After some investigation, we discovered that this behavior is caused by
> big hammer in gcc/cse.c:
> /* A volatile ASM or an UNSPEC_VOLATILE invalidates everything.  */
> if (NONJUMP_INSN_P (insn)
> && volatile_insn_p (PATTERN (insn)))
>   flush_hash_table ();
> This code (introduced in
> http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=193802) aborts CSE
> after seeing a volatile inline asm.

Note that "introduced" is not really correct here, the code had been there for 
a long time but it was treating some volatile asms as barriers and some others 
as not.  Now it treats them all as barriers.

> Is this compiler behavior reasonable? AFAIK GCC documentation only says
> that __volatile__ prevents compiler from removing the asm but it does
> not mention that it supresses optimization of all surrounding expressions.

This is not crystal clear, but the conservative interpretation is that you can 
use volatile asms to do really nasty things behind the back of the compiler:

/* Nonzero if X contains any volatile instructions.  These are instructions
   which may cause unpredictable machine state instructions, and thus no
   instructions or register uses should be moved or combined across them.
   This includes only volatile asms and UNSPEC_VOLATILE instructions.  */

int
volatile_insn_p (const_rtx x)

> If this behavior is not intended, what would be the best way to fix
> performance? I could teach GCC to not remove constant RTXs in
> flush_hash_table() but this is probably very naive and won't cover some
> corner-cases.

That could be a good starting point though.

-- 
Eric Botcazou


Re: Asm volatile causing performance regressions on ARM

2014-02-27 Thread Richard Biener
On Thu, Feb 27, 2014 at 4:02 PM, Eric Botcazou  wrote:
>> After some investigation, we discovered that this behavior is caused by
>> big hammer in gcc/cse.c:
>> /* A volatile ASM or an UNSPEC_VOLATILE invalidates everything.  */
>> if (NONJUMP_INSN_P (insn)
>> && volatile_insn_p (PATTERN (insn)))
>>   flush_hash_table ();
>> This code (introduced in
>> http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=193802) aborts CSE
>> after seeing a volatile inline asm.
>
> Note that "introduced" is not really correct here, the code had been there for
> a long time but it was treating some volatile asms as barriers and some others
> as not.  Now it treats them all as barriers.
>
>> Is this compiler behavior reasonable? AFAIK GCC documentation only says
>> that __volatile__ prevents compiler from removing the asm but it does
>> not mention that it supresses optimization of all surrounding expressions.
>
> This is not crystal clear, but the conservative interpretation is that you can
> use volatile asms to do really nasty things behind the back of the compiler:
>
> /* Nonzero if X contains any volatile instructions.  These are instructions
>which may cause unpredictable machine state instructions, and thus no
>instructions or register uses should be moved or combined across them.
>This includes only volatile asms and UNSPEC_VOLATILE instructions.  */
>
> int
> volatile_insn_p (const_rtx x)
>
>> If this behavior is not intended, what would be the best way to fix
>> performance? I could teach GCC to not remove constant RTXs in
>> flush_hash_table() but this is probably very naive and won't cover some
>> corner-cases.
>
> That could be a good starting point though.

Though with modifying "machine state" you can modify constants as well, no?

Richard.

> --
> Eric Botcazou


Re: Asm volatile causing performance regressions on ARM

2014-02-27 Thread Yury Gribov

Richard Biener wrote:

If this behavior is not intended, what would be the best way to fix
performance? I could teach GCC to not remove constant RTXs in
flush_hash_table() but this is probably very naive and won't cover some
corner-cases.


That could be a good starting point though.


Though with modifying "machine state" you can modify constants as well, no?


Valid point but this would mean relying on compiler to always load all 
constants from memory (instead of, say, generating them via movhi/movlo) 
for a piece of code which looks extremely unstable.


What is the general attitude towards volatile asm? Are people interested 
in making it more defined/performant or should we just leave this can of 
worms as is? I can try to improve generated code but my patches will be 
doomed if there is no consensus on what volatile asm actually means...


-Y


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Torvald Riegel
On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
>  wrote:
> >
> > Good points.  How about the following replacements?
> >
> > 3.  Adding or subtracting an integer to/from a chained pointer
> > results in another chained pointer in that same pointer chain.
> > The results of addition and subtraction operations that cancel
> > the chained pointer's value (for example, "p-(long)p" where "p"
> > is a pointer to char) are implementation defined.
> >
> > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > applied to a chained pointer and an integer for the purposes
> > of alignment and pointer translation results in another
> > chained pointer in that same pointer chain.  Other uses
> > of bitwise operators on chained pointers (for example,
> > "p|~0") are implementation defined.
> 
> Quite frankly, I think all of this language that is about the actual
> operations is irrelevant and wrong.
> 
> It's not going to help compiler writers, and it sure isn't going to
> help users that read this.
> 
> Why not just talk about "value chains" and that any operations that
> restrict the value range severely end up breaking the chain. There is
> no point in listing the operations individually, because every single
> operation *can* restrict things. Listing individual operations and
> depdendencies is just fundamentally wrong.

[...]

> The *only* thing that matters for all of them is whether they are
> "value-preserving", or whether they drop so much information that the
> compiler might decide to use a control dependency instead. That's true
> for every single one of them.
> 
> Similarly, actual true control dependencies that limit the problem
> space sufficiently that the actual pointer value no longer has
> significant information in it (see the above example) are also things
> that remove information to the point that only a control dependency
> remains. Even when the value itself is not modified in any way at all.

I agree that just considering syntactic properties of the program seems
to be insufficient.  Making it instead depend on whether there is a
"semantic" dependency due to a value being "necessary" to compute a
result seems better.  However, whether a value is "necessary" might not
be obvious, and I understand Paul's argument that he does not want to
have to reason about all potential compiler optimizations.  Thus, I
believe we need to specify when a value is "necessary".

I have a suggestion for a somewhat different formulation of the feature
that you seem to have in mind, which I'll discuss below.  Excuse the
verbosity of the following, but I'd rather like to avoid
misunderstandings than save a few words.


What we'd like to capture is that a value originating from a mo_consume
load is "necessary" for a computation (e.g., it "cannot" be replaced
with value predictions and/or control dependencies); if that's the case
in the program, we can reasonably assume that a compiler implementation
will transform this into a data dependency, which will then lead to
ordering guarantees by the HW.

However, we need to specify when a value is "necessary".  We could say
that this is implementation-defined, and use a set of litmus tests
(e.g., like those discussed in the thread) to roughly carve out what a
programmer could expect.  This may even be practical for a project like
the Linux kernel that follows strict project-internal rules and pays a
lot of attention to what the particular implementations of compilers
expected to compile the kernel are doing.  However, I think this
approach would be too vague for the standard and for many other
programs/projects.


One way to understand "necessary" would be to say that if a mo_consume
load can result in more than V different values, then the actual value
is "unknown", and thus "necessary" to compute anything based on it.
(But this is flawed, as discussed below.)

However, how big should V be?  If it's larger than 1, atomic bool cannot
be used with mo_consume, which seems weird.
If V is 1, then Linus' litmus tests work (but Paul's doesn't; see
below), but the compiler must not try to predict more than one value.
This holds for any choice of V, so there always is an *additional*
constraint on code generation for operations that are meant to take part
in such "value dependencies".  The bigger V might be, the less likely it
should be for this to actually constrain a particular compiler's
optimizations (e.g., while it might be meaningful to use value
prediction for two or three values, it's probably not for 1000s).
Nonetheless, if we don't want to predict the future, we need to specify
V.  Given that we always have some constraint for code generation
anyway, and given that V > 1 might be an arbitrary-looking constraint
and disallows use on atomic bool, I believe V should be 1.

Furthermore, there is a problem in saying "a load can 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Linus Torvalds
On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel  wrote:
>
> I agree that just considering syntactic properties of the program seems
> to be insufficient.  Making it instead depend on whether there is a
> "semantic" dependency due to a value being "necessary" to compute a
> result seems better.  However, whether a value is "necessary" might not
> be obvious, and I understand Paul's argument that he does not want to
> have to reason about all potential compiler optimizations.  Thus, I
> believe we need to specify when a value is "necessary".

I suspect it's hard to really strictly define, but at the same time I
actually think that compiler writers (and users, for that matter) have
little problem understanding the concept and intent.

I do think that listing operations might be useful to give good
examples of what is a "necessary" value, and - perhaps more
importantly - what can break the value from being "necessary".
Especially the gotchas.

> I have a suggestion for a somewhat different formulation of the feature
> that you seem to have in mind, which I'll discuss below.  Excuse the
> verbosity of the following, but I'd rather like to avoid
> misunderstandings than save a few words.

Ok, I'm going to cut most of the verbiage since it's long and I'm not
commenting on most of it.

But

> Based on these thoughts, we could specify the new mo_consume guarantees
> roughly as follows:
>
> An evaluation E (in an execution) has a value dependency to an
> atomic and mo_consume load L (in an execution) iff:
> * L's type holds more than one value (ruling out constants
> etc.),
> * L is sequenced-before E,
> * L's result is used by the abstract machine to compute E,
> * E is value-dependency-preserving code (defined below), and
> * at the time of execution of E, L can possibly have returned at
> least two different values under the assumption that L itself
> could have returned any value allowed by L's type.
>
> If a memory access A's targeted memory location has a value
> dependency on a mo_consume load L, and an action X
> inter-thread-happens-before L, then X happens-before A.

I think this mostly works.

> Regarding the latter, we make a fresh start at each mo_consume load (ie,
> we assume we know nothing -- L could have returned any possible value);
> I believe this is easier to reason about than other scopes like function
> granularities (what happens on inlining?), or translation units.  It
> should also be simple to implement for compilers, and would hopefully
> not constrain optimization too much.
>
> [...]
>
> Paul's litmus test would work, because we guarantee to the programmer
> that it can assume that the mo_consume load would return any value
> allowed by the type; effectively, this forbids the compiler analysis
> Paul thought about:

So realistically, since with the new wording we can ignore the silly
cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
cases ("if (p == &variable) .. use p"), and you would forbid the
"global value range optimization case" that Paul bright up, what
remains would seem to be just really subtle compiler transformations
of data dependencies to control dependencies.

And the only such thing I can think of is basically compiler-initiated
value-prediction, presumably directed by PGO (since now if the value
prediction is in the source code, it's considered to break the value
chain).

The good thing is that afaik, value-prediction is largely not used in
real life, afaik. There are lots of papers on it, but I don't think
anybody actually does it (although I can easily see some
specint-specific optimization pattern that is build up around it).

And even value prediction is actually fine, as long as the compiler
can see the memory *source* of the value prediction (and it isn't a
mo_consume). So it really ends up limiting your value prediction in
very simple ways: you cannot do it to function arguments if they are
registers. But you can still do value prediction on values you loaded
from memory, if you can actually *see* that memory op.

Of course, on more strongly ordered CPU's, even that "register
argument" limitation goes away.

So I agree that there is basically no real optimization constraint.
Value-prediction is of dubious value to begin with, and the actual
constraint on its use if some compiler writer really wants to is not
onerous.

> What I have in mind is roughly the following (totally made-up syntax --
> suggestions for how to do this properly are very welcome):
> * Have a type modifier (eg, like restrict), that specifies that
> operations on data of this type are preserving value dependencies:

So I'm not violently opposed, but I think the upsides are not great.
Note that my earlier suggestion to use "restrict" wasn't because I
believed the annotation itself would be visible, but basically just as
a legalistic promise to the compiler that *if* it found an 

Re: Asm volatile causing performance regressions on ARM

2014-02-27 Thread Richard Sandiford
Yury Gribov  writes:
> Richard Biener wrote:
 If this behavior is not intended, what would be the best way to fix
 performance? I could teach GCC to not remove constant RTXs in
 flush_hash_table() but this is probably very naive and won't cover some
 corner-cases.
>>>
>>> That could be a good starting point though.
>>
>> Though with modifying "machine state" you can modify constants as well, no?
>
> Valid point but this would mean relying on compiler to always load all 
> constants from memory (instead of, say, generating them via movhi/movlo) 
> for a piece of code which looks extremely unstable.

Right.  And constant rtx codes have mode-independent semantics.
(const_int 1) is always 1, whatever a volatile asm does.  Same for
const_double, symbol_ref, label_ref, etc.  If a constant load is implemented
using some mode-dependent operation then it would need to be represented
as something like an unspec instead.  But even then, the result would
usually be annotated with a REG_EQUAL note giving the value of the final
register result.  It should be perfectly OK to reuse that register after
a volatile asm if the value in the REG_EQUAL note is needed again.

> What is the general attitude towards volatile asm? Are people interested 
> in making it more defined/performant or should we just leave this can of 
> worms as is? I can try to improve generated code but my patches will be 
> doomed if there is no consensus on what volatile asm actually means...

I think part of the problem is that some parts of GCC (like the one you
noted) are far more conservative than others.  E.g. take:

  void foo (int x, int *y)
  {
y[0] = x + 1;
asm volatile ("# asm");
y[1] = x + 1;
  }

The extra-paranoid check you pointed out means that we assume that
x + 1 is no longer available after the asm for rtx-level CSE, but take
the opposite view for tree-level CSE, which happily optimises away the
second +.

Some places were (maybe still are) worried that volatile asms could
clobber any register they like.  But the register allocator assumes that
registers are preserved across volatile asms unless explicitly clobbered.
And AFAIK it always has.  So in the above example we get:

addl$1, %edi
movl%edi, (%rsi)
#APP
# 4 "/tmp/foo.c" 1
# asm
# 0 "" 2
#NO_APP
movl%edi, 4(%rsi)
ret

with %edi being live across the asm.

We do nothing this draconian for a normal function call, which could
easily use a volatile asm internally.  IMO anything that isn't flushed
for a call shouldn't be flushed for a volatile asm either.

One of the big grey areas is what should happen for floating-point ops
that depend on the current rounding mode.  That isn't really modelled
properly yet though.  Again, it affects calls as well as volatile asms.

Thanks,
Richard



Re: Asm volatile causing performance regressions on ARM

2014-02-27 Thread David Brown
On 27/02/14 16:36, Yury Gribov wrote:
> Richard Biener wrote:
 If this behavior is not intended, what would be the best way to fix
 performance? I could teach GCC to not remove constant RTXs in
 flush_hash_table() but this is probably very naive and won't cover some
 corner-cases.
>>>
>>> That could be a good starting point though.
>>
>> Though with modifying "machine state" you can modify constants as
>> well, no?
> 
> Valid point but this would mean relying on compiler to always load all
> constants from memory (instead of, say, generating them via movhi/movlo)
> for a piece of code which looks extremely unstable.
> 
> What is the general attitude towards volatile asm? Are people interested
> in making it more defined/performant or should we just leave this can of
> worms as is? I can try to improve generated code but my patches will be
> doomed if there is no consensus on what volatile asm actually means...
> 
> -Y
> 

In embedded development, volatile asm statements are unavoidable at
times.  In particular, "asm volatile ("" ::: "memory")" is the most
common memory barrier used, and it can turn up quite often.  I would
definitely consider the regression you found to be an issue.  And if it
is now the case that "asm volatile" causes a complete optimisation
barrier regardless of the clobber, this will definitely make code bigger
and slower in every case where you /don't/ "do really nasty things
behind the back of the compiler".



Re: Asm volatile causing performance regressions on ARM

2014-02-27 Thread Michael Matz
Hi,

On Thu, 27 Feb 2014, Richard Sandiford wrote:

> [... many cases where 'volatile' in asm doesn't inhibit optimizations 
> ...]
> 
> We do nothing this draconian for a normal function call, which could
> easily use a volatile asm internally.  IMO anything that isn't flushed
> for a call shouldn't be flushed for a volatile asm either.

My view was always that the only semantic meaning of 'volatile' in a new 
style asm was that it's not going to be optimized away, even if all 
obvious outputs from the constraints were dead.  Or IOW, that volatile 
can have more side-effects than described by the constraints doesn't give 
leeway to the asms author to wreak havoc on the machine (and in particular 
change the runtime environment in which other instructions are executed).

I think that also matches the current inline asm docu, which always 
mentions 'volatile asm' in connection with not removing it.  It also 
mentions "important side-effect", but I think changing machine state in 
such way as to change interpretations of register content doesn't count as 
side effect.  That's like changing the stack pointer in a volatile asm.  
If machine-state changing side-effects would be allowed in volatile asms, 
then I'd claim that changing the stack pointer should be allowed as well 
(it's just a register, right?).  And that's obvious bollocks.

(machine state changes are of course okay, if the author expects them and 
takes proper precautions)

> One of the big grey areas is what should happen for floating-point ops 
> that depend on the current rounding mode.  That isn't really modelled 
> properly yet though.  Again, it affects calls as well as volatile asms.

Right, which is why we should not talk about floating point control with 
respect to volatile asms.  _Nothing_ related to FPC is there ;)


Ciao,
Michael.


Re: About gsoc 2014 OpenMP 4.0 Projects

2014-02-27 Thread guray ozen
Hi Evgeny,

As i said, I'm working for source-to-source generation for my master
thesis. But my compiler can transform from C to CUDA not PTX now :)
For further information, I uploaded https://github.com/grypp/macc-omp4
my documents and code samples regarding my master thesis. I also added
my benchmark results which covers results of comparisons between CAPS
OpenACC and MACC. For now my compiler MACC has a better result than
CAPS OpenACC for jacobi application and CG application from NAS
parallel benchmark.

Actually I have never thought intermediate language translation. But
it is great idea to generate optimized code. But as i know, any NVidia
architecture doesn't support SPIR backend yet, right?

What i understood that currently GCC is working on SPIR code
generation to support OpenMP 4.0. So do you have any future plan to
generate PTX? Because SPIR backend is too new, it was announced almost
1 month ago. and i think your team has more experience on SPIR than
me. Therefore I'm asking that is there any project to implementation
about PTX?

By the way i couldn't see any specific project regarding openmp 4.0
http://gcc.gnu.org/wiki/openmp . Actually i am wondering for GSoC,
which area am i supposed to focus?

Regards.
Güray Özen
~grypp



2014-02-26 8:20 GMT+01:00 Evgeny Gavrin :
> Hi Guray,
>
> There were two announcements: PTX-backend and OpenCL code generation.
> Initial PTX-patches can be found in mailing list and OpenCL experiments in
> openacc_1-0_branch.
>
> Regarding GSoC it would be nice, if you'll apply with your proposal on code
> generation.
> I think that projects aimed to improve generation of OpenCL or
> implementation of SPIR-backend are going to be useful for GCC.
>
> -
> Thanks,
> Evgeny.
>
>
> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> guray ozen
> Sent: Tuesday, February 25, 2014 3:27 PM
> To: gcc@gcc.gnu.org
> Subject: About gsoc 2014 OpenMP 4.0 Projects
>
> Hello,
>
> I'm master student at high-performance computing at barcelona supercomputing
> center. And I'm working on my thesis regarding openmp accelerator model
> implementation onto our compiler (OmpSs). Actually i almost finished
> implementation of all new directives  to generate CUDA code and same
> implementation OpenCL doesn't take so much according to my design. But i
> haven't even tried for Intel mic and apu other hardware accelerator :) Now
> i'm bench-marking output kernel codes which are generated by my compiler.
> although output kernel is generally naive, speedup is not very very bad.
> when I compare results with HMPP OpenACC 3.2.x compiler, speedups are almost
> same or in some cases my results are slightly better than. That's why in
> this term, i am going to work on compiler level or runtime level
> optimizations for gpus.
>
> When i looked gcc openmp 4.0 project, i couldn't see any things about code
> generation. Are you going to announce later? or should i apply gsoc with my
> idea about code generations and device code optimizations?
>
> Güray Özen
> ~grypp
>


Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread Andi Kleen
Andrew Haley  writes:

> Over the years there has been a great deal of traffic on these lists
> caused by misunderstandings of GCC's inline assembler.  That's partly
> because it's inherently tricky, but the existing documentation needs
> to be improved.
>
> dw  has done a fairly thorough reworking of
> the documentation.  I've helped a bit.


It would be nice if you could include some discussion of the LTO
reference problems.

Something like:

It is not legal to reference a static variable or function symbol from
the assembler code, as the compiler may optimize unused symbols
away. For inline asm in functions these should be referred as "m" input
arguments. For top level asm the referenced symbol should be made global
and marked with __attribute__((externally_visible)) .

And another common problem:

For top level asm there is no guarantee the compiler outputs the
statements in order.

[unless -fno-toplevel-reorder is specified, but I'm not should mention that]

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
> 
> On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> >  wrote:
> > >
> > > Good points.  How about the following replacements?
> > >
> > > 3.  Adding or subtracting an integer to/from a chained pointer
> > > results in another chained pointer in that same pointer chain.
> > > The results of addition and subtraction operations that cancel
> > > the chained pointer's value (for example, "p-(long)p" where "p"
> > > is a pointer to char) are implementation defined.
> > >
> > > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > applied to a chained pointer and an integer for the purposes
> > > of alignment and pointer translation results in another
> > > chained pointer in that same pointer chain.  Other uses
> > > of bitwise operators on chained pointers (for example,
> > > "p|~0") are implementation defined.
> > 
> > Quite frankly, I think all of this language that is about the actual
> > operations is irrelevant and wrong.
> > 
> > It's not going to help compiler writers, and it sure isn't going to
> > help users that read this.
> > 
> > Why not just talk about "value chains" and that any operations that
> > restrict the value range severely end up breaking the chain. There is
> > no point in listing the operations individually, because every single
> > operation *can* restrict things. Listing individual operations and
> > depdendencies is just fundamentally wrong.
> 
> [...]
> 
> > The *only* thing that matters for all of them is whether they are
> > "value-preserving", or whether they drop so much information that the
> > compiler might decide to use a control dependency instead. That's true
> > for every single one of them.
> > 
> > Similarly, actual true control dependencies that limit the problem
> > space sufficiently that the actual pointer value no longer has
> > significant information in it (see the above example) are also things
> > that remove information to the point that only a control dependency
> > remains. Even when the value itself is not modified in any way at all.
> 
> I agree that just considering syntactic properties of the program seems
> to be insufficient.  Making it instead depend on whether there is a
> "semantic" dependency due to a value being "necessary" to compute a
> result seems better.  However, whether a value is "necessary" might not
> be obvious, and I understand Paul's argument that he does not want to
> have to reason about all potential compiler optimizations.  Thus, I
> believe we need to specify when a value is "necessary".
> 
> I have a suggestion for a somewhat different formulation of the feature
> that you seem to have in mind, which I'll discuss below.  Excuse the
> verbosity of the following, but I'd rather like to avoid
> misunderstandings than save a few words.

Thank you very much for putting this forward!  I must confess that I was
stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
is quite clearly way bogus.

One possible saving grace:  From discussions at the standards committee
meeting a few weeks ago, there is a some chance that the committee will
be willing to do a rip-and-replace on the current memory_order_consume
wording, without provisions for backwards compatibility with the current
bogosity.

> What we'd like to capture is that a value originating from a mo_consume
> load is "necessary" for a computation (e.g., it "cannot" be replaced
> with value predictions and/or control dependencies); if that's the case
> in the program, we can reasonably assume that a compiler implementation
> will transform this into a data dependency, which will then lead to
> ordering guarantees by the HW.
> 
> However, we need to specify when a value is "necessary".  We could say
> that this is implementation-defined, and use a set of litmus tests
> (e.g., like those discussed in the thread) to roughly carve out what a
> programmer could expect.  This may even be practical for a project like
> the Linux kernel that follows strict project-internal rules and pays a
> lot of attention to what the particular implementations of compilers
> expected to compile the kernel are doing.  However, I think this
> approach would be too vague for the standard and for many other
> programs/projects.

I agree that a number of other projects would have more need for this than
might the kernel.  Please understand that this is in no way denigrating
the intelligence of other projects' members.  It is just that many of
them have only recently started seriously thinking about concurrency.
In contrast, the Linux kernel community has been doing concurrency since
the mid-1990s.  Projects with less experience with concurrency will
probably need more help, from the compiler and from elsewhere as well.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel  wrote:
> >
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> 
> I suspect it's hard to really strictly define, but at the same time I
> actually think that compiler writers (and users, for that matter) have
> little problem understanding the concept and intent.
> 
> I do think that listing operations might be useful to give good
> examples of what is a "necessary" value, and - perhaps more
> importantly - what can break the value from being "necessary".
> Especially the gotchas.
> 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Ok, I'm going to cut most of the verbiage since it's long and I'm not
> commenting on most of it.
> 
> But
> 
> > Based on these thoughts, we could specify the new mo_consume guarantees
> > roughly as follows:
> >
> > An evaluation E (in an execution) has a value dependency to an
> > atomic and mo_consume load L (in an execution) iff:
> > * L's type holds more than one value (ruling out constants
> > etc.),
> > * L is sequenced-before E,
> > * L's result is used by the abstract machine to compute E,
> > * E is value-dependency-preserving code (defined below), and
> > * at the time of execution of E, L can possibly have returned at
> > least two different values under the assumption that L itself
> > could have returned any value allowed by L's type.
> >
> > If a memory access A's targeted memory location has a value
> > dependency on a mo_consume load L, and an action X
> > inter-thread-happens-before L, then X happens-before A.
> 
> I think this mostly works.
> 
> > Regarding the latter, we make a fresh start at each mo_consume load (ie,
> > we assume we know nothing -- L could have returned any possible value);
> > I believe this is easier to reason about than other scopes like function
> > granularities (what happens on inlining?), or translation units.  It
> > should also be simple to implement for compilers, and would hopefully
> > not constrain optimization too much.
> >
> > [...]
> >
> > Paul's litmus test would work, because we guarantee to the programmer
> > that it can assume that the mo_consume load would return any value
> > allowed by the type; effectively, this forbids the compiler analysis
> > Paul thought about:
> 
> So realistically, since with the new wording we can ignore the silly
> cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
> cases ("if (p == &variable) .. use p"), and you would forbid the
> "global value range optimization case" that Paul bright up, what
> remains would seem to be just really subtle compiler transformations
> of data dependencies to control dependencies.

FWIW, I am looking through the kernel for instances of your first
"if (p == &variable) .. use p" limus test.  All the ones I have found
thus far are OK for one of the following reasons:

1.  The comparison was against NULL, so you don't get to dereference
the pointer anyway.  About 80% are in this category.

2.  The comparison was against another pointer, but there were no
dereferences afterwards.  Here is an example of what these
can look like:

list_for_each_entry_rcu(p, &head, next)
if (p == &variable)
return; /* "p" goes out of scope. */

3.  The comparison was against another RCU-protected pointer,
where that other pointer was properly fetched using one
of the RCU primitives.  Here it doesn't matter which pointer
you use.  At least as long as the rcu_assign_pointer() for
that other pointer happened after the last update to the
pointed-to structure.

I am a bit nervous about #3.  Any thoughts on it?

Some other reasons why it would be OK to dereference after a comparison:

4.  The pointed-to data is constant: (a) It was initialized at
boot time, (b) the update-side lock is held, (c) we are
running in a kthread and the data was initialized before the
kthread was created, (d) we are running in a module, and
the data was initialized during or before module-init time
for that module.  And many more besides, invo

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
> > 
> > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> > >  wrote:
> > > >
> > > > Good points.  How about the following replacements?
> > > >
> > > > 3.  Adding or subtracting an integer to/from a chained pointer
> > > > results in another chained pointer in that same pointer chain.
> > > > The results of addition and subtraction operations that cancel
> > > > the chained pointer's value (for example, "p-(long)p" where "p"
> > > > is a pointer to char) are implementation defined.
> > > >
> > > > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > > applied to a chained pointer and an integer for the purposes
> > > > of alignment and pointer translation results in another
> > > > chained pointer in that same pointer chain.  Other uses
> > > > of bitwise operators on chained pointers (for example,
> > > > "p|~0") are implementation defined.
> > > 
> > > Quite frankly, I think all of this language that is about the actual
> > > operations is irrelevant and wrong.
> > > 
> > > It's not going to help compiler writers, and it sure isn't going to
> > > help users that read this.
> > > 
> > > Why not just talk about "value chains" and that any operations that
> > > restrict the value range severely end up breaking the chain. There is
> > > no point in listing the operations individually, because every single
> > > operation *can* restrict things. Listing individual operations and
> > > depdendencies is just fundamentally wrong.
> > 
> > [...]
> > 
> > > The *only* thing that matters for all of them is whether they are
> > > "value-preserving", or whether they drop so much information that the
> > > compiler might decide to use a control dependency instead. That's true
> > > for every single one of them.
> > > 
> > > Similarly, actual true control dependencies that limit the problem
> > > space sufficiently that the actual pointer value no longer has
> > > significant information in it (see the above example) are also things
> > > that remove information to the point that only a control dependency
> > > remains. Even when the value itself is not modified in any way at all.
> > 
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> > 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Thank you very much for putting this forward!  I must confess that I was
> stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
> is quite clearly way bogus.
> 
> One possible saving grace:  From discussions at the standards committee
> meeting a few weeks ago, there is a some chance that the committee will
> be willing to do a rip-and-replace on the current memory_order_consume
> wording, without provisions for backwards compatibility with the current
> bogosity.
> 
> > What we'd like to capture is that a value originating from a mo_consume
> > load is "necessary" for a computation (e.g., it "cannot" be replaced
> > with value predictions and/or control dependencies); if that's the case
> > in the program, we can reasonably assume that a compiler implementation
> > will transform this into a data dependency, which will then lead to
> > ordering guarantees by the HW.
> > 
> > However, we need to specify when a value is "necessary".  We could say
> > that this is implementation-defined, and use a set of litmus tests
> > (e.g., like those discussed in the thread) to roughly carve out what a
> > programmer could expect.  This may even be practical for a project like
> > the Linux kernel that follows strict project-internal rules and pays a
> > lot of attention to what the particular implementations of compilers
> > expected to compile the kernel are doing.  However, I think this
> > approach would be too vague for the standard and for many other
> > programs/projects.
> 
> I agree that a number of other projects would have more need for this than
> might the kernel.  Please understand that this is in no way denigrating
> the intelligence of other projects' members.  It is just that many of
> them have only recently started s

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Linus Torvalds
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 wrote:
>
> 3.  The comparison was against another RCU-protected pointer,
> where that other pointer was properly fetched using one
> of the RCU primitives.  Here it doesn't matter which pointer
> you use.  At least as long as the rcu_assign_pointer() for
> that other pointer happened after the last update to the
> pointed-to structure.
>
> I am a bit nervous about #3.  Any thoughts on it?

I think that it might be worth pointing out as an example, and saying
that code like

   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p->val;

then the access of "p->val" is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.

I cannot for the life of me come up with a situation where this would
matter, though. If "X" contains a fence, then that fence will be a
stronger ordering than anything the consume through "p" would
guarantee anyway. And if "X" does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through "p" is through p or q is kind of
irrelevant. No?

 Linus


Re: Asm volatile causing performance regressions on ARM

2014-02-27 Thread Georg-Johann Lay

Yury Gribov schrieb:

Richard Biener wrote:

If this behavior is not intended, what would be the best way to fix
performance? I could teach GCC to not remove constant RTXs in
flush_hash_table() but this is probably very naive and won't cover some
corner-cases.


That could be a good starting point though.


Though with modifying "machine state" you can modify constants as 
well, no?


Valid point but this would mean relying on compiler to always load all 
constants from memory (instead of, say, generating them via movhi/movlo) 
for a piece of code which looks extremely unstable.


What is the general attitude towards volatile asm? Are people interested 
in making it more defined/performant or should we just leave this can of 
worms as is? I can try to improve generated code but my patches will be 
doomed if there is no consensus on what volatile asm actually means...


It's definitely a can of worms, in my opinion.

asm volatile + memory clobber should be the last resort barrier, if you 
skip this out of the compiler or change its semantics (pinned by the 
current documentation) at will, it's not unlikely you break existing 
code in favour or saving some poor instructions.


For example, I had the case that a costly computation (division an a 
hardware that cannot divide) was moved into a section enclosed in asms 
which disabled / re-enabled interrupts.  This totally wrecked interrupt 
respond times on the machine.


Notice that such a division had no side effects from the C side or from 
the compiler's point of view, but execution time and interrupt respond 
times cannot be ignored by any software that does system programming.


Johann



Re: About gsoc 2014 OpenMP 4.0 Projects

2014-02-27 Thread Thomas Schwinge
Hi Güray!

Giving some pointers here (but this is not a complete list), to
announcements and a few discussion threads, that should already answer
some of your questions, give an idea who's currently working on what:
,
,
,
,
,
,
,
,
.


Grüße,
 Thomas


pgp_t5s7ankpb.pgp
Description: PGP signature


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
>  wrote:
> >
> > 3.  The comparison was against another RCU-protected pointer,
> > where that other pointer was properly fetched using one
> > of the RCU primitives.  Here it doesn't matter which pointer
> > you use.  At least as long as the rcu_assign_pointer() for
> > that other pointer happened after the last update to the
> > pointed-to structure.
> >
> > I am a bit nervous about #3.  Any thoughts on it?
> 
> I think that it might be worth pointing out as an example, and saying
> that code like
> 
>p = atomic_read(consume);
>X;
>q = atomic_read(consume);
>Y;
>if (p == q)
> data = p->val;
> 
> then the access of "p->val" is constrained to be data-dependent on
> *either* p or q, but you can't really tell which, since the compiler
> can decide that the values are interchangeable.
> 
> I cannot for the life of me come up with a situation where this would
> matter, though. If "X" contains a fence, then that fence will be a
> stronger ordering than anything the consume through "p" would
> guarantee anyway. And if "X" does *not* contain a fence, then the
> atomic reads of p and q are unordered *anyway*, so then whether the
> ordering to the access through "p" is through p or q is kind of
> irrelevant. No?

I can make a contrived litmus test for it, but you are right, the only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU can
reorder the loads into p and q, and the read from p->val can, as you say,
come from either pointer.

For whatever it is worth, hear is the litmus test:

T1: p = kmalloc(...);
if (p == NULL)
deal_with_it();
p->a = 42;  /* Each field in its own cache line. */
p->b = 43;
p->c = 44;
atomic_store_explicit(&gp1, p, memory_order_release);
p->b = 143;
p->c = 144;
atomic_store_explicit(&gp2, p, memory_order_release);

T2: p = atomic_load_explicit(&gp2, memory_order_consume);
r1 = p->b;  /* Guaranteed to get 143. */
q = atomic_load_explicit(&gp1, memory_order_consume);
if (p == q) {
/* The compiler decides that q->c is same as p->c. */
r2 = p->c; /* Could get 44 on weakly order system. */
}

The loads from gp1 and gp2 are, as you say, unordered, so you get what
you get.

And publishing a structure via one RCU-protected pointer, updating it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and still
see consistency across all the fields in the structure, please put a lock
in the structure and use it to guard updates and accesses to those fields.

Thanx, Paul



gcc-4.8-20140227 is now available

2014-02-27 Thread gccadmin
Snapshot gcc-4.8-20140227 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20140227/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.8 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch 
revision 208207

You'll find:

 gcc-4.8-20140227.tar.bz2 Complete GCC

  MD5=b5890aae1adf2832872ea393421d4fea
  SHA1=a9076d3135a9e198a6db0ae06d8fb761a3c1973d

Diffs from 4.8-20140220 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.8
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
> > xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
> > 
> > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
> > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
> > >  wrote:
> > > >
> > > > Good points.  How about the following replacements?
> > > >
> > > > 3.  Adding or subtracting an integer to/from a chained pointer
> > > > results in another chained pointer in that same pointer chain.
> > > > The results of addition and subtraction operations that cancel
> > > > the chained pointer's value (for example, "p-(long)p" where "p"
> > > > is a pointer to char) are implementation defined.
> > > >
> > > > 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> > > > applied to a chained pointer and an integer for the purposes
> > > > of alignment and pointer translation results in another
> > > > chained pointer in that same pointer chain.  Other uses
> > > > of bitwise operators on chained pointers (for example,
> > > > "p|~0") are implementation defined.
> > > 
> > > Quite frankly, I think all of this language that is about the actual
> > > operations is irrelevant and wrong.
> > > 
> > > It's not going to help compiler writers, and it sure isn't going to
> > > help users that read this.
> > > 
> > > Why not just talk about "value chains" and that any operations that
> > > restrict the value range severely end up breaking the chain. There is
> > > no point in listing the operations individually, because every single
> > > operation *can* restrict things. Listing individual operations and
> > > depdendencies is just fundamentally wrong.
> > 
> > [...]
> > 
> > > The *only* thing that matters for all of them is whether they are
> > > "value-preserving", or whether they drop so much information that the
> > > compiler might decide to use a control dependency instead. That's true
> > > for every single one of them.
> > > 
> > > Similarly, actual true control dependencies that limit the problem
> > > space sufficiently that the actual pointer value no longer has
> > > significant information in it (see the above example) are also things
> > > that remove information to the point that only a control dependency
> > > remains. Even when the value itself is not modified in any way at all.
> > 
> > I agree that just considering syntactic properties of the program seems
> > to be insufficient.  Making it instead depend on whether there is a
> > "semantic" dependency due to a value being "necessary" to compute a
> > result seems better.  However, whether a value is "necessary" might not
> > be obvious, and I understand Paul's argument that he does not want to
> > have to reason about all potential compiler optimizations.  Thus, I
> > believe we need to specify when a value is "necessary".
> > 
> > I have a suggestion for a somewhat different formulation of the feature
> > that you seem to have in mind, which I'll discuss below.  Excuse the
> > verbosity of the following, but I'd rather like to avoid
> > misunderstandings than save a few words.
> 
> Thank you very much for putting this forward!  I must confess that I was
> stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
> is quite clearly way bogus.
> 
> One possible saving grace:  From discussions at the standards committee
> meeting a few weeks ago, there is a some chance that the committee will
> be willing to do a rip-and-replace on the current memory_order_consume
> wording, without provisions for backwards compatibility with the current
> bogosity.
> 
> > What we'd like to capture is that a value originating from a mo_consume
> > load is "necessary" for a computation (e.g., it "cannot" be replaced
> > with value predictions and/or control dependencies); if that's the case
> > in the program, we can reasonably assume that a compiler implementation
> > will transform this into a data dependency, which will then lead to
> > ordering guarantees by the HW.
> > 
> > However, we need to specify when a value is "necessary".  We could say
> > that this is implementation-defined, and use a set of litmus tests
> > (e.g., like those discussed in the thread) to roughly carve out what a
> > programmer could expect.  This may even be practical for a project like
> > the Linux kernel that follows strict project-internal rules and pays a
> > lot of attention to what the particular implementations of compilers
> > expected to compile the kernel are doing.  However, I think this
> > approach would be too vague for the standard and for many other
> > programs/projects.
> 
> I agree that a number of other projects would have more need for this than
> might the kernel.  Please understand that this is in no way denigrating
> the intelligence of other projects' members.  It is just that many of
> them have only recently started s

Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread dw


On 2/27/2014 4:11 AM, Richard Sandiford wrote:

Andrew Haley  writes:

Over the years there has been a great deal of traffic on these lists
caused by misunderstandings of GCC's inline assembler.  That's partly
because it's inherently tricky, but the existing documentation needs
to be improved.

dw  has done a fairly thorough reworking of
the documentation.  I've helped a bit.

Section 6.41 of the GCC manual has been rewritten.  It has become:

6.41 How to Use Inline Assembly Language in C Code
6.41.1 Basic Asm - Assembler Instructions with No Operands
6.41.2 Extended Asm - Assembler Instructions with C Expression Operands

We could simply post the patch to GCC-patches and have at it, but I
think it's better to discuss the document here first.  You can read it
at

http://www.LimeGreenSocks.com/gcc/Basic-Asm.html
http://www.LimeGreenSocks.com/gcc/Extended-Asm.html
http://www.LimeGreenSocks.com/gcc/extend04.zip (contains .texi, .patch,
and affected html pages)

All comments are very welcome.

Thanks for doing this, looks like a big improvement.


Thanks, I did my best.  I appreciate you taking the time to review them.


A couple of comments:

The section on basic asms says:

   Do not expect a sequence of asm statements to remain perfectly
   consecutive after compilation. To ensure that assembler instructions
   maintain their order, use a single asm statement containing multiple
   instructions. Note that GCC's optimizer can move asm statements
   relative to other code, including across jumps.

The "maintain their order" might be a bit misleading, since volatile asms
(including basic asms) must always be executed in the original order.
Maybe this was meaning placement/address order instead?


This statement is based on this text from the existing docs:

"Similarly, you can't expect a sequence of volatile |asm| instructions 
to remain perfectly consecutive. If you want consecutive output, use a 
single |asm|."


I do not dispute what you are saying.  I just want to confirm that the 
existing docs are incorrect before making a change.  Also, see Andi's 
response re -fno-toplevel-reorder.


It seems to me that recommending "single statement" is both the 
clearest, and the safest approach here.  But I'm prepared to change my 
mind if there is consensus I should.



It might also be
worth mentioning that the number of instances of an asm in the output
may be different from the input.  (Can it increase as well as decrease?
I'm not sure off-hand, but probably yes.)


So, in the volatile section, how about something like this for decrease:

"GCC does not delete a volatile |asm| if it is reachable, but may delete 
it if it can prove that control flow never reaches the location of the 
instruction."


For increase (not quite sure where to put this yet):

"Under certain circumstances, GCC may duplicate your asm code as part of 
optimization.  This can lead to unexpected duplicate symbol errors 
during compilation if symbols or labels are being used. Using %=  (see 
Assembler Template) may help resolve this problem."



In the extended section:

   Unless an output operand has the '&' constraint modifier (see
   Modifiers), GCC may allocate it in the same register as an unrelated
   input operand, [...]

It could also use it for addresses in other (memory) outputs.


Ok.  But I'm not sure this really adds anything.  Having warned people 
that the register may be re-used unless '&' is used seems sufficient.



For:

   When using asmSymbolicNames for the output operands, you may use these
   names instead of digits.

it might be worth mentioning that you need the enclosing [...].


Done.



Thanks,
Richard






Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread dw


On 2/27/2014 9:35 AM, Andi Kleen wrote:

Andrew Haley  writes:


Over the years there has been a great deal of traffic on these lists
caused by misunderstandings of GCC's inline assembler.  That's partly
because it's inherently tricky, but the existing documentation needs
to be improved.

dw  has done a fairly thorough reworking of
the documentation.  I've helped a bit.


It would be nice if you could include some discussion of the LTO
reference problems.

Something like:

It is not legal to reference a static variable or function symbol from
the assembler code, as the compiler may optimize unused symbols
away. For inline asm in functions these should be referred as "m" input
arguments. For top level asm the referenced symbol should be made global
and marked with __attribute__((externally_visible)) .


What would you say to something like this:

"Since GCC does not parse the asm, it has no visibility of any static 
variables or functions it references.  This may result in those symbols 
getting discarded by GCC as unused.  To avoid this problem, list the 
symbols as inputs or outputs."



And another common problem:

For top level asm there is no guarantee the compiler outputs the
statements in order.


Well, basic asm (which is the only thing you can use at top level) 
already says:


"Do not expect a sequence of |asm| statements to remain perfectly 
consecutive after compilation. To ensure that assembler instructions 
maintain their order, use a single |asm| statement containing multiple 
instructions. Note that GCC's optimizer can move |asm| statements 
relative to other code, including across jumps. "


Is something more needed?


[unless -fno-toplevel-reorder is specified, but I'm not should mention that]

-Andi




Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread dw


On 2/27/2014 4:22 AM, Kyrill Tkachov wrote:

On 27/02/14 11:07, Andrew Haley wrote:

Over the years there has been a great deal of traffic on these lists
caused by misunderstandings of GCC's inline assembler.  That's partly
because it's inherently tricky, but the existing documentation needs
to be improved.

dw  has done a fairly thorough reworking of
the documentation.  I've helped a bit.

Section 6.41 of the GCC manual has been rewritten.  It has become:

6.41 How to Use Inline Assembly Language in C Code
6.41.1 Basic Asm - Assembler Instructions with No Operands
6.41.2 Extended Asm - Assembler Instructions with C Expression Operands

We could simply post the patch to GCC-patches and have at it, but I
think it's better to discuss the document here first.  You can read it
at

http://www.LimeGreenSocks.com/gcc/Basic-Asm.html
http://www.LimeGreenSocks.com/gcc/Extended-Asm.html
http://www.LimeGreenSocks.com/gcc/extend04.zip (contains .texi, .patch,
and affected html pages)

All comments are very welcome.

Hi Andrew, dw,

Thanks for doing this!


Thanks for taking the time to review it.



In the Extended Asm documentation: Other format strings section:
"'%=' outputs a number that is unique to each instruction in the 
entire compilation."


I find the term 'instruction' to be confusing here. From what I 
understand the number is unique to each asm statement, which may 
contain multiple assembly instructions. IMHO it would be clearer to 
say "unique to each asm statement"


I'm not sure your text quite gets us there either.  If (as Richard 
suggests), the asm can get duplicated, I'd expect each to get a unique 
value for %=.  And I'd want to be clear about what happens if you do 
#define DO_SOMETHING asm("%=":).


How would you feel about:

"'%=' outputs a number that is unique to each instance of the asm 
statement in the entire compilation."




Kyrill




Andrew.









Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread Andi Kleen
dw  writes:
>
> What would you say to something like this:
>
> "Since GCC does not parse the asm, it has no visibility of any static
> variables or functions it references.  This may result in those
> symbols getting discarded by GCC as unused.  To avoid this problem,
> list the symbols as inputs or outputs."

output makes no sense I think, only input.

You still need the part about the top-level asm, where input
doesn't work.

>
>> And another common problem:
>>
>> For top level asm there is no guarantee the compiler outputs the
>> statements in order.
>
> Well, basic asm (which is the only thing you can use at top level)
> already says:
>
> "Do not expect a sequence of |asm| statements to remain perfectly
> consecutive after compilation. To ensure that assembler instructions
> maintain their order, use a single |asm| statement containing multiple
> instructions. Note that GCC's optimizer can move |asm| statements
> relative to other code, including across jumps. "
>
> Is something more needed?

Yes it should be made clear that this applies to top-level asm
too.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


Re: Asm volatile causing performance regressions on ARM

2014-02-27 Thread Yuri Gribov
> asm volatile + memory clobber should be the last resort barrier, if you skip
> this out of the compiler or change its semantics (pinned by the current
> documentation) at will, it's not unlikely you break existing code in favour
> or saving some poor instructions.

Problem is that there is no current semantics. As Richard pointed out,
RTL CSE just happens (probably due to historical reasons) to stop at
volatile asm but this is not documented anywhere and people are
certainly not (and probably never were) recommended to rely on this
behavior. And other GCC optimizations may not behave similar way.

> For example, I had the case that a costly computation (division an a
> hardware that cannot divide) was moved into a section enclosed in asms which
> disabled / re-enabled interrupts.  This totally wrecked interrupt respond
> times on the machine.

Looks like a good example of volatile asm not being an optimization barrier.

> Notice that such a division had no side effects from the C side or from the
> compiler's point of view, but execution time and interrupt respond times
> cannot be ignored by any software that does system programming.

I agree that there's probably a need for an construct that would
prevent code motion. But I'm not sure whether current volatile asm is
intended (not even capable) for this.

-Y


RE: Asm volatile causing performance regressions on ARM

2014-02-27 Thread Pavel Fedin
 Hello!

> > This code (introduced in
> > http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=193802) aborts
> > CSE after seeing a volatile inline asm.
> 
> Note that "introduced" is not really correct here, the code had been
> there for a long time but it was treating some volatile asms as
> barriers and some others as not.  Now it treats them all as barriers.

 Yes, actually you are right. This behavior really was there for a while,
just triggering condition has been changed. On older gcc version we also
could reproduce this behavior by changing 'asm("":::"memory")' to
'asm("":::)' in our test example. Looks like volatile asm with empty clobber
list previously was considered even more strict than one with explicit
"memory" clobber.
 So, the main question is not about triggering condition, but about the
behavior itself. Is it correct to flush and reload all constants ? They are
constants after all, they are even not stored in .data section but inlined
in the code, and thus cannot be modified.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia




Re: Request for discussion: Rewrite of inline assembler docs

2014-02-27 Thread Richard Sandiford
dw  writes:
> On 2/27/2014 4:11 AM, Richard Sandiford wrote:
>> Andrew Haley  writes:
>>> Over the years there has been a great deal of traffic on these lists
>>> caused by misunderstandings of GCC's inline assembler.  That's partly
>>> because it's inherently tricky, but the existing documentation needs
>>> to be improved.
>>>
>>> dw  has done a fairly thorough reworking of
>>> the documentation.  I've helped a bit.
>>>
>>> Section 6.41 of the GCC manual has been rewritten.  It has become:
>>>
>>> 6.41 How to Use Inline Assembly Language in C Code
>>> 6.41.1 Basic Asm - Assembler Instructions with No Operands
>>> 6.41.2 Extended Asm - Assembler Instructions with C Expression Operands
>>>
>>> We could simply post the patch to GCC-patches and have at it, but I
>>> think it's better to discuss the document here first.  You can read it
>>> at
>>>
>>> http://www.LimeGreenSocks.com/gcc/Basic-Asm.html
>>> http://www.LimeGreenSocks.com/gcc/Extended-Asm.html
>>> http://www.LimeGreenSocks.com/gcc/extend04.zip (contains .texi, .patch,
>>> and affected html pages)
>>>
>>> All comments are very welcome.
>> Thanks for doing this, looks like a big improvement.
>
> Thanks, I did my best.  I appreciate you taking the time to review them.
>
>> A couple of comments:
>>
>> The section on basic asms says:
>>
>>Do not expect a sequence of asm statements to remain perfectly
>>consecutive after compilation. To ensure that assembler instructions
>>maintain their order, use a single asm statement containing multiple
>>instructions. Note that GCC's optimizer can move asm statements
>>relative to other code, including across jumps.
>>
>> The "maintain their order" might be a bit misleading, since volatile asms
>> (including basic asms) must always be executed in the original order.
>> Maybe this was meaning placement/address order instead?
>
> This statement is based on this text from the existing docs:
>
> "Similarly, you can't expect a sequence of volatile |asm| instructions 
> to remain perfectly consecutive. If you want consecutive output, use a 
> single |asm|."
>
> I do not dispute what you are saying.  I just want to confirm that the 
> existing docs are incorrect before making a change.  Also, see Andi's 
> response re -fno-toplevel-reorder.
>
> It seems to me that recommending "single statement" is both the 
> clearest, and the safest approach here.  But I'm prepared to change my 
> mind if there is consensus I should.

Right.  I agree with that part.  I just thought that the "maintain their
order" could be misunderstood as meaning execution order, whereas I think
both sentences of the original docs were talking about being "perfectly
consecutive" (which to me means "there are no other instructions inbetween").
Maybe a wordsmithed version of:

Do not expect a sequence of asm statements to remain perfectly
consecutive after compilation. If you want to stop the compiler
inserting anything into a sequence of assembler instructions,
you should put those instructions in a single asm statement. [...]
  
>> It might also be
>> worth mentioning that the number of instances of an asm in the output
>> may be different from the input.  (Can it increase as well as decrease?
>> I'm not sure off-hand, but probably yes.)
>
> So, in the volatile section, how about something like this for decrease:
>
> "GCC does not delete a volatile |asm| if it is reachable, but may delete 
> it if it can prove that control flow never reaches the location of the 
> instruction."

It's not just that though.  AIUI it would be OK for:

  if (foo)
{
  ...
  asm ("x");
}
  else
{
  ...
  asm ("x");
}

to become:

  if (foo)
...
  else
...
  asm ("x");

> For increase (not quite sure where to put this yet):
>
> "Under certain circumstances, GCC may duplicate your asm code as part of 
> optimization.  This can lead to unexpected duplicate symbol errors 
> during compilation if symbols or labels are being used. Using %=  (see 
> Assembler Template) may help resolve this problem."

Sounds good.

>> In the extended section:
>>
>>Unless an output operand has the '&' constraint modifier (see
>>Modifiers), GCC may allocate it in the same register as an unrelated
>>input operand, [...]
>>
>> It could also use it for addresses in other (memory) outputs.
>
> Ok.  But I'm not sure this really adds anything.  Having warned people 
> that the register may be re-used unless '&' is used seems sufficient.

It matters where it can be reused though.  If you talk about input
operands only, people might think it is OK to write asms of the form:

   foo tmp,[input0]
   bar [output0],tmp
   frob [output1],tmp

where output0 is a register and output1 is a memory.  This safely avoids
using the input operand after assigning to output0, but the address in
output1 is still live and could be changed by bar.

Thanks,
Richard