from:"Alexander Monakov"

SoC Project: Propagating array data dependencies from Tree-SSA to RTL

2007-03-23 Thread Alexander Monakov

 Melnik:  
http://gcc.gnu.org/ml/gcc-patches/2005-11/msg01518.html


--
Alexander Monakov

Re: SoC Project: Propagating array data dependencies from Tree-SSA to RTL

2007-03-26 Thread Alexander Monakov


On Sun, 25 Mar 2007, Daniel Berlin wrote:


Ayal has not signed up to be a mentor (as of yet). If he doesn't, i'd
be happy to mentor you here, since i wrote part of tree-data-ref.c


  Thanks, I'll be very glad to work with you.

On Mon, 26 Mar 2007, Ayal Zaks wrote:

Sorry, I fear I may have too little time to devote to this; plus, it  
would

be very useful to start with a good understanding of tree-data-ref from
which to start propagating the dep info. Vladimir Yanovsky and I will be
able to help when it comes to what/how to feed the modulo scheduler.


  Thank you for your attention. I hope I will have a chance to ask you for  
help in the frame of GSoC project.


--
Alexander Monakov

Re: [PATCH, v3] wwwdocs: e-mail subject lines for contributions

2020-02-03 Thread Alexander Monakov

On Mon, 3 Feb 2020, Richard Earnshaw (lists) wrote:

> I've not seen any follow-up to this version.  Should we go ahead and adopt
> this?

Can we please go with 'committed' (lowercase) rather than all-caps COMMITTED?
Spelling this with all-caps seems like a recent thing on gcc-patches, before
everyone used the lowercase version, which makes more sense (no need to shout
about the thing that didn't need any discussion before applying the patch).

Also, while tools like 'git format-patch' will automatically put [PATCH] in
the subject, for '[COMMITTED]' it will be the human typing that out, and
it makes little sense to require people to meticulously type that out in caps.
Especially when the previous practice was opposite.

Thanks.
Alexander

Re: [PATCH, v3] wwwdocs: e-mail subject lines for contributions

2020-02-03 Thread Alexander Monakov

On Mon, 3 Feb 2020, Richard Earnshaw (lists) wrote:

> Upper case is what glibc has, though it appears that it's a rule that is not
> strictly followed.  If we change it, then it becomes another friction point
> between developer groups.  Personally, I'd leave it as is, then turn a blind
> eye to such minor non-conformance.

In that case can we simply say that both 'committed' and 'COMMITTED' are okay,
if we know glibc doesn't follow that rule and anticipate we will not follow
it either?

Thanks.
Alexander

Re: Missed optimization with endian and alignment independent memory access on x64

2020-02-06 Thread Alexander Monakov

On Thu, 6 Feb 2020, Moritz Strübe wrote:
> Why is this so hard optimize? As it's quite a common pattern I'd expect that
> there would be at least some hand-coded special case optimizer. (This isn't
> criticism - I'm honestly curious.) Or is there a reason gcc shouldn't optimize
> this / Why it doesn't matter that this is missed?

The compiler would need to exploit the fact that signed overflow is undefined,
or deduce it cannot happen.
Imagine what happens in a more general case if i is INT_MAX
(so without undefined overflow i+1 would be INT_MIN):

int f(unsigned char *ptr, int i)
{
return ptr[i] | ptr[i+1] << 8;
}

With 64-bit address space this might access two bytes 4GB apart.

But you're right that it's a missed optimization in GCC, so you can file it
to the GCC Bugzilla.

> Is there a way to write such code that gcc optimizes?

Simply write a function that accepts one pointer:

int load_16be(unsigned char *ptr)
{
return ptr[0] << 8 | ptr[1];
}

and use it as load_16be(data+i) or load_16be(&data[i]).

> From a performance point of view: If I actually need two consecutive bytes,
> wouldn't it be better to load them as word and split them at the register
> level?

The question is not entirely clear to me, but usually the answer is that
it depends on the microarchitecture and details of the computations that
need to be done with loaded values. Often you'd need more than one instruction
to "split" the wide load, so it wouldn't be profitable.

Alexander

Re: Branch instructions that depend on target distance

2020-02-24 Thread Alexander Monakov

On Mon, 24 Feb 2020, Andreas Schwab wrote:

> On Feb 24 2020, Petr Tesarik wrote:
> 
> > On Mon, 24 Feb 2020 12:29:40 +0100
> > Andreas Schwab  wrote:
> >
> >> On Feb 24 2020, Petr Tesarik wrote:
> >> 
> >> > This works great ... until there's some inline asm() statement, for
> >> > which gcc cannot keep track of the length attribute, so it is probably
> >> > taken as zero.  
> >> 
> >> GCC computes it by counting the number of asm insns.  You can use
> >> ADJUST_INSN_LENGTH to adjust this as needed.
> >
> > Hmm, that's interesting, but does it work for inline asm() statements?
> 
> Yes, for a suitable definition of work.
> 
> > The argument is essentially a free-form string (with some
> > substitution), and the compiler cannot know how many bytes they occupy.
> 
> That's why ADJUST_INSN_LENGTH can adjust it.

I think Petr might be unaware of the fact that GCC counts the **number of
instructions in an inline asm statement** by counting separators in the
asm string. This may overcount when a separator appears in a string literal
for example, but triggering under-counting is trickier.

Petr, please see https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html
for some more discussion.

Alexander

GSoC topic: precise lifetimes in GIMPLE

2020-03-02 Thread Alexander Monakov

Hi,

following the conversation in PR 90348, I wonder if it would make sense
to suggest the idea presented there as a potential GSoC topic? Like this:

**Enhance GIMPLE IR to represent lifetimes explicitly**  At the moment,
GCC internal representation GIMPLE lacks precise lifetime information
for addressable variables: GIMPLE marks the end of the lifetime by the
so-called "GIMPLE clobber" statement, corresponding to the point where
the variable goes out of scope in the original program. However, the
event of the "birth" of a variable (where it appears in scope) is lost,
making the IR ambiguous and opening the door for invalid optimizations,
as shown in bug #90348. The project would entail inventing a way to
represent "lifetime start" in GIMPLE, adjusting front-ends to emit it,
implementing a verifier to check that optimizations do not move
references outside of the variable's lifetime, and potentially
enhancing optimizations to move lifetime markers, expanding the
lifetime, where necessary.

I know we already have good project ideas, and I suspect this idea may
be too complicated for GSoC, but on the other hand it sounds useful,
and gives an "experimental" topic that may be interesting for students.

What do you think?

Thanks.
Alexander

Re: GSoC topic: precise lifetimes in GIMPLE

2020-03-02 Thread Alexander Monakov

On Mon, 2 Mar 2020, Richard Biener wrote:

> PR90348 is certainly entertaining.  But I guess for a GSoC project
> we need a more elaborate implementation plan.  The above suggesting
> of a "lifetime start" is IMHO a no-go btw.  Instead I think the
> only feasible way is to make all references indirect and thus
> make both "allocation" and "deallocation" points explicit.  Then
> there's a data dependence on the "allocation" statement which
> provides upward safety and the "deallocation" statement would
> need to act as a barrier in some to be determined way.  That is,
> how to make optimizers preserve the lifetime end is still unsolved.

I think a verifier that ensures that all references are dominated by
"lifetime start" and post-dominated by clobbers/lifetime-end would
be a substantial step towards that.

Agreed that data dependence on allocation would automatically ensure
part of that verification, but then the problem with deallocation
remains, as you say.

> IMHO whatever we do should combine with the CLOBBERs we have now,
> not be yet another mechanism.

This seems contradictory with the ideas in your previous paragraph.
I agree though, CLOBBER-as-lifetime-end makes sense and does not
call for a replacement.

Thanks.
Alexander

Clarifying attribute-const

2015-09-25 Thread Alexander Monakov

Hello,

I'd like to ask for community input regarding __attribute__((const)) (and
"pure", where applicable).  My main goal is to clarify unclear cases and
improve existing documentation, if possible.

First, a belated follow-up to https://gcc.gnu.org/PR66512 . The bug is asking
why attribute-const appears to have a weaker effect in C++, compared to C.
The answer in that bug is that GCC assumes that attribute-const function can
terminate by throwing an exception.

That doesn't actually seem reasonable.  Consider that C counterpart to
throwing is longjmp; it seems to me that GCC should behave consistently:
either assume that attribute-const may both longjmp and throw (I guess nobody
wants that), or that it may not longjmp nor throw.  Intuitively, if "const"
means "free of side effects so that calls can be moved speculatively or
duplicated", then non-local control flow transfer via throwing should be
disallowed as well.

In any case, it would be nice the intended compiler behavior could be
explicitely stated in the manual.


Second, there is an interesting mismatch between documentation and existing
usage.  Among most prominent users of the attribute there are two glibc
functions: __errno_location(void) and pthread_self(void).  Both return a
pointer to thread-local storage, so the functions are not "const" globally in
a multi-threaded process.  A sufficiently advanced compiler can cause the
following testcase to abort:

#include 
#include 

static void *errno_pointer;

static void *thr(void *unused)
{
  errno_pointer = &errno;
  return 0;
}

int main()
{
  errno_pointer = &errno;
  pthread_t t;
  pthread_create(&t, 0, thr, 0);
  pthread_join(t, 0);
  if (errno_pointer == &errno)
abort();
}

(errno_pointer is static, so the compiler can observe that it does not escape
the translation unit, and all stores in the TU assign the same "const" value)

Does GCC need to be concerned about eventually "miscompiling" such cases?  If
not, can we document an explicit promise that attribute-const may include
pointers-to-TLS?

Thanks.
Alexander

Re: Clarifying attribute-const

2015-10-01 Thread Alexander Monakov

On Fri, 25 Sep 2015, Eric Botcazou wrote:

> > First, a belated follow-up to https://gcc.gnu.org/PR66512 . The bug is
> > asking why attribute-const appears to have a weaker effect in C++, compared
> > to C. The answer in that bug is that GCC assumes that attribute-const
> > function can terminate by throwing an exception.
> 
> FWIW there is an equivalent semantics in Ada: the "const" functions can throw 
> and the language explicitly allows them to be CSEd in this case, etc.

Can you expand on the "etc." a bit, i.e., may the compiler ...

  - move a call to a "const" function above a conditional branch, 
causing a conditional throw to happen unconditionally?

  - move a call to a "const" function below a conditional branch, 
causing an unconditional throw to happen only conditionally?

  - reorder calls to "const" functions  w.r.t. code with side effects, or
other throwing functions?

(all of the above in the context of Ada)

Thanks.
Alexander

Re: nonnull, -Wnonnull, and do/while

2016-02-16 Thread Alexander Monakov

On Tue, 16 Feb 2016, Marek Polacek wrote:
> Well, it's just that "s" has the nonnull attribute so the compiler thinks it
> should never be null in which case comparing it to null should be redundant.
> Doesn't seem like a false positive to me, but maybe someone else feels
> otherwise.

Please look at the posted code again:

static void f(const char *s)
{
  do {
printf("%s\n",s);
s = NULL;
  } while (s != NULL);
}

Since 's' is assigned to, the constraint from 'printf' is no longer useful for
warning at the point of comparison.  It clearly looks like a false positive to
me.

Alexander

Re: who owns stack args?

2016-02-24 Thread Alexander Monakov

On Wed, 24 Feb 2016, DJ Delorie wrote:
> The real question is: are stack arguments call-clobbered or
> call-preserved?  Does the answer depend on the "pure" attribute?

Stack area holding stack arguments should belong to the callee for tail-calls
to work (the callee will trash that area when laying out arguments for the
tail call; thanks to Rich Felker for pointing that out to me).

Thus it cannot depend on attribute-pure.

Alexander

GCC Bugzilla whines broken?

2016-02-29 Thread Alexander Monakov

Hello,

Can anyone quickly confirm whether "whining" feature in the GCC Bugzilla is
supposed to be functioning at the moment?

The lastest thread I could find indicates that it is actually supposed to be
working: https://gcc.gnu.org/ml/gcc/2010-09/msg00569.html .

However I've tried to setup a whine for myself a week ago, and it never
produced the emails.

Actually, I want a different feature than whining: notifications for bugs
matching a certain predicate, e.g. for a specific target; ideally being
automatically Cc'ed to such bugs, with an option to un-cc myself if needed.
I can somewhat emulate that with whine searches restricted to "last N days".
Is anybody doing something like that?

Thanks.
Alexander

Re: out of bounds access in insn-automata.c

2016-03-24 Thread Alexander Monakov

Hi,

On Thu, 24 Mar 2016, Bernd Schmidt wrote:
> On 03/24/2016 11:17 AM, Aldy Hernandez wrote:
> > On 03/23/2016 10:25 AM, Bernd Schmidt wrote:
> > > It looks like this block of code is written by a helper function that is
> > > really intended for other purposes than for maximal_insn_latency. Might
> > > be worth changing to
> > >   int insn_code = dfa_insn_code (as_a  (insn));
> > >   gcc_assert (insn_code <= DFA__ADVANCE_CYCLE);
> >
> > dfa_insn_code_* and friends can return > DFA__ADVANCE_CYCLE so I can't
> > put that assert on the helper function.
> 
> So don't use the helper function? Just emit the block above directly.

Let me chime in :)  The function under scrutiny, maximal_insn_latency, was
added as part of selective scheduling merge; at the same time,
output_default_latencies was factored out of
output_internal_insn_latency_func, and the pair of new functions
output_internal_maximal_insn_latency_func/output_maximal_insn_latency_func
tried to mirror existing pair of
output_internal_insn_latency_func/output_insn_latency_func.

In particular, output_insn_latency_func also invokes
output_internal_insn_code_evaluation (twice, for each argument).  This means
that generated 'insn_latency' can also call 'internal_insn_latency' with
DFA__ADVANCE_CYCLE in arguments.  However, 'internal_insn_latency' then has a
specially emitted 'if' statement that checks if either of the arguments is 
' >= DFA__ADVANCE_CYCLE', and returns 0 in that case.

So ultimately pre-existing code was checking ' > DFA__ADVANCE_CYCLE' first and
' >= DFA_ADVANCE_CYCLE' second (for no good reason as far as I can see), and
when the new '_maximal_' functions were introduced, the second check was not
duplicated in the new copy.

So as long we are not looking for hacking it up further, I'd like to clean up
both functions at the same time.  If calling the 'internal_' variants with
DFA__ADVANCE_CYCLE is rare, extending 'default_insn_latencies' by 1 zero
element corresponding to DFA__ADVANCE_CYCLE is a simple suitable fix. If
either DFA__ADVANCE_CYCLE is not guaranteed to be rare, or extending the table
in that style is undesired, I suggest creating a variant of
'output_internal_insn_code_evaluation' that performs a '>=' rather than '>'
test in the first place, and use it in both output_insn_latency_func and
output_maximal_insn_latency_func.  If acknowledged, I volunteer to regstrap on
x86_64 and submit that in stage1.

Thoughts?

Thanks.
Alexander

[PATCH] clean up insn-automata.c (was: Re: out of bounds access in insn-automata.c)

2016-05-10 Thread Alexander Monakov

On Wed, 30 Mar 2016, Bernd Schmidt wrote:
> On 03/25/2016 04:43 AM, Aldy Hernandez wrote:
> > If Bernd is fine with this, I'm happy to retract my patch and any
> > possible followups.  I'm just interested in having no path causing a
> > possible out of bounds access.  If your patch will do that, I'm cool.
> 
> I'll need to see that patch first to comment :-)

Here's the proposed patch.  I've found that there's only one user of the
current fancy logic in output_internal_insn_code_evaluation: handling of
NULL_RTX and const0_rtx is only useful for 'state_transition' (generated
by output_trans_func), so it's possible to inline the extended handling
there, and handle only plain non-null rtx_insns in
output_internal_insn_code_evaluation.

This change allows to remove extra checks and casting in
output_internal_insn_latency_func, as done by the patch.

As a nice bonus, it constrains prototypes of three automata functions to
accept insn_rtx rather than just rtx.

Bootstrapped and regtested on x86_64, OK?

Thanks.
Alexander

* genattr.c (main): Change 'rtx' to 'rtx_insn *' in prototypes of
'insn_latency', 'maximal_insn_latency', 'min_insn_conflict_delay'.
* genautomata.c (output_internal_insn_code_evaluation): Simplify.
Move handling of non-insn arguments inline into the sole user:
(output_trans_func): ...here.
(output_min_insn_conflict_delay_func): Change 'rtx' to 'rtx_insn *' in
emitted function prototype.
(output_internal_insn_latency_func): Ditto.  Simplify.
(output_internal_maximal_insn_latency_func): Ditto.  Delete
always-unused argument.
(output_insn_latency_func): Ditto.
(output_maximal_insn_latency_func): Ditto.

diff --git a/gcc/genattr.c b/gcc/genattr.c
index 656a9a7..77380e7 100644
--- a/gcc/genattr.c
+++ b/gcc/genattr.c
@@ -240,11 +240,11 @@ main (int argc, const char **argv)
   printf ("/* Insn latency time on data consumed by the 2nd insn.\n");
   printf ("   Use the function if bypass_p returns nonzero for\n");
   printf ("   the 1st insn. */\n");
-  printf ("extern int insn_latency (rtx, rtx);\n\n");
+  printf ("extern int insn_latency (rtx_insn *, rtx_insn *);\n\n");
   printf ("/* Maximal insn latency time possible of all bypasses for this 
insn.\n");
   printf ("   Use the function if bypass_p returns nonzero for\n");
   printf ("   the 1st insn. */\n");
-  printf ("extern int maximal_insn_latency (rtx);\n\n");
+  printf ("extern int maximal_insn_latency (rtx_insn *);\n\n");
   printf ("\n#if AUTOMATON_ALTS\n");
   printf ("/* The following function returns number of alternative\n");
   printf ("   reservations of given insn.  It may be used for better\n");
@@ -290,8 +290,8 @@ main (int argc, const char **argv)
   printf ("state_transition should return negative value for\n");
   printf ("the insn and the state).  Data dependencies between\n");
   printf ("the insns are ignored by the function.  */\n");
-  printf
-   ("extern int min_insn_conflict_delay (state_t, rtx, rtx);\n");
+  printf ("extern int "
+  "min_insn_conflict_delay (state_t, rtx_insn *, rtx_insn *);\n");
   printf ("/* The following function outputs reservations for given\n");
   printf ("   insn as they are described in the corresponding\n");
   printf ("   define_insn_reservation.  */\n");
diff --git a/gcc/genautomata.c b/gcc/genautomata.c
index dcde604..92c8b5c 100644
--- a/gcc/genautomata.c
+++ b/gcc/genautomata.c
@@ -8113,14 +8113,10 @@ output_internal_trans_func (void)
 
 /* Output code
 
-  if (insn != 0)
-{
-  insn_code = dfa_insn_code (insn);
-  if (insn_code > DFA__ADVANCE_CYCLE)
-return code;
-}
-  else
-insn_code = DFA__ADVANCE_CYCLE;
+  gcc_checking_assert (insn != 0);
+  insn_code = dfa_insn_code (insn);
+  if (insn_code >= DFA__ADVANCE_CYCLE)
+return code;
 
   where insn denotes INSN_NAME, insn_code denotes INSN_CODE_NAME, and
   code denotes CODE.  */
@@ -8129,21 +8125,12 @@ output_internal_insn_code_evaluation (const char 
*insn_name,
  const char *insn_code_name,
  int code)
 {
-  fprintf (output_file, "\n  if (%s == 0)\n", insn_name);
-  fprintf (output_file, "%s = %s;\n\n",
-  insn_code_name, ADVANCE_CYCLE_VALUE_NAME);
-  if (collapse_flag)
-{
-  fprintf (output_file, "\n  else if (%s == const0_rtx)\n", insn_name);
-  fprintf (output_file, "%s = %s;\n\n",
-  insn_code_name, COLLAPSE_NDFA_VALUE_NAME);
-}
-  fprintf (output_file, "\n  else\n{\n");
-  fprintf (output_file,
-  "  %s = %s (as_a  (%s));\n",
-  insn_code_name, DFA_INSN_CODE_FUNC_NAME, insn_name);
-  fprintf (output_file, "  if (%s > %s)\nreturn %d;\n}\n",
-  insn_code_name, ADVANCE_CYCLE_VALUE_NAME, code);
+  fprintf (output_file, "  gcc_checking_assert

Re: [RFD] Extremely large alignment of read-only strings

2016-07-28 Thread Alexander Monakov

On Wed, 27 Jul 2016, Thorsten Glaser wrote:

First of all, I think option -malign-data=abi (new in GCC 5) addresses your
need: it can be used to reduce the default (excessive) alignment to just the
psABI-dictated value (you can play with this at https://gcc.godbolt.org even if
you can't install GCC-5 locally).

Note that like with other ABI-affecting options you need to consider
implications for linking with code you're not building yourself: if the other
code expects bigger alignment, you'll have a bug.

One comment to your email below.

> After some (well, lots) more debugging, I eventually
> discovered -fdump-translation-unit (which, in the version
> I was using, also worked for C, not just C++), which showed
> me that the alignment was 256 even (only later reduced to
> 32 as that’s the maximum alignment for i386).

Most likely the quoted figures from GCC dumps are in bits, not bytes.

HTH
Alexander

Re: [libgomp] No references to env.c -> no libgomp construction

2016-11-29 Thread Alexander Monakov

Hello,

On Tue, 29 Nov 2016, Sebastian Huber wrote:
> * env.c: Split out ICV definitions into...
> * icv.c: ...here (new file) and...
> * icv-device.c: ...here. New file.
> 
> the env.c contains now only local symbols (at least for target *-rtems*-*):
> 
[...]
> 
> Thus the libgomp constructor is not linked in into executables.

Thanks for the report.  This issue affects only static libgomp.a (and not on
NVPTX where env.c is deliberately empty).

I think the minimal solution here is to #include  from icv.c instead of
compiling it separately (using <> inclusion rather than "" so in case of NVPTX
we pick up the empty config/nvptx/env.c from toplevel icv.c).

A slightly more involved but perhaps a preferable approach is to remove
config/nvptx/env.c, introduce LIBGOMP_OFFLOADED_ONLY macro, and use it to
guard inclusion of env.c from icv.c (which then can use the #include "env.c"
form).

Thanks.
Alexander

Re: How to avoid constant propagation into functions?

2016-12-07 Thread Alexander Monakov

[adding gcc@ for the compiler-testsuite-related discussion, please drop either
gcc@ or gcc-help@ from Cc: as appropriate in replies]

On Wed, 7 Dec 2016, Segher Boessenkool wrote:
> > For example, this might have impact on writing test for GCC:
> > 
> > When I am writing a test with noinline + noclone then my
> > expectation is that no such propagation happens, because
> > otherwise a test might turn trivial...
> 
> The usual ways to prevent that are to add some volatile, or an
> asm("" : "+g"(some_var));   etc.

No, that doesn't sound right.  As far as I can tell from looking that the GCC
testsuite, the prevailing way is actually the noinline+noclone combo, not the
per-argument asms or volatiles.

This behavior is new in gcc-7 due to new IPA-VRP functionality. So -fno-ipa-vrp
gets the old behavior.  I think from the testsuite perspective the situation got
a bit worse due to this, as now in existing testcases stuff can get propagated
where the testcase used noinline+noclone to suppress propagation. This means
that some testcases may get weaker and no longer test what they were supposed
to.  And writing new testcases gets less convenient too.

However, this actually demonstrates how the noinline+noclone was not
future-proof, and in a way backfired now.  Should there be, ideally, a single
'noipa' attribute encompassing noinline, noclone, -fno-ipa-vrp, -fno-ipa-ra and
all future transforms using inter-procedural knowledge?

Alexander

Re: How to avoid constant propagation into functions?

2016-12-07 Thread Alexander Monakov

On Wed, 7 Dec 2016, Richard Biener wrote:
> >Agreed, that's what I've been using in the past for glibc test cases.
> >
> >If that doesn't work, we'll need something else.  Separate compilation
> >of test cases just to thwart compiler optimizations is a significant
> >burden, and will stop working once we have LTO anyway.
> >
> >What about making the function definitions weak?  Would that be more
> >reliable?
> 
> Adding attribute((used)) should do the trick.  It introduces unknown callers 
> and thus without cloning disables IPA.

Hm, depending on the case I think this may be not enough: it thwarts IPA on the
callee side, but still allows the compiler to optimize the caller: for example,
deduce that callee is pure/const (in which case optimizations in the caller may
cause it to be called fewer times than intended or never at all), apply IPA-RA, 
or
(perhaps in future) deduce that the callee always returns non-NULL and optimize
the caller accordingly.  I think attribute-weak works to suppress IPA on the
caller side, but that is not a good solution because it also has an effect on
linking semantics, may be not available on non-ELF platforms, etc.

Alexander

Re: How to avoid constant propagation into functions?

2016-12-09 Thread Alexander Monakov

On Fri, 9 Dec 2016, Richard Biener wrote:
> Right, 'used' thwarts IPA on the callee side only.  noclone and noinline are
> attributes affecting the caller side but we indeed miss attributes for the
> properties you mention above.  I suppose adding a catch-all attribute for
> caller side effects (like we have 'used' for the callee side) would be a good
> idea.

For general uses, i.e. for testcases that ought to be portable across different
compilers, I believe making a call through a volatile pointer already places a
sufficient compiler barrier to prevent both caller- and callee-side analysis.
That is, where you have

  int foo(int);
  foo(arg);

you could transform it to

  int foo(int);
  int (* volatile vpfoo)(int) = foo;
  vpfoo(arg);

While this also has an effect of forcing the call to be indirect, I think
usually that should be acceptable.

But for uses in the gcc testsuite, I believe an attribute is still needed.

Alexander

Re: [RFC] noipa attribute (was Re: How to avoid constant propagation into functions?)

2016-12-15 Thread Alexander Monakov

On Thu, 15 Dec 2016, Jakub Jelinek wrote:
> So here is a proof of concept of an attribute that disables inlining,
> cloning, ICF, IPA VRP, IPA bit CCP, IPA RA, pure/const/throw discovery.
> Does it look reasonable?  Anything still missing?

I'd like to suggest some additions to the extend.texi entry:

> --- gcc/doc/extend.texi.jj2016-12-15 11:26:07.0 +0100
> +++ gcc/doc/extend.texi   2016-12-15 12:19:32.738996533 +0100
> @@ -2955,6 +2955,15 @@ asm ("");
>  (@pxref{Extended Asm}) in the called function, to serve as a special
>  side-effect.
>  
> +@item noipa
> +@cindex @code{noipa} function attribute
> +Disable interprocedural optimizations between the function with this
> +attribute and its callers, as if the body of the function is not available
> +when optimizing callers and the callers are unavailable when optimizing
> +the body.  This attribute implies @code{noinline}, @code{noclone},
> +@code{no_icf} and @code{used} attributes and in the future might
> +imply further newly added attributes.

1. I believe the last sentence should call out that the effect of this attribute
is not reducible to just the existing attributes, because suppression of IPA-RA
and pure/const discovery is not expressible that way, and that is actually
intended.  Can this be added to clarify the intent:

  However, this attribute is not equivalent to a combination of other
  attributes, because its purpose is to suppress existing and future
  optimizations employing interprocedural analysis, including those that
  do not have an attribute suitable for disabling them individually.

(and perhaps remove ' ... and in the future might imply ...' from the quoted
snippet, because the clarification makes it partially redundant)

2. Can we gently suggest to readers of documentation that this was invented for
use in the GCC testsuite, and encourage them to seek proper alternatives, e.g.:

  This attribute is exposed for the purpose of testing the compiler.  In general
  it should be preferable to properly constrain code generation using the
  language facilities: for example, using separate compilation or calling
  through a volatile pointer achieves a similar effect in a portable way
  [ except in case of a sufficiently advanced compiler indistinguishable from an
  adversary ;) ]

Thanks.
Alexander

LTO remapping/deduction of machine modes of types/decls

2016-12-30 Thread Alexander Monakov

Hello, Richard, Jakub, community,

May I join/restart the old discussion about machine mode remapping at LTO
stream-in time.  To recap, when offloading to NVPTX was introduced, there
was a problem due to differences in the set of supported modes (e.g. there
was no 'XFmode' on NVPTX that would correspond to 'long double' tree type
node in GIMPLE LTO streams produced by x86 host compiler).

The current solution in GCC is to additionally stream a 'mode table' and use it
to remap numeric mode identifiers during LTO stream-in in all trees that have
modes.  This is the solution initially outlined by Jakub in the message
https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00226.html .  In response to that,
Richard said,

> I think (also communicated that on IRC) we should instead try not streaming
> machine-modes at all but generating them at stream-in time via layout_type
> or layout_decl.

and later in the thread also:

> I'm just looking for a way to make this less of a hack (and the LTO IL
> less target dependent).  Not for GCC 5 for which something like your
> patch is probably ok, but for the future.

Now that we're in the future, I've asked Vlad Ivanishin (Cc'ed) to try and
implement Richard's approach.  The motivation is enhancing LTO for offloaded
code, in particular to expose library code for inlining.  In that scenario,
the current scheme has a problem that WPA can arbitrarily mix LTO sections
coming from libraries (where the modes don't need remapping) and LTO sections
produced by the host compiler.  Thus, mode_table would need to be only
selectively applied during stream-in, based on the origin of the section.  And,
we'd need to ensure that WPA duplicates mode tables across all ltrans units.

In light of that, I felt that trying Richard's approach would be proper.
Actually, I don't know why gimple/tree representation carries machine modes in
the first place; it seems to be redundant information deducible from type
information.

Vlad's current patch is adding mode deduction for types and decls, matches the
deduced mode against the streamed-in mode, and ICEs in case of mismatch. To be
clear, he's checking this for native LTO via lto-bootstrap, but nevertheless
it's a nice way of giving confidence that mode inference works as intended.

This seems to be fine for C, but in C++ we are seeing some hard-to-explain cases
where the deduced BLKmode for 7-byte-sized/4-byte-aligned base-class decl is
mismatching against deduced DImode.  The DImode would be obviously correct for
8-byte-sized decl of the corresponding type, but the base class decl does not
have 1 byte of padding in the tail.  What's worse, the issue is just for the
mode of the decl: the mode of the type is BLKmode, as we'd expect.

Unfortunately, just adjusting the C++ frontend to place BLKmode on the decl too
doesn't lead to immediate success, because decl modes have implications for
debug info generation, and the compiler starts ICE'ing there instead.

So we're hitting under-documented places in the compiler here, and personally
I don't have the confidence to judge how they're intended to work.

Basically for now my questions are:

1. Is there an intended invariant that decl modes should match type modes? It
appears that if there was, the above situation with C++ base objects would be a
violation.

2. Do you think we should continue digging in this direction?

I'm not sure how much it'd help a discussion, but for completeness Vlad's
current patchset is provided as attachments. Patch 1/3 adds mode inference for
types (only), patch 2 just reverts Jakub's additions of mode_table handling,
and finally patch 3 adds mode inference for decls, adds checking against
streamed-in modes, and shows where the attempted adjustments in the C++ frontend
and debug info generation were.  There are a few coding style violations; sorry;
I hope they are not too distracting.

Thanks.
AlexanderFrom 58ad9d4d75cbc057c003c701ff3f0e6b8fa35e39 Mon Sep 17 00:00:00 2001
From: Vladislav Ivanishin 
Date: Tue, 13 Dec 2016 14:58:26 +0300
Subject: [PATCH 1/3] Infer modes from types after LTO streaming-in

* gcc/lto/lto.c: New function lto_infer_mode () which calls ...
* gcc/stor-layout.c: ... the new function set_mode_for_type ().
* gcc/stor-layout.h: Declare set_mode_for_type ().
---
 gcc/lto/lto.c |  20 +
 gcc/stor-layout.c | 127 ++
 gcc/stor-layout.h |   2 +
 3 files changed, 149 insertions(+)

diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index 6718fbbe..cec54e3 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -1656,6 +1656,25 @@ unify_scc (struct data_in *data_in, unsigned from,
   return unified_p;
 }
 
+static void
+lto_infer_mode (tree type)
+{
+  if (!TYPE_P (type))
+return;
+
+  if (!COMPLETE_TYPE_P (type) && TYPE_MODE (type) == VOIDmode)
+return;
+
+  /* C++ FE has complex logic for laying out classes.  We don't have
+ the information here to reproduce the decision process (nor do we
+ w

Re: LTO remapping/deduction of machine modes of types/decls

2017-01-02 Thread Alexander Monakov

On Mon, 2 Jan 2017, Jakub Jelinek wrote:
> In my view mode is essential part of the type system.  It (sadly, but still)
> participates in many ABI decisions, but more importantly especially for
> floating point types it is the main source of information of what the type
> actually is, as just size and precision are nowhere near enough.
> The precision/size isn't able to carry information like whether the type is
> decimal or binary floating, what padding it has and where, what NaN etc.
> conventions it uses.  So trying to throw away modes and reconstruct them
> looks conceptually wrong to me.

I wonder if it's possible to have a small tag in tree types to distinguish
between binary/decimal/fixed-point types.  For prototyping, I was thinking
about just looking at the type name, because non-ieee-binary types have an
easily recognizable prefix.

For padding and NaN variability, can you point me on which targets the modes
affect that? The "Machine Modes" chapter in the documentation doesn't give a
hint (IFmode/KFmode are not documented there either).

Alternatively, is reconstructing all modes necessary in the first place? On
tree level GCC has explicit trees for all fundamental types like
float_type_node. Is it possible to remap just those trees? Modes of composite
types should be deducible, and modes of decls may be deducible from their types
(not sure; why do decls have modes separately from their types, anyway?)

> One can also just use
> float __attribute__((mode (XFmode))) or float __attribute__((mode (TFmode)))
> or float __attribute__((mode (KFmode))) or IFmode etc., how do you want to
> differentiate between those?  And I don't see how this can help with the
> long double stuff for NVPTX offloading.  If user uses 80-bit long double
> (or mode(XFmode) floats/doubles) in his source, then as PTX only has SFmode
> and DFmode (perhaps also HFmode?), the only way to get it working is through
> emulation (whether soft-fp, or writing some emulation using double,
> whatever).  Pretending long double on the host is DFmode on the PTX side
> just won't work, they have different representation.

(yes, PTX spec has half floats, but GCC doesn't implement those on PTX today,
and thus doesn't have HFmode now)

For attribute-mode, I wasn't aware of KFmode/IFmode stuff; wherever the modes
affect semantics without leaving any other trace in the type, leaving out the
mode loses information. So either one keeps the modes, or adds sufficient
tagging in the type tree.

For long double, I think offloading to PTX should have the following semantics:
size/alignment of long double matches those on host. Otherwise, storage layout
of composite types won't match, and that's really bad. But otherwise long double
is the same as double on PTX (so for offloading from x86-64 it has 64 bits of
padding). This means that long double data is not transferable between
accelerator and host, but otherwise gives the most sane semantics I can imagine.
I think this implies that XFmode/TFmode don't need to exist on NVPTX to back
long_double_type_node.

Thanks.
Alexander

Re: LTO remapping/deduction of machine modes of types/decls

2017-01-02 Thread Alexander Monakov

On Mon, 2 Jan 2017, Jakub Jelinek wrote:
> If the host has long double the same as double, sure, PTX can use its native
> DFmode even for long double.  But otherwise, the storage must be
> transferable between accelerator and host.

Hm, sorry, the 'must' is not obvious to me: is it known that the OpenMP ARB
would find only this implementation behavior acceptable?

Apart from floating-point types, are there other situations where modes carry
information not deducible from the rest of the tree node?

Thanks.
Alexander

Re: LTO remapping/deduction of machine modes of types/decls

2017-01-02 Thread Alexander Monakov

On Mon, 2 Jan 2017, Jakub Jelinek wrote:
> On Mon, Jan 02, 2017 at 09:49:55PM +0300, Alexander Monakov wrote:
> > On Mon, 2 Jan 2017, Jakub Jelinek wrote:
> > > If the host has long double the same as double, sure, PTX can use its 
> > > native
> > > DFmode even for long double.  But otherwise, the storage must be
> > > transferable between accelerator and host.
> > 
> > Hm, sorry, the 'must' is not obvious to me: is it known that the OpenMP ARB
> > would find only this implementation behavior acceptable?
> 
> long double is not non-mappable type in the spec, so it is supposed to work.
> The implementation may choose not to offload whenever it sees long
> double/__float128/_Float128/_Float128x etc.

But this is not something the implementation can properly enforce; consider

  long double v;
  char buf[sizeof v];
  #pragma omp target map(from:buf)
sscanf ("1.0", "%Lf", buf);
  memcpy(&v, buf, sizeof v);

The offloading compiler wouldn't see a 'long double' anywhere, it gets brought
in at linking stage. So the implementation would have to tag individual
translation units and see only in the end of linking if the offloaded image
touches a long double datum anywhere. And as the example shows, it would prevent
using printf-like functions.

Alexander

Re: TARGET_MACRO_FUSION_PAIR for something besides compare-and-branch ?

2014-05-28 Thread Alexander Monakov

On Wed, 28 May 2014, Kyrill Tkachov wrote:

> Hi all,
> 
> The documentation for TARGET_MACRO_FUSION_PAIR says that it can be used to
> tell the scheduler that two insns should not be scheduled apart. It doesn't
> specify what kinds of insns those can be.
> 
> Yet from what I can see in sched-deps.c it can only be used on compares and
> conditional branches, as implemented in i386.

Please note that it's not only restricted to conditional branches, but also to
keeping the instructions together if they were consecutive in the first place
(i.e. it does not try to move a compare insn closer to the branch).

Doing it that way allowed to solve the issue at hand at that time without a
separate scan of the whole RTL instruction stream.

> Say I want to specify two other types of instruction that I want to force
> together, would it be worth generalising the TARGET_MACRO_FUSION_PAIR usage
> to achieve that?

I'd say yes, but that would be the least of the problems; the more important
question is how to trigger the hook (you probably want to integrate it into
the existing scheduler dependencies evaluation loop rather than adding a new
loop just to discover macro-fusable pairs).  You'll also have to invent
something new if you want to move non-consecutive fusable insns together if
they are apart.

HTH.
Alexander

Re: Branch taken rate of Linux kernel compiled with GCC 4.9

2015-01-13 Thread Alexander Monakov

On Tue, 13 Jan 2015, Pengfei Yuan wrote:
> I use perf with rbf88:k,rff88:k events (Haswell specific) to profile
> the taken rate of conditional branches in the kernel. Here are the
> results:
[...]
> 
> The results are very strange because all the taken rates are greater
> than 50%. Why not reverse the basic block reordering heuristics to
> make them under 50%? Is there anything wrong with GCC?

Your measurement includes the conditional branches at the end of loop bodies.
When loops iterate, those branches are taken, and it doesn't make sense to
reverse them.

HTH
Alexander

Re: Failure to dlopen libgomp due to static TLS data

2015-02-12 Thread Alexander Monakov

There's a pending patch for glibc that addresses this issue among others:
https://sourceware.org/ml/libc-alpha/2014-11/msg00469.html

([BZ#17090/17620/17621]: fix DTV race, assert, and DTV_SURPLUS Static TLS
limit)

Alexander

Re: Android native build of GCC

2015-02-15 Thread Alexander Monakov

> Given that info...and in spite of my aforementioned limited knowledge  I
> went back to take another look at the source and found this in
> libfakechroot.c
> 
> /bld/fakechrt/fakechroot-2.16 $ grep -C 4 dlsym src/libfakechroot.c
> /* Lazily load function */
> LOCAL fakechroot_wrapperfn_t fakechroot_loadfunc (struct fakechroot_wrapper * 
> w)
> {
> char *msg;
> if (!(w->nextfunc = dlsym(RTLD_NEXT, w->name))) {;
> msg = dlerror();
> fprintf(stderr, "%s: %s: %s\n", PACKAGE, w->name, msg != NULL ? msg : 
> "unresolved symbol");
> exit(EXIT_FAILURE);
> }
> 
> I'm fairly certain I remember reading something about Android and lazy
> function loadinghow it doesn't handle it well or does so differently
> from standard Linux builds.  At any rate,  I believe the above code is
> responsible for those annoying 'fakechroot: undefined reference to dlopen'
> errors, so I'll see if I can fix that.

In Android's Bionic libc, the implementation of dlopen() resides in the
dynamic loader, and not present in libdl.so.  So to obtain the pointer to
dlopen the code like above can use dlsym(RTLD_DEFAULT, "dlopen"), but not
RTLD_NEXT (the loader precedes the fakeroot library in the lookup chain).

The preceding discussion seems to have libc and libdl switched.  Normally the
implementation of dlopen is found in libdl.so, but not in libc.so.

Hope that helps,
Alexander

Re: Android native build of GCC

2015-02-15 Thread Alexander Monakov

On Sun, 15 Feb 2015, Cyd Haselton wrote:

> On Sun, Feb 15, 2015 at 12:41 PM, Cyd Haselton  wrote:
> 
> *snip*
> >
> >> So to obtain the pointer to
> >> dlopen the code like above can use dlsym(RTLD_DEFAULT, "dlopen"), but not
> >> RTLD_NEXT (the loader precedes the fakeroot library in the lookup chain).
> >>
> *snip*
> 
> Just a quick update:  RTLD_DEFAULT is definitely not the solution here
> as it results in a segfault when libfakechroot loads.  Perhaps a
> different RTLD_FLAG was meant?

I think you need to use RTLD_DEFAULT only when resolving "dlopen", and keep
RTLD_NEXT for all other symbol names (unless later on you run into other
symbols with similar behavior).  Something like

... = dlsym(strcmp(name, "dlopen") ? RTLD_NEXT : RTLD_DEFAULT, name);

Alexander

Re: Android native build of GCC

2015-02-15 Thread Alexander Monakov



On Sun, 15 Feb 2015, Cyd Haselton wrote:

> On Sun, Feb 15, 2015 at 11:53 AM, Alexander Monakov  
> wrote:
> >> Given that info...and in spite of my aforementioned limited knowledge  I
> >> went back to take another look at the source and found this in
> >> libfakechroot.c
> >>
> >> /bld/fakechrt/fakechroot-2.16 $ grep -C 4 dlsym src/libfakechroot.c
> >> /* Lazily load function */
> >> LOCAL fakechroot_wrapperfn_t fakechroot_loadfunc (struct 
> >> fakechroot_wrapper * w)
> >> {
> >> char *msg;
> >> if (!(w->nextfunc = dlsym(RTLD_NEXT, w->name))) {;
> >> msg = dlerror();
> >> fprintf(stderr, "%s: %s: %s\n", PACKAGE, w->name, msg != NULL ? 
> >> msg : "unresolved symbol");
> >> exit(EXIT_FAILURE);
> >> }
> >>
> >> I'm fairly certain I remember reading something about Android and lazy
> >> function loadinghow it doesn't handle it well or does so differently
> >> from standard Linux builds.  At any rate,  I believe the above code is
> >> responsible for those annoying 'fakechroot: undefined reference to dlopen'
> >> errors, so I'll see if I can fix that.
> >
> > In Android's Bionic libc, the implementation of dlopen() resides in the
> > dynamic loader, and not present in libdl.so.
> 
> Yet in Android's NDK documentation, they state that in order to use
> dlopen() functionality in native code you must link against libdl and
> include dlfcn.h.  Why would this be the case if the dlopen()
> implementation is not in libdl?
> (documentation link: http://www.kandroid.org/ndk/docs/STABLE-APIS.html)

That's the standard way of using dlopen, i.e. same as you would do it on Linux
with glibc for example.  So the link merely says that you can get dlopen the
same way as usual.

The difference is that Android's libdl only contains stub symbols for
dlopen&co, and the real symbols can be looked up in the dynamic loader.  That
RTLD_NEXT does not work for obtaining a pointer for dlopen, as it works on
glibc, is quite unfortunate, and probably a bug in Bionic.

Alexander

Broken test gcc.target/i386/sibcall-2.c

2015-05-13 Thread Alexander Monakov

Hello,

Last year's x86 sibcall improvements added a currently xfailed test:

  /* { dg-do compile { target ia32 } } */
  /* { dg-options "-O2" } */

  extern int doo1 (int);
  extern int doo2 (int);
  extern void bar (char *);

  int foo (int a)
  {
char s[256];
bar (s);
return (a < 0 ? doo1 : doo2) (a);
  }

  /* { dg-final { scan-assembler-not "call\[ \t\]*.%eax" { xfail *-*-* } } } */

It was xfailed by https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00016.html

Can you tell me what the test is supposed to test?  A tail call is impossible
here, because 'bar' might save the address of 's' in a global variable, and
therefore 's' must be live when 'doo1' or 'doo2' are invoked.

Should we remove or unbreak this test?

Thanks.
Alexander

Re: Broken test gcc.target/i386/sibcall-2.c

2015-05-13 Thread Alexander Monakov

Ah.  I realize it's most likely for testing sibcall_[value]_pop_memory
peepholes, right?  In which case the testcase might look like this:

  /* { dg-do compile } */
  /* { dg-options "-O2" } */

  void foo (int a, void (**doo1) (void), void (**doo2) (void))
  {
char s[16] = {0};
do s[a] = 1; while (a &= a-1);
(*(s[8] ? doo1 : doo2)) ();
  }

  /* { dg-final { scan-assembler-not "call" } } */

However on the above testcase memory-indirect jump is currently generated only
for 64-bit x86.  With -mx32 it's impossible, but with -m32 the peephole
doesn't match.  Is that expected?

Can you also tell me why ..._pop call and sibcall instructions are predicated
on !TARGET_64BIT?

Thanks.

Alexander

Re: May 2015 Toolchain Update

2015-05-18 Thread Alexander Monakov

Hello,

A couple of comments below.

On Mon, 18 May 2015, Nick Clifton wrote:
>   val |= ~0 << loaded;// Generates warning
>   val |= (unsigned) ~0 << loaded; // Does not warn

To reduce verbosity, '~0u' can be used here instead of a cast.

> * GCC supports a new option: -fno-plt
> 
>   When compiling position independent code this tells the compiler
>   not to use PLT for external function calls.  Instead the address
>   is loaded from the GOT and then branched to directly.  This
>   leads to more efficient code by eliminating PLT stubs and
>   exposing GOT load to optimizations.
> 
>   Not all architectures support this option, and some other
>   optimization features, such as lazy binding, may disable it.

The last paragraph looks confusing to be on both points.  '-fno-plt' is
implemented as a transformation during TreeSSA-to-RTL expansion, so it works
in a machine-independent manner; it's a no-op only if the target has no way to
turn on '-fPIC'.  Is that what you meant?

Second, lazy binding is not an optimization feature of GCC (it's implemented
as part of (e.g. glibc's) dynamic linker), so it's not quite right to say that
-fno-plt would be disabled by it.  Text I've added to the documentation says:

Lazy binding requires PLT: with -fno-plt all external symbols are resolved
at load time.

Thus, for code compiled with -fno-plt the dynamic linker would not be able to
perform lazy binding (even if it was otherwise possible, e.g. -z now -z relro
weren't in effect, and profitable, i.e. the library was not already prelinked).

Alexander

Re: RFC: Creating a more efficient sincos interface

2018-09-13 Thread Alexander Monakov

On Thu, 13 Sep 2018, Wilco Dijkstra wrote:
> What do people think? Ideally I'd like to support this in a generic way so 
> all targets can
> benefit, but it's also feasible to enable it on a per-target basis. Also 
> since not all libraries
> will support the new interface, there would have to be a flag or configure 
> option to switch
> the new interface off if not supported (maybe automatically based on the 
> math.h header).

GCC already has __builtin_cexpi for this, so I think you can introduce cexpi
implementation in libc, and then adjust expand_builtin_cexpi appropriately.

I wonder if it would be possible to add a fallback cexpi implementation to
libgcc.a that would be picked by the linker if there's no such symbol in libm?

Alexander

libgcov as shared library and other issues

2018-09-24 Thread Alexander Monakov

Hello,

Here's the promised "libgcov summary"; sorry about the delay.

So libgcov has a bit unusual design where:

  - on the one hand, the library is static-only, has PIC code and may be linked
into shared libraries,

  - almost all gcov symbols have "hidden" visibility so they don't participate
in dynamic linking

  - on the other hand, the __gcov_master symbol deliberately has default
visibility, presumably with the intention that a running program has exactly
one instance of this symbol, although the exact motivation is unclear to me.

This latter point does not reliably work as intended though: there are scenarios
where a dynamically linked program will have multiple __gcov_masters anyway:

  - via repeated dlopen(RTLD_LOCAL) with main executable not linked against 
libgcov
or not exporting libgcov symbols (as in PR 83879)
  - when shared libraries have version scripts that hide their __gcov_master
  - when -Bsymbolic is in effect


Additionally, indirect call profiling symbols are not hidden either, and that
leads to extra complications. Since there are multiple symbols, during dynamic
linking they may be partially interposed. PR 84107 demonstrates how this leads
to libgcov segfaulting in a fairly simple and legitimate program.

Bottom line: static linking code with default-visibility symbols
into shared libraries is problematic.

So one strategy is to ensure all gcov symbols have hidden visibility. That would
isolate gcov instances in each shared library loaded in the program, and each
library would have the responsibility to write out its counters when unloaded.
Also, __gcov_dump would dump only the counters specific to the current library.

I may be missing something here so it might be nice to unearth why exactly
__gcov_master is intended to be global.

Another strategy is to introduce libgcov.so and have it host either all libgcov
symbols or just those that by design are required to exist once in the program.

When talking to Richi at the Cauldron I got the impression he'd question if
shared libgcov is worth the cost, e.g. would it make any easier for users to
mix two libraries, one linked against older libgcov, and another with a newer
(something that doesn't work at all now, but would be nice to support if I
understood Richard correctly).

Alexander

Re: Backporting gcc_qsort

2018-10-01 Thread Alexander Monakov

On Mon, 1 Oct 2018, Jeff Law wrote:
> To add a bit more context for Cory.
> 
> Generally backports are limited to fixing regressions and serious code
> generation bugs.  While we do make some exceptions, those are good
> general guidelines.
> 
> I don't think the qsort changes warrant an exception.

Personally I think in this case there isn't a strong reason to backport, the
patch is fairly isolated, so individuals or companies that need it should have
no problem backporting it on their own.  Previously, Franz Sirl reported back
in June they've used the patch to achieve matching output on their Linux-hosted
vs Cygwin-hosted cross-compilers based on GCC 8:
https://gcc.gnu.org/ml/gcc-patches/2018-06/msg00751.html

Alexander

Re: movmem pattern and missed alignment

2018-10-08 Thread Alexander Monakov

On Mon, 8 Oct 2018, Michael Matz wrote:
> > Ok, but why is that not a bug?  The whole point of passing alignment to 
> > the movmem pattern is to let it generate code that takes advantage of 
> > the alignment.  So we get a missed optimization.
> 
> Only if you somewhere visibly add accesses to *i and *j.  Without them you 
> only have the "accesses" via memcpy, and as Richi says, those don't imply 
> any alignment requirements.  The i and j pointers might validly be char* 
> pointers in disguise and hence be in fact only 1-aligned.  I.e. there's 
> nothing in your small example program from which GCC can infer that those 
> two global pointers are in fact 2-aligned.

Well, it's not that simple. C11 6.3.2.3 p7 makes it undefined to form an
'int *' value that is not suitably aligned:

  A pointer to an object type may be converted to a pointer to a different
  object type. If the resulting pointer is not correctly aligned for the
  referenced type, the behavior is undefined.

So in addition to what you said, we should probably say that GCC decides
not to exploit this UB in order to allow code to round-trip pointer values
via arbitrary pointer types?

To put Michael's explanation in different words:

This is not obviously a bug, because static pointer type does not imply the
dynamic pointed-to type. The caller of 'f1' could look like

void call_f1(void)
{
  short ibuf[20] = {0}, jbuf[20] = {0};
  i = (void *) ibuf;
  j = (void *) jbuf;
  f1();
}

and it's valid to memcpy from jbuf to ibuf, memcpy does not "see" the
static pointer type, and works as if by dereferencing 'char *' pointers.
(although as mentioned above it's more subtly invalid when assigning to
i and j).

If 'f1' dereferences 'i', GCC may deduce that dynamic type of '*i' is 'int' and
therefore 'i' must be suitably aligned. But in absence of dereferences GCC
does not make assumptions about dynamic type and alignment.

Alexander

Re: movmem pattern and missed alignment

2018-10-08 Thread Alexander Monakov

On Tue, 9 Oct 2018, Richard Biener wrote:
> >This had worked as Paul expects until GCC 4.4 IIRC and this was perfectly OK
> >for every language on strict-alignment platforms.  This was changed only
> >because of SSE on x86.
> 
> And because we ended up ignoring all pointer casts. 

It's not quite obvious what SSE has to do with this - any hint please?

(according to my quick check this changed between gcc-4.5 and gcc-4.6)

Alexander

Re: movmem pattern and missed alignment

2018-10-09 Thread Alexander Monakov

On Tue, 9 Oct 2018, Richard Biener wrote:
> 
> then we cannot set the alignment of i_1 at/after k = *i_1 because doing so 
> would
> affect the alignment test which we'd then optimize away.  We'd need to 
> introduce
> a SSA copy to get a new SSA name but that would be optimized away quickly.

We preserve __builtin_assume_aligned up to pass-fold-all-builtins, so would it
work to emit it just before the memcpy

  i_2 = __builtin_assume_aligned(i_1, 4);
  __builtin_memcpy(j, i_2, 32);

in theory?

Alexander

Re: avoidance of lea after 5 operations?

2018-10-11 Thread Alexander Monakov

On Thu, 11 Oct 2018, Jason A. Donenfeld wrote:
> 
> I realize this is probably a fairly trivial matter, but I am very
> curious if somebody knows which heuristic gcc is applying here, and
> why exactly. It's not something done by any other compiler I could
> find, and it only started happening with gcc 6.

It's a change in register allocation, gcc selects eax instead of esi
for the shifts. Doesn't appear to be obviously intentional, could be
a bug or bad luck.

Alexander

Re: Is it a bug allowing to copy GIMPLE_ASM with labels?

2018-12-28 Thread Alexander Monakov

On Sat, 29 Dec 2018, Bin.Cheng wrote:
> tracer-1.c: Assembler messages:
> tracer-1.c:16: Error: symbol `foo_label' is already defined
> 
> Root cause is in tracer.c which duplicates basic block without
> checking if any GIMPLE_ASM defines labels.
> Is this a bug or invalid code?

This is invalid code, GCC documentation is clear that the compiler
may duplicate inline asm statements (passes other than tracer can do
that too, loop unswitching just to give one example).

We don't provide a way to write an asm that wouldn't be duplicated.

Alexander

Re: Idea: extend gcc to save C from the hell of intel vector instructions

2019-02-20 Thread Alexander Monakov

On Wed, 20 Feb 2019, Warren D Smith wrote:
> but if I try to replace that with the nicer (since more portable)
>c = __builtin_shuffle(a, b);
> then
> error: use of unknown builtin '__builtin_shuffle'
> [-Wimplicit-function-declaration]

Most likely you're on OS X and the 'gcc' command actually invokes Clang/LLVM.
Clang does not implement this builtin (there's __builtin_shufflevector with
a different interface — see Clang documentation for details).

Alexander

Re: GCC turns &~ into | due to undefined bit-shift without warning

2019-03-21 Thread Alexander Monakov

On Thu, 21 Mar 2019, Richard Biener wrote:
> > Maybe an example would help.
> >
> > Consider this code:
> >
> > for (int i = start; i < limit; i++) {
> >   foo(i * 5);
> > }
> >
> > Should GCC be entitled to turn it into
> >
> > int limit_tmp = i * 5;
> > for (int i = start * 5; i < limit_tmp; i += 5) {
> >   foo(i);
> > }
> >
> > If you answered "Yes, GCC should be allowed to do this", would you
> > want a warning? And how many such warnings might there be in a typical
> > program?
> 
> I assume i is signed int.  Even then GCC may not do this unless it knows
> the loop is entered (start < limit).

Additionally, the compiler needs to prove that 'foo' always returns normally 
(i.e. cannot invoke exit/longjmp or such).

Alexander

Re: [RFC] Disabling ICF for interrupt functions

2019-07-19 Thread Alexander Monakov

On Fri, 19 Jul 2019, Jozef Lawrynowicz wrote:

> For MSP430, the folding of identical functions marked with the "interrupt"
> attribute by -fipa-icf-functions results in wrong code being generated.
> Interrupts have different calling conventions than regular functions, so
> inserting a CALL from one identical interrupt to another is not correct and
> will result in stack corruption.

But ICF by creating an alias would be fine, correct?  As I understand, the
real issue here is that gcc does not know how to correctly emit a call to
"interrupt" functions (because they have unusual ABI and exist basically to
have their address stored somewhere).

So I think the solution shouldn't be in disabling ICF altogether, but rather
in adding a way to recognize that a function has quasi-unknown ABI and thus
not directly callable (so any other optimization can see that it may not emit
a call to this function), then teaching ICF to check that when deciding to
fold by creating a wrapper.

(would it be possible to tell ICF that addresses of interrupt functions are
not significant so it can fold them by creating aliases?)

Alexander

Re: [RFC] Disabling ICF for interrupt functions

2019-07-22 Thread Alexander Monakov

On Mon, 22 Jul 2019, Jozef Lawrynowicz wrote:

> This would have to be caught at the point that an optimization pass
> first considers inserting a CALL to the interrupt, i.e., if the machine
> description tries to prevent the generation of a call to an interrupt function
> once the RTL has been generated (e.g. by blanking on the define_expand for
> "call"), we are going to have ICEs/wrong code generated a lot of the time.
> Particularly in the case originally mentioned here - there would be an empty
> interrupt function.

Yeah, I imagine it would need to be a new target hook direct_call_allowed_p
receiving a function decl, or something like that.

> > (would it be possible to tell ICF that addresses of interrupt functions are
> > not significant so it can fold them by creating aliases?)
> 
> I'll take a look.

Sorry, I didn't say explicitly, but that was meant more as a remark to IPA
maintainers: currently in GCC "address taken" implies "address significant",
so "address not significant" would have to be a new attribute, or a new decl
bit (maybe preferable for languages where function addresses are not significant
by default).

Alexander

Re: asking for attribute((aligned()) clarification

2019-08-19 Thread Alexander Monakov

On Mon, 19 Aug 2019, Richard Earnshaw (lists) wrote:

> Correct, but note that you can only pack structs and unions, not basic types.
> there is no way of under-aligning a basic type except by wrapping it in a
> struct.

I don't think that's true. In GCC-9 the doc for 'aligned' attribute has been
significantly revised, and now ends with

  When used as part of a typedef, the aligned attribute can both increase and
  decrease alignment, and specifying the packed attribute generates a warning. 

(but I'm sure defacto behavior of accepting and honoring reduced alignment on
a typedef'ed scalar type goes way earlier than gcc-9)

Alexander

Re: Aw: Re: asking for attribute((aligned()) clarification

2019-08-21 Thread Alexander Monakov

On Tue, 20 Aug 2019, "Markus Fröschle" wrote:

> Thank you (and others) for your answers. Now I'm just as smart as before, 
> however.
> 
> Is it a supported, documented, 'long term' feature we can rely on or not?
> 
> If yes, I would expect it to be properly documented. If not, never mind.

I think it's properly documented in gcc-9:

  https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/Common-Type-Attributes.html

(the "old" behavior where the compiler would neither honor reduced alignment
nor issue a warning seems questionable, the new documentation promises a more
sensible approach)

In portable code one can also use memcpy to move unaligned data, the compiler
should translate it like an unaligned load/store when size is a suitable
constant:

  int val;
  memcpy(&val, ptr, sizeof val);

(or __builtin_memcpy when -ffreestanding is in effect)

Alexander

Re: asking for attribute((aligned()) clarification

2019-08-21 Thread Alexander Monakov

On Wed, 21 Aug 2019, Paul Koning wrote:

> I agree, but if the new approach generates a warning for code that was written
> to the old rules, that would be unfortunate.

FWIW I don't know which GCC versions accepted 'packed' on a scalar type.
Already in 2006 GCC 3.4 would issue a warning:

$ echo 'typedef int ui __attribute__((packed));' | gcc34 -xc - -S -o-
.file   ""
:1: warning: `packed' attribute ignored
.section.note.GNU-stack,"",@progbits
.ident  "GCC: (GNU) 3.4.6 20060404 (Red Hat 3.4.6-4)"

> Yes.  But last I tried, optimizing that for > 1 alignment is problematic
> because that information often doesn't make it down to the target code even
> though it is documented to do so.

Thanks, indeed this memcpy solution is not so well suited for that.

Alexander

Re: -fpatchable-function-entry: leverage multi-byte NOP on x86

2020-01-06 Thread Alexander Monakov

On Mon, 6 Jan 2020, Martin Liška wrote:

> You are right, we do not leverage multi-byte NOPs. Note that the support
> depends
> on a CPU model (-march) and the similar code is quite complex in binutils:
> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gas/config/tc-i386.c;h=d0b8f2624a1885d83d2595474bfd78ae844f48f2;hb=HEAD#l1441
> 
> I'm not sure how worthy would it be to implement that?

Huh? Surely the right move would be to ask Binutils to expose that via
a new pseudo-op, like .balign but requesting a specific space rather
than aligning up to a boundary.

Alexander

Re: [RFC] builtin functions and `-ffreestanding -nostartfies` with static binaries

2020-01-10 Thread Alexander Monakov

On Fri, 10 Jan 2020, Siddhesh Poyarekar wrote:

> I spent some time thinking about this and while it's trivial to fix by
> disabling ifuncs for static glibc, I wanted a solution that wasn't such
> a big hammer.  The other alternative I could think of is to have an
> exported alias (called __builtin_strlen for example instead of strlen)
> of a default implementation of the builtin function in glibc that gcc
> generates a call to if freestanding && nostartfiles && static.

In the Linaro bugreport you mention,

> Basically, IFUNCs and freestanding don't mix.

but really any libc (Glibc included) and -nostartfiles don't mix: stdio
won't be initialized, TLS won't be setup, and pretty much all other
libc-internal datastructures won't be properly setup. Almost no libc functions
are callable, because for example if they try to access 'errno', they crash.

Looking at the opening comment of the failing kselftest source:

 * This program tries to be as small as possible itself, to
 * avoid perturbing the system memory utilization with its
 * own execution.  It also attempts to have as few dependencies
 * on kernel features as possible.
 *
 * It should be statically linked, with startup libs avoided.
 * It uses no library calls, and only the following 3 syscalls:
 *   sysinfo(), write(), and _exit()

so in fact allowing it to link with libc strlen would be contrary to its intent.
The fix is simple: add -nodefaultlibs next to -nostartfiles in its Makefile, and
write a trivial loop in place of __builtin_strlen.

Alexander

Re: LTO remapping/deduction of machine modes of types/decls

2017-01-09 Thread Alexander Monakov

On Wed, 4 Jan 2017, Richard Biener wrote:
> My suggestion at that time isn't likely working in practice due to the
> limitations Jakub outlines below.  The situation is a bit unfortunate
> but expect to run into more host(!) dependences in the LTO bytecode.
> Yes, while it would be nice to LTO x86_64->arm and ppc64le->arm
> LTO bytecode it very likely isn't going to work.

Yes, I think it's not really practical to seek wide portability of LTO bytecode.
After all, platform specifics get into constant expressions (e.g. 'int p =
sizeof (void *);') and are also observable on the preprocessor level (e.g. via
'#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__').

However the accelerator platform must be compatible with the host platform in
almost all ABI (storage layout?) features such as type sizes and alignments,
endianness, default signedness of char, bitfield layout, and possibly others
(but yet in the other subthread I was arguing that compromising and making
'long double' only partially compatible makes sense).  Thus, portability issue
is much smaller in scope here.

I think it's a bit unfortunate that the discussion really focused on the trouble
with floating-point types.  I'd really appreciate any insight on the other
questions that were raised, such as whether the decl mode should match that
decl's type mode.

For floating types, I believe in the long run it should be possible to tag tree
type nodes with the floating-point type 'kind' such as IEEE binary, IEEE
decimal, accum/fract/sat, or IBM double-double.

For our original goal, I think we'll switch to the other solution I've outlined
in the opening mail, i.e. propagating mode tables at WPA stage and keeping
enough information to know if the section comes from the host or native
compiler.

Thanks.
Alexander

Re: LTO remapping/deduction of machine modes of types/decls

2017-01-10 Thread Alexander Monakov

On Tue, 10 Jan 2017, Richard Biener wrote:
> In general I think they should match.  But without seeing concrete 
> examples of where they do not I can't comment on whether such exceptions
> make sense.  For example if you adjust a DECLs alignment and then
> re-layout it I'd expect you might get a non-BLKmode mode for an
> aggregate in some circumstances -- but then decl and type are not 1:1
> compatible (due to different alignment), but this case is clearly desired
> as requiring type copies for the sake of alignment would be wasteful.

Thanks; Vlad will follow up with (I believe) a different kind of mismatches
originating in the C++ front-end.

> > For our original goal, I think we'll switch to the other solution I've 
> > outlined in the opening mail, i.e. propagating mode tables at WPA stage 
> > and keeping enough information to know if the section comes from the 
> > host or native compiler.
> 
> So add a hack ontop of the hack?  Ugh.  So why exactly doesn't it
> already work?  It looks like decls and types have their modes
> "fixed" with the per-file mode table at WPA time.  So what is missing
> is to "fix" modes in the per-function sections that are not touched
> by WPA?

WPA re-streams packed function bodies as-is, so anything referred to
from within just the body won't be subject to mode remapping; I think
only modes of toplevel declarations and functions' arguments will be
remapped.  And I believe it wouldn't be acceptable to unpack/remap/repack
function bodies at WPA stage (it's contrary to LTO scalability goal).

Alexander

Re: LTO remapping/deduction of machine modes of types/decls

2017-01-11 Thread Alexander Monakov

On Wed, 11 Jan 2017, Richard Biener wrote:
> > WPA re-streams packed function bodies as-is, so anything referred to
> > from within just the body won't be subject to mode remapping; I think
> > only modes of toplevel declarations and functions' arguments will be
> > remapped.  And I believe it wouldn't be acceptable to unpack/remap/repack
> > function bodies at WPA stage (it's contrary to LTO scalability goal).
> 
> Yes indeed.  But this means the mode-maps have to be per function
> section (with possibly a way to "share" them?).  Or we need a way
> to annotate function sections with "no need to re-map" as the
> native nvptx sections don't need remapping and the others all use
> the same map?

Right, the latter: we know that sections coming from the native compiler
already have the right modes and thus need no remapping, and the sections
coming from the host compiler all need remapping (and will use the same
mapping).  Prefixes of per-function section names already carry the distinction
(".gnu.lto_foo" vs. ".gnu.offload_lto_foo").

Alexander

Re: Improving code generation in the nvptx back end

2017-02-20 Thread Alexander Monakov

On Fri, 17 Feb 2017, Thomas Schwinge wrote:
> On Fri, 17 Feb 2017 14:00:09 +0100, I wrote:
> > [...] for "normal" functions there is no reason to use the
> > ".param" space for passing arguments in and out of functions.  We can
> > then get rid of the boilerplate code to move ".param %in_ar*" into ".reg
> > %ar*", and the other way round for "%value_out"/"%value".  This will then
> > also simplify the call sites, where all that code "evaporates".  That's
> > actually something I started to look into, many months ago, and I now
> > just dug out those changes, and will post them later.
> > 
> > (Very likely, the PTX "JIT" compiler will do the very same thing without
> > difficulty, but why not directly generate code that is less verbose to
> > read?)

In general you cannot use this shorthand notation because PTX interop guidelines
explicitly prescribe using the .param space for argument passing.  See
https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/ , section 3.

So at best GCC can use it for calls where interop concerns are guaranteed to not
arise: when the callee is not externally visible, and does not have its address
taken.  And there's a question of how well it's going to work after time passes,
since other compilers always use the verbose form (and thus the .reg calling
style is not frequently exercised).

Alexander

Re: GPLv3 clarification - what constitutes IR

2017-03-06 Thread Alexander Monakov

On Mon, 6 Mar 2017, Richard Biener wrote:
> >I am not a lawyer and this is not legal advice.
> >
> >Generating SPIR-V output would not cause that output to become GPLv3
> >licensed.  However, linking the result against the GCC support
> >libraries, as is normally required for any program generated by GCC,
> >and then distributing the resulting executable to other people, would
> >require you to use an eligible compilation process (as defined by the
> >GCC Runtime Library Exception license that you cite).  What this means
> >in practice is that you can not take SPIR-V, do further processing it
> >using a proprietary compiler, link the result with the GCC runtime
> >libraries, and then distribute the resulting program to anybody else.
> >
> >I don't think it is necessary to determine whether SPIR-V is "target
> >code" or "intermediate representation" to draw that conclusion.
> 
> Note we already have the HSAIL and PTX backends which have the very same
> (non-)problem.  Both invoke a proprietary compiler for final compilation.

Sorry, to me this statement sounds a bit ambiguous, so allow me to clarify:
there is no mandatory invocation of proprietary code generators taking place as
part of GCC invocation (I think there's none at all in case of HSAIL, and in
case of PTX it's done for the purpose of syntax checking and can be omitted).

Translation of HSAIL/PTX assembly to GPU binary code takes place when the
host executable runs, on user's machine, by invoking corresponding libraries (in
case of PTX it's NVIDIA's CUDA driver library).

There is no support for translating HSAIL/PTX on the developer's machine and
linking the resulting GPU binary code into GCC-produced executable.

Hope that helps.
Alexander

Re: [RFA] update ggc_min_heapsize_heuristic()

2017-04-09 Thread Alexander Monakov

On Sun, 9 Apr 2017, Markus Trippelsdorf wrote:

> The minimum size heuristic for the garbage collector's heap, before it
> starts collecting, was last updated over ten years ago.
> It currently has a hard upper limit of 128MB.
> This is too low for current machines where 8GB of RAM is normal.
> So, it seems to me, a new upper bound of 1GB would be appropriate.

While amount of available RAM has grown, so has the number of available CPU
cores (counteracting RAM growth for parallel builds). Building under a
virtualized environment with less-than-host RAM got also more common I think.

Bumping it all the way up to 1GB seems excessive, how did you arrive at that
figure? E.g. my recollection from watching a Firefox build is that most of
compiler instances need under 0.5GB (RSS).

> Compile times of large C++ projects improve by over 10% due to this
> change.

Can you explain a bit more, what projects you've tested?.. 10+% looks
surprisingly high to me.

> What do you think?

I wonder if it's possible to reap most of the compile time benefit with a bit
more modest gc threshold increase?

Thanks.
Alexander

Re: Linux and Windows generate different binaries

2017-07-15 Thread Alexander Monakov

On Fri, 14 Jul 2017, Yuri Gribov wrote:
> I've also detect transitiveness violation compare_assert_loc
> (tree-vrp.c), will send fix once tests are done.

There are more issues still, see the thread starting at
https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00899.html

Alexander

Re: Linux and Windows generate different binaries

2017-07-16 Thread Alexander Monakov

On Sat, 15 Jul 2017, Segher Boessenkool wrote:
> Would it hurt us to use stable sorts *everywhere*?

Stability (using the usual definition: keeping the original order of
elements that compare equal) is not required to achieve reproducibility [*].
GCC could import or nih any non-randomized implementation of a [potentially
not stable, e.g. qsort] sorting routine, and that would be sufficient
to eliminate this source of codegen differences.

Alexander

[*] nor would it be sufficient, given our current practice of passing
invalid comparators to libc sort, at which point anything can happen
due to undefined behavior

Re: Linux and Windows generate different binaries

2017-07-16 Thread Alexander Monakov

On Sun, 16 Jul 2017, Segher Boessenkool wrote:
> I am well aware, and that is not what I asked.  If we would use stable
> sorts everywhere

How? There's no stable sort in libc and switching over to std::stable_sort
would be problematic. The obvious approach is adding an implementation of
a stable sort in GCC, but at that point it doesn't matter if it's stable,
the fact it's deterministic is sufficient for reproducibility.

> we would not have to think about whether some sorting routine has to be
> stable or if we can get away with a (slightly slower) non-stable sort.

I think you mean '(slightly faster)'.

> That is just a plain bug, undefined behaviour even (C11 7.22.5/4).
> Of course it needs to be fixed.

I've posted patches towards this goal.

Alexander

Re: Linux and Windows generate different binaries

2017-07-17 Thread Alexander Monakov

On Sun, 16 Jul 2017, Segher Boessenkool wrote:
> > How? There's no stable sort in libc and switching over to std::stable_sort
> > would be problematic.
> 
> Why?

- you'd need to decide if the build time cost of extra 8000+ lines
  lines brought in by  (per each TU) is acceptable;

- you'd need to decide if the code size cost of multiple instantiations
  of template stable_sort is acceptable (or take measures to unify them);

- you'd need to adapt comparators, as STL uses a different interface
  that C qsort;

- you'd need to ensure it doesn't lead to a noticeable slowdown.

(unrelated, but calls to std::stable sort cannot be intercepted by Yuri's
sortcheck, and of course my recent sortcheck-like patch entirely missed it too)

> Sure.  Some of our sorts in fact require stable sort though

At moment only bb-reorder appears to use std::stable_sort, is that what
you meant, or are there more places?

Alexander

Re: RFC: C extension to support variable-length vector types

2017-08-04 Thread Alexander Monakov

On Thu, 3 Aug 2017, Richard Biener wrote:
> Btw., I did this once to represent constrained expressions on
> multi-dimensional arrays in SSA form.  There control (aka loop) structure was
> also implicit.  Google for 'middle-end array expressions'.  The C interface
> was builtins and VLAs.

The description at https://gcc.gnu.org/wiki/MiddleEndArrays refers to
http://www.suse.de/~rguenther/guenther.pdf
but this redirects to users.suse.de, for which DNS resolution fails.
The Web Archive doesn't seem to have a copy.

Any chance this might be available elsewhere?

Thanks.
Alexander

Re: RFC: Improving GCC8 default option settings

2017-09-12 Thread Alexander Monakov

On Tue, 12 Sep 2017, Wilco Dijkstra wrote:
> * Make -fno-math-errno the default - this mostly affects the code generated 
> for
>   sqrt, which should be treated just like floating point division and not set
>   errno by default (unless you explicitly select C89 mode).

(note that this can be selectively enabled by targets where libm never sets
errno in the first place, docs call out Darwin as one such target, but musl-libc
targets have this property too)

> * Make -fno-trapping-math the default - another obvious one. From the docs:
>   "Compile code assuming that floating-point operations cannot generate 
>user-visible traps."
>   There isn't a lot of code that actually uses user-visible traps (if any -
>   many CPUs don't even support user traps as it's an optional IEEE feature). 
>   So assuming trapping math by default is way too conservative since there is
>   no obvious benefit to users. 

OTOH -O options are understood to _never_ sacrifice standards compliance, with
the exception of -Ofast.  I believe that's an important property to keep.

Maybe it's possible to treat -fno-trapping-math similar to -ffp-contract=fast,
i.e. implicitly enable it in the default C-with-GNU-extensions mode, keeping
strict-compliance mode (-std=c11 as opposed to gnu11) untouched?

In any case it shouldn't be hard to issue a warning if fenv.h functions are
used when -fno-trapping-math/-fno-rounding-math is enabled.

If the above doesn't fly, I believe adopting and promoting a single option for
non-value-changing math optimizations (-fno-math-errno -fno-trapping-math, plus
-fno-rounding-math -fno-signaling-nans when they're no longer default) would
be nice.

> * Make -fno-common the default - this was originally needed for pre-ANSI C, 
> but
>   is optional in C (not sure whether it is still in C99/C11). This can
>   significantly improve code generation on targets that use anchors for 
> globals
>   (note the linker could report a more helpful message when ancient code that
>   requires -fcommon fails to link).

I think in ISO C situations where -fcommon allows link to succeed fall under
undefined behavior, which in GNU toolchain is defined to match the historical
behavior.

I assume the main issue with this is the amount of legacy code that would cause
a link failure if -fno-common is made default - thus, is there anybody in
position to trigger a full-distro rebuild with gcc patched to enable
-fno-common, and compare before/after build failure stats?

Thanks.
Alexander

Re: atomic_thread_fence() semantics

2017-10-19 Thread Alexander Monakov

On Thu, 19 Oct 2017, Andrew Haley wrote:
> On 19/10/17 12:58, Mattias Rönnblom wrote:
> > Did I misunderstand the semantics of 
> > atomic_thread_fence+memory_order_release?
> 
> No, you did not.  This looks like a bug.  Please report it.

This bug is fixed on trunk, so should work from gcc-8 onwards (PR 80640).

Alexander

Re: atomic_thread_fence() semantics

2017-10-20 Thread Alexander Monakov

On Fri, 20 Oct 2017, Torvald Riegel wrote:
> On Thu, 2017-10-19 at 15:31 +0300, Alexander Monakov wrote:
> > On Thu, 19 Oct 2017, Andrew Haley wrote:
> > > No, you did not.  This looks like a bug.  Please report it.
> > 
> > This bug is fixed on trunk, so should work from gcc-8 onwards (PR 80640).
> 
> The test case is invalid (I added some more detail as a comment on this
> bug).

Sorry, I was imprecise.  To be clear, the issue I referred to above as the
"bug [that was] fixed on trunk" is the issue Andrew Haley pointed out: when
GCC transitioned from GIMPLE to RTL IR, empty RTL was emitted for the fence
statement, losing its compile-time effect as a compiler memory barrier entirely.

I agree that the testcase in the opening message of this thread is not valid
in the sense that this reordering could not have changed the behavior of a
conforming program, but the optimization that GCC performed here was entirely
unintentional, not something the compiler is presently designed to do.

Alexander

Re: Unstable build/host qsorts causing differing generated target code

2018-01-12 Thread Alexander Monakov

On Fri, 12 Jan 2018, Jakub Jelinek wrote:
> The qsort checking failures are tracked in http://gcc.gnu.org/PR82407
> meta bug, 8 bugs in there are fixed, 2 known ones remain.

Note that qsort_chk only catches really bad issues where the compiler
invokes undefined behavior by passing an invalid comparator to qsort;
differences between Glibc and musl-hosted compilers may remain because
qsort is not quaranteed to be a stable sort: when sorting an array {A, B} where
A and B are not bitwise-identical but cmp(A, B) returns 0, the implementation of
qsort may yield either {B, A} or {A, B}, and that may cause codegen differences.

(in other words, bootstrapping on a libc with randomized qsort has
a good chance to run into bootstrap miscompares even if qsort_chk-clean)

Alexander

Re: Unstable build/host qsorts causing differing generated target code

2018-01-12 Thread Alexander Monakov

On Fri, 12 Jan 2018, Jeff Law wrote:
> THe key here is the results can differ if the comparison function is not
> stable.  That's inherent in the qsort algorithms.

I'm afraid 'stable' is unclear/ambiguous in this context. I'd rather say
'if the comparator returns 0 if and only if the items being compared are
bitwise identical'.

Otherwise qsort, not being a guaranteed-stable sort, has a choice as to how
reorder non-identical items that compare equal.

> But, if the comparison functions are fixed, then the implementation
> differences between the qsorts won't matter.
> 
> Alexander Monokov has led an effort to identify cases where the
> comparison functions do not provide a stable ordering and to fix them.

No. The qsort_chk effort was limited to catching instances where comparators
are invalid, i.e. lack anti-commutativity (may indicate A < B < A) or
transitivity property (may indicate A < B < C < A). Fixing them doesn't
imply making corresponding qsort invocations stable.

> Some remain, but the majority have been addressed over the last year.
> His work also includes a qsort checking implementation to try and spot
> these problems as part of GCC's internal consistency checking mechanisms.

Well, currently qsort_chk only checks for validity. It could in principle check
for stability, but for each unstable sort it's rather hard to analyze whether
it may ultimately cause codegen changes, and as a result patches may be
impossible to justify.

Alexander

Re: Unstable build/host qsorts causing differing generated target code

2018-01-12 Thread Alexander Monakov

On Fri, 12 Jan 2018, Joseph Myers wrote:
> On Fri, 12 Jan 2018, Alexander Monakov wrote:
> 
> > No. The qsort_chk effort was limited to catching instances where comparators
> > are invalid, i.e. lack anti-commutativity (may indicate A < B < A) or
> > transitivity property (may indicate A < B < C < A). Fixing them doesn't
> > imply making corresponding qsort invocations stable.
> 
> Incidentally, does it detect being invalid because of comparing A != A?  

If A appears twice at different positions in the array yes, otherwise no:
qsort_chk never passes two equal pointers to the comparator. This is
intentional: an earlier effort by Yuri Gribov tried to enforce reflexivity,
but that caught instances where input arrays could not contain identical
items. So under the assumption that qsort would never compare an element to
itself, catching and fixing that wouldn't make a difference in practice -
apart from complicating the comparator.

Alexander

Re: GCC interpretation of C11 atomics (DR 459)

2018-02-25 Thread Alexander Monakov

Hello,

Although I wouldn't like to fight defending GCC's design change here, let me
offer a couple of corrections/additions so everyone is on the same page:

On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote:
> 
> 1. Not consistent with clang/llvm which completely supports double-width
> atomics for arm32, arm64, x86 and x86-64 making it possible to write portable
> code (w/o specific extensions or assembly code) across all these architectures
> (which is finally possible with C11!).The behavior of clang: if mxc16 is
> specified, cmpxchg16b is generated for x86-64 (without any calls to
> libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are
> always generated. In my opinion, this is very logical and non-confusing.

Note that there's more issues to that than just behavior on readonly memory:
you need to ensure that the whole program, including all static and shared
libraries, is compiled with -mcx16 (and currently there's no ld.so/ld-level
support to ensure that), or you'd need to be sure that it's safe to mix code
compiled with different -mcx16 settings because it never happens to interop
on wide atomic objects.

(if you mix -mcx16 and -mno-cx16 code operating on the same 128-bit object,
you get wrong code that will appear to work >99% of the time)

> 3. The behavior is inconsistent even within GCC. Older (and more limited, less
> portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic
> and C11 -- do not. Moreover, __sync builtins are probably less suitable for
> arm/arm64.

Note that there's no "load" function in the __sync family, so the original
concern about operations on readonly memory does not apply.

> For these reasons, it may be a good idea if GCC folks reconsider past
> decision. And just to clarify: if mcx16 (x86-64) is not specified during
> compilation, it is totally OK to redirect to libatomic, and there make the
> final decision if target CPU supports a given instruction or not. But if it is
> specified, it makes sense for performance reasons and lock-freedom guarantees
> to always generate it directly. 

You don't mention it directly, so just to make it clear for readers: on systems
where GNU IFUNC extension is available (i.e. on Glibc), libatomic tries to do
exactly that: test for cmpxchg16b availability and redirect 128-bit atomics to
lock-free RMW implementations if so.  (I don't like this solution)

Thanks.
Alexander

Re: Fw: GCC interpretation of C11 atomics (DR 459)

2018-02-26 Thread Alexander Monakov

On Mon, 26 Feb 2018, Ruslan Nikolaev via gcc wrote:
> Well, if libatomic is already doing it when corresponding CPU feature is
> available (i.e., effectively implementing operations using cmpxchg16b), I do
> not see any problem here. mcx16 implies that you *have* cmpxchg16b, therefore
> other code compiled without -mcx16 flag will go to libatomic. Inside
> libatomic, it will detect that cmpxchg16b *is* available, thus making code
> compiled with and without -mcx16 flag completely compatible on a given system.
> Or do I miss something here?

I'd say the main issue is that libatomic is not guaranteed to work like that.
Today it relies on IFUNC for redirection, so you may (and not "will") get the
desired behavior on Glibc (implying Linux), not on other OSes, and neither on
Linux with non-GNU libc (nor on bare metal, for that matter).

Alexander

Re: GCC interpretation of C11 atomics (DR 459)

2018-02-26 Thread Alexander Monakov

On Mon, 26 Feb 2018, Szabolcs Nagy wrote:
> 
> rmw load is only valid if the implementation can
> guarantee that atomic objects are never read-only.

OK, but that sounds like a matter of not emitting atomic
objects into .rodata, which shouldn't be a big problem,
if not for backwards compatibility concern?

> current implementations on linux (including clang)
> don't do that, so an rmw load can observably break
> conforming c code: a static global const object is
> placed in .rodata section and thus rmw on it is a
> crash at runtime contrary to c standard requirements.

Note that in your example GCC emits 'x' as a common symbol,
you need '... x = { 0 };' for it to appear in .rodata,

> on an aarch64 machine clang miscompiles this code:
[...]

and then with new enough libatomic on Glibc this segfaults
with GCC on x86_64 too due to IFUNC redirection mentioned
in the other subthread.

Alexander

Re: Why does IRA force all pseudos live across a setjmp call to be spilled?

2018-03-02 Thread Alexander Monakov

On Fri, 2 Mar 2018, Peter Bergner wrote:

> But currently ira-lives.c:process_bb_node_lives() has:
> 
>   /* Don't allocate allocnos that cross setjmps or any
>  call, if this function receives a nonlocal
>  goto.  */
>   if (cfun->has_nonlocal_label
>   || find_reg_note (insn, REG_SETJMP,
>   NULL_RTX) != NULL_RTX)
> {
>   SET_HARD_REG_SET (OBJECT_CONFLICT_HARD_REGS (obj));
>   SET_HARD_REG_SET (OBJECT_TOTAL_CONFLICT_HARD_REGS (obj));
> }
> 
> ...which forces us to spill everything live across the setjmp by forcing
> the pseudos to interfere all hardregs.  That can't be good for performance.
> What am I missing?

FWIW there's a similar issue with exceptions where IRA chooses memory
for pseudos inside the loop even though the throwing call is outside:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82242#c3

Alexander

Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction

2018-04-13 Thread Alexander Monakov

On Fri, 13 Apr 2018, Vivek Kinhekar wrote:
> The mfence instruction with memory clobber asm instruction should create a
> barrier between division and printf instructions.

No, floating-point division does not touch memory, so the asm does not (and
need not) restrict its motion.

Alexander

Re: Sched1 stability issue

2018-07-04 Thread Alexander Monakov

On Wed, 4 Jul 2018, Kugan Vivekanandarajah wrote:
> We noticed a difference in the code generated for aarch64 gcc 7.2
> hosted in Linux vs mingw. AFIK, we are supposed to produce the same
> output.
> 
> For the testacse we have (quite large and I am trying to reduce), the
> difference comes from sched1 pass. If I disable sched1 the difference
> is going away.
> 
> Is this a known issue? Attached is the sched1 dump snippet where there
> is the difference.

The rank_for_schedule comparator used for qsort in the scheduler is known
to be invalid; some issues have been fixed in gcc-8, but some remain (you
can search the bugzilla for qsort_chk issues). Since the comparator is
invalid, different qsort implementations reorder the ready list differently.

In gcc-9 qsort calls use gcc_qsort instead and thus wouldn't diverge.

Alexander

Re: ChangeLog's: do we have to?

2018-07-05 Thread Alexander Monakov

On Thu, 5 Jul 2018, Richard Kenner wrote:

> > After 20 years of hacking on GCC I feel like I have literally wasted 
> > days of my life typing out ChangeLog entries that could have easily been 
> > generated programmatically.
> > 
> > Can someone refresh my memory here, what are the remaining arguments for 
> > requiring ChangeLog entries?
> 
> I take the position that any ChangeLog entry that could have been generated
> automatically is not a good one.  Yes, the list of functions changed could
> be generated automatically. (Isn't there actually a way to do that in
> emacs?  I could have sworn I saw it, though I never used it much.)  But the
> *purpose* of the change to that function can't be generated automatically
> because the ChangeLog entry is the only place that should have that
> information.  The comments should say what a function currently does, but
> isn't the place for a history lesson.  That data belongs only in ChangeLog

GCC ChangeLogs don't record the purpose of the change. They say what changed,
but not why.

As far as I know, this ChangeLog style helped in pre-Subversion times when
source control tools tracked changes per-file, not per-tree.

Alexander

Re: ChangeLog's: do we have to?

2018-07-23 Thread Alexander Monakov

On Mon, 23 Jul 2018, Segher Boessenkool wrote:

> For example for .md files you can use
> 
> [diff "md"]
> xfuncname = "^\\(define.*$"
> 
> in your local clone's .git/config
> 
> and
> 
> *.md  diff=md
> 
> in .gitattributes (somewhere in the source tree).

Not necessarily in the source tree: individual users can put that into their
$XDG_CONFIG_HOME/git/attributes or $HOME/.config/git/attributes.

Likewise the previous snippet can go into $HOME/.gitconfig rather than each
individual cloned tree.

(the point is, there's no need to split this quality-of-life change between the
repository and the user's setup - it can be done once by a user and will work
for all future checkouts)

Alexander

Re: Question about GCC benchmarks and uninitialized variables

2018-07-24 Thread Alexander Monakov

On Tue, 24 Jul 2018, Fredrik Hederstierna wrote:

> So my question is how to approach this problems when doing benchmarking,
> ofcourse we want the benchmark to mirror as near as 'real life' code as
> possible.  But if code contains real bugs, and issues that could cause
> unpredictable code generation, should such code be banned from benchmarking,
> since results might be misleading?

Well, all benchmarks are going to be imperfect reflections of real-life
workloads in the first place, so their bugs just increase the degree to
which they are misleading.

When a new compiler version starts to treat some undefined piece of code
differently, it can cause a range of effects from code size perturbations
as in your case, to completely invalidating the benchmark as in spec2k6
x264 benchmark's case (where GCC exploited undefined behavior in a loop,
turning it to an infinite loop that eventually segfaulted).

Perhaps even though results on individual benchmarks can vary wildly,
aggregated results across a wide range of non-toy benchmarks may be
indicative of ... something, because they are unlikely to all exhibit
the same "bugs".

> On the other hand, the compiler should
> generate best code for any input?

Engineering effort is limited, so it's probably better to go for generating
good code for inputs that are likely to resemble actively used code (and in
actively used&maintained code, bugs can be reported and fixed) :)

> What do you think, should benchmarking code not being allowed to have eg
> warnings like -Wuninitialized and maybe -Wmaybe-uninitialized?  Are there more
> warnings that indicate unpredictable code generations due to bad code, or are
> the root cause that these are 'bugs', and we should not allow real bugs at all
> in benchmarking code?

A blanket ban on warnings won't work, they have false positives (especially the
-Wmaybe- one), and there exist code that validly uses uninitialized data. I
don't have such a striking example for scalar variables, but for uninitialized
arrays there's this sparse set algorithm (which GCC itself also uses):
https://research.swtch.com/sparse

I think good benchmarks sets should be able to evolve to account for newly
discovered bugs, rather then remain frozen (which sounds like a reason to
become obsolete sooner rather than later).

Alexander

Re: Question related to GCC structure variable assignment optimization

2018-07-27 Thread Alexander Monakov

On Fri, 27 Jul 2018, keshav tilak wrote:
> This leads to GCC compiler issuing a call to `memcpy@PLT()' in function bar1.
> 
> I want to create a position independent executable from this source
> and run this on
> a secure environment which implements ASLR and the loader disallows any binary
> which has PLT/GOT based relocations.

The linker should be able to relax those nominally-PLT calls to direct calls
since it emits a PIE and a local definition is available. Therefore the loader
(the dynamic linker) should not get a GOT relocation for this call.

Are you sure the linker does not perform this relaxation in your case? If so,
that's an issue (missed optimization) in the linker.

That said, the GCC should be able to emit direct calls as in some cases, most
notably the 32-bit x86 ABI, it causes a size/speed penalty the linker would
not be able to clean up.

What should work is (re)declaring memcpy with hidden visibility:

  __attribute__((visibility("hidden")))
  void *memcpy(void *, const void *, size_t);

or via the pragma, but today this doesn't work. I've opened a GCC bugreport
for this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86695

Alexander

Re: Questions related to creation of libgcov.so

2018-08-06 Thread Alexander Monakov

On Fri, 3 Aug 2018, Martin Liška wrote:
> I'm attaching current patch, so any comment is welcomed.

Please consider passing -Wl,-z,now when linking the new shared library:
gcov has a few thread-local variables that may be accessed in async-signal
context, and Glibc has bugs related to lazy binding and/or allocation
(not sure which exactly) of TLS symbols.

Lazy binding doesn't buy anything for gcov, and that would be a cheap way
to avoid a regression.

On the other hand, it's not gcov's responsibility to work around
long-standing bugs in Glibc, so I don't insist.

Alexander

Questions/topics for UB BoF

2018-08-21 Thread Alexander Monakov

Hello,

For the upcoming Cauldron, I've registered a BoF on treatment of undefined
behavior in GCC. To "promote" the BoF, collect some input prior to the
event, and indirectly indicate my interests, I am offering the following
light-hearted questionnaire.

I would ask to mail responses directly to my address, not the list.

If you prefer Google Forms, please fill https://goo.gl/forms/1sLhMtLLhorvzDm42

Development policies touched upon in items 2-6 is what I'd invite to
discuss on the BoF.

Thank you.
Alexander

You are wearing:

  [*] Distribution maintainer's hat
  [*] GCC developer's hat
  [*] GCC user's hat

# Undefined behavior and compiler diagnostics

1. Some undefined behavior is relevant only at translation time,
   not execution time (for example: an unmatched ' or " character
   is encountered on a logical source line during tokenization).
   GCC typically issues a diagnostic when encountering such UB.
   Should GCC rather make use of the undefinedness in order to
   make the compiler simpler and faster?

  [*] Yes
  [*] No
  Details: _

2. In some instances GCC is able to produce a warning for code
   that is certain to invoke undefined behavior at run time.
   Should GCC strive to diagnose that as much as practical, even
   at the cost of the compiler getting more complex?

  [*] Yes
  [*] No
  Details: _

3. Sometimes GCC is also able to issue a warning for code that
   is likely (but not certain) to invoke undefined behavior.
   As such warnings may have false positives, should GCC
   nevertheless try to issue them too, where practical?

  [*] Yes
  [*] No
  Details: _

4. When GCC assumes absence of undefined behavior in optimization,
   it leads to "surprising" transformations. This is generally not
   reportable to the user; however, -Waggressive-loop-optimization
   was added for one such case. If you are familiar with that flag,
   would you say overall it was worth it, and more warnings in that
   spirit would be nice to have?

  [*] Yes
  [*] No
  Details: _


# Undefined behavior and compiler optimizations

5. When GCC optimizations encounter code that is certain to invoke
   undefined behavior, they do not react consistently: at the moment
   we have a wide range of behaviors, like not transforming the code
   at all, or replacing it with a trap instruction. Should GCC apply
   one tactic consistently, and if so, which?

  [*] Yes
  [*] No
  Details: _

6. Some GCC optimization/analysis functionality assumes absence of
   UB, and cannot detect that code invokes undefined behavior. Should
   GCC prefer to assume presence of UB in analysis, which would allow
   to diagnose it or transform to runtime trap?

  [*] Yes
  [*] No
  Details: _

Re: ChangeLog files: 8 spaces vs. a tabular

2018-08-27 Thread Alexander Monakov

On Mon, 27 Aug 2018, Martin Liška wrote:

> Hi.
> 
> Recently I've noticed that I have wrongly set up my editor and
> I installed quite some changes where my changelog entries
> have 8 spaces instead of a tabular.
> 
> I grepped that for all ChangeLog files (ignoring ChangeLog-{year} files)
> and I see:

Note that some files in your list appear only because they have
12 spaces indenting the second author in a multi-author change,
which is intentional and probably does not need changing.

Alexander

Re: Scheduling automaton question

2011-02-11 Thread Alexander Monakov

On Fri, 11 Feb 2011, Bernd Schmidt wrote:

> Suppose I have two insns, one reserving (A|B|C), and the other reserving
> A. I'm observing that when the first one is scheduled in an otherwise
> empty state, it reserves the A unit and blocks the second one from being
> scheduled in the same cycle. This is a problem when there's an
> anti-dependence of cost 0 between the two instructions.
> 
> Vlad - two questions. Is this behaviour what you would expect to happen,
> and how much work do you think would be involved to fix it (i.e. make
> the first one transition to a state where we can still reserve any two
> out of the three units)?

Could you please clarify a bit: would the modified behavior match what your
target CPU does?  The current behavior matches CPUs without lookahead in
instruction dispatch: the first insn goes to the first matching execution
unit (A), the second has to wait.

Alexander

Re: RTL Conditional and Call

2011-12-30 Thread Alexander Monakov

On Sat, 31 Dec 2011, Matt Davis wrote:

> Hi,
> I am having an RTL problem trying to make a function call from a
> COND_EXEC rtx.  The reload pass has been called, and very simply I
> want to compare on an 64bit x86 %rdx with a specific integer value,
> and if that value is true, my function call executes.  I can call the
> function fine outside of the conditional, but when I set it in the
> conditional expression, I get the following error:
> 
> test.c:6:1: error: unrecognizable insn:

Indeed, x86 does not have a "conditional call" instruction.  You would have to
generate the call in a separate basic block and add a conditional branch
instruction around it.  You can reference the following code, which attempts
to convert any COND_EXECs to explicit control flow:

http://gcc.gnu.org/ml/gcc-patches/2011-10/msg02383.html

(but you will probably need to additionally generate comparison instructions).

Hope that helps,

Alexander

Re: Problems with selective scheduling

2009-10-27 Thread Alexander Monakov


Hi,

On Tue, 27 Oct 2009, Markus L wrote:


Hi,

I recently read the articles about the selective scheduling
implementation and found it quite interesting, I especially liked the
idea of how neatly software pipelining is integrated. The target I am
working on is a VLIW DSP so obviously these things are very important
for good code generation.

However when compiling the following C function with
-fselective-scheduling2 and -fsel-sched-pipelining I face a few
problems.


Increase verbosity of scheduler dumps to obtain more useful information by
passing the following flags:
 -fdump-rtl-sched1-details -fdump-rtl-sched2-details -fsched-verbose=6

It may also be useful to compare the scheduler behaviour on your target to
ia64.  Note that building full-fledged cross-compiler wouldn't be necessary,
just 'configure --target=ia64-linux && make all-gcc' and invoke gcc/cc1
(to produce dumps, change 'sched2' to 'mach' in the line above, since
sel-sched is invoked from machine-reorg pass on ia64).

More comments below.


long dotproduct2(int *a, int *b)
{
   int i;
   long s=0;

   for (i = 0; i < 256; i++)
  s += (long)*a++**b++;
   return s;
}

The output I get from sched2 pass is:
...
Scheduling region 0

Scheduling on fences: (uid:32;seqno:6;)
scanning new insn with uid = 80.
deleting insn with uid = 80.
Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns
renamed, 0 insns substituted
Scheduling region 1

Scheduling on fences: (uid:72;seqno:1;)
scanning new insn with uid = 81.
deleting insn with uid = 81.
Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns
renamed, 0 insns substituted
Scheduling region 2

Scheduling on fences: (uid:65;seqno:1;)
scanning new insn with uid = 82.
deleting insn with uid = 82.
Scheduled 0 bookkeeping copies, 0 insns needed bookkeeping, 0 insns
renamed, 0 insns substituted

(note 26 27 65 2 NOTE_INSN_FUNCTION_BEG)

(insn:TI 65 26 30 2 dotprod2.c:2 (set (mem:QI (pre_dec (reg/f:QI 32
sp)) [0 S1 A16])
   (reg/f:QI 32 sp)) 12 {pushqi1} (nil))

(insn 30 65 62 2 dotprod2.c:2 (set (reg/v:HI 16 a0l [orig:62 s ] [62])
   (const_int 0 [0x0])) 6 {*zero_load_hi} (expr_list:REG_EQUAL
(const_int 0 [0x0])
   (nil)))

(insn 62 30 66 2 dotprod2.c:2 (set (reg:QI 2 r2 [70])
   (const_int 256 [0x100])) 5 {*constant_load_qi}
(expr_list:REG_EQUAL (const_int 256 [0x100])
   (nil)))

(insn:TI 66 62 67 2 dotprod2.c:2 (set (mem:QI (pre_dec (reg/f:QI 32
sp)) [0 S1 A16])
   (reg/f:QI 33 dp)) 12 {pushqi1} (nil))

(insn:TI 67 66 69 2 dotprod2.c:2 (set (reg/f:QI 33 dp)
   (reg/f:QI 32 sp)) 10 {*move_regs_qi} (nil))

(note 69 67 39 2 NOTE_INSN_PROLOGUE_END)

(code_label 39 69 31 3 2 "" [1 uses])

(note 31 39 34 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

(note 34 31 32 3 NOTE_INSN_DELETED)

(insn:TI 32 34 33 3 dotprod2.c:10 (set (reg:QI 19 a1h [67])
   (mem:QI (post_inc:QI (reg/v/f:QI 1 r1 [orig:65 b ] [65])) [2
S1 A16])) 3 {*load_word_qi_with_post_inc} (expr_list:REG_INC
(reg/v/f:QI 1 r1 [orig:65 b ] [65])
   (nil)))

(insn 33 32 35 3 dotprod2.c:10 (set (reg:QI 18 a1l [68])
   (mem:QI (post_inc:QI (reg/v/f:QI 0 r0 [orig:64 a ] [64])) [2
S1 A16])) 3 {*load_word_qi_with_post_inc} (expr_list:REG_INC
(reg/v/f:QI 0 r0 [orig:64 a ] [64])
   (nil)))

(insn 35 33 61 3 dotprod2.c:10 (set (reg/v:HI 16 a0l [orig:62 s ] [62])
   (plus:HI (mult:HI (sign_extend:HI (reg:QI 19 a1h [67]))
   (sign_extend:HI (reg:QI 18 a1l [68])))
   (reg/v:HI 16 a0l [orig:62 s ] [62]))) 23 {multacc}
(expr_list:REG_DEAD (reg:QI 19 a1h [67])
   (expr_list:REG_DEAD (reg:QI 18 a1l [68])
   (nil

(jump_insn:TI 61 35 75 3 dotprod2.c:8 (parallel [
   (set (pc)
   (if_then_else (ne (reg:QI 2 r2 [70])
   (const_int 1 [0x1]))
   (label_ref:QI 39)
   (pc)))
   (set (reg:QI 2 r2 [70])
   (plus:QI (reg:QI 2 r2 [70])
   (const_int -1 [0x])))
   (use (const_int 255 [0xff]))
   (use (const_int 255 [0xff]))
   (use (const_int 1 [0x1]))
   ]) 43 {doloop_end_internal} (expr_list:REG_BR_PROB (const_int
9899 [0x26ab])
   (nil)))

(note 75 61 70 4 [bb 4] NOTE_INSN_BASIC_BLOCK)

(note 70 75 72 4 NOTE_INSN_EPILOGUE_BEG)

...

The loop body is not correctly scheduled, the TImode flags indicate
that the entire loop-body will be executed in a single cycle as a VLIW
packet and this will not work since no loop-prologue code has been
emitted.


I suspect your machine description says that dependency between loads and
multiply-add has zero latency, thus allowing the scheduler to place them into
one instruction group.  Grep for various comments about tick_check_p function.
In verbose scheduler dumps, there should be something like

Expr 35 is not ready yet until cycle 2
No best expr found!
Finished a cycle.  Current cycle = 2


My (probably quite limited) understanding of what should happen is that:

1. the fence is placed at (bef

Re: libgomp forces -Werror even when top level configure --disable-werror

2009-12-28 Thread Alexander Monakov



In the broader scope, there are two separate problems here.

One is that libgomp does not honor --disable-werror indeed.  However, 
--disable-werror, if it worked for libgomp, would be too big a hammer to 
work around the second (real) problem, and not quite useful for 
development builds anyway.


The second issue is how libgomp configure detects availability of
pthread_{attr,}_{get,set}affinity_np.  It uses link test and does not pass 
-Wall -Werror to compiler; thus, libgomp build fails if those 
functions are declared in /usr/include/nptl/pthread.h but not in 
/usr/include/pthread.h (libgomp only includes the latter).


This can be fixed by making configure use -Werror (I believe that
adding --with-pthread= option is out of the question).

Bootstraps on affected systems should probably use

make CFLAGS_FOR_TARGET='-g -O2 -I/usr/include/nptl'

as a clean workaround.

--
Alexander Monakov

Re: Modulo Scheduling

2010-02-09 Thread Alexander Monakov




On Tue, 2 Feb 2010, Cameron Lowell Palmer wrote:


Does Modulo Scheduling work on x86 platforms? I have tried adding in
various versions of the -fmodulo-sched option and get the exact same
output with or without. The application is a very simplistic matrix
multiply without dependencies.


No, at present SMS is not able to schedule any loops on x86 at all.  This 
is due to implementation detail: SMS operates on loops that end with 
decrement-and-branch instruction, and GCC does not generate such 
instructions on x86.


Sorry.
Alexander Monakov

Re: Understanding Scheduling

2010-03-19 Thread Alexander Monakov

On Fri, 19 Mar 2010, Ian Bolton wrote:

> Let's start with sched1 ...
> 
> For our architecture at least, it seems like Richard Earnshaw is
> right that sched1 is generally bad when you are using -Os, because
> it can increase register pressure and cause extra spill/fill code when
> you move independent instructions in between dependent instructions.

Please note that Vladimir Makarov implemented register pressure-aware sched1
for GCC 4.5, activated with  -fsched-pressure.  I thought I should mention
this because your e-mail omits it completely, so it's hard to tell whether you
tested it.

Best regards,
Alexander Monakov

Re: bug linear loop transforms

2010-03-29 Thread Alexander Monakov

[gcc-bugs@ removed from Cc:]

On Mon, 29 Mar 2010, Alex Turjan wrote:

> Im writing to you regarding a possible bug in linear loop transfor.
> The bug can be reproduce by compiling the attached c file with gcc.4.5.0 
> (20100204, 20100325) on x86 machine.
> 
> The compiler flags that reproduce the error are:
> -O2 -fno-inline -fno-tree-ch -ftree-loop-linear
> 
> If the compiler is run with:
> -O2 -fno-inline -fno-tree-ch -fno-tree-loop-linear 
> then the produced code is correct.

Instead of writing to a mailing list, please file a bug in GCC Bugzilla, as
described in http://gcc.gnu.org/bugs/ .  Posting bug reports to gcc-bugs@ does
not register them in the bugzilla, and thus is not recommended.

Thanks.

Alexander Monakov

Re: (known?) Issue with bitmap iterators

2009-06-26 Thread Alexander Monakov




On Thu, 25 Jun 2009, Jeff Law wrote:
I wasn't suggesting we make them "safe" in the sense that one could modify 
the bitmap and everything would just work.  Instead I was suggesting we make 
the bitmap readonly for the duration of the iterator and catch attempts to 
modify the bitmap -- under the control of ENABLE_CHECKING of course.  If that 
turns out to still be too expensive, it could be conditional on 
ENABLE_BITMAP_ITERATOR_CHECKING or whatever, which would normally be off.


My biggest concern would be catching all the exit paths from the gazillion 
iterators we use and making sure they all reset the readonly flag.  Ick...


Unless I'm missing something, one can avoid that by checking whether the 
bitmap has been modified in the iterator increment function.  To be 
precise, what I have in mind is:


1. Add bool field `modified_p' in bitmap structure.
2. Make iterator setup functions (e.g. bmp_iter_set_init) reset it to 
false.

3. Make functions that modify the bitmap set it to true.
4. Make iterator increment function (e.g. bmp_iter_next) assert 
!modified_p.


This approach would also allow to modify the bitmap on the iteration 
ending with break, which IMHO is fine.


--
Alexander Monakov

Re: (known?) Issue with bitmap iterators

2009-06-26 Thread Alexander Monakov




On Fri, 26 Jun 2009, Joe Buck wrote:


On Fri, Jun 26, 2009 at 03:38:31AM -0700, Alexander Monakov wrote:

1. Add bool field `modified_p' in bitmap structure.
2. Make iterator setup functions (e.g. bmp_iter_set_init) reset it to
false.
3. Make functions that modify the bitmap set it to true.
4. Make iterator increment function (e.g. bmp_iter_next) assert
!modified_p.


Sorry, it doesn't work.  Function foo has a loop that iterates
over a bitmap.  During the iteration, it calls a function bar.  bar
modifies the bitmap, then iterates over the bitmap.  It then returns
to foo, which is in the middle of an iteration, which it continues.
The bitmap has been modified (by bar), but modified_p was reset to
false by the iteration that happened at the end of bar.


Good catch, thanks!  Let me patch that up a bit:

1. Add int field `generation' in bitmap structure, initialized arbitrarily 
at bitmap creation time.
2. Make iterator setup functions save initial generation count in the 
iterator (or generation counts for two bitmaps, if needed).

3. Modify generation count when bitmap is modified (e.g. increment it).
4. Verify that saved and current generations match when incrementing the 
iterator.

Re: A question regarding bundling and NOPs insertion for VLIW architecture

2010-05-11 Thread Alexander Monakov



On Tue, 11 May 2010, Revital1 Eres wrote:

> 
> Hello,
> 
> I have a question regarding the process of bundling and NOPs insertion for
> VLIW architecture
> and I appreciate your answer:
> 
> I am calling the second scheduler from the machine reorg pass; similar to
> what is done for IA64.
> I now want to handle the bundling and NOPs insertion for VLIW architecture
> with issue rate of 4
> and I want to make sure I understand the process:
> 
> IIUC I can use the insns with TImode that the scheduler marked to indicate
> a new cycle, so the
> the question is how many nops to insert after that cycle, if any.
> I noticed the following approach that was used in c6x which is mentioned
> in:
> http://archiv.tu-chemnitz.de/pub/2004/0176/data/index.html
> 
> "NOP Insertion and Parallel Scheduling
> If the scheduler is run, it checks dependencies and tries to schedule the
> instructions as to
> minimize the processing cycles. The hooks TARGET_SCHED_REORDER(2) are
> considered
> to reorder the instructions in the ready cue in case the back end wants to
> override the
> default rules. I used the hooks to memorize the program cycle the
> instruction is scheduled.
> This value is stored in a hash table I created for that purpose. From the
> cycle information
> I can later determine how many NOPs have to be inserted between two
> instructions. This
> value then overrides the attribute value."
> 
> IA64 seems to have much more complicated approach for the bundling and NOPs
> insertion and I wonder
> if the reason is due to IA64 specific issues? or there is something I'm
> missing in the approach
> mentioned above?

>From skimming the paper I understand that the target processor is a 4-wide
VLIW with little or no instruction issue constraints (which insn type may go
in which bundle slot) and uses a non-interlocked pipeline, thus requiring NOP
insertion to avoid dependencies.  IA64 is different in both regards.

Bundling in ia64 is complicated because not all combinations of insn types are
possible in a bundle (a bundle contains three insns), and instruction issue
boundaries can appear in mid-bundles (ia64 architecture uses stop bits to
indicate parallel issue boundaries, and there are some bundle kinds with a
stop bit in between).  Incidentally, ia64 does not need NOP insertion to avoid
data dependency violation, because it uses scoreboarding to track register
dependencies.  Thus, NOP insertion is only needed to satisfy bundling
constraints.

I think the ia64 port in GCC uses dynamic programming to perform bundling
because it would be much harder to extract the instruction placement from the
automaton (which I think tracks all of the mentioned constraints internally).

Alexander

Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Alexander Monakov

Hi,

On Fri, 25 Jun 2010, Jan Hubicka wrote:

> I would be also very interested to know how profile feedback works in this 
> case
> (and why it does not work in previous releases).

Profiling multi-threading programs needs -fprofile-correction that appeared
only in 4.4 (but I have no idea whether 4.4 works for Mozilla or not -- the
initial message only speaks about 4.3 and 4.5).  Mozilla code also triggered a
bug in libgcov ( http://gcc.gnu.org/PR43825 ), and they have probably modified
their code to never leave non-default alignment at the end of the TU (I have
posted a patch for the libgcov bug [1], but it was not reviewed and does not
apply anymore due to build_constructor changes).

[1] http://gcc.gnu.org/ml/gcc-patches/2010-05/msg00292.html

Cheers,
Alexander

Re: CFG traversal

2010-07-06 Thread Alexander Monakov

On Tue, 6 Jul 2010, Alex Turjan wrote:

> Hi,
> Is there functionality in gcc based on which the CFG can be traversed in
> such a way that a node gets visited once all of its predecessors have been
> visited?

(Assuming you mean when there are no loops)
Yes, see post_order_compute in cfganal.c.  It computes topological sort order
(which is what you need) in reverse: nodes that must be visited last come
first in the array.

Hope that helps.
Alexander

Re: A question about doloop

2010-07-26 Thread Alexander Monakov

On Mon, 26 Jul 2010, Revital1 Eres wrote:

> 
> Hello,
> 
> Doloop optimization fails to be applied on the following kernel from
> tescase sms-4.c with mainline (-r 162294) due to 'Possible infinite
> iteration
> case' message; taken from the loop2_doloop dump. (please see below).
> With an older version of gcc (-r 146278) doloop succeeded to be applied
> and I appreciate an explanation about the change of behavior.

This may be due to changes in induction variable selection and the ability of
loop2_doloop to discover number of iterations for different variants of IV
selection (which is not trivial when 'i' variable is eliminated and loop
boundary is expressed with pointer comparison).  See PR32283 [1] audit trail
for an example of a related problem that was discussed before.

It is possible that something is missing in simplify-rtx.c so that 'infinite
if' condition cannot be simplified and proven to be always false.  Zdenek once
had to improve simplify-rtx.c for this reason, as the audit trail shows.

[1] http://gcc.gnu.org/PR32283

Hope that helps,
Alexander

Re: Restrict qualifier still not working?

2010-08-02 Thread Alexander Monakov



On Mon, 2 Aug 2010, Bingfeng Mei wrote:

> Hi,
> I ran a small test to see how the trunk/4.5 works 
> with the rewritten restrict qualified pointer code. But it doesn't
> seem to work on both x86-64 and our port. 
> 
> tst.c:
> void foo (int * restrict a, int * restrict b,
>   int * restrict c, int * restrict d)
> {
>   *c = *a + 1;
>   *d = *b + 1;
> }
[snip]
> foo:
> .LFB0:
>   .cfi_startproc
>   movl(%rdi), %eax
>   addl$1, %eax
>   movl%eax, (%rdx)
>   movl(%rsi), %eax
>   addl$1, %eax
>   movl%eax, (%rcx)
>   ret
> 
> In the finally generated code, the second load should have
> been moved before the first store if restrict qualifiers
> are handled correctly. 
> 
> Am I missing something here? Thanks. 

The second load is moved for me with -fschedule-insns, -frename-registers or
-fselective-scheduling2 (all of which are disabled by default on x86-64 -O2).
Without those flags, second scheduler alone cannot lift the load due to
dependency on %eax.

Hope that helps.

Alexander

RE: Restrict qualifier still not working?

2010-08-03 Thread Alexander Monakov

On Tue, 3 Aug 2010, Bingfeng Mei wrote:

> Thanks, I can reproduce it with trunk compiler but not 4.5.0. 
> Do you know how alias set are represented and used now.

I'm not aware of any changes regarding alias sets.

> It used to 
> be each alias set is assigned a unique number and there won't 
> be a dependence edge drawn between different alias set.

Are you implying that restricted pointers would get different alias sets
numbers before but don't anymore?  I don't think that this might have changed
but I may be mistaken (hopefully Richard can clarify this).

> It seems not
> to be the case anymore. [2 *a_1(D)+0 S4 A32] The second field 
> must play a role in disambiguate the memory access. 

Yes, this is MEM_EXPR which is used to invoke tree alias oracle from RTL.  See
mem_refs_may_alias_p and its invocations in {true,anti,write}_dependence.  It
should be much more precise than alias set numbers (but they are still used
nevertheless).

> BTW, why these two intermediate variables are both assigned to eax
> without these non-default options? This example has no register pressure.
> It looks like an issue with IRA. 

Well, RA is quite complicated even without considering issues like this.
Thanks to Vladimir's pressure-sensitive scheduling patches, pre-RA scheduling
should solve this.

Alexander

Re: pipeline description

2010-11-12 Thread Alexander Monakov



On Thu, 11 Nov 2010, Ian Lance Taylor wrote:

> roy rosen  writes:
> 
> > If I have two insns:
> > r2 = r3
> > r3 = r4
> > It seems to me that the dependency analysis creates a dependency
> > between the two and prevent parallelization. Although there is a
> > dependency (because of r3) I want GCC to parallelize them together.
> > Since if the insns are processed together the old value of r3 is used
> > for the first insn before it is written by the second insn.
> > How do I let GCC know about these things (When exactly each operand is
> > used and when it is written)?
> > Is it in these hooks?
> > In which port can I see a good example for that?
> 
> I was under the impression that an anti-dependence in the same cycle was
> permitted to execute.  But perhaps I am mistaken.

No, you are right.  That is the norm when compiling for ia64, for example.  I
don't think the backend should specifically care about it: the anti-dependency
gets zero latency, and the scheduler is able to issue the second instruction
on the same cycle after issuing the first.

Alexander

Re: Concerns regarding the -ffp-contract=fast default

2023-09-14 Thread Alexander Monakov

On Thu, 14 Sep 2023, Florian Weimer via Gcc wrote:

> While rebuilding CentOS Stream with -march=x86-64-v3, I rediscovered
> several packages had test suite failures because x86-64 suddenly gained
> FMA support.  I say “rediscovered” because these issues were already
> visible on other architectures with FMA.
> 
> So far, our package/architecture maintainers had just disabled test
> suites or had built the package with -fp-contract=off because the
> failures did not reproduce on x86-64.  I'm not sure if this is the right
> course of action.
> 
> GCC contraction behavior is rather inconsistent.  It does not contract x
> + x - x without -ffast-math, for example, although I believe it would be
> permissible under the rules that enable FMA contraction.  This whole
> thing looks suspiciously like a quick hack to get a performance
> improvement from FMA instructions (sorry).
> 
> I know that GCC 14 has -fp-contract=standard.  Would it make sense to
> switch the default to that?  If it fixes those package test suites, it
> probably has an observable performance impact. 8-/

Note that with =standard FMA contraction is still allowed within an
expression: the compiler will transform 'x * y + z' to 'fma(x, y, z)'.
The difference between =fast and =standard is contraction across
statement boundaries. So I'd expect some test suite failures you speak of
to remain with =standard as opposed to =off.

I think it's better to switch both C and C++ defaults to =standard,
matching Clang, but it needs a bit of leg work to avoid regressing
our own testsuite for targets that have FMA in the base ISA.

(personally I'd be on board with switching to =off even)

See https://gcc.gnu.org/PR106902 for a worked example where -ffp-contract=fast
caused a correctness issue in a widely used FOSS image processing application
that was quite hard to debug.

Obviously -Ofast and -ffast-math will still imply -ffp-contract=fast if we
make the change, so SPEC scores won't be affected.

Thanks.
Alexander

Re: Concerns regarding the -ffp-contract=fast default

2023-09-18 Thread Alexander Monakov

Hi Florian,

On Thu, 14 Sep 2023, Alexander Monakov wrote:

> 
> On Thu, 14 Sep 2023, Florian Weimer via Gcc wrote:
> 
> > While rebuilding CentOS Stream with -march=x86-64-v3, I rediscovered
> > several packages had test suite failures because x86-64 suddenly gained
> > FMA support.  I say “rediscovered” because these issues were already
> > visible on other architectures with FMA.
> > 
> > So far, our package/architecture maintainers had just disabled test
> > suites or had built the package with -fp-contract=off because the
> > failures did not reproduce on x86-64.  I'm not sure if this is the right
> > course of action.
> > 
> > GCC contraction behavior is rather inconsistent.  It does not contract x
> > + x - x without -ffast-math, for example, although I believe it would be
> > permissible under the rules that enable FMA contraction.  This whole
> > thing looks suspiciously like a quick hack to get a performance
> > improvement from FMA instructions (sorry).
> > 
> > I know that GCC 14 has -fp-contract=standard.  Would it make sense to
> > switch the default to that?  If it fixes those package test suites, it
> > probably has an observable performance impact. 8-/
> 
> Note that with =standard FMA contraction is still allowed within an
> expression: the compiler will transform 'x * y + z' to 'fma(x, y, z)'.
> The difference between =fast and =standard is contraction across
> statement boundaries. So I'd expect some test suite failures you speak of
> to remain with =standard as opposed to =off.
> 
> I think it's better to switch both C and C++ defaults to =standard,
> matching Clang, but it needs a bit of leg work to avoid regressing
> our own testsuite for targets that have FMA in the base ISA.
> 
> (personally I'd be on board with switching to =off even)
> 
> See https://gcc.gnu.org/PR106902 for a worked example where -ffp-contract=fast
> caused a correctness issue in a widely used FOSS image processing application
> that was quite hard to debug.
> 
> Obviously -Ofast and -ffast-math will still imply -ffp-contract=fast if we
> make the change, so SPEC scores won't be affected.

Is this the sort of information you were looking for?

If you're joining the Cauldron and could poll people about changing the default,
I feel that could be helpful.

One of the tricky aspects is what to do under -std=cNN, which implies
-ffp-contract=off; "upgrading" it to =standard would introduce FMAs.

Also, I'm a bit unsure what you were implying here:

> I know that GCC 14 has -fp-contract=standard.  Would it make sense to
> switch the default to that?  If it fixes those package test suites, it
> probably has an observable performance impact. 8-/

The "correctness trumps performance" principle still applies, and
-ffp-contract=fast (the current default outside of -std=cNN) is
known to cause correctness issues and violates the C language standard.
And -ffast[-and-loose]-math for is not going away.

Thanks.
Alexander

Re: Concerns regarding the -ffp-contract=fast default

2023-09-18 Thread Alexander Monakov

On Mon, 18 Sep 2023, Florian Weimer via Gcc wrote:

> x - x is different because replacing it with 0 doesn't seem to be a
> valid contraction because it's incorrect for NaNs.  x + x - x seems to
> be different in this regard, but in our implementation, there might be a
> quirk about sNaNs and qNaNs.

Sorry, do you mean contracting 'x + x - x' to 'x'? That is not allowed
due to different result and lack of FP exception for infinities.

Contracting 'x + x - x' to fma(x, 2, -x) would be fine.

Alexander

1 2 >

1 - 100 of 172 matches

Mail list logo