[Bug other/99288] xgettext does not get HOST_WIDE_INT_PRINT_UNSIGNED

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99288

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Richard Biener  ---
Fixed.

[Bug translation/40883] [meta-bug] Translation breakage with trivial fixes

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40883
Bug 40883 depends on bug 99288, which changed state.

Bug 99288 Summary: xgettext does not get HOST_WIDE_INT_PRINT_UNSIGNED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99288

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug testsuite/99292] FAIL: gcc.c-torture/compile/pr98096.c -O0 (test for excess errors)

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99292

--- Comment #1 from Richard Biener  ---
IIRC it requires LRA, maybe add a dg target selector for LRA (or reload, that's
likely smaller now)?

[Bug c/99295] [11 Regression] documentation on __attribute__((malloc)) is wrong

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99295

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1

[Bug middle-end/99299] Need a recoverable version of __builtin_trap()

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99299

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||rguenth at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
'enhancement' Importance is the magic we use, in the end it's a missed
optimization since you refer to sub-optimal code gen.

I'm not sure what your proposed not noreturn trap() would do in terms of
IL semantics compared to a not specially annotated general call?

"recoverable" likely means resuming after the trap, not on an exception
path (so it'll not be a throw())?

The only thing that might be useful to the middle-end would be marking
the function as not altering the memory state.  But I suppose it should
still serve as a barrier for code motion of both loads and stores, even
of those loads/stores are known to not trap.  The only magic we'd have
for this would be __attribute__((const,returns_twice)).  Which likely
will be more detrimental to general optimization.

So - what's the "sub-optimal code generation" you refer to from the
(presumably) volatile asm() you use for the trap?

[yeah, asm() on GIMPLE is less optimized than a call]

[Bug rtl-optimization/99305] [11 Regression] range condition simplification after inlining

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99305

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization,
   ||needs-bisection
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Target Milestone|--- |11.0
   Last reconfirmed||2021-03-01

--- Comment #1 from Richard Biener  ---
Confirmed.  Some forwprop/match.pd prevents phiopt to trigger:

GCC 10 (forwprop->phiopt):

[local count: 1073741824]:
   _7 = (unsigned char) c_2(D);
   _8 = _7 + 208;
-  if (_8 <= 9)
-goto ; [50.00%]
-  else
-goto ; [50.00%]
-
-   [local count: 536870913]:
-
-   [local count: 1073741824]:
-  # iftmp.1_1 = PHI <1(3), 0(2)>
-  return iftmp.1_1;
+  _9 = _8 <= 9;
+  return _9;

forwprop difference GCC 10/11:

-  Replaced '_9 != 0' with '_8 <= 9'
-bar (char c)
+bool bar (char c)
 {
   bool iftmp.1_1;
-  unsigned char _7;
-  unsigned char _8;
+  unsigned char c.0_4;
+  unsigned char _5;
+  bool _6;
+  bool _7;

[local count: 1073741824]:
-  _7 = (unsigned char) c_2(D);
-  _8 = _7 + 208;
-  if (_8 <= 9)
+  if (c_2(D) != 0)
 goto ; [50.00%]
   else
 goto ; [50.00%]

[local count: 536870913]:
+  c.0_4 = (unsigned char) c_2(D);
+  _5 = c.0_4 + 208;
+  _6 = _5 <= 9;
+  _7 = -_6;

[local count: 1073741824]:
-  # iftmp.1_1 = PHI <1(3), 0(2)>
+  # iftmp.1_1 = PHI <_7(3), 0(2)>
   return iftmp.1_1;

[Bug libstdc++/99306] cross compiler bootstrap failure on msdosdjgpp: error: alignment of 'm' is greater than maximum object file alignment 16

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99306

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2021-03-01
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #2 from Richard Biener  ---
  __gnu_cxx::__mutex&
  get_mutex(unsigned char i)
  {
// increase alignment to put each lock on a separate cache line
struct alignas(64) M : __gnu_cxx::__mutex { };
static M m[mask + 1];
return m[i];

there's __BIGGEST_ALIGNMENT__ one could use as bound but that will usually
be lower than the max ofile alignment and on most targets likely less
than 64.  That value (64) looks like it should be target dependent anyway
(configury?)

[Bug c++/99309] [10/11 Regression] Segmentation fault with __builtin_constant_p usage at -O2

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99309

Richard Biener  changed:

   What|Removed |Added

  Known to fail||11.0
   Target Milestone|--- |10.3
Summary|Segmentation fault with |[10/11 Regression]
   |__builtin_constant_p usage  |Segmentation fault with
   |at -O2  |__builtin_constant_p usage
   ||at -O2
  Known to work||9.3.1
   Priority|P3  |P2
   Last reconfirmed||2021-03-01
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Keywords||wrong-code

--- Comment #1 from Richard Biener  ---
confirmed.

[Bug c++/99310] [11 Regression] ICE: canonical types differ for identical types 'void (A::)(void*)' and 'void (A::)(void*)'

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99310

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P4
   Target Milestone|--- |11.0
   Keywords||error-recovery,
   ||ice-checking

[Bug preprocessor/99313] ICE while changing global target options via pragma

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99313

--- Comment #3 from Richard Biener  ---
But this results in unexpected behavior when there's functions with arch=z13
vs. arch=z9 and depending on "luck" we then inherit the wrong params where
we should not?

That said, when unifying target/optimize options these should be handled
and stored once, right?

[Bug c++/99318] [10/11 Regression] -Wdeprecated-declarations where non-should be?

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99318

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |10.3
   Keywords||diagnostic, rejects-valid

[Bug c/99323] [9/10/11 Regression] ICE in add_hint, at diagnostic-show-locus.c:2234

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99323

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |9.4
 CC||dmalcolm at gcc dot gnu.org

[Bug c/99324] ICE in mark_addressable, at gimple-expr.c:918

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99324

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2021-03-02
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Confirmed.

914   /* Also mark the artificial SSA_NAME that points to the partition of
X.  */
915   if (TREE_CODE (x) == VAR_DECL
916   && !DECL_EXTERNAL (x)
917   && !TREE_STATIC (x)
918   && cfun->gimple_df != NULL
919   && cfun->gimple_df->decls_to_pointers != NULL)
920 {
(gdb) p cfun
$1 = (function *) 0x0

I suppose this could be made more robust by checking for cfun being non-NULL
or checking currently_expanding_to_rtl.

[Bug c/99325] [11 Regression] ICE in maybe_print_line_1, at c-family/c-ppoutput.c:454

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99325

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2021-03-02
 Status|UNCONFIRMED |NEW
   Target Milestone|--- |11.0

--- Comment #1 from Richard Biener  ---
Confirmed.

[Bug fortran/99326] [9/10/11 Regression] ICE in gfc_build_dummy_array_decl, at fortran/trans-decl.c:1299

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99326

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P4
   Target Milestone|--- |9.4

[Bug debug/99334] Generated DWARF unwind table issue while on instructions where rbp is pointing to callers stack frame

2021-03-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99334

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2021-03-02
 Status|UNCONFIRMED |WAITING
 Target||x86_64-linux

[Bug c/99324] [8/9/10/11 Regression] ICE in mark_addressable, at gimple-expr.c:918 since r6-314

2021-03-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99324

--- Comment #4 from Richard Biener  ---
(In reply to Jakub Jelinek from comment #3)
> Wouldn't it be better to remove the mark_addressable call from build_va_arg
> and call {c,cxx}_mark_addressable in the callers instead.

Sure, or make it a langhook so c-common code can call the "correct"
mark_addresable (there's also c_common_mark_addressable_vec which might
suggest that splitting out common c_common_mark_addressable from
{c,cxx}_mark_addressable should be viable and use that).

> That way we'd also e.g. diagnose invalid (on i686-linux):
> register __builtin_va_list ap __asm ("%ebx");
> 
> void
> foo (int a, ...)
> {
>   __builtin_va_arg (ap, int);
> }

[Bug c/99340] -Werror=maybe-uninitialized warning with -fPIE, but not -fPIC

2021-03-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99340

--- Comment #2 from Richard Biener  ---
PIC allows interposing ags_midi_buffer_util_get_varlength and thus possibly
initializing the argument.  PIE does not allow this so we see it is not
initialized.

I suppose the change on the branch is for some unreduced testcase where
different optimization might trigger the new warning (correctly I think).

[Bug middle-end/99339] Poor codegen with simple varargs

2021-03-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Keywords||missed-optimization
 Target||x86_64-*-*
  Component|c   |middle-end
 CC||matz at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-03-02

--- Comment #1 from Richard Biener  ---
The stack space is not eliminated because we lower __builtin_va_start only
after RTL expansion and that reserves stack space necessary for accessing
some of the meta (including the passed value itself) as memory.

So it's unavoidable up to somebody designing sth smarter around varargs
and GIMPLE.

Arguably the not lowered variant would be easier to expand optimally:

int test_va (int x)
{
  struct  va[1];
  int i;
  int _7;

   [local count: 1073741824]:
  __builtin_va_start (&va, 0);
  i_4 = .VA_ARG (&va, 0B, 0B);
  __builtin_va_end (&va);
  _7 = i_4 + x_6(D);
  va ={v} {CLOBBER};
  return _7;

I'm not fully sure why we lower at all.  Part of the lowering determines
whether there's any FP arguments referenced and optimizes based on that,
but IIRC that's all.

[Bug middle-end/99339] Poor codegen with simple varargs

2021-03-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #2 from Richard Biener  ---
Btw, clang manages to produce the following, which shows the situation could be
worse ;)

test_va:# @test_va
.cfi_startproc
# %bb.0:
subq$88, %rsp
.cfi_def_cfa_offset 96
movl%eax, %r10d
movl%edi, %eax
testb   %r10b, %r10b
je  .LBB0_2
# %bb.1:
movaps  %xmm0, -48(%rsp)
movaps  %xmm1, -32(%rsp)
movaps  %xmm2, -16(%rsp)
movaps  %xmm3, (%rsp)
movaps  %xmm4, 16(%rsp)
movaps  %xmm5, 32(%rsp)
movaps  %xmm6, 48(%rsp)
movaps  %xmm7, 64(%rsp)
.LBB0_2:
movq%rsi, -88(%rsp)
movq%rdx, -80(%rsp)
movq%rcx, -72(%rsp)
movq%r8, -64(%rsp)
movq%r9, -56(%rsp)
leaq-96(%rsp), %rcx
movq%rcx, -112(%rsp)
leaq96(%rsp), %rcx
movq%rcx, -120(%rsp)
movabsq $206158430216, %rcx # imm = 0x38
movq%rcx, -128(%rsp)
movl$8, %edx
cmpq$40, %rdx
ja  .LBB0_4
# %bb.3:
movl$8, %ecx
addq-112(%rsp), %rcx
addl$8, %edx
movl%edx, -128(%rsp)
jmp .LBB0_5
.LBB0_4:
movq-120(%rsp), %rcx
leaq8(%rcx), %rdx
movq%rdx, -120(%rsp)
.LBB0_5:
addl(%rcx), %eax
addq$88, %rsp
.cfi_def_cfa_offset 8
retq

[Bug middle-end/99339] Poor codegen with simple varargs

2021-03-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #3 from Richard Biener  ---
So we could try to lower even va_start/end to expose the va_list meta fully
to the middle-end early which should eventually allow eliding it.  That
would require introducing other builtins/internal fns to allow referencing
the frame or the incoming arg registers by number.

[Bug c/99340] -Werror=maybe-uninitialized warning with -fPIE, but not -fPIC

2021-03-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99340

--- Comment #6 from Richard Biener  ---
GCC 9 warns as well.  I think this was a false negative which is now fixed.

Note GCC 10.1.0 and GCC 10.2.0 warn for me as well, so something must have
regressed this between 10.2.0 and g:eddcb627ccfbd97e025cf366

I'm inclined to mark as INVALID.

[Bug middle-end/99339] Poor codegen with simple varargs

2021-03-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

Richard Biener  changed:

   What|Removed |Added

 CC||jamborm at gcc dot gnu.org

--- Comment #7 from Richard Biener  ---
For simple cases some IPA pass (IPA-CP or IPA-SRA?) could also 'clone' varargs
functions based on callers, eliding varargs and thus also allow inlining
(or like the early IPA-SRA did, modify a function in place if all callers are
simple).

Directly supporting inlining might also be possible.

What's required for all this is some local analysis of the varargs function
on whether it's possible to replace the .VA_ARG calls with direct parameter
references (no .VA_ARG in loops for example, no passing of the va_list to
other functions, etc.).

[Bug inline-asm/99342] Clobbered register used for input operand (aarch64)

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID
 Target||aarch64

--- Comment #5 from Richard Biener  ---
(In reply to Stewart Hildebrand from comment #4)
> Created attachment 50287 [details]
> Simplified test case
> 
> I simplified the test case - hopefully this should make it clearer. This:
> 
> asm volatile("\n"
>  "ldr x0,  %0  \n"
>  "ldr x1,  %1  \n"
>  "ldr x2,  %2  \n"
>  : // No output operands
>  : // Inputs:
>"Q"(s_current->_state.fp), "Ump"(s_current->_state.sp),
>"Ump"(this->_state.fp)
>  : // Clobbers:
>// Registers we use here
>"x0", "x1", "x2",
>// Callee-saved registers (general purpose)
>"x19", "x20", "x21", "x22", "x23", "x24",
>"x25", "x26", "x27", "x28",
>// Memory access
>"memory");
> 
> Results in:
> 
>  118:   f9400080ldr x0, [x4]
>  11c:   f9401461ldr x1, [x3, #40]
>  120:   f9400c02ldr x2, [x0, #24]

you are clobbering x{0,1,2} before the asm finished using its input operands
so you have to use earlyclobbers.

[Bug preprocessor/99343] Suggest: -H option support output to file

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99343

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2021-03-03
 Ever confirmed|0   |1
   Severity|normal  |enhancement
 Status|UNCONFIRMED |NEW

--- Comment #1 from Richard Biener  ---
sounds reasonable.  patches should be sent to gcc-patc...@gcc.gnu.org, see also
https://gcc.gnu.org/contribute.html

[Bug fortran/99345] [11 Regression] ICE in doloop_contained_procedure_code, at fortran/frontend-passes.c:2464

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99345

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P4
   Target Milestone|--- |11.0

[Bug rtl-optimization/99347] [9/10/11 Regression] ICE in create_block_for_bookkeeping, at sel-sched.c:4549 since r9-6859-g25eafae67f186cfa

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99347

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |9.4
   Priority|P3  |P2
 CC||amonakov at gcc dot gnu.org

[Bug fortran/99350] [9/10/11 Regression] ICE in gfc_get_symbol_decl, at fortran/trans-decl.c:1869

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99350

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P4
   Target Milestone|--- |9.4

[Bug fortran/99355] -freal-X-real-Y -freal-Z-real-X promotes Z to Y

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99355

Richard Biener  changed:

   What|Removed |Added

Version|unknown |10.2.0

--- Comment #2 from Richard Biener  ---
So you say -freal-8-real-16 -freal-4-real-8 promotes real(4) to real(16). 
Which indeed sounds less than useful but could be a valid reading of the
intended semantics.

[Bug ipa/99357] Missed Dead Code Elimination Opportunity

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99357

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2021-03-03
 Status|UNCONFIRMED |NEW
  Component|tree-optimization   |ipa
 Ever confirmed|0   |1
 CC||hubicka at gcc dot gnu.org,
   ||marxin at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
   Keywords||missed-optimization

--- Comment #1 from Richard Biener  ---
We have no "flow sensitive" analysis of global variable values.  When you
remove the 'a = 0' assignment we figure a is never written to and promote it
constant which then allows constant folding of the read.

Now we could eventually enhance that analysis to ignore writes that store
the same value as the initializer (and also make sure to remove those
later).

But consider

static int a = 0;
extern void bar(void);
int main() {
if (a)
bar();
a = 1;
return 0;
}

which would be still valid to be optimized to just

int main()
{
  return 0;
}

eliding the call and the variable 'a' completely (since it's unused).

Thus it's also a missed dead store elimination (for which we'd need to
know if there are any finalizers referencing 'a' for example).

[Bug tree-optimization/97897] ICE tree check: expected ssa_name, have integer_cst in compute_optimized_partition_bases, at tree-ssa-coalesce.c:1638

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97897

Richard Biener  changed:

   What|Removed |Added

  Known to work||10.2.1
 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Biener  ---
Fixed.  Not planning to backport further.

[Bug tree-optimization/98526] [10 Regression] Double-counting of reduction cost

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98526

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
   Priority|P3  |P2
  Known to fail||10.2.0
 Resolution|--- |FIXED
  Known to work||10.2.1

--- Comment #7 from Richard Biener  ---
Fixed.

[Bug tree-optimization/98640] [10 Regression] GCC produces incorrect code with -O1 and higher since r10-2711-g3ed01d5408045d80

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98640

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Richard Biener  ---
Fixed.

[Bug tree-optimization/99101] optimization bug with -ffinite-loops

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101

Richard Biener  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #21 from Richard Biener  ---
So I'm somewhat lost in pointing to the actual error.  And I'm not sure there
is any error, just unfortuate optimization behavior in the face of the testcase
being undefined with -ffinite-loops (or in C++).

That said, more "sensible" optimization from the undefined behavior would
have been to exit the loop, not preserving the if (xx) test.  With preserving
it we either end up with infinite puts() or no puts() calls both which
have the "wrong" number of invocations of the side-effect in the loop.

There's still the intuitively missing control dependence on the if (at_eof)
check (which is also missing without -ffinite-loops but doesn't cause any
wrong DCE there).  But as said my gut feeling is that control dependence
doesn't capture the number of invocations but only whether something is
invoked.  That's likely why we manually add control dependences of the
latch of loops for possibly infinite loops.

CCing Honza who added the control-dependence stuff and who may remember
some extra details.

[Bug tree-optimization/99101] optimization bug with -ffinite-loops

2021-03-03 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101

--- Comment #23 from Richard Biener  ---
Just for the record we had the idea to apply the "bolt" of marking the latch
control dependence (as done for possibly infinite loops) for loops containing
stmts with side-effects.

diff --git a/gcc/tree-ssa-dce.c b/gcc/tree-ssa-dce.c
index c027230acdc..c07b60bf25c 100644
--- a/gcc/tree-ssa-dce.c
+++ b/gcc/tree-ssa-dce.c
@@ -695,6 +695,12 @@ propagate_necessity (bool aggressive)
  if (bb != ENTRY_BLOCK_PTR_FOR_FN (cfun)
  && !bitmap_bit_p (visited_control_parents, bb->index))
mark_control_dependent_edges_necessary (bb, false);
+ /* If the stmt has side-effects the number of invocations matter.
+In this case mark the containing loop control.  */
+ if (gimple_has_side_effects (stmt)
+ && bb->loop_father->num != 0)
+   mark_control_dependent_edges_necessary (bb->loop_father->latch,
+   false);
}

   if (gimple_code (stmt) == GIMPLE_PHI

But while that works for CDDCE1, CDDCE2 is presented a slightly altered CFG
that somehow prevents it from working.  Which also means that both loops
need to be considered infinite for the present bolting to work.

[Bug c/99363] [11 regression] gcc.dg/attr-flatten-1.c fails starting with r11-7469

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99363

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2021-03-04
 Ever confirmed|0   |1
   Keywords||diagnostic, needs-bisection
 Target|powerpc64*-linux-gnu,   |powerpc64*-linux-gnu,
   |cris-elf|cris-elf, x86_64-*-*
   Target Milestone|--- |11.0
 Status|UNCONFIRMED |NEW
  Component|other   |c

--- Comment #2 from Richard Biener  ---
Likely fails everywhere.  Eventually a testsuite issue.

[Bug fortran/99369] [10/11 Regression] ICE in gfc_resolve_expr, at fortran/resolve.c:7167

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99369

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P4
   Target Milestone|--- |10.3

[Bug target/99372] gimplefe-28.c ICEs when sqrt insn is not available

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99372

Richard Biener  changed:

   What|Removed |Added

Version|unknown |11.0
  Component|tree-optimization   |target
 Target||powerpc

--- Comment #1 from Richard Biener  ---
It does check for that:

/* { dg-do compile { target sqrt_insn } } */

the error is with the powerpc target not implementing the dg sqrt_insn target
properly.

[Bug ipa/99373] unused static function not being removed in some cases after optimization

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99373

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-03-04
 CC||hubicka at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
The issue is only IPA reference promotes 'd' constant and thus only late
optimization elides the call to 'j'.  That's too late to eliminate the
function.
Note we process 'j' first duing late opts (to make the late local IPA
pure-const
useful).

We'd need another IPA phase before RTL expansion to collect unreachable
functions again (IIRC the original parallel compilation GSoC project added
one).

I'm also quite sure we have a duplicate of this PR.

[Bug tree-optimization/99383] No tree-switch-conversion under PIC

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99383

Richard Biener  changed:

   What|Removed |Added

 CC||marxin at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
   Keywords||missed-optimization
   Last reconfirmed||2021-03-04
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Same for -fPIE.  The reason is:

Bailing out - value from a case would need runtime relocations.

  reloc = initializer_constant_valid_p (val, TREE_TYPE (val));
  if ((flag_pic && reloc != null_pointer_node)
  || (!flag_pic && reloc == NULL_TREE))
{
  if (reloc)
reason
  = "value from a case would need runtime relocations";

reloc is a STRING_CST here.  Not sure why it says 'runtime relocation' or what
that should be.  It's a reloc in .rodata to sth in .string.

[Bug tree-optimization/99383] No tree-switch-conversion under PIC

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99383

Richard Biener  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #2 from Richard Biener  ---
-fPIC/-fPIE refers to _code_ so I'm not sure why we restrict _data_ in any way
here?  Using those flags, at least?

Jakub added this code for PR36881 in g:f6e6e9904cd32cc78873a33f0a3839812b0d0f57

[Bug tree-optimization/99383] No tree-switch-conversion under PIC

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99383

--- Comment #3 from Richard Biener  ---
For the specific case of strings switch-conversion could also generate
a combined string (with intermediate '\0's) and use a table of
offsets into said string, thus doing a single relocation to the
combined string in .text (or GOT)
plus offsetting that with the offset from the table.
(at the cost of less string merging and thus larger .string)

I guess relocs to .string aren't any better than relocs to .{,ro}data.

[Bug middle-end/97855] [11 regression] Bogus warning locations during lto-bootstrap

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97855

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Richard Biener  ---
Fixed on trunk (but sure latent as well).

[Bug gcov-profile/99385] [11 regression] gcc.dg/tree-prof/indir-call-prof-malloc.c etc. FAIL

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99385

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Richard Biener  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org
   Keywords||ra

--- Comment #17 from Richard Biener  ---
So coming back here.  We're presenting RA with a quite hard problem given we
have

(insn 7 4 8 2 (set (reg:TI 84 [ _9 ])
(mem:TI (reg:DI 101) [0 MEM <__int128 unsigned> [(char *
{ref-all})in_8(D)]+0 S16 A8])) 73 {*movti_internal}
 (expr_list:REG_DEAD (reg:DI 101)
(nil)))
(insn 8 7 9 2 (parallel [
(set (reg:DI 95)
(lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 8)
(const_int 63 [0x3f])))
(clobber (reg:CC 17 flags))
]) "t.c":7:26 703 {*lshrdi3_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
..
(insn 10 9 11 2 (parallel [
(set (reg:DI 97)
(lshiftrt:DI (subreg:DI (reg:TI 84 [ _9 ]) 0)
(const_int 63 [0x3f])))
(clobber (reg:CC 17 flags))
]) "t.c":8:30 703 {*lshrdi3_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
..
(insn 12 11 13 2 (set (reg:V2DI 98 [ vect__5.3 ])
(ashift:V2DI (subreg:V2DI (reg:TI 84 [ _9 ]) 0)
(const_int 1 [0x1]))) "t.c":9:16 3611 {ashlv2di3}
 (expr_list:REG_DEAD (reg:TI 84 [ _9 ])
(nil)))

where I wonder why we keep the (subreg:DI (reg:TI 84 ...) 8) around
for so long.  Probably the subreg pass gives up because of the V2DImode
subreg of that reg.

That said RA chooses xmm for reg:84 but then spills it immediately
to fulfil the subregs even though there's mov and pextrd that could
be used or the reload could use the original mem.  That we reload
even the xmm use is another odd thing.

Vlad, I'm not sure about the possibilities LRA has here but maybe
you can have a look at the testcase in comment#6 (use -O3 -march=znver2
or -march=core-avx2).  For one I expected

vmovdqu (%rsi), %xmm2
vmovdqa %xmm2, -24(%rsp)
movq-16(%rsp), %rax   (2a)
vmovdqa -24(%rsp), %xmm4  (1)
...
movq-24(%rsp), %rdx   (2b)

(1) to be not there (not sure how that even survives postreload
optimizations...)
(2a/b) to be 'inherited' by instead loading from (%rsi) and 8(%rsi) which
is maybe too much being asked because it requires aliasing considerations

That is, even if we don't consider using

   movq %xmm2, %rax (2a)
   pextrd %xmm2, %rdx, 1 (2b)

I expected us to not spill.

[Bug c++/99386] std::variant overhead much larger compared to clang

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99386

--- Comment #1 from Richard Biener  ---
Is that clang++ using libstdc++ from GCC or libc++?  In the end the difference
might boil down to inlining decision differences.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #18 from Richard Biener  ---
There's another thing - we end up with

vmovq   %rax, %xmm3
vpinsrq $1, %rdx, %xmm3, %xmm0

but that has way worse latency than the alternative you'd get w/o SSE 4.1:

vmovq   %rax, %xmm3
vmovq   %rdx, %xmm7
punpcklqdq  %xmm7, %xmm3

for example on Zen3 vmovq and vpisnrq have latencies of 3 while punpck
has a latency of only one.  So the second variant should have 2 cycles
less latency.

Testcase:

typedef long v2di __attribute__((vector_size(16)));

v2di foo (long a, long b)
{
  return (v2di){a, b};
}

Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3.  Not
sure if we should somehow do this late somehow (peephole or splitter) since
it requires one more %xmm register.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #19 from Richard Biener  ---
So to recover performance we need both, avoiding the latency on the vector plus
avoiding the spilling.  This variant is fast:

.L56:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
movq8(%rsi), %rdx
shrq$63, %rdx
imulq   $135, %rdx, %rdi
movq(%rsi), %rdx
vmovq   %rdi, %xmm0
vpsllq  $1, %xmm4, %xmm1
shrq$63, %rdx
vmovq   %rdx, %xmm5
vpunpcklqdq %xmm5, %xmm0, %xmm0
vpxor   %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
jmp .L53

compared to the original:

.L56:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
vmovdqa %xmm4, 16(%rsp)
movq24(%rsp), %rdx
vmovdqa 16(%rsp), %xmm5
shrq$63, %rdx
imulq   $135, %rdx, %rdi
movq16(%rsp), %rdx
vmovq   %rdi, %xmm0
vpsllq  $1, %xmm5, %xmm1
shrq$63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor   %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rax)
jmp .L53

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #22 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #21)
> (In reply to Uroš Bizjak from comment #20)
> > (In reply to Richard Biener from comment #18)
> > > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3.  Not
> > > sure if we should somehow do this late somehow (peephole or splitter) 
> > > since
> > > it requires one more %xmm register.
> > What happens if you disparage [v]pinsrd alternatives in vec_concatv2di?
> 
> Please try this:
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index db5be59f5b7..edf7b1a3074 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -16043,7 +16043,12 @@
>   (const_string "maybe_evex")
>]
>(const_string "orig")))
> -   (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")])
> +   (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")
> +   (set (attr "preferred_for_speed")
> + (cond [(eq_attr "alternative" "0,1,2,3")
> + (symbol_ref "false")
> +  ]
> +  (symbol_ref "true")))])
>  
>  (define_insn "*vec_concatv2di_0"

That works to avoid the vpinsrq.  I guess the case of a mem operand
behaves similar to a gpr (plus the load uop), at least I don't have any
contrary evidence (but I didn't do any microbenchmarks either).

I'm not sure IRA/LRA will optimally handle the situation with register
pressure causing spilling in case it needs to reload both gpr operands.
At least for

typedef long v2di __attribute__((vector_size(16)));

v2di foo (long a, long b)
{
  return (v2di){a, b};
}

with -msse4.1 -O3 -ffixed-xmm1 -ffixed-xmm2 -ffixed-xmm3 -ffixed-xmm4
-ffixed-xmm5 -ffixed-xmm6 -ffixed-xmm7 -ffixed-xmm8 -ffixed-xmm9 -ffixed-xmm10
-ffixed-xmm11 -ffixed-xmm12 -ffixed-xmm13 -ffixed-xmm14 -ffixed-xmm15 I get
with the
patch

foo:
.LFB0:
.cfi_startproc
movq%rsi, -16(%rsp)
movq%rdi, %xmm0
pinsrq  $1, -16(%rsp), %xmm0
ret

while without it's

movq%rdi, %xmm0
pinsrq  $1, %rsi, %xmm0

as far as I understand LRA dumps the new attribute is a hard one, even
applying when other alternatives are worse.  In this case we choose
alt 7.  Covering also alts 7 and 8 with the optimize-for-speed attribute
causes reload fails - which is expected if there's no way for LRA to
choose alt 1.  The following seems to work for the small testcase above
but not for the important case in the benchmark (meh).

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index db5be59f5b7..e393a0d823b 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -15992,7 +15992,7 @@
  (match_operand:DI 1 "register_operand"
  "  0, 0,x ,Yv,0,Yv,0,0,v")
  (match_operand:DI 2 "nonimmediate_operand"
- " rm,rm,rm,rm,x,Yv,x,m,m")))]
+ " !rm,!rm,!rm,!rm,x,Yv,x,!m,!m")))]
   "TARGET_SSE"
   "@
pinsrq\t{$1, %2, %0|%0, %2, 1}

I guess the idea of this insn setup was exactly to get IRA/LRA choose
the optimal instruction sequence - otherwise exposing the reload so
late is probably suboptimal.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #23 from Richard Biener  ---
Created attachment 50300
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50300&action=edit
preprocessed source of the important Botan TU

This is the full preprocessed source of the TU.  When compiled with -Ofast
-march=znver2 look for poly_double_n_le in the assembly, in the prologue the
function jumps based on kernel size - size 16 is the important one:

cmpq$16, %rdx
je  .L54
...
.L54:
.cfi_restore_state
vmovdqu (%rsi), %xmm4
vmovdqa %xmm4, 16(%rsp)
movq24(%rsp), %rdx
vmovdqa 16(%rsp), %xmm5
shrq$63, %rdx
imulq   $135, %rdx, %rcx
movq16(%rsp), %rdx
vmovq   %rcx, %xmm0
vpsllq  $1, %xmm5, %xmm1
shrq$63, %rdx
vpinsrq $1, %rdx, %xmm0, %xmm0
vpxor   %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rdi)
leaq-16(%rbp), %rsp
popq%r12
popq%r13
popq%rbp
.cfi_remember_state
.cfi_def_cfa 7, 8
ret

[Bug tree-optimization/95401] [10 Regression] GCC produces incorrect instruction with -O3 for AVX2 since r10-2257-g868363d4f52df19d

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95401

--- Comment #8 from Richard Biener  ---
(In reply to Alexandre Oliva from comment #7)
> How important is it that the test added for this PR be split into two
> separate source files?
> 
> I ask because, on targets that support vectors, but the vector unit is not
> enabled in the default configuration, vect.exp makes compile the default
> action, instead of run, and with additional sources, compile fails because
> one can't compile multiple sources into a single asm output.

Hmm, but that sounds like a mistake in the dg setup?  Anyway, if you can make
the testcase fail when combined (and some noipa attributes sprinkled around)
it's certainly fine to merge it into a single TU.

[Bug middle-end/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

--- Comment #2 from Richard Biener  ---
This is a loop-carried data dependence which we can't handle (we avoid creating
those from PRE but here it appears in the source itself).  I wonder how
LLVM handles this (pre/post vectorization IL).

Specifically 'carry around variable' is something we don't handle.

Can you somehow extract a compilable testcase (with just this kernel)?

Looking at the source peeling a single iteration (to get rid of the initial
value) and then undoing the PRE, vectorizing

for (int i = 1; i < LEN_1D; i++) {
a[i] = (b[i] + b[i-1]) * (real_t).5;
}

would likely result in optimal code.  The assembly from clang doesn't look
optimal to me - llvm likely materializes 'x' as temporary array, vectorizing

  x[0] = b[LEN_1D-1];
for (int i = 0; i < LEN_1D; i++) {
a[i] = (b[i] + x[i]) * (real_t).5;
x[i+1] = b[i];
}

and then somehow (like we handle OMP simd lane arrays?) uses two vectors
as a sliding window over x[].  At least the standard strathegy for
these kind of dependences is to get "rid" of them by making them data
dependences and then hope for the best.

[Bug tree-optimization/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Keywords||missed-optimization
 Ever confirmed|0   |1
   Last reconfirmed||2021-03-05
 CC||rguenth at gcc dot gnu.org
  Component|middle-end  |tree-optimization

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2021-03-05
 CC||rguenth at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org
   Keywords||missed-optimization
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
  Component|middle-end  |tree-optimization

--- Comment #2 from Richard Biener  ---
please provide compilable testcases ...

Reduced testcase:

double a[1024];
void foo ()
{
  for (int i = 0; i < 1022; i += 2)
{
  a[i] = a[i+1] * a[i];
  a[i+1] = a[i+2] * a[i+1];
}
}

[Bug tree-optimization/99397] s152 benchmark of TSVC is vectorized by clang and not by gcc

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99397

Richard Biener  changed:

   What|Removed |Added

  Component|middle-end  |tree-optimization
   Keywords||missed-optimization
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-03-05

--- Comment #1 from Richard Biener  ---
That's the long-standing issue of dependence analysis not handling mixed
array and pointer access forms which means we miss distance zero computation
and handling here.

There's a duplicate for this.

The mitigiation is to "try again" with the array access demoted to a
pointer-based access (thus, analyze some alternative DR and see if dependence
analysis
can handle that).

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #26 from Richard Biener  ---
(In reply to rguent...@suse.de from comment #25)
> On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
> > 
> > --- Comment #24 from Uroš Bizjak  ---
> > (In reply to Richard Biener from comment #22)
> > > I guess the idea of this insn setup was exactly to get IRA/LRA choose
> > > the optimal instruction sequence - otherwise exposing the reload so
> > > late is probably suboptimal.
> > 
> > THere is one more tool in the toolbox. A peephole2 pattern can be
> > conditionalized on availabe XMM register. So, if XMM reg is available, the
> > GPR->XMM move can be emitted in front of the insn. So, if there is XMM 
> > register
> > pressure, pinsrd will be used, but if an XMM register is availabe, it will 
> > be
> > reused to emit punpcklqdq.
> > 
> > The peephole2 pattern can also be conditionalized for targets where GPR->XMM
> > moves are fast.
> 
> Note the trick is esp. important when GPR->XMM moves are _slow_.  But only
> in the case we originally combine two GPR operands.  Doing two
> GPR->XMM moves and then one puncklqdq hides half of the latency of the
> slow moves since they have no data dependence on each other.  So for the
> peephole we should try to match this - a reloaded operand and a GPR
> operand.  When the %xmm operand results from a SSE computation there's
> no point in splitting out a GPR->XMM move.
> 
> So in the end a peephole2 sounds like it could better match the condition
> the transform is profitable on.

I tried

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index db5be59f5b7..8d0d3077cf8 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1419,6 +1419,23 @@
   DONE;
 })

+(define_peephole2
+  [(set (match_operand:DI 0 "sse_reg_operand")
+(match_operand:DI 1 "general_gr_operand"))
+   (match_scratch:DI 2 "sse_reg_operand")
+   (set (match_operand:V2DI 2 "sse_reg_operand")
+   (vec_concat:V2DI (match_dup:DI 0)
+(match_operand:DI 3 "general_gr_operand")))]
+  "reload_completed"
+  [(set (match_dup 0)
+(match_dup 1))
+   (set (match_dup 2)
+(match_dup 3))
+   (set (match_dup 2)
+   (vec_concat:V2DI (match_dup 0)
+(match_dup 2)))]
+  "")
+
 ;; Merge movsd/movhpd to movupd for TARGET_SSE_UNALIGNED_LOAD_OPTIMAL targets.
 (define_peephole2
   [(set (match_operand:V2DF 0 "sse_reg_operand")

but that doesn't seem to match for some unknown reason.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #29 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #27)
> (In reply to Richard Biener from comment #26)
> > but that doesn't seem to match for some unknown reason.
> 
> Try this:
> 
> (define_peephole2
>   [(match_scratch:DI 5 "Yv")
>(set (match_operand:DI 0 "sse_reg_operand")
> (match_operand:DI 1 "general_reg_operand"))
>(set (match_operand:V2DI 2 "sse_reg_operand")
> (vec_concat:V2DI (match_operand:DI 3 "sse_reg_operand")
>  (match_operand:DI 4 "nonimmediate_gr_operand")))]
>   ""
>   [(set (match_dup 0)
> (match_dup 1))
>(set (match_dup 5)
> (match_dup 4))
>(set (match_dup 2)
>(vec_concat:V2DI (match_dup 3)
> (match_dup 5)))])

Ah, I messed up operands.  The following works (the above position of
match_scratch happily chooses an operand matching operand 0):

;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
;; one already reloaded, to hide the latency of one GPR->XMM transitions.
(define_peephole2
  [(set (match_operand:DI 0 "sse_reg_operand")
(match_operand:DI 1 "general_reg_operand"))
   (match_scratch:DI 2 "Yv")
   (set (match_operand:V2DI 3 "sse_reg_operand")
(vec_concat:V2DI (match_dup 0)
 (match_operand:DI 4 "nonimmediate_gr_operand")))]
  "reload_completed && optimize_insn_for_speed_p ()"
  [(set (match_dup 0)
(match_dup 1))
   (set (match_dup 2)
(match_dup 4))
   (set (match_dup 3)
(vec_concat:V2DI (match_dup 0)
 (match_dup 2)))])

but for some reason it again doesn't work for the important loop.  There
we have

  389: xmm0:DI=cx:DI
  REG_DEAD cx:DI
  390: dx:DI=[sp:DI+0x10]
   56: {dx:DI=dx:DI 0>>0x3f;clobber flags:CC;}
  REG_UNUSED flags:CC
   57: xmm0:V2DI=vec_concat(xmm0:DI,dx:DI)

I suppose the reason is that there's two unrelated insns between the
xmm0 = cx:DI and the vec_concat.  Which would hint that we somehow
need to not match this GPR->XMM move in the peephole pattern but
instead somehow in the condition (can we use DF there?)

The simplified variant below works but IMHO matches cases we do not
want to transform.  I can't find any example on how to achieve that
though.

;; Further split pinsrq variants of vec_concatv2di with two GPR sources,
;; one already reloaded, to hide the latency of one GPR->XMM transitions.
(define_peephole2
  [(match_scratch:DI 3 "Yv")
   (set (match_operand:V2DI 0 "sse_reg_operand")
(vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
 (match_operand:DI 2 "nonimmediate_gr_operand")))]
  "reload_completed && optimize_insn_for_speed_p ()"
  [(set (match_dup 3)
(match_dup 2))
   (set (match_dup 0)
(vec_concat:V2DI (match_dup 1)
 (match_dup 3)))])

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #33 from Richard Biener  ---
Created attachment 50308
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50308&action=edit
patch

I am testing the following.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #35 from Richard Biener  ---
(In reply to Richard Biener from comment #33)
> Created attachment 50308 [details]
> patch
> 
> I am testing the following.

It FAILs

FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
vpinsrq[^\\n\\r]*\\
\\\$1[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19
FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
vpinsrq[^\\n\\r]*\$1[^\\n\\r]*%rsi[^\\n\\r]*%xmm16[^\\n\\r]*%xmm17
FAIL: gcc.target/i386/avx512vl-concatv2di-1.c scan-assembler
vmovhps[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19

I'll see how to update those next week.

[Bug tree-optimization/99407] s243 benchmark of TSVC is vectorized by clang and not by gcc, missed DSE

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99407

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2021-03-08
 Ever confirmed|0   |1
   Keywords||missed-optimization
 Blocks||53947
Summary|s243 benchmark of TSVC is   |s243 benchmark of TSVC is
   |vectorized by clang and not |vectorized by clang and not
   |by gcc  |by gcc, missed DSE
  Component|middle-end  |tree-optimization
 Status|UNCONFIRMED |NEW

--- Comment #2 from Richard Biener  ---
Hmm, wonder why DSE didn't remove the first a[i] store.  Ah, because DSE
doesn't use data-ref analysis and thus cannot disambiguate the variable offset.

Manually applying DSE produces

.L4:
vmovaps c(%rax), %ymm1
vaddps  e(%rax), %ymm1, %ymm0
addq$32, %rax
vmovups a-28(%rax), %ymm1
vmulps  d-32(%rax), %ymm1, %ymm1
vmulps  d-32(%rax), %ymm0, %ymm0
vaddps  b-32(%rax), %ymm0, %ymm0
vmovaps %ymm0, b-32(%rax)
vaddps  %ymm0, %ymm1, %ymm0
vmovaps %ymm0, a-32(%rax)
cmpq$127968, %rax
jne .L4


manually DSEd loop:

for (int nl = 0; nl < iterations; nl++) {
for (int i = 0; i < LEN_1D-1; i++) {
real_t tem = b[i] + c[i  ] * d[i];
b[i] = tem + d[i  ] * e[i];
a[i] = b[i] + a[i+1] * d[i];
}
}


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug middle-end/99408] s3251 benchmark of TSVC vectorized by clang runs about 7 times faster compared to gcc

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99408

Richard Biener  changed:

   What|Removed |Added

 Blocks||53947
   Keywords||missed-optimization

--- Comment #1 from Richard Biener  ---
Hum, GCCs code _looks_ faster.  Maybe it's our tendency to duplicate memory
accesses in vector instructions (there's a PR about this somewhere).  A
load uop on every stmt is likely the bottleneck here.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/99409] s252 benchmark of TSVC is vectorized by clang and not by gcc

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Blocks||53947
  Component|middle-end  |tree-optimization

--- Comment #1 from Richard Biener  ---
Yes, we can't do 'scalar expansion'.  We'd need some pre-pass to turn PHIs
into data accesses.  Here we want

t[0] = (real_t) 0.;
for (int i = 0; i < LEN_1D; i++) {
s = b[i] * c[i];
a[i] = s + t[i];
t[i+1] = s;
}

and then of course the trick is to elide the actual array and instead do
clever shuffling of vector registers instead.

IIRC one of the other TSVC examples was similar.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/99411] s311, s312, s31111, s31111, s3110, vsumr benchmark of TSVC is vectorized by clang better than by gcc

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411

Richard Biener  changed:

   What|Removed |Added

 Blocks||53947
   Keywords||missed-optimization
  Component|middle-end  |tree-optimization

--- Comment #6 from Richard Biener  ---
So clang uses a larger VF (unroll of the vectorized loop) here.  I think we
have another PR about this.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

Richard Biener  changed:

   What|Removed |Added

  Component|middle-end  |tree-optimization
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Blocks||53947
   Keywords||missed-optimization
 Ever confirmed|0   |1
 Depends on||97832
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2021-03-08

--- Comment #1 from Richard Biener  ---
With -fno-tree-reassoc we detect the reduction chain and produce

.L3:
vmovaps b(%rax), %ymm5
vmovaps b+32(%rax), %ymm6
addq$160, %rax
vfmadd231ps a-160(%rax), %ymm5, %ymm1
vmovaps b-96(%rax), %ymm7
vfmadd231ps a-128(%rax), %ymm6, %ymm0
vmovaps b-64(%rax), %ymm5
vmovaps b-32(%rax), %ymm6
vfmadd231ps a-96(%rax), %ymm7, %ymm2
vfmadd231ps a-64(%rax), %ymm5, %ymm3
vfmadd231ps a-32(%rax), %ymm6, %ymm4
cmpq$128000, %rax
jne .L3
vaddps  %ymm1, %ymm0, %ymm0
vaddps  %ymm2, %ymm0, %ymm0
vaddps  %ymm3, %ymm0, %ymm0
vaddps  %ymm4, %ymm0, %ymm0
vextractf128$0x1, %ymm0, %xmm1
vaddps  %xmm0, %xmm1, %xmm1
vmovhlps%xmm1, %xmm1, %xmm0
vaddps  %xmm1, %xmm0, %xmm0
vshufps $85, %xmm0, %xmm0, %xmm1
vaddps  %xmm0, %xmm1, %xmm0
decl%edx
jne .L2

we're not re-rolling and thus are forced to use a VF of 4 here.

Note that LLVM doesn't seem to veectorize the loop but instead vectorizes
the basic-block which isn't what TSVC looks for (but that would work for
non-fast-math).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
[Bug 97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than
-O3

[Bug tree-optimization/99414] s235 benchmark of TSVC is vectorized better by icc than gcc (loop interchange)

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99414

Richard Biener  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
 Blocks||53947
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-03-08
 Ever confirmed|0   |1
  Component|middle-end  |tree-optimization
   Keywords||missed-optimization

--- Comment #1 from Richard Biener  ---
linterchange says:

Consider loop interchange for loop_nest<1 - 3>
Access Strides for DRs:
  a[i_33]:  <0, 4,  0>
  b[i_33]:  <0, 4,  0>
  c[i_33]:  <0, 4,  0>
  a[i_33]:  <0, 4,  0>
  aa[_6][i_33]: <0, 4,  1024>
  bb[j_34][i_33]:   <0, 4,  1024>
  aa[j_34][i_33]:   <0, 4,  1024>

Loop(3) carried vars:
  Induction:  j_34 = {1, 1}_3
  Induction:  ivtmp_53 = {255, 4294967295}_3

Loop(2) carried vars:
  Induction:  i_33 = {0, 1}_2
  Induction:  ivtmp_51 = {256, 4294967295}_2

and then doesn't do anything.

I suppose the best thing to do here is to first distribute the loop nest,
but our cost modeling fuses the two obvious candidates:

Fuse partitions because they have shared memory refs:
  Part 1: 0, 1, 2, 3, 4, 5, 6, 7, 19, 20, 21
  Part 2: 0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21

so this is a case that asks for better cost modeling there.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/99415] s115 benchmark of TSVC is vectorized by icc and not by gcc

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99415

Richard Biener  changed:

   What|Removed |Added

  Component|middle-end  |tree-optimization
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
 Blocks||53947
   Last reconfirmed||2021-03-08
   Keywords||missed-optimization

--- Comment #1 from Richard Biener  ---
The benchmark is written badly to confuse our loop header copying it seems. 
Writing

for (int j = 0; j < LEN_2D-1; j++) {
for (int i = j+1; i < LEN_2D; i++) {
a[i] -= aa[j][i] * a[j];
}
}

fixes the vectorizing.

Possibly a mistake users do, so probably worth investigating further.  Not
sure how to most easily address this - we'd like to peel the last iteration
of the outer loop, noting it does nothing.  Maybe loop-splitting can figure
this out?  Alternatively loop header copying should just do its job...

Hmm, actually loop-header copying does do its job but then there's jump
threading messing this up again (the loop header check is redundant for
all but the last iteration of the outer loop).  So -fno-tree-dominator-opts
fixes this as well.  And for some reason ch_vect thinks the loops are
all do-while loops.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/99416] s211 benchmark of TSVC is vectorized by icc and not by gcc

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99416

Richard Biener  changed:

   What|Removed |Added

 Blocks||53947
   Last reconfirmed||2021-03-08
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
  Component|middle-end  |tree-optimization
   Keywords||missed-optimization
 CC||amker at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
Confirmed.  ICC applies loop distribution but again our cost-modeling doesn't
want that to happen.

I suspect we want to detect extra incentives there (make dependences "good",
allow interchange, etc.)


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug ipa/99419] possible missed optimization for dead code elimination

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99419

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
Version|unknown |11.0
 Depends on||80603
   Last reconfirmed||2021-03-08
   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW

--- Comment #1 from Richard Biener  ---
dup or at least depends on PR80603


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80603
[Bug 80603] Optimize loads from constant arrays or aggregates with arrays

[Bug ipa/99428] possible missed optimization for dead code elimination

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99428

Richard Biener  changed:

   What|Removed |Added

  Component|tree-optimization   |ipa
   Keywords||missed-optimization
   Last reconfirmed||2021-03-08
 CC||hubicka at gcc dot gnu.org,
   ||marxin at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
Version|unknown |11.0

--- Comment #1 from Richard Biener  ---
IPA - gimple "phase ordering" issue.  Alternatively when 'b' is discovered
read-only its analysis would need to consider the initializer propagated and
thus eventually not address-taken to make 'a' readonly as well ... (or apply
modref to tell 'a' is not written to?)

[Bug c++/99445] [11 Regression] ICE in hashtab_chk_error, at hash-table.c:137 since r11-7011-g6e0a231a4aa2407b

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99445

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1
   Target Milestone|--- |11.0

[Bug preprocessor/99446] [11 Regression] ICE in linemap_position_for_loc_and_offset, at libcpp/line-map.c:1005

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99446

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |11.0

[Bug lto/99447] [11 Regression] ICE (segfault) in lookup_page_table_entry

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99447

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |11.0

--- Comment #4 from Richard Biener  ---
I also wonder when the GC was triggered, thus whether it's another case of a
live stmt / SSA name where we now forcefully free the CFG.

[Bug c++/99451] [plugin] cannot enable specific dump for plugin passes

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99451

--- Comment #1 from Richard Biener  ---
Yeah.

[Bug c++/99456] [11 regression] ABI breakage with some static initialization

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99456

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
   Priority|P3  |P1

[Bug debug/99457] gcc/gdb -gstabs+ is buggy.

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99457

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #5 from Richard Biener  ---
Works for me.

[Bug c++/99459] [11 Regression] Many coroutines regressions on armv7hl-linux-gnueabi

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99459

Richard Biener  changed:

   What|Removed |Added

 Target||arm
   Priority|P3  |P1

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #36 from Richard Biener  ---
(In reply to Richard Biener from comment #35)
> (In reply to Richard Biener from comment #33)
> > Created attachment 50308 [details]
> > patch
> > 
> > I am testing the following.
> 
> It FAILs
> 
> FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
> vpinsrq[^\\n\\r]*\\
> \\\$1[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19

That's exactly the case we're looking after.  V2DI concat from two GPRs.

> FAIL: gcc.target/i386/avx512dq-concatv2di-1.c scan-assembler
> vpinsrq[^\\n\\r]*\$1[^\\n\\r]*%rsi[^\\n\\r]*%xmm16[^\\n\\r]*%xmm17

This is, like below, a MEM case.

> FAIL: gcc.target/i386/avx512vl-concatv2di-1.c scan-assembler
> vmovhps[^\\n\\r]*%[re]si[^\\n\\r]*%xmm18[^\\n\\r]*%xmm19

This one is because nonimmediate_gr_operand also matches a MEM, in this case
we apply the peephole to

(insn 12 11 13 2 (set (reg/v:V2DI 55 xmm19 [ c ])
(vec_concat:V2DI (reg:DI 54 xmm18 [91]) 
(mem:DI (reg/v/f:DI 4 si [orig:86 y ] [86]) [1 *y_8(D)+0 S8 A64]))) 

latency-wise memory isn't any better than a GPR so the decision to split
is reasonable.

> I'll see how to update those next week.

So I updated the above to check for vpunpcklqdq instead.

[Bug rtl-optimization/99462] New: Enhance scheduling to split instructions

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462

Bug ID: 99462
   Summary: Enhance scheduling to split instructions
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

Maybe the scheduler(s) can already do this (I have zero knowledge here).  For
example the x86 vec_concatv2di insn has alternatives that cause the instruction
to be split into multiple uops (vpinsrq, movhpd) when the 'insert' operand
is not XMM (but GPR or MEM).  We now have a peephole2 to split such cases:

+;; Further split pinsrq variants of vec_concatv2di to hide the latency
+;; the GPR->XMM transition(s).
+(define_peephole2
+  [(match_scratch:DI 3 "Yv")
+   (set (match_operand:V2DI 0 "sse_reg_operand")
+   (vec_concat:V2DI (match_operand:DI 1 "sse_reg_operand")
+(match_operand:DI 2 "nonimmediate_gr_operand")))]
+  "TARGET_64BIT && TARGET_SSE4_1
+   && !optimize_insn_for_size_p ()"
+  [(set (match_dup 3)
+(match_dup 2))
+   (set (match_dup 0)
+   (vec_concat:V2DI (match_dup 1)
+(match_dup 3)))])

but in reality this is only profitable when we either can execute
two "bad" move uops in parallel (thus when originally composing
two GPRs or two MEMs) or when we can schedule one "bad" move much
earlier.

Thus, can the scheduler already "split" an instruction - say split
away a load uop and issue it early when a scratch register is available?

(the reverse alternative is to not expose multi-uop insns before scheduling
and only merge them later - during scheduling?)

How does GCC deal with situations like this?

[Bug rtl-optimization/99462] Enhance scheduling to split instructions

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||amonakov at gcc dot gnu.org,
   ||law at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
CCing scheduler maintainers

[Bug target/99461] [11 Regression] ICE in extract_constrain_insn, at recog.c:2670 since r11-7526-g9105757a59b89019

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99461

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1

--- Comment #1 from Richard Biener  ---
Looks like a dup.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #37 from Richard Biener  ---
So my analysis was partly wrong and the vpinsrq isn't an issue for the
benchmark
but only the spilling is.

Note that the other idea of disparaging vector CTORs more like with

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 260f87b..f8caf8e7dff 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21821,8 +21821,15 @@ ix86_builtin_vectorization_cost (enum
vect_cost_for_stmt type_of_cost,

   case vec_construct:
{
- /* N element inserts into SSE vectors.  */
+ /* N-element inserts into SSE vectors.  */
  int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+ /* We cannot insert from GPRs directly but there's always a
+GPR->XMM uop involved.  Account for that.
+???  Note that loads are already costed separately so this
+eventually double-counts them.  */
+ if (!fp)
+   cost += (TYPE_VECTOR_SUBPARTS (vectype)
+* ix86_cost->hard_register.integer_to_sse);
  /* One vinserti128 for combining two SSE vectors for AVX256.  */
  if (GET_MODE_BITSIZE (mode) == 256)
cost += ix86_vec_cost (mode, ix86_cost->addss);

helps for generic and core-avx2 tuning:

t.c:10:3: note: Cost model analysis:
0x3858cd0 _6 1 times scalar_store costs 12 in body
0x3858cd0 _4 1 times scalar_store costs 12 in body
0x3858cd0 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3858cd0 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3858cd0 _15 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 _14 << 1 1 times scalar_stmt costs 4 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3858cd0  0 times vec_perm costs 0 in body
0x3858cd0 _15 << 1 1 times vector_stmt costs 4 in body
0x3858cd0 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3858cd0  1 times vec_construct costs 20 in prologue
0x3858cd0 _6 1 times unaligned_store (misalign -1) costs 12 in body
0x3858cd0 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue
0x3858cd0 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue
t.c:10:3: note: Cost model analysis for part in loop 0:
  Vector cost: 48
  Scalar cost: 44
t.c:10:3: missed: not vectorized: vectorization is not profitable.

but not for znver2:

t.c:10:3: note: Cost model analysis:
0x3703790 _6 1 times scalar_store costs 16 in body
0x3703790 _4 1 times scalar_store costs 16 in body
0x3703790 _5 ^ carry_10 1 times scalar_stmt costs 4 in body
0x3703790 _2 ^ _3 1 times scalar_stmt costs 4 in body
0x3703790 _15 << 1 1 times scalar_stmt costs 4 in body
0x3703790 _14 << 1 1 times scalar_stmt costs 4 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times scalar_stmt costs 4 in body
0x3703790  0 times vec_perm costs 0 in body
0x3703790 _15 << 1 1 times vector_stmt costs 4 in body
0x3703790 _5 ^ carry_10 1 times vector_stmt costs 4 in body
0x3703790  1 times vec_construct costs 20 in prologue
0x3703790 _6 1 times unaligned_store (misalign -1) costs 16 in body
0x3703790 BIT_FIELD_REF <_9, 64, 0> 1 times vec_to_scalar costs 4 in epilogue
0x3703790 BIT_FIELD_REF <_9, 64, 64> 1 times vec_to_scalar costs 4 in epilogue
t.c:10:3: note: Cost model analysis for part in loop 0:
  Vector cost: 52
  Scalar cost: 52
t.c:10:3: note: Basic block will be vectorized using SLP

appearantly for znver{1,2,3} we choose a slightly higher load/store cost.

We could also try mitigating vectorization by decomposing the __int128
load in forwprop where we have

  else if (TREE_CODE (TREE_TYPE (lhs)) == VECTOR_TYPE
   && TYPE_MODE (TREE_TYPE (lhs)) == BLKmode
   && gimple_assign_load_p (stmt)
   && !gimple_has_volatile_ops (stmt)
   && (TREE_CODE (gimple_assign_rhs1 (stmt))
   != TARGET_MEM_REF)
   && !stmt_can_throw_internal (cfun, stmt))
{
  /* Rewrite loads used only in BIT_FIELD_REF extractions to
 component-wise loads.  */

this was tailored to decompose GCC vector extension loads that are not
supported on the HW early.  Here we have

  _9 = MEM <__int128 unsigned> [(char * {ref-all})in_8(D)];
  _14 = BIT_FIELD_REF <_9, 64, 64>;
  _15 = BIT_FIELD_REF <_9, 64, 0>;

where the HW doesn't have any __int128 GPRs.  If we do not vectorize then
the RTL pipeline will eventually split the load.  If vectorization is
profitable then the vectorizer should be able to vectorize the resulting
split loads as well.  In this case this would cause actual costing of the
load (the re-use of the __int128 to-be-in-SSE reg is instead free) and also
cost the live lane extract for the retained integer code.  But that moves
the cost even more towards vectorizing since now a vector load (cost 12)
plus two live lane extracts (when fixed to cost sse_to_integer that's 2 * 6)
is used in place of two scalar loads (cost

[Bug tree-optimization/99473] redundant conditional zero-initialization not eliminated

2021-03-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99473

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #2 from Richard Biener  ---
g3 and g1 behave differently also because of (there's a dup PR I can't find
right now) sinking happening in a way that the pass store-commoning code
doesn't trigger on the sunk store.

cselim doesn't trigger because

  if ((TREE_CODE (lhs) != MEM_REF
   && TREE_CODE (lhs) != ARRAY_REF
   && TREE_CODE (lhs) != COMPONENT_REF)
  || !is_gimple_reg_type (TREE_TYPE (lhs)))
return false;

lhs is a VAR_DECL and 'nontrap' only tracks pointers.  There's code to actually
handle auto-vars now but the above still disallows bare decls.  Because
we have the address-taken the transform will also require
-fallow-store-data-races.

diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c
index ddd9d531b13..6f7efa29a1b 100644
--- a/gcc/tree-ssa-phiopt.c
+++ b/gcc/tree-ssa-phiopt.c
@@ -2490,9 +2490,8 @@ cond_store_replacement (basic_block middle_bb,
basic_block join_bb,
   locus = gimple_location (assign);
   lhs = gimple_assign_lhs (assign);
   rhs = gimple_assign_rhs1 (assign);
-  if ((TREE_CODE (lhs) != MEM_REF
-   && TREE_CODE (lhs) != ARRAY_REF
-   && TREE_CODE (lhs) != COMPONENT_REF)
+  if ((!REFERENCE_CLASS_P (lhs)
+   && !DECL_P (lhs))
   || !is_gimple_reg_type (TREE_TYPE (lhs)))
 return false;


fixes g3 this (with -fallow-store-data-races).  Queued for GCC 12.

g2 needs sinking/commoning of f (&x) for which there's yet another PR I think.

[Bug tree-optimization/99475] [10/11 Regression] bogus -Warray-bounds accessing an array element of empty structs

2021-03-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99475

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2
   Target Milestone|--- |10.3

[Bug target/99487] [10 Regression] ICE during RTL pass: final in expand_function_start on hppa-linux-gnu

2021-03-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99487

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2021-03-09
   Keywords||build
   Target Milestone|--- |10.3

--- Comment #1 from Richard Biener  ---
That's during bootstrap it seems?  The backtrace is not very useful, it misses
the caller.  Is there more pieces scattered somewhere in the log?

[Bug tree-optimization/97104] [11 Regression] aarch64, SVE: ICE in vect_get_loop_mask since r11-3070-g783dc66f9cc

2021-03-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97104

Richard Biener  changed:

   What|Removed |Added

 CC||marxin at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org
   Keywords||needs-bisection

--- Comment #4 from Richard Biener  ---
Also gone latent (or fixed) now.  I also know nothing about the loop masks
code.

Extracting a GIMPLE testcase from one of the broken revs maybe carries this
over.

We're vectorizing the last testcase with SVE now and not ICEing.  So maybe it
was really fixed, who knows (but I don't see any live stmts).

Can somebody bisect what fixed this?

[Bug target/97513] [11 regression] aarch64 SVE regressions since r11-3822

2021-03-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97513

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2021-03-09
 Status|UNCONFIRMED |WAITING

--- Comment #5 from Richard Biener  ---
What's the current state of affairs?

[Bug target/99492] double and double _Complex not aligned consistently on AIX

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99492

Richard Biener  changed:

   What|Removed |Added

   Keywords||ABI

--- Comment #1 from Richard Biener  ---
I assume GCC puts it at offset 8?

[Bug other/99496] [11 regression] g++.dg/modules/xtreme-header-3_c.C ICEs after r11-7557

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99496

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |11.0
   Priority|P3  |P1

[Bug target/99497] _mm_min_ss/_mm_max_ss incorrect results when values known at compile time

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99497

--- Comment #3 from Richard Biener  ---
(In reply to Jakub Jelinek from comment #2)
> And another question is if we without -ffast-math ever create
> MIN_EXPR/MAX_EXPR and what exactly are the rules for those, if it is safe to
> expand those into SMAX etc., or if those need to use UNSPECs too.

We don't create them w/o -ffinite-math-only -fno-signed-zeros.  We of course
eventually could, if there's a compare sequence matching the relaxed
requirements.

But MIN/MAX_EXPR should be safe to expand to smin/max always given their
semantics match.

[Bug testsuite/99498] [11 regression] new test case g++.dg/opt/pr99305.C in r11-7587 fails

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99498

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |11.0

[Bug c++/99500] [11 Regression] ICE: tree check: expected tree that contains 'decl minimal' structure, have 'error_mark' in cp_parser_requirement_parameter_list, at cp/parser.c:28828

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99500

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P4

[Bug tree-optimization/99504] Missing memmove detection

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99504

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Keywords||missed-optimization
   Last reconfirmed||2021-03-10

--- Comment #1 from Richard Biener  ---
The issue is that in the pixel case we have an aggregate assignment:

   [local count: 955630225]:
  # p_17 = PHI 
  # q_18 = PHI 
  # i_19 = PHI 
  q_9 = q_18 + 4;
  p_10 = p_17 + 4;
  *p_17 = *q_18;
  i_12 = i_19 + 1;
  if (n_8(D) != i_12)
goto ; [89.00%]
  else
goto ; [11.00%]

and that's not handled by vectorization or dependence analysis.

We might want to consider applying the same folding to this as we do for
memcpy folding and turn it into

  _42 = MEM [q_18, (pixel *)0];
  MEM [q_17, (pixel *)0] = _42;

[Bug fortran/99506] internal compiler error: in record_reference, at cgraphbuild.c:64

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99506

--- Comment #4 from Richard Biener  ---
This is a frontend issue, the FE produces an invalid static initializer for
'latt' (DECL_INITIAL):

{(real(kind=8)) latt100[(integer(kind=8)) i + -1] / 1.0e+2, (real(kind=8))
latt100[(integer(kind=8)) i + -1] / 1.0e+2,... }

if this should be dynamic initialization FEs are responsible for lowering
this.

I don't know fortran enough for what 'parameter' means in this context:

   real(double),  parameter:: latt(jmax) = [(latt100(i)/100.d0, j=1,jmax)]

but the middle-end sees a readonly global (TREE_STATIC) variable.

[Bug c++/99508] [11 Regression] Asm labels declared inside a function are ignored

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99508

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1
  Known to work||10.2.1
Summary|Asm labels declared inside  |[11 Regression] Asm labels
   |a function are ignored  |declared inside a function
   ||are ignored
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Target Milestone|--- |11.0
   Keywords||wrong-code
   Last reconfirmed||2021-03-10
  Known to fail||11.0

--- Comment #1 from Richard Biener  ---
Confirmed.  Works with the C frontend.

[Bug gcov-profile/99512] New: Add counter annotation to allow store-data-races to be introduced with -fprofile-update=single

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99512

Bug ID: 99512
   Summary: Add counter annotation to allow store-data-races to be
introduced with -fprofile-update=single
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: gcov-profile
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rguenth at gcc dot gnu.org
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

PR64928 shows that LIM when applying store-motion to profile counters with
-fprofile-update=single adds code to avoid introducing store-data-races.
But with -fprofile-update=single that's not required.

The proposal is to add means to control -fallow-store-data-races at the
decl level (also by users) by adding an attribute and in the middle-end
abstract the global flag check for a decl specific which can lookup the
attribute and coverage.c marking the profile counters this way for
-fprofile-update=single.

[Bug middle-end/64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in "phase opt and generate" with -ftest-coverage -fprofile-arcs

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928

--- Comment #36 from Richard Biener  ---
So the issue is still the same - one thing I noticed is that store-motion also
adds a flag for each counter update to avoid introducing store-data-races.
-fallow-store-data-races mitigates that part and speeds up the compilation
quite a bit.  In case there are threads involved you'd want
-fprofile-update=atomic
which then causes store-motion to give up and the compile-time is great
overall.

The original trigger of the regression is likely the marking of the profile
counters as to not be aliased - we might want to introduce another flag to
tell that store-data-races for the particular decl are not a consideration
(maybe even have some user-visible attribute for this).

Otherwise re-confirmed (I stripped options down to -O -fPIC -fprofile-arcs
-ftest-coverage):

rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-2.o1-fib-2.i
1.84user 0.05system 0:01.90elapsed 99%CPU (0avgtext+0avgdata
160764maxresident)k
0inputs+0outputs (0major+58129minor)pagefaults 0swaps
rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-3.o1-fib-3.i 
10.15user 0.17system 0:10.32elapsed 99%CPU (0avgtext+0avgdata
726688maxresident)k
0inputs+0outputs (0major+265008minor)pagefaults 0swaps
rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-4.o1-fib-4.i 
43.60user 1.06system 0:44.68elapsed 99%CPU (0avgtext+0avgdata
6107260maxresident)k
0inputs+0outputs (0major+1765217minor)pagefaults 0swaps
rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i 
gcc: fatal error: Killed signal terminated program cc1
compilation terminated.
Command exited with non-zero status 1
143.09user 3.93system 2:28.29elapsed 99%CPU (0avgtext+0avgdata
24636148maxresident)k
37504inputs+0outputs (31major+6133278minor)pagefaults 0swaps

on the last which runs OOM adding -fallow-store-data-races does

rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fallow-store-data-races
123.06user 0.45system 2:03.59elapsed 99%CPU (0avgtext+0avgdata
100maxresident)k
57304inputs+0outputs (68major+535127minor)pagefaults 0swaps

and -fprofile-update=atomic

rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fprofile-update=atomic 
0.61user 0.02system 0:00.63elapsed 100%CPU (0avgtext+0avgdata
73236maxresident)k
72inputs+0outputs (0major+18284minor)pagefaults 0swaps

and -fno-tree-loop-im

rguenther@ryzen:/tmp> /usr/bin/time ~/install/gcc-11.0/usr/local/bin/gcc -S -O
-fPIC -fprofile-arcs -ftest-coverage fib-5.o1-fib-5.i -fno-tree-loop-im  
1.06user 0.01system 0:01.07elapsed 99%CPU (0avgtext+0avgdata 90672maxresident)k
0inputs+0outputs (0major+24331minor)pagefaults 0swaps

I still wonder if you can produce an even smaller testcase where visualizing
the CFG is possible.  Unfortunately the source is mechanically generated
and following it is hard.  Like a testcase that retains the basic structure
but ends up with just a few (2, less than 10) computed gotos?

[Bug gcov-profile/99512] Add counter annotation to allow store-data-races to be introduced with -fprofile-update=single

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99512

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Blocks||64928
   Severity|normal  |enhancement


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64928
[Bug 64928] [8/9/10/11 Regression] Inordinate cpu time and memory usage in
"phase opt and generate" with -ftest-coverage -fprofile-arcs

[Bug tree-optimization/99510] [11 Regression] Compile time hog in build_aligned_type since r11-7123-g63538886d1f7fc7c

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99510

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #1 from Richard Biener  ---
Confirmed.

 tree slp vectorization :  32.83 ( 84%)   0.05 ( 20%)  32.90 ( 83%)
   62M ( 24%)

I suspect it's latent before and caused by excessive vector size iteration.

I'll see what's the real cause here.

[Bug tree-optimization/99510] [11 Regression] Compile time hog in build_aligned_type since r11-7123-g63538886d1f7fc7c

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99510

--- Comment #2 from Richard Biener  ---
Ah, OK.  We're having a lot of vector CTORs we "vectorize" with load
permutations
like { 484 506 } and that runs into the pre-existing issue (there's a PR
about this...) that we emit dead vector loads for all of the elements in the
group, including gaps.

Costing says they're even which possibly makes sense.

We do a build_aligned_type for each emitted stmt and for some reason
it's quite costly here (well, there's the awkward linear type variant list
to walk ...).

Caching should be possible but the load vectorization loop is already
quite awkward.  Meh.

The rev. likely triggered this because we didn't cost the scalar root
stmt before (the CTOR itself we replace).  Doing that made the costing
profitable.  Having equal scalar and vector load cost makes fixing on
the costing side difficult - the vector load should be an epsilon more
expensive to avoid these issues.

Note for some reason we have gazillion of type variants here.  Huh.
~36070 variants per type.  Ah.  And _that's_ because build_aligned_type does

  for (t = TYPE_MAIN_VARIANT (type); t; t = TYPE_NEXT_VARIANT (t))
if (check_aligned_type (t, type, align))
  return t;

  t = build_variant_type_copy (type);
  SET_TYPE_ALIGN (t, align);
  TYPE_USER_ALIGN (t) = 1;


and check_aligned_type checks for an exact match TYPE_USER_ALIGN, but of
course if 'type' wasn't aligned originally it won't find the created
aligned type ...

Fixing that fixes the compile-time issue.

[Bug tree-optimization/99510] [11 Regression] Compile time hog in build_aligned_type since r11-7123-g63538886d1f7fc7c

2021-03-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99510

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Richard Biener  ---
Fixed.

  1   2   3   4   5   6   7   8   9   10   >