[Bug tree-optimization/99398] New: Miss to optimize vector permutation fed by CTOR and CTOR/CST

2021-03-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99398

Bug ID: 99398
   Summary: Miss to optimize vector permutation fed by CTOR and
CTOR/CST
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

#include "altivec.h"

vector long long foo(long long a, long long b) {
  vector long long v1 = {a, 0};
  vector long long v2 = {b, 0};
  vector unsigned char vc = {0,1,2,3,4,5,6,7, 16,17,18,19,20,21,22,23};
  vector long long vres = (vector long long)vec_perm ((vector unsigned char)v1,
(vector unsigned char)v2, vc);
  return vres;
}

gcc -Ofast -mcpu=power9, it generates (asm on BE btw)

mtvsrdd 32,3,9
mtvsrdd 33,4,9
lxv 34,0(10)
vperm 2,0,1,2
blr

But it can be optimized into:

mtvsrdd 34,3,4
blr

The gimple at optimized dumping looks like:

__vector long foo (long long int a, long long int b)
{
  __vector long vres;
  __vector long v2;
  __vector long v1;
  __vector unsigned char _5;
  __vector unsigned char _6;
  __vector unsigned char _7;

   [local count: 1073741824]:
  v1_2 = {a_1(D), 0};
  v2_4 = {b_3(D), 0};
  _5 = VIEW_CONVERT_EXPR<__vector unsigned char>(v1_2);
  _6 = VIEW_CONVERT_EXPR<__vector unsigned char>(v2_4);
  _7 = VEC_PERM_EXPR <_5, _6, { 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21,
22, 23 }>;
  vres_8 = VIEW_CONVERT_EXPR<__vector long>(_7);
  return vres_8;

}

But it can look like:

__vector long foo (long long int a, long long int b)
{
  vector(2) long long int _10;

   [local count: 1073741824]:
  _10 = {a_1(D), b_3(D)};
  return _10;

}

[Bug tree-optimization/99398] Miss to optimize vector permutation fed by CTOR and CTOR/CST

2021-03-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99398

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2021-03-05

--- Comment #1 from Kewen Lin  ---
I have a patch by teaching simplify_permutation@tree-ssa-forwprop.c to bypass
VIEW_CONVERT_EXPR.

[Bug tree-optimization/99398] Miss to optimize vector permutation fed by CTOR and CTOR/CST

2021-03-07 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99398

--- Comment #2 from Kewen Lin  ---
Created attachment 50329
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50329&action=edit
tested patch

[Bug tree-optimization/96129] [11 regression] gcc.dg/vect/vect-alias-check.c etc. FAIL

2020-10-20 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96129

--- Comment #4 from Kewen Lin  ---
As the regressed failures, it's highly suspected to be duplicated of PR96376.

[Bug tree-optimization/96376] [11 regression] vect/vect-alias-check.c and vect/vect-live-5.c fail on armeb

2020-10-21 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96376

Kewen Lin  changed:

   What|Removed |Added

 CC||ro at gcc dot gnu.org

--- Comment #4 from Kewen Lin  ---
*** Bug 96129 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/96129] [11 regression] gcc.dg/vect/vect-alias-check.c etc. FAIL

2020-10-21 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96129

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #5 from Kewen Lin  ---
Tested on Cfarm gcc102, confirmed it's due to
30fdaead5b7880c4e9f140618e26ad1c545642d5, marked as dup.

*** This bug has been marked as a duplicate of bug 96376 ***

[Bug tree-optimization/96376] [11 regression] vect/vect-alias-check.c and vect/vect-live-5.c fail on armeb

2020-10-21 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96376

Kewen Lin  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #5 from Kewen Lin  ---
By checking the case vect-alias-check.c on the sparc machine, although
vectorizer does the versioning for the alignment requirement and is able to set
the misaligned flag off for misaligned DRs, it bypasses the DRs *b_20(D) and
MEM[(int *)b_20(D) + 4B], as they satisfy the condition check
!vect_relevant_for_alignment_p.

   if (aligned_access_p (dr_info)
   || !vect_relevant_for_alignment_p (dr_info))
continue;

More specific, it's due to these two DR's steps are unchanged in the loop (b[0]
and b[1]).

   /* Scatter-gather and invariant accesses continue to address individual
 scalars, so vector-level alignment is irrelevant.  */
  if (STMT_VINFO_GATHER_SCATTER_P (stmt_info)
  || integer_zerop (DR_STEP (dr_info->dr)))
return false;

A simple fix seems to be:

  diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 2b4421b5fb4..d5f52929f89 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -1186,7 +1186,7 @@ vect_update_misalignment_for_peel (dr_vec_info *dr_info,

 /* Return true if alignment is relevant for DR_INFO.  */

-static bool
+bool
 vect_relevant_for_alignment_p (dr_vec_info *dr_info)
 {
   stmt_vec_info stmt_info = dr_info->stmt;
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index cec5c601268..8a04f55fdb1 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -2369,7 +2369,8 @@ get_load_store_type (vec_info  *vinfo, stmt_vec_info
stmt_info,
   return false;
 }

-  if (*alignment_support_scheme == dr_unaligned_unsupported)
+  if (*alignment_support_scheme == dr_unaligned_unsupported
+  && vect_relevant_for_alignment_p (STMT_VINFO_DR_INFO (stmt_info)))
 {
   if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 7c6de8397b3..067222ffe39 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1935,6 +1935,7 @@ extern tree vect_get_new_ssa_name (tree, enum
vect_var_kind,
 extern tree vect_create_addr_base_for_vector_ref (vec_info *,
  stmt_vec_info, gimple_seq *,
  tree, tree = NULL_TREE);
+bool vect_relevant_for_alignment_p (dr_vec_info *dr_info);

 /* In tree-vect-loop.c.  */
 extern widest_int vect_iv_limit_for_partial_vectors (loop_vec_info
loop_vinfo);

[Bug other/97705] [11 regression] cc.c-torture/unsorted/dump-noaddr.c.*r.ira fails after r11-4637

2020-11-03 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97705

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
   Last reconfirmed||2020-11-04

--- Comment #2 from Kewen Lin  ---
Thanks for reporting and sorry for the failure. I did run the regression
testing on P8 LE, but thought it's endianness irrelevant and didn't run it on
BE.

[Bug testsuite/97705] [11 regression] cc.c-torture/unsorted/dump-noaddr.c.*r.ira fails after r11-4637

2020-11-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97705

--- Comment #3 from Kewen Lin  ---
The "-DMASK=2" dumping has more lines for register 282, which is introduced in
ira. Something weird causes ira to dump more contexts.

$ diff dump1/dump-noaddr.c.289r.ira dump2/dump-noaddr.c.289r.ira
107a108
>   r282 costs: BASE_REGS:0 GENERAL_REGS:0 FLOAT_REGS:312 ALTIVEC_REGS:312 
> VSX_REGS:312 GEN_OR_FLOAT_REGS:312 GEN_OR_VSX_REGS:312 LINK_REGS:468 
> CTR_REGS:468 LINK_OR_CTR_REGS:468 SPEC_OR_GEN_REGS:468 ALL_REGS:1872 MEM:312
373a375
>   r282 costs: GENERAL_REGS:0 FLOAT_REGS:312 ALTIVEC_REGS:312 VSX_REGS:312 
> GEN_OR_FLOAT_REGS:312 GEN_OR_VSX_REGS:312 LINK_REGS:468 CTR_REGS:468 
> LINK_OR_CTR_REGS:468 SPEC_OR_GEN_REGS:468 ALL_REGS:1872 MEM:312

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2020-11-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 96789, which changed state.

Bug 96789 Summary: x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-11-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #37 from Kewen Lin  ---
The degradation should be gone on Power now.

[Bug testsuite/97705] [11 regression] cc.c-torture/unsorted/dump-noaddr.c.*r.ira fails after r11-4637

2020-11-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97705

--- Comment #4 from Kewen Lin  ---
I think my commit just exposed one bug in ira. The newly introduced function
remove_scratches can bump the max_regno, then the data structures
regstat_n_sets_and_refs and reg_info_p which are allocated according to
max_regno become stale. When we call print_pseudo_costs to dump some register
information, it use the latest regno with max_reg_num (), it leads we can
access some regno which don't have relevant data structures which have been
allocated, the read values can be random (out of array bound access).

The fix can be to free/re-init/re-compute the relevant data structures when we
know the max_regno already changes like remove_scratches succeeds.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-11-05 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Kewen Lin  ---
Should be done with latest trunk.

[Bug gcov-profile/97594] [11 Regression] new test case gcc.dg/tree-prof/pr97461.c execution failure

2020-11-05 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97594

--- Comment #3 from Kewen Lin  ---


(In reply to Martin Liška from comment #2)
> (In reply to Martin Liška from comment #1)
> > Mine, I see a strange error:
> > 
> > $ Program received signal SIGBUS, Bus error.
> > 0x3fffb7ceddbc in __GI__IO_link_in () from /lib64/libc.so.6
> > Missing separate debuginfos, use: debuginfo-install
> > glibc-2.17-307.el7.1.ppc64le
> > (gdb) bt
> > #0  0x3fffb7ceddbc in __GI__IO_link_in () from /lib64/libc.so.6
> > #1  0x3fffb7cebe58 in _IO_new_file_init_internal () from 
> > /lib64/libc.so.6
> 
> All right, so the test-case overloads malloc and returns a memory that is a
> static buffer. For some reason, it leads to SEGBUS.
> Do Power people know what's causing that?

I was testing the patch for PR97705 and met this issue during regression
testing, happened to notice this PR and just realized this one is also a random
issue. (how lucky I am :-))

Checked the assembly insn causing the SEGBUS

   0x77cc6940 <+240>:   beq 0x77cc6b30 <__GI__IO_link_in+736>
   0x77cc6944 <+244>:   li  r9,1
   0x77cc6948 <+248>:   clrldi  r10,r10,32
=> 0x77cc694c <+252>:   lwarx   r8,0,r3
   0x77cc6950 <+256>:   subf.   r8,r10,r8

r3 0x100207e6  268568550

As Power ISA pointed out, the EA for lwarx must be a multiple of 4. "If it is
not, either the system alignment error handler is invoked or the results are
boundedly undefined."

So the code of function __GI__IO_link_in has already assumed the address there
would have one reasonable alignment.

By checking the manual of malloc/calloc, it says:

   RETURN VALUE
 The malloc() and calloc() functions return a pointer to the 
 allocated memory, which is  suitably  aligned  for any built-in
 type.  On error, these functions return NULL.  NULL may also be
 returned by a successful call to malloc() with a size of zero,
 or by a successful call to calloc() with nmemb or size equal to
 zero.

I think the assumption there is reasonable, the addresses returned from
user-overloaded malloc/calloc should also take care of this alignment
requirement and adjust the return address respecting this.

The below small patch can get the case to pass.

$ diff ~/gcc/gcc-git/gcc/testsuite/gcc.dg/tree-prof/pr97461.c pr97461.c
20a21,26
> /* The malloc() and calloc() functions return a pointer to the allocated
>memory, which is suitably aligned for any built-in type.  Use 16
>bytes here as the basic alignment requirement for user-defined malloc
>and calloc.  See PR97594 for the details.  */
> #define ROUND_UP_FOR_16B_ALIGNMENT(x) ((x + 15) & (-16))
>
23c29
< memory_p += size;
---
> memory_p += ROUND_UP_FOR_16B_ALIGNMENT (size);

[Bug tree-optimization/97744] [11 regression] 32 bit floating point result errors after r11-4637

2020-11-06 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97744

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
   Last reconfirmed||2020-11-07
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #3 from Kewen Lin  ---
(In reply to seurer from comment #2)
> This particular test has been problematic before.  For example, see pr83497.
> Previously the change in output was from some floating point operations
> being reordered causing a minute rounding change but it was actually OK.
> 
> By adding -fno-associative-math or removing -ffast-math it will work.  We
> are probably just going to go with that.

Thanks for the information! I'll have a look at it.

[Bug rtl-optimization/97705] [11 regression] cc.c-torture/unsorted/dump-noaddr.c.*r.ira fails after r11-4637

2020-11-08 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97705

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Kewen Lin  ---
Should be fixed with latest trunk r11-4827.

[Bug tree-optimization/97744] [11 regression] 32 bit floating point result errors after r11-4637

2020-11-16 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97744

--- Comment #4 from Kewen Lin  ---
The additional pass fre4 run triggers this, to disable fre4 can make it pass
(but to disable dse3 can't separately, so it's unrelated), further narrowing
down shows fre4 on the function MG3XDEMO is responsible. By checking the
differences in the optimized files, some FMA computation order differences
attracted me. eg:

original:
   [local count: 4095816269717]:
  # D__I_lsm0.2159_1324 = PHI <_269(33), _1322(32)>
  # D_lsm0.2160_1321 = PHI <_382(33), _1313(32)>
  # D_lsm0.2162_1312 = PHI <_418(33), _1304(32)>
  # doloop.2199_1211 = PHI 
  # ivtmp.2221_1207 = PHI 
  # ivtmp.2232_1201 = PHI 
  _827 = ivtmp.2266_1109 + 24;
  _826 = (real(kind=8) *) _827;

  ...
  _293 = MEM  [(real(kind=8)[0:D.3022] *)_816 + ivtmp.2232_1201 *
1];
  _294 = _288 + _293;
  _295 = ((_294));
  _296 = _295 * 2.5e-1;
  _15 = .FMA (_263, 5.0e-1, _296);
  _815 = ivtmp.2270_1092 + 32;
  _814 = (real(kind=8) *) _815;
  ...
  _362 = _418 + D_lsm0.2162_1312;
  _363 = ((_362));
  _12 = .FMA (_363, 6.25e-2, _15);
  _370 = .FMA (_337, 1.25e-1, _12);
  _1163 = (void *) ivtmp.2221_1207;
  MEM  [(real(kind=8)[0:D.3025] *)_1163 + 16B] = _370;

vs. with culprit commit:

   [local count: 4095816269717]:
  # D__I_lsm0.2159_1324 = PHI <_269(33), _1322(32)>
  # D_lsm0.2160_1321 = PHI <_382(33), _1313(32)>
  # D_lsm0.2162_1312 = PHI <_418(33), _1304(32)>
  # doloop.2200_11 = PHI 
  # ivtmp._213 = PHI 
  # ivtmp.2233_1325 = PHI 
  _952 = ivtmp.2267_1189 + 24;
  _951 = (real(kind=8) *) _952;
  ...
  _293 = MEM  [(real(kind=8)[0:D.3022] *)_941 + ivtmp.2233_1325 *
1];
  _365 = _290 + _293;
  _294 = _365 + D__I_lsm0.2159_1324;
  _295 = ((_294));
...
  _362 = _418 + D_lsm0.2162_1312;
  _363 = ((_362));
  _364 = _363 * 6.25e-2;
  _558 = .FMA (_263, 5.0e-1, _364);
  _12 = .FMA (_295, 2.5e-1, _558);
  _370 = .FMA (_337, 1.25e-1, _12);
  _1265 = (void *) ivtmp._213;
  MEM  [(real(kind=8)[0:D.3025] *)_1265 + 16B] = _370;

So I tried to disable pass widening_mul, it can pass and then I further
narrowed down to convert_mult_to_fma.

So the commit causes FMA computation order changes and triggered the final
comparison failure.  By bisecting on one particular FMA transform, it comes to
the one in PSINV which is called by MG3XDEMO for several times.

  _896 = _8 * _902;
  _240 = _637 + _896;

vs.

  _240 = .FMA (_8,_902, _637);

Excepting for Seurer mentioned options which make it pass, as
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, the option
-ffp-contract=off which is able to disable FMA optimization can also make this
pass.

Besides, searching bugs shows this mgrid is well known to be with too small
absolute tolerance such as PR35418.

[Bug tree-optimization/97744] [11 regression] 32 bit floating point result errors after r11-4637

2020-11-16 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97744

--- Comment #5 from Kewen Lin  ---
btw, this is power7 specific, I found it can pass with -mcpu=power8.

[Bug tree-optimization/97744] [11 regression] 32 bit floating point result errors after r11-4637

2020-11-17 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97744

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |INVALID

--- Comment #6 from Kewen Lin  ---
With the above findings, I think the culprit commit just gets the fragile
tolerance issue exposed again, closed as invalid.

[Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da

2020-12-02 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

Kewen Lin  changed:

   What|Removed |Added

 CC||rguenther at suse dot de
   Last reconfirmed||2020-12-03
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

--- Comment #1 from Kewen Lin  ---
(In reply to Ilya Leoshkevich from comment #0)
> s390's vxe/popcount-1.c began to fail after PR96789 fix.

Sorry to see this regression.

...

> 
> that is, replaced a sequence of stores with a sequence of
> BIT_INSERT_EXPRs.
> 
> slp1 now says: "missed:  not vectorized: no grouped stores in basic
> block", presumably because it doesn't understand BIT_INSERT_EXPRs.

Yes, currently slp instance is built from group stores (also CTOR), I expect
Richi's ongoing slp rework will extend it to support group loads. For this
BIT_INSERT_EXPR, I guess we can extend vector CTOR handling to cover this if
the BIT_INSERT_EXPR chain can only and fully cover all lanes which is
equivalent to vector CTOR.

CC @Richi. Hi Richi, what do you think of this?

[Bug tree-optimization/98113] [11 Regression] popcnt is not vectorized on s390 since f5e18dd9c7da

2020-12-02 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98113

--- Comment #2 from Kewen Lin  ---
(In reply to Kewen Lin from comment #1)
> (In reply to Ilya Leoshkevich from comment #0)
> > s390's vxe/popcount-1.c began to fail after PR96789 fix.
> 
> Sorry to see this regression.
> 
> ...
> 
> > 
> > that is, replaced a sequence of stores with a sequence of
> > BIT_INSERT_EXPRs.
> > 
> > slp1 now says: "missed:  not vectorized: no grouped stores in basic
> > block", presumably because it doesn't understand BIT_INSERT_EXPRs.
> 
> Yes, currently slp instance is built from group stores (also CTOR), I expect
> Richi's ongoing slp rework will extend it to support group loads. For this
> BIT_INSERT_EXPR, I guess we can extend vector CTOR handling to cover this if
> the BIT_INSERT_EXPR chain can only and fully cover all lanes which is
> equivalent to vector CTOR.
> 
> CC @Richi. Hi Richi, what do you think of this?

Or another idea is to teach FRE to optimize expected BIT_INSERT_EXPR chain to
vector CTOR?

[Bug tree-optimization/98138] New: BB vect fail to SLP one case

2020-12-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

Bug ID: 98138
   Summary: BB vect fail to SLP one case
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

Test case:

  extern void test(unsigned int t[4][4]);

  void foo(unsigned char *p1, int i1, unsigned char *p2, int i2)
  {
unsigned int tmp[4][4];
unsigned int a0, a1, a2, a3;

for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
  a0 = (p1[0] - p2[0]) + ((p1[4] - p2[4]) << 16);
  a1 = (p1[1] - p2[1]) + ((p1[5] - p2[5]) << 16);
  a2 = (p1[2] - p2[2]) + ((p1[6] - p2[6]) << 16);
  a3 = (p1[3] - p2[3]) + ((p1[7] - p2[7]) << 16);

  int t0 = a0 + a1;
  int t1 = a0 - a1;
  int t2 = a2 + a3;
  int t3 = a2 - a3;

  tmp[i][0] = t0 + t2;
  tmp[i][2] = t0 - t2;
  tmp[i][1] = t1 + t3;
  tmp[i][3] = t1 - t3;
}
test(tmp);
  }

The expected code on ppc64le can look like:

  // p1 byte 0 to byte 7 
  d1_0_7 = load_dword(p1)
  // p1+i1 b0 to b7, rename it as 8 to 15   
  d1_8_15 = load_dword(p1 + i1)
  d1_16_23 = load_dword(p1 + 2*i1) 
  d1_24_31 = load_dword(p1 + 3*i1)

  V_d1_0_15 = construct_vec(d1_0_7,d1_8_15) // vector char
  V_d1_16_31 = construct_vec(d1_16_23,d1_24_31)
  V_d1_0_3_all = vperm(V_d1_0_15, V_d1_0_15, 
  {0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27})
  V_d1_4_7_all = vperm(V_d1_0_15, V_d1_0_15, 
  {4 12 20 28 5 13 21 29 6 14 22 30 7 15 23 31})

  // Do the similar for p2 with i2, get V_d2_0_3_all, V_d2_4_7_all

  // Do the subtraction together (all 4x4 bytes)
  V_sub1 = V_d1_0_3_all - V_d2_0_3_all
  V_sub2 = V_d1_4_7_all - V_d2_4_7_all

  // Do some unpack and get the promoted vector int
  V_a0_tmp = vec_promote(V_sub2, {0 1 2 3}) // vector int {b4 b12 b20 b28}
  V_a0_1 = V_a0_tmp << 16
  V_a0_0 = vec_promote(V_sub1, {0 1 2 3}).  // vector int {b0 b8 b16 b24}
  // vector int {a0_iter0, a0_iter1, a0_iter2, a0_iter3}
  V_a0 = V_a0_0 + V_a0_1

  // Get the similar for V_a1, V_a2, V_a3

  // Compute t0/t1/t2/t3
  // vector int {t0_iter0, t0_iter1, t0_iter2, t0_iter3}
  V_t0 = V_a0 + V_a1  
  V_t1 = V_a0 - V_a1
  V_t2 = V_a2 + V_a3
  V_t3 = V_a2 - V_a3

  // Compute tmps
  // vector int {tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]}
  V_tmp0 = V_t0 + V_t2
  V_tmp2 = V_t0 - V_t2
  V_tmp1 = V_t1 + V_t3
  V_tmp3 = V_t1 - V_t3

  // Final construct the {tmp[0][0], tmp[0][1], tmp[0][2], tmp[0][3]} ...
  // with six further permutation on V_tmp0/V_tmp1/V_tmp2/V_tmp3

[Bug tree-optimization/98138] BB vect fail to SLP one case

2020-12-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #1 from Kewen Lin  ---
Similar case is x264_pixel_satd_8x4 in x264
https://github.com/mirror/x264/blob/4121277b40a667665d4eea1726aefdc55d12d110/common/pixel.c#L288

[Bug tree-optimization/98138] BB vect fail to SLP one case

2020-12-06 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #3 from Kewen Lin  ---
(In reply to Richard Biener from comment #2)
> So the expected vectorization builds vectors
> 
>  { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] }
> 
> that's not SLP, SLP tries to build the
> 
>  { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }
> 
> vector and "succeeds" - the SLP tree turns out to be
> highly inefficient though.  So for the stores your desire
> is to see an interleaving scheme with VF 4 (the number of
> iterations).  But interleaving fails because it would require
> a VF of 16 and there are not enough iteration in the loop.
> 
> The classical SLP scheme degenerates (also due to the plus/minus
> mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3}
> expression.
> 
> Starting SLP discovery from the grouped loads would get things going
> up to the above same expression.
> 
> So not sure what's the best approach to this case.  The testcase
> can be simplified still showing the SLP discovery issue:
> 
> extern void test(unsigned int t[4][4]);
> 
> void foo(int *p1, int i1, int *p2, int i2)
> {
>   unsigned int tmp[4][4];
>   unsigned int a0, a1, a2, a3;
> 
>   for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>   a0 = (p1[0] - p2[0]);
>   a1 = (p1[1] - p2[1]);
>   a2 = (p1[2] - p2[2]);
>   a3 = (p1[3] - p2[3]);
> 
>   int t0 = a0 + a1;
>   int t1 = a0 - a1;
>   int t2 = a2 + a3;
>   int t3 = a2 - a3;
> 
>   tmp[i][0] = t0 + t2;
>   tmp[i][2] = t0 - t2;
>   tmp[i][1] = t1 + t3;
>   tmp[i][3] = t1 - t3;
>   }
>   test(tmp);
> }
> 
> So it's basically SLP discovery degenerating to an interleaving scheme
> on the load side but not actually "implementing" it.

IIUC, in current implementation, we get four grouped stores:
  { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 

When all these tryings fail, could we do some re-try on the groups
  { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
with one extra intermediate layer covering those original groups, then start
from these newly adjusted groups? the built operands should isomorphic then.
May be too hackish?

[Bug tree-optimization/97075] [11 regression] powerpc64 vector tests fails after r11-3230

2020-09-23 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97075

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Kewen Lin  ---
Richard's rework r11-3393 has taken care of the failure on
gcc.target/powerpc/p9-vec-length-epil-7.c.  All failures should be gone on
latest trunk.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-25 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #18 from Kewen Lin  ---
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores 
> > > > needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around 
> > > this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

By following this idea, to release the restriction on loop_outer (loop_father)
when setting the father_bbs, I can see FRE works as expectedly.  But it
actually does the rpo_vn from cfun's entry to its exit. If it's taken as
costly, we probably can guard it to take effects only when all its inner loops
are unrolled, for this case, all of its three loops are unrolled.
Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
for scalar codes it needs one more time dse run after cunroll to get array d
eliminated. I guess it's not costly? Can one pass be run or not controlled by
something in another pass? via global variable and add one parameter in
passes.def seems weird. If it's costly, probably we can go by factoring out one
routine to be called instead of running a pass, like do_rpo_vn?

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-25 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #19 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #17)
> On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > 
> > --- Comment #15 from Kewen Lin  ---
> > (In reply to rguent...@suse.de from comment #14)
> > > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > > > 
> > > > --- Comment #13 from Kewen Lin  ---
> > > > >   2) on Power, the conversion from unsigned char to unsigned short is 
> > > > > nop
> > > > > conversion, when we counting scalar cost, it's counted, then add 
> > > > > costs 32
> > > > > totally onto scalar cost. Meanwhile, the conversion from unsigned 
> > > > > short to
> > > > > signed short should be counted but it's not (need to check why 
> > > > > further). 
> > > > 
> > > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's 
> > > > not
> > > > even put into the cost vector. 
> > > > 
> > > > tree_nop_conversion_p's comments saying:
> > > > 
> > > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> > > >no instruction.  */
> > > > 
> > > > I may miss something here, but UH to SH conversion does need one 
> > > > explicit
> > > > extend instruction *extsh*, the precision/mode equality check looks 
> > > > wrong for
> > > > this conversion.
> > > 
> > > Well, it isn't a RTL predicate and it only needs extension because
> > > there's never a HImode pseudo but always SImode subregs.
> > 
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do 
> > the
> > bb_info slp_instances revisits in finish_cost.
> 
> I think it's not something we should handle on GIMPLE.

Got it! For 

  else if (vect_nop_conversion_p (stmt_info))
continue;

Is it a good idea to change it to call record_stmt_cost like the others? 
  1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
  2) builtin_vectorization_cost return zero for it by default as before.
  3) targets can adjust the cost according to its need

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #25 from Kewen Lin  ---
> > 
> > Got it! For 
> > 
> >   else if (vect_nop_conversion_p (stmt_info))
> > continue;
> > 
> > Is it a good idea to change it to call record_stmt_cost like the others? 
> >   1) introduce one vect_cost_for_stmt enum type eg: nop_stmt
> >   2) builtin_vectorization_cost return zero for it by default as before.
> >   3) targets can adjust the cost according to its need
> 
> I think this early-out was added for the case where there was no cost but
> the target costed it.  So at least go back and look what target that was
> and see if it can be adjusted.

OK, thanks. The change is from commit r10-4592, I think most of its handled
objects are no costs for all targets, like VIEW_CONVERT_EXPR, UL/SL
bi-direction conversions etc, so it's good to differentiate it from
scalar_stmt, but the "continue" way makes target hard to tweak for some
tree_nop_conversion_p stmts.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #26 from Kewen Lin  ---
> > By following this idea, to release the restriction on loop_outer
> > (loop_father) when setting the father_bbs, I can see FRE works as
> > expectedly.  But it actually does the rpo_vn from cfun's entry to its exit.
> 
> Yeah, that's the reason we do not do it.  We could possibly restrict it
> to a containing loop, or if the containing loop is the whole function,
> restrict it to the original preheader block to the loop exits (which are
> no longer there, we'd need to pre-record those I think)

Thanks for the suggestion! 

I tried the idea to restrict it to run from the original preheader block to the
loop exits (pre-record both as you said), but it can't support the array d
eliminated finally, unfortunately this case requires VN to run across the
boundary between the original loops.  Now I ended up to run one time the whole
function VN if there isn't any loops after unrolling. I guess if there are no
loops, the CFG should be simple in most times and then not so costly? 

> > Besides, when SLP happens, FRE gen the bit_field_ref and remove array d, but
> > for scalar codes it needs one more time dse run after cunroll to get array d
> > eliminated. I guess it's not costly? Can one pass be run or not controlled
> > by something in another pass? via global variable and add one parameter in
> > passes.def seems weird. If it's costly, probably we can go by factoring out
> > one routine to be called instead of running a pass, like do_rpo_vn?
> 
> No, we don't have a good way to schedule passes from other passes.  And yes,
> the way forward is to support key transforms on regions.  Oh, and every
> pass that does memory alias analysis (DSE, DCE, VN, etc.) is costly.
> 

OK, I'll have a look at DSE and try to get it to support region style. Although
it may not help this case since it needs to operate things across loop
boundary.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #27 from Kewen Lin  ---
(In reply to Hongtao.liu from comment #22)
> >One of my workmates found that if we disable vectorization for SPEC2017 
> >>525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with 
> >?>explicit function attribute 
> >__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
> 
> For CLX, if we disable slp vectorization in sub4x4_dct by 
> __attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by
> 4%.
> 
> > Thanks Richi! Should we take care of this case? or neglect this kind of
> > extension as "no instruction"? I was intent to handle it in target specific
> > code, but it isn't recorded into cost vector while it seems too heavy to do
> > the bb_info slp_instances revisits in finish_cost.
> 
> For i386 backend unsigned char --> unsigned short is no "no instruction",

Thanks for the information, it means it's target specific.

> but in this case
> ---
> 1033  _134 = MEM[(pixel *)pix1_295 + 2B];   
> 
> 1034  _135 = (short unsigned int) _134;
> ---
> 
> It could be combined and optimized to 
> ---
> movzbl  19(%rcx), %r8d
> ---
> 
> So, if "unsigned char" variable is loaded from memory, then the convertion
> would also be "no instruction", i'm not sure if backend cost model could
> handle such situation.

Probably you can try to tweak it in ix86_add_stmt_cost? when the statement is
UB to UH conversion statement, further check if the def of the input UB is MEM.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #29 from Kewen Lin  ---
(In reply to Hongtao.liu from comment #28)
> > Probably you can try to tweak it in ix86_add_stmt_cost? when the statement
> 
> Yes, it's the place.
> 
> > is UB to UH conversion statement, further check if the def of the input UB
> > is MEM.
> 
> Only if there's no multi-use for UB. More generally, it's quite difficult to
> guess later optimizations for the purpose of more accurate vectorization
> cost model, :(.

Yeah, it's hard sadly. The generic cost modeling is rough, ix86_add_stmt_cost
is more fine-grain (at least than what we have on Power :)), if you want to
check it more, it seems doable in target specific hook finish_cost where you
can get the whole vinfo object, but it could end up with very heavy analysis
and might not be worthy.

Do you mind to check if it can also fix this degradation on x86 to run FRE and
DSE just after cunroll? I found it worked for Power, hoped it can help there
too.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #32 from Kewen Lin  ---
(In reply to Richard Biener from comment #31)
> (In reply to Kewen Lin from comment #29)
> > (In reply to Hongtao.liu from comment #28)
> > > > Probably you can try to tweak it in ix86_add_stmt_cost? when the 
> > > > statement
> > > 
> > > Yes, it's the place.
> > > 
> > > > is UB to UH conversion statement, further check if the def of the input 
> > > > UB
> > > > is MEM.
> > > 
> > > Only if there's no multi-use for UB. More generally, it's quite difficult 
> > > to
> > > guess later optimizations for the purpose of more accurate vectorization
> > > cost model, :(.
> > 
> > Yeah, it's hard sadly. The generic cost modeling is rough,
> > ix86_add_stmt_cost is more fine-grain (at least than what we have on Power
> > :)), if you want to check it more, it seems doable in target specific hook
> > finish_cost where you can get the whole vinfo object, but it could end up
> > with very heavy analysis and might not be worthy.
> > 
> > Do you mind to check if it can also fix this degradation on x86 to run FRE
> > and DSE just after cunroll? I found it worked for Power, hoped it can help
> > there too.
> 
> Btw, we could try sth like adding a TODO_force_next_scalar_cleanup to be
> returned from passes that see cleanup opportunities and have the pass
> manager queue that up, looking for a special marked pass and enabling
> that so we could have
> 
>   NEXT_PASS (pass_predcom);
>   NEXT_PASS (pass_complete_unroll);
>   NEXT_PASS (pass_scalar_cleanup);
>   PUSH_INSERT_PASSES_WITHIN (pass_scalar_cleanup);
> NEXT_PASS (pass_fre, false /* may_iterate */);
> NEXT_PASS (pass_dse);
>   POP_INSERT_PASSES ();
> 
> with pass_scalar_cleanup gate() returning false otherwise.  Eventually
> pass properties would match this better, or sth else.
> 

Thanks for the suggestion! Before cooking the patch, I have one question that
it looks to only update function property is enough, eg: some pass sets
property PROP_ok_for_cleanup and later pass_scalar_cleanup only goes for the
func with this property (checking in gate), I'm not quite sure the reason for
the TODO_flag TODO_force_next_scalar_cleanup.

[Bug other/98437] New: confusing wording in the description of option -fsanitize=address

2020-12-24 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98437

Bug ID: 98437
   Summary: confusing wording in the description of option
-fsanitize=address
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

As Qingnan's question[1] in the mail list gcc-help, the last part in the
current description of option -fsanitize=address looks confusing. Manual users
are easy to misunderstand the below wording

   "Note that the only target this option is currently supported on is
AArch64." 

is specific for option -fsanitize=address.  But it's intended to note the
option -fsanitize=hwaddress which can not be combined with option
-fsanitize=address.

Both hook TARGET_MEMTAG_CAN_TAG_ADDRESSES and commit r11-5375 shows only option
-fsanitize=hwaddress is aarch64 specific.

[1] https://gcc.gnu.org/pipermail/gcc-help/2020-December/139745.html

[Bug tree-optimization/98464] [11 Regression] ICE: tree check: expected class 'type', have 'exceptional' (error_mark) in tree_nop_conversion_p, at tree.c:12825 by r11-4637

2020-12-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98464

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org

--- Comment #2 from Kewen Lin  ---
Thanks for reporting and bisecting, I'll investigate this first.

[Bug tree-optimization/98464] [11 Regression] ICE: tree check: expected class 'type', have 'exceptional' (error_mark) in tree_nop_conversion_p, at tree.c:12825 by r11-4637

2020-12-29 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98464

Kewen Lin  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 CC||rguenth at gcc dot gnu.org

--- Comment #3 from Kewen Lin  ---
My commit exposed one issue in FRE. The reason why the ICE occurs is that one
SSA_NAME used in loop's nb_iterations is actually in free list.

(unsigned long) _54 - (unsigned long) v$ptr__25

arg:0 
nothrow
def_stmt
version:54 in-free-list>>

I noticed that FRE already has some codes to handle loop's nb_iterations, but
there are some inconsistence which cause FRE unable to replace the one in
loop's nb_iterations but later replace all uses of _54.

When FRE is processing BB 19, it's able to replace _54 with _53 like:

   [local count: 789698041]:
  ...
  if (_26 != v$D4172$_M_impl$D4082$_M_finish$ptr__54)
goto ; [89.00%]

=>

   [local count: 789698041]:
  ...
  if (_26 != __new_finish$ptr__53)
goto ; [89.00%]

since it calls eliminate_stmt which works with def_bb instead of current bb:

sprime = eliminate_avail (gimple_bb (SSA_NAME_DEF_STMT (use)), use)

while rpo_vn_valueize only works with vn_context_bb.

For this particular case, _54's def_bb is BB 13, _53's def_bb is BB 11, it's ok
for dominated_by_p_w_unex check. While vn_context_bb is BB 19,
dominated_by_p_w_unex check returns false for it.

[Bug tree-optimization/98464] [11 Regression] ICE: tree check: expected class 'type', have 'exceptional' (error_mark) in tree_nop_conversion_p, at tree.c:12825 by r11-4637

2021-01-03 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98464

Kewen Lin  changed:

   What|Removed |Added

   Assignee|linkw at gcc dot gnu.org   |rguenth at gcc dot 
gnu.org

--- Comment #4 from Kewen Lin  ---
Posted one patch here[1], Richard pointed out it's incorrect and said he would
have a look later[2].

Hi Richi,

Passed this to you, thanks for your time in advance!

[1] https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562596.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562608.html

[Bug tree-optimization/98464] [11 Regression] ICE: tree check: expected class 'type', have 'exceptional' (error_mark) in tree_nop_conversion_p, at tree.c:12825 by r11-4637

2021-01-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98464

--- Comment #8 from Kewen Lin  ---
(In reply to Richard Biener from comment #5)
> But this
> 
> sprime = eliminate_avail (gimple_bb (SSA_NAME_DEF_STMT (use)), use);
> 
> should make it more conservative (compared to the more desirable use of
> gimple_bb (stmt) aka 'b').
> 
> The issue is really that dominated_by_p_w_unex is not "transitive" with
> respect to an intermediate immediate dominator chain.
> 
> We can make that more consistent at least by doing sth like your patch but
> only in vn_valueize_wrapper.  

OK, just noticed that wrapper is only for simplify_replace_tree. Thanks for
fixing.  But I still don't get why it's buggy for rpo_vn_valueize.

[Bug tree-optimization/98464] [11 Regression] ICE: tree check: expected class 'type', have 'exceptional' (error_mark) in tree_nop_conversion_p, at tree.c:12825 by r11-4637

2021-01-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98464

--- Comment #10 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #9)
> On Mon, 4 Jan 2021, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98464
> > 
> > --- Comment #8 from Kewen Lin  ---
> > (In reply to Richard Biener from comment #5)
> > > But this
> > > 
> > > sprime = eliminate_avail (gimple_bb (SSA_NAME_DEF_STMT (use)), 
> > > use);
> > > 
> > > should make it more conservative (compared to the more desirable use of
> > > gimple_bb (stmt) aka 'b').
> > > 
> > > The issue is really that dominated_by_p_w_unex is not "transitive" with
> > > respect to an intermediate immediate dominator chain.
> > > 
> > > We can make that more consistent at least by doing sth like your patch but
> > > only in vn_valueize_wrapper.  
> > 
> > OK, just noticed that wrapper is only for simplify_replace_tree. Thanks for
> > fixing.  But I still don't get why it's buggy for rpo_vn_valueize.
> 
> Because rpo_vn_valueize argument can be a value handle SSA which does
> not need to be a definition on the path to the stmt we are simplifying.
> rpo_vn_valueize is called from (too) many contexts and I'm not 100%
> convinced we're only feeding it names available at the location we are
> processing.
> 

Thanks for the explanation! I'll keep in mind of this kind of possibility.

> Note I think we should use 'b' and not SSA_NAME_DEF_STMT of the use but
> then this code is as-is since the rewrite and I don't feel tracking down
> fallout of changing this back now (but I'll try to remember for stage1,
> as well as somehow better dealing with dominated_by_w_unex)

Yeah, I totally agree that improving dominated_by_w_unex is better. The
existing checking in dominated_by_w_unex is limited, I did one hack on it by
recursing one more argument depth and special handling on backedge before I
went with the direction with SSA_NAME_DEF_STMT, but thought it can enlarge FRE
applied scope and also impact compilation time definitely, looks not a good
direction especially now is stage3.

[Bug c/89126] missing -Wtype-limits for int variables

2021-01-04 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89126

Kewen Lin  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #4 from Kewen Lin  ---
(In reply to Martin Sebor from comment #3)

> Eventually, after the function returns the inequality expression without a
> warning, c_parser_condition() calls c_fully_fold() on it which folds it into
> a constant, without a warning.

Happened to notice this PR when trying to answer the question [1].  The part
doing the folding is from match.pd.

/* Non-equality compare simplifications from fold_binary  */
(for cmp (lt gt le ge)
 /* Comparisons with the highest or lowest possible integer of
the specified precision will have known values.  */
 (simplify
  (cmp (convert?@2 @0) uniform_integer_cst_p@1)
...
 (if (wi::to_wide (cst) == max)
  (switch
   (if (cmp == GT_EXPR)
{ constant_boolean_node (false, type); })
   (if (cmp == GE_EXPR)
(eq @2 @1))
   (if (cmp == LE_EXPR)
{ constant_boolean_node (true, type); })

I noticed there are some warning support in match.pd like
fold_overflow_warning. Not sure whether we can add the similar supports there. 
Probably hard to get LOC?

[1] https://gcc.gnu.org/pipermail/gcc-help/2021-January/139755.html

[Bug tree-optimization/98138] BB vect fail to SLP one case

2021-01-05 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #4 from Kewen Lin  ---
(In reply to Kewen Lin from comment #3)
> 
> IIUC, in current implementation, we get four grouped stores:
>   { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 
> 
> When all these tryings fail, could we do some re-try on the groups
>   { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
> with one extra intermediate layer covering those original groups, then start
> from these newly adjusted groups? the built operands should isomorphic then.
> May be too hackish?

This thought seems impractical in current framework. I was thinking whether we
can use another vectorization way which is sub-optimal but more fits with the
current framework. It only focuses on all things in one loop iteration and
starts from group stores { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }. It can
be:

  w1_0123 = load_word(p1)
  V1_0123 = scalar_to_vec(w1_0123) // also with promotion
  w1_4567 = load_word(p1+4)
  V1_4567 = scalar_to_vec(w1_4567)

  w2_0123 = load_word(p2)
  V2_0123 = scalar_to_vec(w2_0123)
  w2_4567 = load_word(p2+4)
  V2_4567 = scalar_to_vec(w2_4567)

  V_a0123 = (V1_0123-V2_0123) + (V1_4567-V2_4567) << 16

  V1 = V_a0123  // {a0, a1, a2, a3}
  V2 = vperm (V1, <1, 0, 3, 2>) // {a1, a0, a3, a2}

  V3 = V1 - V2  // {t1, -t1, t3, -t3}
  V4 = V1 + V2  // {t0,  t0, t2,  t2}

  V5 = vperm (V3, V4, <4, 0, 6, 2>) // {t0, t1, t2, t3}
  V6 = vperm (V3, V4, <6, 2, 4, 0>) // {t2, t3, t0, t1}

  V7 = V5 - V6 // {t0 - t2, t1 - t3, t2 - t0, t3 - t1}
  V8 = V5 + V6 // {t0 + t2, t1 + t3, t2 + t0, t3 + t1}

  V9 = vperm (V7, V8, <4, 5, 0, 1>)

  vec_store (tmp[i], V9)

One simplified case can be:

extern void test(unsigned int t[4]);

void foo(int *p1, int i1, int *p2, int i2) {
  unsigned int tmp[4];
  unsigned int a0, a1, a2, a3;

  a0 = (p1[0] - p2[0]);
  a1 = (p1[1] - p2[1]);
  a2 = (p1[2] - p2[2]);
  a3 = (p1[3] - p2[3]);

  int t0 = a0 + a1;
  int t1 = a0 - a1;
  int t2 = a2 + a3;
  int t3 = a2 - a3;

  tmp[0] = t0 + t2;
  tmp[2] = t0 - t2;
  tmp[1] = t1 + t3;
  tmp[3] = t1 - t3;

  test(tmp);
}

Currently we still can't vectorize this simple case. SLP doesn't go deep enough
to expose the group loads chance on the bottom nodes loading p1/p2. It ends up
early with node { _13, _14, _13, _14 } and { _15, _16, _15, _16 } because the
operands like {a0_28, a0_28, a0_28, a0_28} etc. satisfy the condition
all_uniform_p so "Building parent vector operands from scalars instead".

One rough idea seems:
  1) Relax this condition all_uniform_p somehow to get SLP instance building to
go deeper and get those p1/p2 loads as SLP nodes.
  2) Introduce one more vect_pattern recognizer to catch this kind of pattern,
transform the slp instance as we expect. I assume we can know the whole slp
instance then we can transform it as we want here. Probably need some costing
condition to gate this pattern matching.
  3) If 2) fail, trim the slp instance from those nodes which satisfy
all_uniform_p condition to ensure it's same as before.

Does it sound reasonable?

[Bug tree-optimization/98138] BB vect fail to SLP one case

2021-01-05 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #5 from Kewen Lin  ---
(In reply to Kewen Lin from comment #4)
> One rough idea seems:
>   1) Relax this condition all_uniform_p somehow to get SLP instance building
> to go deeper and get those p1/p2 loads as SLP nodes.
>   2) Introduce one more vect_pattern recognizer to catch this kind of
> pattern, transform the slp instance as we expect. I assume we can know the
> whole slp instance then we can transform it as we want here. Probably need
> some costing condition to gate this pattern matching.
>   3) If 2) fail, trim the slp instance from those nodes which satisfy
> all_uniform_p condition to ensure it's same as before.
> 

For 2), instead of vect_pattern with IFN, the appropriate place seems to be
vect_optimize_slp.

But after more thinking, building SLP instance starting from group loads
instead of group stores looks more straightforward. 

  a0 = (p1[0] - p2[0]);
  a1 = (p1[1] - p2[1]);
  a2 = (p1[2] - p2[2]);
  a3 = (p1[3] - p2[3]);

Building the vector  looks more natural and then check the uses
of its all lanes and special patterns to have vector  and
repeat similarly.

Hi Richi,

Is this a good example to request SLP instance build starting group loads?

[Bug tree-optimization/98138] BB vect fail to SLP one case

2021-01-11 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #7 from Kewen Lin  ---
(In reply to Richard Biener from comment #6)
> Starting from the loads is not how SLP discovery works so there will be
> zero re-use of code.  Sure - the only important thing is you end up
> with a valid SLP graph.
> 
> But going back to the original testcase and the proposed vectorization
> for power - is that faster in the end?
> 

Good question. I coded foo2 and foo3 with altivec built-in functions as the
proposals, foo2 is for the original version (#c0) and foo3 is for the
sub-optimial version (#c4), they have been verified with some random inputs to
ensure the functionality. 

I evaluated the timings on Power8/Power9/Power10(pre) with option -Ofast and
corresponding -mcpu=power{8,9,10} for the build. Using foo1 (scalar version) as
the baseline 100, do the normalization, the speed-ups look below (more bigger,
more better):

foo1   foo2 foo3
P8  100   163.26   164.97
P9  100   101.15   101.60
P10 100   172.42   159.10

The evaluation shows the vectorized versions can have big gains on Power8 and
Power10, but the gain is trivial on Power9.  I tried to evaluate P8 executable
on P9, it didn't have any remarkable differences. There are some known issues
on P9 for fp/vect intensive workloads, it's probably due to that since the
vectorized versions are full of vector codes. Sigh, we might have to make
different costs for P9 but that's another story. Btw, I tried to run P9
executable on P10, it can also gains that much like what we have in P10 row. So
it's profitable to vectorize this as P8 and P10 testing show.

> For the "rewrite" of the vectorizer into all-SLP we do have to address
> that "interleaving scheme not carried out as interleaving" at some point,
> but that's usually for loop vectorization - for BB vectorization all
> we have is optimize_slp.  I have patches that would build the vector load
> SLP node (you still have to kill that 'build from scalars' thing to make
> it trigger ).  But then we end up with a shared vector load node and
> N extract/splat operations at the 'scalar' points.  It's not entirely
> clear to me how to re-arrange the SLP graph at that point.
> 

Great, it's good to know we can end up with a shared vector load node. It looks
we can further simplify the graph on top of it, for #c4 simplified case,
assuming we have two shared vector load nodes
  N1 = 
  N2 = 

1) If all lanes of one shared vector load node only works with all lanes of
another shared vector load nodes, we can move up (postpone) the extract/splat
operations.  For this case, we needs 4 extract/splat for p1's lanes, 4
extract/splat for p2's lanes, 4 minus for {T,T,T,T} /T=a{0,1,2,3}/, it can be
simplified to one shared node N3 = N1 - N2 (1 minus) and 4 extract/splat.

2) And then we have the shared vector node , before it's going
to extract/splat to {T,T,T,T} /T=a{0,1,2,3}/, it can check all the uses of
extract/splat lanes, try to detect some pattern and further simplify. For this
case, it can be permuted to , do the minus/plus and the
permutations to get the nodes N4 <_13, _14, _13, _14> and N5 <_15, _16, _15,
_16> to save the need of extract/splat.

3) Further we can detect the pattern on SLP graph and skip the need to gen 
N4 and N5, but gen <_13, _14, _15, _16> and repeat the minus/plus and
permutations for the required t1/t2/t3/t4 computation.

> Btw, on current trunk the simplified testcase no longer runs into the
> 'scalar operand' build case but of course vectorization is thought to be
> not profitable.  pattern recog of the plus/minus subgraphs may help
> (not sure if ppc has those as instruction, x86 has).

You meant the simplified case in #c4? I re-based the latest trunk (r11-6601)
but still suffered the 'scalar operand' build, unfortunately ppc doesn't
support that kind of instructions, sigh.

> 
> That said, "failure" to identify the common (vector) load is known
> and I do have experimental patches trying to address that but did
> not yet arrive at a conclusive "best" approach.

Thanks for the information!

[Bug tree-optimization/98138] BB vect fail to SLP one case

2021-01-11 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #8 from Kewen Lin  ---
Created attachment 49942
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49942&action=edit
vectorized with altivec built-in functions

[Bug rtl-optimization/100328] New: IRA doesn't model dup num constraint well

2021-04-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100328

Bug ID: 100328
   Summary: IRA doesn't model dup num constraint well
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

source: function LBM_performStreamCollideTRT in SPEC2017 519.lbm_r

This issue was exposed by O2 vectorization enablement evaluation on 519.lbm_r.

baseline option: -O2 -mcpu=power9 -ffast-math

test option: -O2 -mcpu=power9 -ffast-math -ftree-vectorize
 -fvect-cost-model=very-cheap

The ratio with test option will degrade -1.66% against baseline (-1.74% without
the very-cheap cost model).

The hotspot LBM_performStreamCollideTRT isn't vectorized at all, but the
pre-pass if-conversion of vectorization gets the issue exposed. Firstly,
if-conversion will use the new copied loop as the scalar version after loop
versioning, once vectorization fails, we end up with one loop which has a
little
difference against before.

The difference mainly comes from: 

1) Different basic block placement. For this function, the fall through BB and
branch BB are switched. The reason is that the new copied loop BBs are adjusted
as dom_order while the idom insertion order changes when it sets the idom
during
copying. Anyway, it's acceptable. 

2) SSA names difference. The new copied loop can reuse some discarded
SSA_names,
the gimple commutative operands  canonicalization will change some order.

I did some hack to filter the fall through/branch BB difference, the gap
becomes
smaller but still some. The remaining difference on gimple are some operand
orders as mentioned above, the difference on assembly file are some different
insns choices mainly on fma style insns, one remarkable difference is the
number
of register copies: 

  fmr + xxlor: 16 (baseline) vs 21 (test respecting fall through)

In this function, there are many FMA style expressions (27 FMA, 19 FMS, 11
FNMA). Their VSX_REG version insns are destructive and the define_insns look
like:

(define_insn "*nfma4_fpr"
  [(set (match_operand:SFDF 0 "gpc_reg_operand" "=,wa,wa")
(neg:SFDF
 (fma:SFDF
  (match_operand:SFDF 1 "gpc_reg_operand" ",wa,wa")
  (match_operand:SFDF 2 "gpc_reg_operand" ",wa,0")
  (match_operand:SFDF 3 "gpc_reg_operand" ",0,wa"]
  "TARGET_HARD_FLOAT"
  "@
   fnmadd %0,%1,%2,%3
   xsnmaddap %x0,%x1,%x2
   xsnmaddmp %x0,%x1,%x3"
  [(set_attr "type" "fp")
   (set_attr "isa" "*,,")])

(define_insn "*fms4_fpr"
  [(set (match_operand:SFDF 0 "gpc_reg_operand" "=,wa,wa")
(fma:SFDF
 (match_operand:SFDF 1 "gpc_reg_operand" ",wa,wa")
 (match_operand:SFDF 2 "gpc_reg_operand" ",wa,0")
 (neg:SFDF (match_operand:SFDF 3 "gpc_reg_operand" ",0,wa"]
...

(define_insn "*fma4_fpr"
  [(set (match_operand:SFDF 0 "gpc_reg_operand" "=,wa,wa")
(fma:SFDF
  (match_operand:SFDF 1 "gpc_reg_operand" "%,wa,wa")
  (match_operand:SFDF 2 "gpc_reg_operand" ",wa,0")
  (match_operand:SFDF 3 "gpc_reg_operand" ",0,wa")))]
...


Since the 1st alternative are with class FLOAT_REG, which are the subset of
VSX_REG whose total number are 64 while fp shares the first 32, in most cases
the preferred rclass for these insns are VSX_REG. Assuming we have the
expression that:

  FMA A,B,C,D

If these four register are totally different, it can not meet with the
alternatives with duplicated number constraint. If it prefers to use the
remaining alternative (1st), at the same time, if one of these isn't low 32 vsx
(can't fit with fp), we have to generate register copy from vsx register (high
number vsx reg) to fp register (low number vsx reg).

How the commutative operand order affects this? 

IRA tries to create copy for register coalescing, for FMA expression above,
assuming both B and C are dead at the current insn, it will have copy on A/B
and
A/C, later when it does thread forming, if both A/B and A/C have the same freq,
lower copy number comes first. It means the operand order can affect how we
form
the thread, different pulled-in allocno will probably produce different
conflict
set, it further affects the global thread forming and final assignment.

But I think the root cause is that when we create copy for these fma style
insns, ira doesn't fully consider the duplicate number constraint, for example,
for FMS if the operands 1,2,3 are dead, both 2 and 3 should take higher
priority
in copy queue.

I noticed that there is one function ira_get_dup_out_num, which meant to create
this kind of copy, but the below code looks to refuse to create if there is an
alternative which has valid regclass without spilled need. 

  default:
{
  enum constraint_num cn = lookup_constraint (str);
  enum reg_class cl =

[Bug rtl-optimization/100328] IRA doesn't model dup num constraint well

2021-04-30 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100328

--- Comment #1 from Kewen Lin  ---
Created attachment 50715
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50715&action=edit
ira:consider matching cstr in all alternatives

With little understanding on ira, I am not quite sure this patch is on the
reasonable direction. It aims to check the matching constraint in all
alternatives, if there is one alternative with matching constraint and matches
the current preferred regclass, it will record the output operand number and
further create one copy for it. Normally it can get the priority against
shuffle copies and the matching constraint will get satisfied with higher
possibility, reload doesn't create extra copies to meet the matching constraint
or the desirable register class when it has to.

For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay as
shuffle copies, and later any of A,B,C,D gets assigned by one hardware register
which is a VSX register but not a FP register, which means it has to pay costs
once we can NOT go with VSX alternatives, so at that time we can increase the
freq for the remaining copies related to this, once the matching constraint
gets satisfied further, there aren't any extra costs to pay. This idea seems a
bit complicated in the current framework, so the proposed patch aggressively
emphasizes the matching constraint at the time of creating copies.

FWIW bootstrapped/regtested on powerpc64le-linux-gnu P9. The evaluation with
Power9 SPEC2017 all run shows 505.mcf_r +2.98%, 508.namd_r +3.37%, 519.lbm_r
+2.51%, no remarkable degradation is observed.

[Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails

2021-05-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

Bug ID: 100794
   Summary: suboptimal code due to missing pre2 when vectorization
fails
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

I was investigating one degradation from SPEC2017 554.roms_r on Power9, the
baseline is -O2 -mcpu=power9 -ffast-math while the test line is -O2
-mcpu=power9 -ffast-math -ftree-vectorize -fvect-cost-model=very-cheap.

One reduced C test case is as below:

#include 

#define MIN fmin
#define MAX fmax

#define N1 400
#define N2 600
#define N3 800

extern int j_0, j_n, i_0, i_n;
extern double diff2[N1][N2];
extern double dZdx[N1][N2][N3];
extern double dTdz[N1][N2][N3];
extern double dTdx[N1][N2][N3];
extern double FS[N1][N2][N3];

void
test (int k1, int k2)
{
  for (int j = j_0; j < j_n; j++)
for (int i = i_0; i < i_n; i++)
  {
double cff = 0.5 * diff2[j][i];
double cff1 = MIN (dZdx[k1][j][i], 0.0);
double cff2 = MIN (dZdx[k2][j][i + 1], 0.0);
double cff3 = MAX (dZdx[k2][j][i], 0.0);
double cff4 = MAX (dZdx[k1][j][i + 1], 0.0);

FS[k2][j][i]
  = cff
* (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i])
   + cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1])
   + cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i])
   + cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1]));
  }
}

O2 fast:

   [local count: 955630225]:
  # prephitmp_107 = PHI <_6(8), pretmp_106(7)>
  # prephitmp_109 = PHI <_4(8), pretmp_108(7)>
  # prephitmp_111 = PHI <_23(8), pretmp_110(7)>
  # prephitmp_113 = PHI <_13(8), pretmp_112(7)>
  # doloop.9_55 = PHI 
  # ivtmp.33_102 = PHI 
  _87 = (double[400][600] *) ivtmp.45_60;
  _1 = MEM[(double *)_87 + ivtmp.33_102 * 1];
  cff_38 = _1 * 5.0e-1;
  cff1_40 = MIN_EXPR ;
  _4 = MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1];
  cff2_42 = MIN_EXPR <_4, 0.0>;
  cff3_43 = MAX_EXPR ;
  _6 = MEM[(double *)_79 + ivtmp.33_102 * 1];
  cff4_44 = MAX_EXPR <_6, 0.0>;


O2 fast vect (very-cheap)
   [local count: 955630225]:
  # doloop.9_55 = PHI 
  # ivtmp.37_102 = PHI 
  # ivtmp.38_92 = PHI 
  _77 = (double[400][600] *) ivtmp.48_62;
  _1 = MEM[(double *)_77 + ivtmp.37_102 * 1];
  cff_38 = _1 * 5.0e-1;
  _2 = MEM[(double *)&dZdx + ivtmp.38_92 * 1];   // redundant load
  cff1_40 = MIN_EXPR <_2, 0.0>;
  _4 = MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1];
  cff2_42 = MIN_EXPR <_4, 0.0>;
  _5 = MEM[(double *)&dZdx + ivtmp.37_102 * 1];  // redundant load 
  cff3_43 = MAX_EXPR <_5, 0.0>;
  _6 = MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1];
  cff4_44 = MAX_EXPR <_6, 0.0>;


I found the root cause is that: in the baseline version, PRE makes it to reuse
some load result from previous iterations, it saves some loads. while in the
test line version, with the check below:

  /* Inhibit the use of an inserted PHI on a loop header when
 the address of the memory reference is a simple induction
 variable.  In other cases the vectorizer won't do anything
 anyway (either it's loop invariant or a complicated
 expression).  */
  if (sprime
  && TREE_CODE (sprime) == SSA_NAME
  && do_pre
  && (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1)

PRE doesn't optimize it to avoid introducing loop carried dependence. It makes
sense. But unfortunately the expected downstream loop vectorization isn't
performed on the given loop since with "very-cheap" cost model, it doesn't
allow vectorizer to peel for niters. Later there seems no downstream pass which
is trying to optimize it, it eventually results in sub-optimal code.

To rerun pre once after loop vectorization did fix the degradation, but not
sure it's practical, since iterating pre seems much time-consuming. Or tagging
this kind of loop and later just run pre on the tagged one? It seems also not
practical to predict one loop whether can be loop-vectorized later. Also not
sure whether there are some passes which can be taught for this.

[Bug tree-optimization/99398] Miss to optimize vector permutation fed by CTOR and CTOR/CST

2021-05-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99398

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Kewen Lin  ---
Fixed.

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

2021-05-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #2 from Kewen Lin  ---
(In reply to Richard Biener from comment #1)

Thanks for the comments!

> There's predictive commoning which can do similar transforms and runs after
> vectorization.  It might be it doesn't handle these "simple" cases or that
> loop dependence info is not up to the task there.
> 

pcom does fix this problem, but it's enabled by default at -O3. Could it be
considered to be run at O2? Or enabled at O2 at some conditions such as: only
for one loop which skips loop carried optimization and isn't vectorized
further?

> Another option is to avoid the PRE guard with the (very) cheap cost model
> at the expense of not vectorizing affected loops.
> 

OK, I will benchmark this to see its impact. For the particular loops in
554.roms_r, they can be vectorized at cheap cost model, this bmk got improved
at cheap cost model on both Power8 and Power9 (a bit though). So I will just
test the impact on very cheap cost model.

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

2021-05-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #4 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #3)
> On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > 
> > --- Comment #2 from Kewen Lin  ---
> > (In reply to Richard Biener from comment #1)
> > 
> > Thanks for the comments!
> > 
> > > There's predictive commoning which can do similar transforms and runs 
> > > after
> > > vectorization.  It might be it doesn't handle these "simple" cases or that
> > > loop dependence info is not up to the task there.
> > > 
> > 
> > pcom does fix this problem, but it's enabled by default at -O3. Could it be
> > considered to be run at O2? Or enabled at O2 at some conditions such as: 
> > only
> > for one loop which skips loop carried optimization and isn't vectorized
> > further?
> 
> I think pcom should be enabled when vectorization is due to the 
> interaction with PRE.  It could be tamed down (it can do peeling/unrolling 
> which is why it is -O3) based on the vectorizer cost model active
> if only implicitely enabled ...   Things will get a bit messy I guess.
> 

Good point! I prefer this idea to the one guarding cost model in sccvn code. 

> > > Another option is to avoid the PRE guard with the (very) cheap cost model
> > > at the expense of not vectorizing affected loops.
> > > 
> > 
> > OK, I will benchmark this to see its impact. For the particular loops in
> > 554.roms_r, they can be vectorized at cheap cost model, this bmk got 
> > improved
> > at cheap cost model on both Power8 and Power9 (a bit though). So I will just
> > test the impact on very cheap cost model.
> 
> So another thing to benchmark would be to enable pcom but make sure
> 
>   /* Determine the unroll factor, and if the loop should be unrolled, 
> ensure
>  that its number of iterations is divisible by the factor.  */
>   unroll_factor = determine_unroll_factor (chains);
>   scev_reset ();
>   unroll = (unroll_factor > 1
> && can_unroll_loop_p (loop, unroll_factor, &desc));
> 
> is false for the cheap and very-cheap cost models unless
> flag_predictive_commoning is active.
> 

Thanks for the hints! One question: could we just enable non-unroll version of
pcom if it's enabled by flag_tree_loop_vectorize implicitly without considering
vect cost model? Although the very-cheap and cheap cost model are very likely
associated to O2, users still can try dynamic or unlimited cost model at O2, or
very-cheap/cheap cost model at O3, it seems not good to map cost model onto
unroll decision here directly. Or maybe we check the optimization level? such
as:

  virtual bool
  gate (function *)
  {
if (flag_predictive_commoning != 0)
  return true;
if (flag_tree_loop_vectorize)
  {
allow_unroll_p = optimize > 2;
return true;
  }
return false;
  }


> It's probably also a good idea to investigate whether the
> update_ssa calls in pcom can be delayed to until after all transforms
> have been done.

OK, will check this later.

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

2021-05-30 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #6 from Kewen Lin  ---
Created attachment 50894
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50894&action=edit
Method 1, implicitly enable pcom without unrolling once loop vectorization is
enabled but pcom isn't set explicitly.

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

2021-05-30 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #7 from Kewen Lin  ---
Created attachment 50895
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50895&action=edit
Method 2, let pre generate loop carried dependence for very cheap and cheap
cost model.

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

2021-05-30 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #8 from Kewen Lin  ---
Created attachment 50896
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50896&action=edit
M1 M2 SPEC2017 P9 eval result

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

2021-05-30 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

--- Comment #9 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #5)
> On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > 
> > --- Comment #4 from Kewen Lin  ---
> > (In reply to rguent...@suse.de from comment #3)
> > > On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > > > 
> > > > --- Comment #2 from Kewen Lin  ---
> > > > (In reply to Richard Biener from comment #1)
> > > > 
> > > > Thanks for the comments!
> > > > 
> > > > > There's predictive commoning which can do similar transforms and runs 
> > > > > after
> > > > > vectorization.  It might be it doesn't handle these "simple" cases or 
> > > > > that
> > > > > loop dependence info is not up to the task there.
> > > > > 
> > > > 
> > > > pcom does fix this problem, but it's enabled by default at -O3. Could 
> > > > it be
> > > > considered to be run at O2? Or enabled at O2 at some conditions such 
> > > > as: only
> > > > for one loop which skips loop carried optimization and isn't vectorized
> > > > further?
> > > 
> > > I think pcom should be enabled when vectorization is due to the 
> > > interaction with PRE.  It could be tamed down (it can do 
> > > peeling/unrolling 
> > > which is why it is -O3) based on the vectorizer cost model active
> > > if only implicitely enabled ...   Things will get a bit messy I guess.
> > > 
> > 
> > Good point! I prefer this idea to the one guarding cost model in sccvn 
> > code. 
> > 
> > > > > Another option is to avoid the PRE guard with the (very) cheap cost 
> > > > > model
> > > > > at the expense of not vectorizing affected loops.
> > > > > 
> > > > 
> > > > OK, I will benchmark this to see its impact. For the particular loops in
> > > > 554.roms_r, they can be vectorized at cheap cost model, this bmk got 
> > > > improved
> > > > at cheap cost model on both Power8 and Power9 (a bit though). So I will 
> > > > just
> > > > test the impact on very cheap cost model.
> > > 
> > > So another thing to benchmark would be to enable pcom but make sure
> > > 
> > >   /* Determine the unroll factor, and if the loop should be unrolled, 
> > > ensure
> > >  that its number of iterations is divisible by the factor.  */
> > >   unroll_factor = determine_unroll_factor (chains);
> > >   scev_reset ();
> > >   unroll = (unroll_factor > 1
> > > && can_unroll_loop_p (loop, unroll_factor, &desc));
> > > 
> > > is false for the cheap and very-cheap cost models unless
> > > flag_predictive_commoning is active.
> > > 
> > 
> > Thanks for the hints! One question: could we just enable non-unroll 
> > version of pcom if it's enabled by flag_tree_loop_vectorize implicitly 
> > without considering vect cost model? Although the very-cheap and cheap 
> > cost model are very likely associated to O2, users still can try dynamic 
> > or unlimited cost model at O2, or very-cheap/cheap cost model at O3, it 
> > seems not good to map cost model onto unroll decision here directly. Or 
> > maybe we check the optimization level? such as:
> > 
> >   virtual bool
> >   gate (function *)
> >   {
> > if (flag_predictive_commoning != 0)
> >   return true;
> > if (flag_tree_loop_vectorize)
> >   {
> > allow_unroll_p = optimize > 2;
> 
> But what about -O2 -ftree-vectorize -fno-predictive-commoning?  IMHO
> we want to check global_options_set and for "implicit" pcom do not
> allow unrolling and for explicitely disabled pcom do not do any
> pcom at all.
> 
> Richard.

OK, the M1 patch followed the suggestion here, not allow unrolling if the pcom
is enabled implicitly by loop vect.

As the evaluation result shows, the way with implied pcom (M1) looks better.
For  very cheap cost model, it can make wrf a bit better. For cheap cost model,
the loop carried dependence (M2) did affect some bmks (especially wrf -9%). The
build time increment with M1 looks fine too.

Btw, commenting out two places with "update_ssa
(TODO_update_ssa_only_virtuals)", bootstrapped with the default way and
--with-build-config=bootstrap-O3 way. Will check with regression testing and
SPEC2017 build, but I guess it's another independent patch.

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

2021-05-30 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2021-05-31

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

2021-06-08 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from Kewen Lin  ---
Fixed on trunk, will continue to refactor the tree_predictive_commoning_loop
and its callees into class and member functions as suggested.

[Bug rtl-optimization/100328] IRA doesn't model matching constraint well

2021-06-24 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100328

--- Comment #3 from Kewen Lin  ---
(In reply to Vladimir Makarov from comment #2)
> (In reply to Kewen Lin from comment #1)
> > Created attachment 50715 [details]
> > ira:consider matching cstr in all alternatives
> > 
> > With little understanding on ira, I am not quite sure this patch is on the
> > reasonable direction. It aims to check the matching constraint in all
> > alternatives, if there is one alternative with matching constraint and
> > matches the current preferred regclass, it will record the output operand
> > number and further create one copy for it. Normally it can get the priority
> > against shuffle copies and the matching constraint will get satisfied with
> > higher possibility, reload doesn't create extra copies to meet the matching
> > constraint or the desirable register class when it has to.
> > 
> > For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay as
> > shuffle copies, and later any of A,B,C,D gets assigned by one hardware
> > register which is a VSX register but not a FP register, which means it has
> > to pay costs once we can NOT go with VSX alternatives, so at that time we
> > can increase the freq for the remaining copies related to this, once the
> > matching constraint gets satisfied further, there aren't any extra costs to
> > pay. This idea seems a bit complicated in the current framework, so the
> > proposed patch aggressively emphasizes the matching constraint at the time
> > of creating copies.
> > 
> > FWIW bootstrapped/regtested on powerpc64le-linux-gnu P9. The evaluation with
> > Power9 SPEC2017 all run shows 505.mcf_r +2.98%, 508.namd_r +3.37%, 519.lbm_r
> > +2.51%, no remarkable degradation is observed.
> 
> Thank you for working on this issue.
> 
> The current implementation of ira_get_dup_out_num was specifically tuned for
> better register allocation for x86-64 div insns.
> 
> Your patch definitely improves code for power9 and I would say significantly
> (congratulations!).  The patch you proposed makes me think that it might
> work for major targets as well.
> 
> I would prefer to avoid introducing new parameter because there are too many
> of them already and its description is cryptic.
> 

Thanks for your comments, Vladimir!  Yeah, Segher also thought it can benefit
other targets and suggested making it on by default, I've made this parameter
on by default in v2, if it's fine on x86-64 and aarch64 with some testing and
benchmarking later, I think we can simply get rid of the parameter as you
suggested. 

> It would be nice if you benchmark the patch on x86-64 too, If there is no
> overall degradation with new behaviour we could remove the parameter and
> make the new behaviour as a default. If it is not, well we will keep the
> parameter.
> 

Sorry that I don't have a x86-64 or aarch64 performance machine at hand, the
new version v2 was bootstrapped/regtested on powerpc64le-linux-gnu P9 and
x86_64-redhat-linux, but had some failures on aarch64, I was still
investigating it.  Once it got root-caused and fixed, I would ask around folks
to help to benchmark this.

> As for the patch itself, I don't like some variable names.  Sorry.  Could
> you use op_regno, out_regno, and present_alt instead of op_no, out_no, tot. 
> Please, in general use longer variable names reflecting their purpose as GCC
> developers reads code in many times more than writing it.

Got it, thanks for the suggestion! This part has been simplified with
recog_op_alt, hope it looks better.

[Bug rtl-optimization/100328] IRA doesn't model matching constraint well

2021-06-24 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100328

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
   Last reconfirmed||2021-06-24
 Ever confirmed|0   |1

--- Comment #4 from Kewen Lin  ---
Created attachment 51059
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51059&action=edit
ira-respect-matching-constraint-v2

This v2 considers the situation that: for one preferred register class
there can be two or more alternatives, one of them has the matching
constraint, while another doesn't have.  For the given operand, even if
it's assigned by a hardware reg which doesn't meet the matching
constraint, it can simply use the alternative which doesn't have
matching constraint so no register copy is needed.  One typical case is
define_insn *mov_internal2 on rs6000:

(define_insn "*mov_internal2"
  [(set (match_operand:CC 2 "cc_reg_operand" "=y,x,?y")
(compare:CC (match_operand:P 1 "gpc_reg_operand" "0,r,r")
(const_int 0)))
   (set (match_operand:P 0 "gpc_reg_operand" "=r,r,r") (match_dup 1))]
  ""
  "@
   cmpi %2,%0,0
   mr. %0,%1
   #"

So we shouldn't create constraint copy for it.  For fma style insns on
rs6000, there are also several alternatives for preferred regclass and
also only one with matching constraint.  The difference is that for the
given operand although there is no matching constraint applied for the
alternaitve but matching constraint is being applied for one other input
operand in this same alternative, it means when one matching constraint
can be applied to more than one input operand, it has to have several
alternatives like this.  And to create constraint copies for all of
these input operands with matching constraint is fine, once the matching
constraint is honored on one input operand, it implicitly disable the
others due to the interference relationship.  So this patch is going
to record and check all the other alternatives, which don't have matching
constraint but with preferred classes, whether there is one input operand
having same matching constraint.

It also considers the possible free register move in the same register
class, disable this if so since the register copy to meet the constraint
is considered as free.

I re-evaluated SPEC2017 performance with option set Ofast unroll, bmks 
508.namd_r and 519.lbm_r were observed to be improved by 2.4% ~ 3.8%
on Power8 and Power9.

As mentioned before, it's bootstrapped/regtested on powerpc64le-linux-gnu P9
and x86_64-redhat-linux, but hit some regression failures on aarch64, I am
still
investigating the only one PASS->FAIL: (the others are XFAIL->XPASS)

PASS->FAIL: gcc.target/aarch64/sve/acle/general/pr94683.c -march=armv8.2-a+sve
-moverride=tune=none  check-function-bodies test

In this case, the newly created constraint copy is expected (which was shuffle
copy), but this copy change somehow affects the full cost on register 92 due to
conflict with reg 102. Need more digging on this.

[Bug rtl-optimization/100328] IRA doesn't model matching constraint well

2021-06-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100328

--- Comment #5 from Kewen Lin  ---
Created attachment 51065
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51065&action=edit
ira: Consider matching constraint heavily with some parameter v3

The mentioned only one aarch64-linux-gnu "PASS->FAIL" regression failure
exposes one issue: how to set up the costs when either of matching constraint
two sides is hardware register (hw reg for short later) meanwhile there are
several operands can have the matching constraint. Without this patch, the
constraint copy is simply taken as a real move insn, so when there is one hw
reg, it set the preference and conflict cost as big value, it's good since we
used to create constraint copy when there is only one operand respecting
matching constraint. With this patch, we allow there are several input operands
respecting the same matching constraint (but in different alternatives), so
even there is one hw reg, we can't set the preference/conflict cost with high
value.

CASE 1:
  hw_reg0 = reg1, reg2, reg3 ... (op1/2/3 has the matching constraint on op0)

We can make reg1/reg2/reg3 be preferred to be assigned by hw_reg0 over the
other hw_regs. But reg1/reg2/reg3 conflicts with any of the others, we can't
make reg1's conflict objects be disparaged on hw_reg0 since it can affect
hw_reg0 to be assigned onto reg2/reg3 and also their allocno copies. So make it
simply as shuffle copy.

CASE 2:
  reg0 = hw_reg1, reg2, reg3 ... (op1/2/3 has the matching constraint on op0)

hw_reg1 is just one option to be assigned to reg0, shouldn't give it much cost
otherwise it can disparage reg0's copies with reg2/reg3. Now the patch simple
takes it as shuffle copy, so it slightly prefer hw_reg1, but still have the
flexibility to go with some hw_reg in reg2/reg3. Note that we can early catch
this in function ira_get_dup_out_num, but I think it might be more clear to
leave the handling in process_regs_for_copy near the CASE1 handling and easier
to tweak it in case we have the need.

[Bug rtl-optimization/100328] IRA doesn't model matching constraint well

2021-06-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100328

--- Comment #6 from Kewen Lin  ---
Created attachment 51066
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51066&action=edit
aarch64 XPASS failure list

The patch v3 bootstrapped and regression-tested on x86_64-redhat-linux and
powerpc64le-linux-gnu, but still have some "XFAIL -> XPASS" regression failures
on aarch64-linux-gnu, I think these remaining failures are expected (assembly
gets improved). Need Richard's help to double check/confirm them.

For example, for div_f16.c, mad_f32.c and sub_f64.c, the related different
assembly looks like below:

BEFORE:

div_h4_f16_x_untied:

mov z4.h, h4
movprfx z0, z1
fdivz0.h, p0/m, z0.h, z4.h
ret

mad_s4_f32_x_untied:

mov z4.s, s4
movprfx z0, z4
fmlaz0.s, p0/m, z1.s, z2.s
ret

sub_d4_f64_x_untied:

mov z4.d, d4
movprfx z0, z1
fsubz0.d, p0/m, z0.d, z4.d
ret

AFTER:

div_h4_f16_x_untied:

mov z0.h, h4
fdivr   z0.h, p0/m, z0.h, z1.h
ret

mad_s4_f32_x_untied:

mov z0.s, s4
fmlaz0.s, p0/m, z1.s, z2.s
ret

sub_d4_f64_x_untied:

mov z0.d, d4
fsubr   z0.d, p0/m, z0.d, z1.d
ret

The assembly with this patch saves movprfx.

[Bug target/101235] [11/12 Regression] Fails to bootstrap with binutils 2.32

2021-06-27 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101235

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
 CC||linkw at gcc dot gnu.org,
   ||segher at gcc dot gnu.org
   Last reconfirmed||2021-06-28

--- Comment #2 from Kewen Lin  ---
Fixed with r12-1738 on trunk, need one backport.

[Bug target/101235] [11/12 Regression] Fails to bootstrap with binutils 2.32

2021-06-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101235

--- Comment #3 from Kewen Lin  ---
Will backport the fix after 2021 July 7th (two weeks since it's into trunk) if
this isn't urgent meanwhile got the backport approval.

[Bug target/101235] [11/12 Regression] Fails to bootstrap with binutils 2.32

2021-06-28 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101235

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #4)
> (In reply to Kewen Lin from comment #3)
> > Will backport the fix after 2021 July 7th (two weeks since it's into trunk)
> > if this isn't urgent meanwhile got the backport approval.
> 
> The reason to not backport immediately is because you end up with a
> royal mess if your patch causes regressions.  A week is normally
> plenty: there will be lots of test results that can give you
> confidence things work.
> 
> If a patch caused build failures, we want that fixed as soon as
> possible, because we will not have test results at all otherwise :-)

Thanks for the note. I thought this is one partial build issue (the build can
still work with newer binutils) so tried to be conservative. Built successfully
with latest releases/gcc-11, also bootstrapped and regress-tested on
releases/gcc-11 (with newer binutils for regression testing of coz.).

Should be fixed on gcc-11.

[Bug target/108240] [13 Regression] ICE in emit_library_call_value_1 at gcc/calls.cc:4181 since r13-4894-gacc727cf02a144

2023-01-09 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108240

--- Comment #5 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #4)
> (In reply to Kewen Lin from comment #3)
> > With the culprit commit r13-4894, we always implicitly enable powerpc64 for
> > both explicit and implicit 64 bit, it's the same as before for the explicit
> > 64 bit case, but for the implicit 64 bit case, there is no chance for the
> > used cpu to unset powerpc64 (like this case). To keep it consistent with the
> > previous, the fix can be to only enable powerpc64 implicitly for explicit 64
> > bit, while let it be for implicit 64 bit.
> 
> No?  If the user says to use a CPU without 64-bit instructions, while the
> user also says we require 64-bit insns (via -m64), we should just error.

But both the previous behavior (before r13-4894) and the current behavior
(starting from r13-4894) honour the given explicit -m64, it would always enable
-mpowerpc64 at the same time without any errors/warnings.

> Not hide the problem (and cause many more problems!)
> 

The behavior change is for the case without any explicit -m64 but the
TARGET_DEFAULT has 64 bit set (implicit -m64).  And yes, different from the
previous behavior, the current behavior hides the error/warning and force the
-mpower64, so I posted one patch at:

https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609492.html

It would allow that powerpc64 gets unset if the user says to use a CPU without
64-bit instructions and with implicit 64 bit.

[Bug target/108272] [13 Regression] ICE in gen_movxo, at config/rs6000/mma.md:339

2023-01-09 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108272

--- Comment #4 from Kewen Lin  ---
(In reply to Arseny Solokha from comment #3)
> (In reply to Kewen Lin from comment #2)
> > Created attachment 54192 [details]
> > untested patch
> > 
> > Hi @Arseny, I hope this patch can help to clear all the ICEs about
> > unexpected uses of MMA opaque types in inline asm, that is to filter those
> > noises duplicated to this bug.
> 
> Indeed, I haven't seen such ICEs w/ the patch applied so far. Still get an
> ICE in gen_movoo, at config/rs6000/mma.md:292 when compiling
> gcc/testsuite/gcc.target/powerpc/pr96506-1.c w/ -m32, though. Do you want me
> to file another PR for that one?

Thanks @Arseny!  Yes, please file a new one. Thanks again.

[Bug target/108240] [13 Regression] ICE in emit_library_call_value_1 at gcc/calls.cc:4181 since r13-4894-gacc727cf02a144

2023-01-09 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108240

--- Comment #6 from Kewen Lin  ---
(In reply to Kewen Lin from comment #5)
> (In reply to Segher Boessenkool from comment #4)
> > (In reply to Kewen Lin from comment #3)
> > > With the culprit commit r13-4894, we always implicitly enable powerpc64 
> > > for
> > > both explicit and implicit 64 bit, it's the same as before for the 
> > > explicit
> > > 64 bit case, but for the implicit 64 bit case, there is no chance for the
> > > used cpu to unset powerpc64 (like this case). To keep it consistent with 
> > > the
> > > previous, the fix can be to only enable powerpc64 implicitly for explicit 
> > > 64
> > > bit, while let it be for implicit 64 bit.
> > 
> > No?  If the user says to use a CPU without 64-bit instructions, while the
> > user also says we require 64-bit insns (via -m64), we should just error.
> 
> But both the previous behavior (before r13-4894) and the current behavior
> (starting from r13-4894) honour the given explicit -m64, it would always
> enable -mpowerpc64 at the same time without any errors/warnings.
> 

It's implied that when the user explicitly specify -m64, the handlings would
neglect the impact of CPU, I'm not sure if it's intentional but the reason
probably is that the underlying CPU is actually 64 bit in most cases, so make
-m64 win and the compilation can go forward.

If we change the behavior to error for both explicit and implicit 64 bit, some
compilations which worked in the past can start to fail (though it's arguable
that it's expected). Note that for implicit 64 bit and no powerpc64, we gets
errors on Linux but just warnings on darwin/aix (maybe more fallouts come out
on them). So considering the current release phase, I'm inclined to just make
it consistent with the previous, and try to adjust the behavior (as Segher's
proposal) in next release.

[Bug target/108348] ICE in gen_movoo, at config/rs6000/mma.md:292

2023-01-09 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108348

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
   Last reconfirmed||2023-01-10
   Target Milestone|--- |13.0
 CC||linkw at gcc dot gnu.org

--- Comment #1 from Kewen Lin  ---
Thanks for reporting, confirmed!

Since ppc64le is with 64 bit configuration, it rejects -m32. But this can be
reproduced on ppc64 with option:

-m32 -mcpu=power10 -mno-altivec

Investigating.

[Bug target/108348] ICE in gen_movoo, at config/rs6000/mma.md:292

2023-01-09 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108348

--- Comment #2 from Kewen Lin  ---
This is a 32 bit specific issue, the function rs6000_pass_by_reference has:

  /* Allow -maltivec -mabi=no-altivec without warning.  Altivec vector
 modes only exist for GCC vector types if -maltivec.  */
  if (TARGET_32BIT && !TARGET_ALTIVEC_ABI && ALTIVEC_VECTOR_MODE (arg.mode))
{
  if (TARGET_DEBUG_ARG)
fprintf (stderr, "function_arg_pass_by_reference: AltiVec\n");
  return 1;
}

It assumes that the altivec is on when we see those altivec vector modes, but
it doesn't hold with the option combination for this case. It returns true for
rs6000_pass_by_reference, then the following logic of generic code tries to
make a copy of the object and pass the address to the function being called, it
invokes store_expr for storing to the copy, then triggers ICE.

And targetm.calls.function_arg is too late to raise the errors.

[Bug target/108348] ICE in gen_movoo, at config/rs6000/mma.md:292

2023-01-10 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108348

--- Comment #3 from Kewen Lin  ---
There seem to be two alternatives to fix this, one is to raise error in
rs6000_pass_by_reference like what we do in rs6000_function_arg, but we need
something like mma_return_type_error to avoid re-erroring; the other is to
teach the function rs6000_opaque_type_invalid_use_p about function call
statement on arguments and return values. For now, the argument and return
value checking is against the modes directly rather than the types, maybe to
unique them into rs6000_opaque_type_invalid_use_p is better?

Hi @Segher and @Peter, what do you think of this?

[Bug target/108348] ICE in gen_movoo, at config/rs6000/mma.md:292

2023-01-10 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108348

--- Comment #4 from Kewen Lin  ---
(In reply to Kewen Lin from comment #3)
> There seem to be two alternatives to fix this, one is to raise error in
> rs6000_pass_by_reference like what we do in rs6000_function_arg, but we need
> something like mma_return_type_error to avoid re-erroring; the other is to
> teach the function rs6000_opaque_type_invalid_use_p about function call
> statement on arguments and return values. For now, the argument and return
> value checking is against the modes directly rather than the types, maybe to
> unique them into rs6000_opaque_type_invalid_use_p is better?
> 
> Hi @Segher and @Peter, what do you think of this?


Ah, the later is a bad idea, as the current checking is for non MMA but the
function argument and return values checking should happen all the time. With
further thinking, the fix can be: (always return false for MMA modes, let
rs6000_function_arg to raise error msg.)

diff --git a/gcc/config/rs6000/rs6000-call.cc
b/gcc/config/rs6000/rs6000-call.cc
index 59c51fa3579..6767a1f541c 100644
--- a/gcc/config/rs6000/rs6000-call.cc
+++ b/gcc/config/rs6000/rs6000-call.cc
@@ -2013,6 +2013,11 @@ rs6000_pass_by_reference (cumulative_args_t, const
function_arg_info &arg)
 {
   if (TARGET_DEBUG_ARG)
 fprintf (stderr, "function_arg_pass_by_reference: AltiVec\n");
+  /* We do not allow MMA types being used as function arguments,
+ return false to avoid the ICE on the copying for passing by
+ reference.  */
+  if (TYPE_MODE (arg.type) == OOmode || TYPE_MODE (arg.type) == XOmode)
+return 0;
   return 1;
 }

[Bug target/108348] ICE in gen_movoo, at config/rs6000/mma.md:292

2023-01-10 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108348

--- Comment #6 from Kewen Lin  ---
(In reply to Peter Bergner from comment #5)
> (In reply to Kewen Lin from comment #1)
> > diff --git a/gcc/config/rs6000/rs6000-call.cc
> > b/gcc/config/rs6000/rs6000-call.cc
> > index 59c51fa3579..6767a1f541c 100644
> > --- a/gcc/config/rs6000/rs6000-call.cc
> > +++ b/gcc/config/rs6000/rs6000-call.cc
> > @@ -2013,6 +2013,11 @@ rs6000_pass_by_reference (cumulative_args_t, const
> > function_arg_info &arg)
> >  {
> >if (TARGET_DEBUG_ARG)
> >  fprintf (stderr, "function_arg_pass_by_reference: AltiVec\n");
> > +  /* We do not allow MMA types being used as function arguments,
> > + return false to avoid the ICE on the copying for passing by
> > + reference.  */
> > +  if (TYPE_MODE (arg.type) == OOmode || TYPE_MODE (arg.type) == XOmode)
> > +return 0;
> >return 1;
> >  }
> 
> This doesn't seem right.  The function comment for rs6000_pass_by_reference
> is:
> 
>A C expression that indicates when an argument must be passed by
>reference.  If nonzero for an argument, a copy of that argument is
>made in memory and a pointer to the argument is passed instead of
>the argument itself.  The pointer is passed in whatever way is
>appropriate for passing a pointer to that type.
> 
> By returning false here, that seems to imply that MMA types don't have to be
> passed by reference.  Does that imply they can be passed by value?  If so,
> that's not correct, since if we have a MMA type as a function
> argument/return type, it must be via a pointer to that type, so
> pass-by-referernce.

Thanks @Peter for the comments! I think you are right. By checking the
handlings after pass-by-reference is true, I found the argument (tree_value)
would be replaced with one pointer to that type, which is permitted. I'll go
back to the approach teaching the invalid use of MMA type argument (without MMA
support) in rs6000_opaque_type_invalid_use_p then.

[Bug target/108240] [13 Regression] ICE in emit_library_call_value_1 at gcc/calls.cc:4181 since r13-4894-gacc727cf02a144

2023-01-11 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108240

--- Comment #8 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #7)
> -m64 requires 64-bit instructions.  We will ICE if we try to generate code
> for -m64 without support for 64-bit insns enabled in the compiler.  For
> example, the stdu insn is required to implement the ABI sanely.
> 

The current behavior for one explicit command line option -m64 doesn't violate
the comment, the explicitly given -m64 will enable powerpc64 all the time, it
makes -m64 compilation always have 64-bit insns enabled. It's the same for both
cases before and after r13-4894.

> If the user said they want a -mcpu= for a CPU that has no 64-bit insns,
> but also wants to use -m64, we should just say sorry, that won't fly.

I agree that this is a sensible thing to look into and make. But to change the
behavior like this fully (on Linux, aix and darwin, 64 bit env w/ or w/o
explicit -m64) is a big adjustment comparing with the previous behaviors.

Since for the case that "the explicit option -m64 + cpu without 64-bit insn +
Linux/aix/darwin" it doesn't emit errors before, for the cases that "no
explicit option -m64 + cpu without 64-bit insn + aix/darwin" it only emits
warnings before. Only for the case "no explicit option -m64 + cpu without
64-bit insn + Linux", it emits error before r13-4894. After the culprit commit
it changes to not emit errors, this part is a regression, the proposed patch
can fix it.
But for the others in which cases we don't emit error before (for both cases
before and after r13-4894), to make them to emit errors is new behavior, it
could cost non-trivial efforts (at least on testing and some fixing on possible
fallouts).

[Bug target/108415] New: ICE in emit_library_call_value_1 at gcc/calls.cc:4181

2023-01-15 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108415

Bug ID: 108415
   Summary: ICE in emit_library_call_value_1 at gcc/calls.cc:4181
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
CC: bergner at gcc dot gnu.org, linkw at gcc dot gnu.org,
marxin at gcc dot gnu.org, segher at gcc dot gnu.org
Depends on: 108240
  Target Milestone: ---
Target: powerpc

+++ This bug was initially created as a clone of Bug #108240 +++

For the case
/home/marxin/Programming/gcc/gcc/testsuite/gfortran.dg/minloc_string_1.f90,

with options:

  -mmodulo -mcpu=401 -m64 (Linux and the default cpu power8)

Or options:

  -m32 -mcpu=401 -mmodulo (Linux BE and the default cpu power8)

we can see the ICE which is exactly as that in PR108240.

This bug is separated from PR108240, as it's a latent bug, the ICE itself is
actually nothing to do with r13-4894 which rework powerpc64 handling.

The root cause is that -mcpu=401 implies soft float and -mmodulo implies
ISA_3_0_MASKS_SERVER
(VSX), they conflicts very much, but this check is too early:

if (!TARGET_HARD_FLOAT)
  {
if (rs6000_isa_flags_explicit & OPTION_MASK_VSX)
  msg = N_("%<-mvsx%> requires hardware floating point");
else
  {
rs6000_isa_flags &= ~ OPTION_MASK_VSX;
rs6000_isa_flags_explicit |= OPTION_MASK_VSX;
  }
  }


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108240
[Bug 108240] [13 Regression] ICE in emit_library_call_value_1 at
gcc/calls.cc:4181 since r13-4894-gacc727cf02a144

[Bug target/108415] ICE in emit_library_call_value_1 at gcc/calls.cc:4181

2023-01-15 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108415

Kewen Lin  changed:

   What|Removed |Added

URL||https://gcc.gnu.org/piperma
   ||il/gcc-patches/2023-January
   ||/609724.html
 Ever confirmed|0   |1
   Last reconfirmed||2023-01-16
   Target Milestone|--- |13.0
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org

[Bug target/108240] [13 Regression] ICE in emit_library_call_value_1 at gcc/calls.cc:4181 since r13-4894-gacc727cf02a144

2023-01-15 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108240

--- Comment #9 from Kewen Lin  ---
I filed one new bug PR108415 for the ICE itself to avoid the confusion here.

This ICE is not a regression, it's a latent bug, because:

1) Without the culprit commit r13-4894 (like using r13-4893 or reverting it),
if we specify one more explicit option -m64, we can see the exactly same ICE.
It's just concealed as we emit error first on Linux without -m64. (r13-4894
changed it not to emit error any more so no error to conceal then).

2) It's even nothing to do with powerpc64 handling, this ICE can be produced
on ppc64 Linux with options "-m32 -mcpu=401 -mmodulo", so it's not 64 bit
specific (so not powerpc64 handling related either).

I put one comment about root cause of this ICE in PR108415.

[Bug target/108240] [13 Regression] Error message missing since r13-4894-gacc727cf02a144 (then make concealed ICE exposed)

2023-01-15 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108240

Kewen Lin  changed:

   What|Removed |Added

Summary|[13 Regression] ICE in  |[13 Regression] Error
   |emit_library_call_value_1   |message missing since
   |at gcc/calls.cc:4181 since  |r13-4894-gacc727cf02a144
   |r13-4894-gacc727cf02a144|(then make concealed ICE
   ||exposed)
URL||https://gcc.gnu.org/piperma
   ||il/gcc-patches/2023-January
   ||/609492.html

--- Comment #10 from Kewen Lin  ---
Put the ICE aside, the actual regression here is that for the given option set
the expected error message below is missing:

  f951: Error: ‘-m64’ requires a PowerPC64 cpu

But note that we only emit this error message when there is no explicit -m64.
The ICE shows up with an explicit option -m64 given without r13-4894 also
proves this (since otherwise we see error message first instead of ICE).

So to fix the regression itself, we only need to consider the case without any
explicit -m64, that's what the proposed patch aims to fix.

[Bug target/108396] [12/13 Regression] PPCLE: vec_vsubcuq missing since r12-5752-gd08236359eb229

2023-01-16 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108396

Kewen Lin  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
Summary|[12/13 Regression] PPCLE:   |[12/13 Regression] PPCLE:
   |vec_vsubcuq missing |vec_vsubcuq missing since
   ||r12-5752-gd08236359eb229
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 CC||linkw at gcc dot gnu.org

--- Comment #2 from Kewen Lin  ---
Yes, it's a typo, which makes the macro definition change to:

#define vec_vsubcuqP __builtin_vec_vsubcuq

Unfortunately we don't have the testing coverage in testsuite for the expected
name vec_vsubcuq (in rs6000-vecdefines.h):

$grep -ir vsubcuq gcc/testsuite/gcc.target/powerpc/
gcc/testsuite/gcc.target/powerpc//p8vector-int128-1.c:  return
__builtin_vec_vsubcuq (p, q);
gcc/testsuite/gcc.target/powerpc//p8vector-int128-1.c:/* { dg-final {
scan-assembler   "vsubcuq"   } } */
gcc/testsuite/gcc.target/powerpc//p8vector-builtin-8.c:/* { dg-final {
scan-assembler-times "vsubcuq" 2 } } */

FWIW, it started to fail from r12-5752-gd08236359eb229 which enables the new
bif framework, though the commit r12-3169 introduced it as Andrew pointed out.

[Bug target/108396] [12/13 Regression] PPCLE: vec_vsubcuq missing since r12-5752-gd08236359eb229

2023-01-16 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108396

--- Comment #4 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #3)
> (In reply to Kewen Lin from comment #2)
> > Unfortunately we don't have the testing coverage in testsuite for the
> > expected name vec_vsubcuq (in rs6000-vecdefines.h):
> 
> Is it hard to add one for all (or many) of the legacy builtins?  Do we want
> to test more than just if it compiles?

Good question, I think it's not hard to add several (classified as their
stanzas) for them and testing the compilation is enough. I just filed one
internal issue for tracking it in GCC14, it's to visit all bif names defined in
header file and get or create one in test suite.

[Bug target/108240] [13 Regression] Error message missing since r13-4894-gacc727cf02a144 (then make concealed ICE exposed)

2023-01-17 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108240

--- Comment #13 from Kewen Lin  ---
(In reply to Martin Liška from comment #11)
> One more test-case that started to ICE with the same revision:
> 
> ./xgcc -B.
> /home/marxin/Programming/gcc/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-2.c
> -mcpu=e300c2 -mmodulo -c
> during RTL pass: reload
> /home/marxin/Programming/gcc/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-2.c:
> In function ‘test_mult’:
> /home/marxin/Programming/gcc/gcc/testsuite/gcc.dg/atomic/c11-atomic-exec-2.c:
> 74:1: internal compiler error: maximum number of generated reload insns per
> insn achieved (90)
>74 | }
>   | ^
> 0x778bb5af __libc_start_call_main
>   ../sysdeps/nptl/libc_start_call_main.h:58
> 0x778bb678 __libc_start_main_impl
>   ../csu/libc-start.c:381
> 0xb648e4 _start
>   ../sysdeps/x86_64/start.S:115
> Please submit a full bug report, with preprocessed source (by using
> -freport-bug).
> Please include the complete backtrace with any bug report.
> See  for instructions.

I think it has the same root cause as PR108415:

RS6000_CPU ("e300c2", PROCESSOR_PPCE300C2, OPTION_MASK_SOFT_FLOAT)

and also expected this ICE can be reproduced with explicit -m64 at r13-4893
(before the culprit commit r13-4894).

[Bug target/108415] ICE in emit_library_call_value_1 at gcc/calls.cc:4181

2023-01-17 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108415

--- Comment #2 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #1)
> Soft float does not conflict with anything (anything that does not need
> FP registers that is).  But yes, we really should neuter -mmodulo.

The contradictory thing is that we can have TARGET_VSX and TARGET_SOFT_FLOAT
(!TARGET_HARD_FLOAT) together with the proposed option combination. :)

I agree that we should neuter -mmodulo, but note that this ICE isn't -mmodulo
specific.

With some more testings on those flags which the proposed patch (as field URL)
tries to take care of, the ICE can be reproduced with any one of below:

  -mcpu=401 -m64 -mmodulo
  -mcpu=401 -m64 -mpower9-vector
  -mcpu=401 -m64 -mpower9-misc
  -mcpu=401 -m64 -mpower8-vector
  -mcpu=401 -m64 -mcrypto

For -mcpu=401 -m64 -mpower9-minmax, it has the below error instead of ICE:

  f951: Error: power9 target option is incompatible with ‘-mcpu=’ for
 less than power9

[Bug target/108415] ICE in emit_library_call_value_1 at gcc/calls.cc:4181

2023-01-17 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108415

--- Comment #4 from Kewen Lin  ---
(In reply to Peter Bergner from comment #3)
> (In reply to Kewen Lin from comment #2)
> > The contradictory thing is that we can have TARGET_VSX and TARGET_SOFT_FLOAT
> > (!TARGET_HARD_FLOAT) together with the proposed option combination. :)
> 
> I'm not sure what you mean with this comment, but just to be clear, we
> should not allow TARGET_VSX and TARGET_SFT_FLOAT (!TARGET_HARD_FLOAT) at the
> same time.

Sorry for the confusion and thanks for clarifying, I meant the similar meaning
that: the given option combination, like

  -mmodulo -mcpu=401 or -mcpu=401 -mpower9-vector or ...

makes us to have both TARGET_VSX and TARGET_SOFT_FLOAT at the same time, it's
contradictory (shouldn't be allowed as you noted).

[Bug target/108396] [12/13 Regression] PPCLE: vec_vsubcuq missing since r12-5752-gd08236359eb229

2023-02-12 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108396

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #8 from Kewen Lin  ---
Fixed on trunk.

[Bug target/108348] ICE in gen_movoo, at config/rs6000/mma.md:292

2023-02-12 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108348

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #11 from Kewen Lin  ---
Fixed on trunk and backported to related branches.

[Bug target/108272] [13 Regression] ICE in gen_movxo, at config/rs6000/mma.md:339

2023-02-12 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108272

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Kewen Lin  ---
Fixed on trunk and backported to related branches.

[Bug analyzer/107396] [13 regression] new test case gcc.dg/analyzer/pipe-glibc.c in r13-3466-g792f039fc37faa fails with excess errors

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107396

--- Comment #2 from Kewen Lin  ---
*** Bug 108726 has been marked as a duplicate of this bug. ***

[Bug target/108726] gcc.dg/analyzer/pipe-glibc.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108726

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE
 CC||linkw at gcc dot gnu.org

--- Comment #1 from Kewen Lin  ---
As the comments in PR107396 (same symptom), this one is duplicated of that one,
and it's Power9 specific (but not BE only?).

*** This bug has been marked as a duplicate of bug 107396 ***

[Bug target/108728] gcc.dg/torture/float128-cmp-invalid.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108728

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2023-02-14
 CC||linkw at gcc dot gnu.org,
   ||meissner at gcc dot gnu.org,
   ||segher at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Kewen Lin  ---
Confirmed.

On Power8 it calls the lib function __lekf2, while on Power9 it leverages
xscmpuqp, but which doesn't raise invalid operation exception for qNaN
unfortunately. So it gets unexpected invalid exception result.

[Bug target/108729] gcc.target/powerpc/bfp/scalar-test-data-class-12.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108729

Kewen Lin  changed:

   What|Removed |Added

   Last reconfirmed||2023-02-14
 Status|UNCONFIRMED |NEW
 CC||linkw at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Kewen Lin  ---
Confirmed.

Error message is:

gcc/testsuite/gcc.target/powerpc/bfp/scalar-test-data-class-12.c:31:3: error:
‘__builtin_vsx_scalar_insert_exp’ requires the ‘-mcpu=power9’ option and either
the ‘-m64’ or ‘-mpowerpc64’ option

The test case needs:

  /* { dg-require-effective-target has_arch_ppc64 } */

as it uses bif scalar_insert_exp which is only available on 64 bit.

[Bug target/108731] gcc.target/powerpc/bfp/vec-test-data-class-9.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108731

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2023-02-14
 Ever confirmed|0   |1
 CC||linkw at gcc dot gnu.org

--- Comment #1 from Kewen Lin  ---
Confirmed, it's similar to PR108729, need the has_arch_ppc64 checking for used
bif scalar_insert_exp.

[Bug testsuite/108731] gcc.target/powerpc/bfp/vec-test-data-class-9.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108731

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|NEW |RESOLVED
  Component|target  |testsuite

--- Comment #2 from Kewen Lin  ---
Dup.

*** This bug has been marked as a duplicate of bug 108729 ***

[Bug target/108729] gcc.target/powerpc/bfp/scalar-test-data-class-12.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108729

--- Comment #2 from Kewen Lin  ---
*** Bug 108731 has been marked as a duplicate of this bug. ***

[Bug target/108729] gcc.target/powerpc/bfp/scalar-test-data-class-12.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108729

--- Comment #3 from Kewen Lin  ---
PR108731 has another test case gcc.target/powerpc/bfp/vec-test-data-class-9.c
which needs to be updated too.

[Bug target/108699] gcc.c-torture/execute/builtin-bitops-1.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108699

Kewen Lin  changed:

   What|Removed |Added

   Last reconfirmed||2023-02-14
 CC||linkw at gcc dot gnu.org,
   ||meissner at gcc dot gnu.org,
   ||segher at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Kewen Lin  ---
Confirmed.

This is due to one latent bug. When specifying -Os, it doesn't try to use
SImode parity any more, but tries to use wider mode TImode parity instead, it
resulted in the wrong result.

The current vector parity (including TImode) support is wrong:

;; Vector parity
(define_insn "*p9v_parity2"
  [(set (match_operand:VParity 0 "register_operand" "=v")
(parity:VParity (match_operand:VParity 1 "register_operand" "v")))]
  "TARGET_P9_VECTOR"
  "vprtyb %0,%1"
  [(set_attr "type" "vecsimple")])

The vprtyb[dwq] is for byte parity, it doesn't match the RTL parity semantic
directly.

[Bug target/108699] gcc.c-torture/execute/builtin-bitops-1.c fails on power 9 BE

2023-02-14 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108699

--- Comment #3 from Kewen Lin  ---
One more test case fail with abort on both LE & BE (with -Ofast -mcpu=power9):



#define N 16

unsigned long long vals[N];
unsigned int res[N];
unsigned int expects[N] = {0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0};

unsigned long long inputs[N]
  = {0xULL, 0x0001ULL, 0x8000ULL,
 0x0002ULL, 0x4000ULL, 0x0001ULL,
 0x8000ULL, 0xa5a5a5a5a5a5a5a5ULL, 0x5a5a5a5a5a5a5a5aULL,
 0xcafecafeULL, 0xcafecafeULL, 0xcafecafeULL,
 0x80706000ULL, 0xULL};

__attribute__ ((noipa)) void
init ()
{
  for (int i = 0; i < N; i++)
vals[i] = inputs[i];
}

__attribute__ ((noipa)) void
do_parity ()
{
  for (int i = 0; i < N; i++)
res[i] = __builtin_parityll (vals[i]);
}

int
main (void)
{
  init ();
  do_parity ();
  for (int i = 0; i < N; i++)
if (res[i] != expects[i])
  __builtin_abort();

  return 0;
}



[Bug target/108807] [11 regression] gcc.target/powerpc/vsx-builtin-10d.c fails after b29225597584b697762585e0b707b7cb4b427650 on power 9 BE

2023-02-15 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108807

Kewen Lin  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 CC||linkw at gcc dot gnu.org
   Last reconfirmed||2023-02-16

--- Comment #1 from Kewen Lin  ---
Confirmed.

It's likely an endianess issue, I'll have a look first.

[Bug target/108807] [11/12/13 regression] gcc.target/powerpc/vsx-builtin-10d.c fails after r11-6857-gb29225597584b6 on power 9 BE

2023-02-16 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108807

--- Comment #2 from Kewen Lin  ---
Created attachment 54478
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54478&action=edit
untested patch

The lvsr and lvsl for generating permutation control vectors only works for LE
as the element ordering is different on LE and BE. The proposed patch is to fix
gen function for generating permutation control vectors by considering
endianness.

It can fix the exposed failures on vsx-builtin-{9,10,11,14,16,18}d.c, and a
full testing is ongoing.

[Bug target/108814] gcc.target/powerpc/pr79251-run.p9.c fails on power 9 BE

2023-02-16 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108814

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|UNCONFIRMED |RESOLVED
 CC||linkw at gcc dot gnu.org

--- Comment #1 from Kewen Lin  ---
Dup.

*** This bug has been marked as a duplicate of bug 108807 ***

[Bug target/108807] [11/12/13 regression] gcc.target/powerpc/vsx-builtin-10d.c fails after r11-6857-gb29225597584b6 on power 9 BE

2023-02-16 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108807

--- Comment #3 from Kewen Lin  ---
*** Bug 108814 has been marked as a duplicate of this bug. ***

[Bug testsuite/108810] gcc.target/powerpc/fold-vec-extract-double.p9.c fails on power 9 BE

2023-02-16 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108810

Kewen Lin  changed:

   What|Removed |Added

   Last reconfirmed||2023-02-17
 Status|UNCONFIRMED |NEW
  Component|target  |testsuite
 CC||linkw at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Kewen Lin  ---
This is an test case issue:

double
testd_cst (vector double vd2)
{
  return vec_extract (vd2, 1);
}

got xxlor on LE but xxpermdi on BE.

The scan insn xxlor is for the high order element, the index is 0 on BE and 1
on LE.

The below diff can fix it:

diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-double.p9.c
b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-double.p9.c
index 6c515035d1a..100f680fd02 100644
--- a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-double.p9.c
+++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-double.p9.c
@@ -18,9 +18,15 @@ testd_var (vector double vd2, signed int si)
   return vec_extract (vd2, si);
 }

+#ifdef __BIG_ENDIAN__
+#define HIGH_ORDER_ELEMENT_INDEX 0
+#else
+#define HIGH_ORDER_ELEMENT_INDEX 1
+#endif
+
 double
 testd_cst (vector double vd2)
 {
-  return vec_extract (vd2, 1);
+  return vec_extract (vd2, HIGH_ORDER_ELEMENT_INDEX);
 }

[Bug testsuite/108813] gcc.target/powerpc/pr101384-2.c fails on power 9 BE

2023-02-17 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108813

Kewen Lin  changed:

   What|Removed |Added

   Last reconfirmed||2023-02-17
 CC||linkw at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
  Component|target  |testsuite

--- Comment #1 from Kewen Lin  ---
This is a test issue, GCC generates xxspltib rather than vspltis[whb] for const
vector. The below patch can fix it:

-/* { dg-final { scan-assembler-times {\mvspltis[whb] [^\n\r]*,-1\M} 9 } } */
+/* { dg-final { scan-assembler-times {\mvspltis[whb] [^\n\r]*,-1\M|\mxxspltib
[^\n\r]*,255\M} 9 } } */

[Bug rtl-optimization/108273] Inconsistent dfa state between debug and non-debug

2023-02-22 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108273

Kewen Lin  changed:

   What|Removed |Added

   Last reconfirmed||2023-02-23
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

  1   2   3   4   5   6   7   8   9   >