[Bug tree-optimization/100756] [12 Regression] vect: Superfluous epilog created on s390x

2023-02-01 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100756

--- Comment #8 from rdapp at gcc dot gnu.org ---
For completeness: haven't observed any fallout on s390 since and the regression
is fixed.

[Bug middle-end/106527] New: ICE with modulo scheduling dump (-fdump-rtl-sms)

2022-08-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106527

Bug ID: 106527
   Summary: ICE with modulo scheduling dump (-fdump-rtl-sms)
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
CC: zhroma at gcc dot gnu.org
  Target Milestone: ---
  Host: s390
Target: s390

Hi,

on s390 we are observing more and more problems with -fmodulo-sched.  I
initially tried debugging an -fcompare-debug failure with -fmodulo-sched but we
already ICE when just dumping via `-fdump-rtl-sms`.

The problem occurs when compiling the test case
 gcc.dg/sms-compare-debug-1.c
with
 gcc -O2 -fmodulo-sched  sms-compare-debug-1.c -fdump-rtl-sms:

sms-compare-debug-1.c:36:1: internal compiler error: in
linemap_ordinary_map_lookup, at libcpp/line-map.cc:1064
   36 | }
  | ^
0x2694499 linemap_ordinary_map_lookup
../../libcpp/line-map.cc:1064
0x2694ef7 linemap_macro_loc_to_exp_point
../../libcpp/line-map.cc:1561
0x266a5c5 expand_location_1
../../gcc/input.cc:243
0x266c54d expand_location(unsigned int)
../../gcc/input.cc:956
0x1513ecb insn_location(rtx_insn const*)
../../gcc/emit-rtl.cc:6558
0x24cb523 dump_insn_location
../../gcc/modulo-sched.cc:1250
0x24cb523 dump_insn_location
../../gcc/modulo-sched.cc:1246
0x24cf5d7 sms_schedule
../../gcc/modulo-sched.cc:1418
0x24d267f execute
../../gcc/modulo-sched.cc:3358

I didn't manage to simplify the test case further. It works fine on x86.

The ICE does not seem to occur with GCC 11, therefore I can bisect the issue if
it's of any help. Given the several other problems we're having with modulo
scheduling I figured it's better to ask for general guidance here first.

Regards
 Robin

[Bug rtl-optimization/105988] [10/11/12/13 Regression] ICE in linemap_ordinary_map_lookup, at libcpp/line-map.cc:1064 since r6-4873-gebedc9a3414d8422

2022-08-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105988

rdapp at gcc dot gnu.org changed:

   What|Removed |Added

 Target|x86_64-pc-linux-gnu |x86_64-pc-linux-gnu s390

--- Comment #6 from rdapp at gcc dot gnu.org ---
We are also seeing this on s390 as well as several other problems with
-fmodulo-sched.  Is this pass here to stay or is it safe to ignore all
issues/FAILs with it because it's going away anyway?

Regards
 Robin

[Bug target/106701] Compiler does not take into account number range limitation to avoid subtract from immediate

2022-08-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701

rdapp at gcc dot gnu.org changed:

   What|Removed |Added

 Target|s390|s390 x86_64-linux-gnu
 CC||glisse at gcc dot gnu.org,
   ||rdapp at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
Summary|s390: Compiler does not |Compiler does not take into
   |take into account number|account number range
   |range limitation to avoid   |limitation to avoid
   |subtract from immediate |subtract from immediate

--- Comment #1 from rdapp at gcc dot gnu.org ---
Added x86 to targets because we don't seem to optimize this there either (at
least I didn't see it on my recent-ish GCC).

The following (not regtested) helps on s390

diff --git a/gcc/match.pd b/gcc/match.pd
index e486b4be282c..2ebbf68010f9 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7992,3 +7992,27 @@ and,
 (match (bitwise_induction_p @0 @2 @3)
  (bit_not
   (nop_convert1? (bit_xor@0 (convert2? (lshift integer_onep@1 @2)) @3
+
+/* cst - a -> cst ^ a if 0 >= a <= cst and integer_pow2p (cst + 1).  */
+#if GIMPLE
+(simplify
+ (minus INTEGER_CST@0 @1)
+ (with {
+wide_int w = wi::to_wide (@0) + 1;
+value_range vr;
+wide_int wmin = w;
+wide_int wmax = w;
+if (get_global_range_query ()->range_of_expr (vr, @1)
+   && vr.kind () == VR_RANGE)
+   {
+ wmin = vr.lower_bound ();
+ wmax = vr.upper_bound ();
+   }
+  }
+   (if (wi::exact_log2 (w) != -1
+   && wi::geu_p (wmin, 0)
+   && wi::leu_p (wmax, w))
+(bit_xor @0 @1))
+ )
+)
+#endif

but it can surely be improved by some match.pd magic still.  A second question
would be, do we unconditionally want to simplify this or should it rather be
backend dependent?

[Bug target/106701] Compiler does not take into account number range limitation to avoid subtract from immediate

2022-08-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701

--- Comment #3 from rdapp at gcc dot gnu.org ---
I though expand (or combine) were independent of value range. What would be the
proper place for it then?

[Bug middle-end/91213] Missed optimization: (sub X Y) -> (xor X Y) when Y <= X and isPowerOf2(X + 1)

2022-08-29 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213

rdapp at gcc dot gnu.org changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org

--- Comment #6 from rdapp at gcc dot gnu.org ---
What's the mechanism to get range information at RTL level? The only related
thing I saw in (e.g.) simplify-rtx.cc is nonzero_bits and this does not seem to
be propagated from gimple.

[Bug middle-end/91213] Missed optimization: (sub X Y) -> (xor X Y) when Y <= X and isPowerOf2(X + 1)

2022-08-31 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213

--- Comment #8 from rdapp at gcc dot gnu.org ---
Hacked something together, inspired by the other cases that try two different
sequences.  Does this go into the right direction?  Works for me on s390. I see
some regressions related to predictive commoning that I will look into.

diff --git a/gcc/expr.cc b/gcc/expr.cc
index c90cde35006b..395b4df2e214 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -23,6 +23,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "backend.h"
 #include "target.h"
 #include "rtl.h"
+#include "tree-core.h"
 #include "tree.h"
 #include "gimple.h"
 #include "predict.h"
@@ -65,7 +66,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "rtx-vector-builder.h"
 #include "tree-pretty-print.h"
 #include "flags.h"

 /* If this is nonzero, we do not bother generating VOLATILE
around volatile memory references, and we are willing to
@@ -9358,6 +9359,21 @@ expand_expr_real_2 (sepops ops, rtx target, machine_mode
tmode,
  return simplify_gen_binary (MINUS, mode, op0, op1);
}

+  /* Convert const - A to A xor const if integer_pow2p (const + 1)
+and 0 <= A <= const.  */
+  if (code == MINUS_EXPR
+ && TREE_CODE (treeop0) == INTEGER_CST
+ && SCALAR_INT_MODE_P (mode)
+ && unsignedp
+ && wi::exact_log2 (wi::to_wide (treeop0) + 1) != -1)
+   {
+ rtx res = maybe_optimize_cst_sub (code, treeop0, treeop1,
+   mode, unsignedp, type,
+   target, subtarget);
+ if (res)
+   return res;
+   }
+
   /* No sense saving up arithmetic to be done
 if it's all in the wrong mode to form part of an address.
 And force_operand won't know whether to sign-extend or
@@ -12641,6 +12657,77 @@ maybe_optimize_mod_cmp (enum tree_code code, tree
*arg0, tree *arg1)
   return code == EQ_EXPR ? LE_EXPR : GT_EXPR;
 }

+/* Optimize cst - x if integer_pow2p (cst + 1) and 0 >= x <= cst.  */
+
+rtx
+maybe_optimize_cst_sub (enum tree_code code, tree treeop0, tree treeop1,
+   machine_mode mode, int unsignedp, tree type,
+   rtx target, rtx subtarget)
+{
+  gcc_checking_assert (code == MINUS_EXPR);
+  gcc_checking_assert (TREE_CODE (treeop0) == INTEGER_CST);
+  gcc_checking_assert (TREE_CODE (TREE_TYPE (treeop1)) == INTEGER_TYPE);
+  gcc_checking_assert (wi::exact_log2 (wi::to_wide (treeop0) + 1) != -1);
+
+  if (!optimize)
+return NULL_RTX;
+
+  optab this_optab;
+  rtx op0, op1;
+
+  if (wi::leu_p (tree_nonzero_bits (treeop1), tree_nonzero_bits (treeop0)))
+{
+  expand_operands (treeop0, treeop1, subtarget, &op0, &op1,
+  EXPAND_NORMAL);
+  bool speed_p = optimize_insn_for_speed_p ();
+  do_pending_stack_adjust ();
+  start_sequence ();
+  this_optab = optab_for_tree_code (MINUS_EXPR, type,
+   optab_default);
+  rtx subi = expand_binop (mode, this_optab, op0, op1, target,
+  unsignedp, OPTAB_LIB_WIDEN);
+
+  rtx_insn *sub_insns = get_insns ();
+  end_sequence ();
+  start_sequence ();
+  this_optab = optab_for_tree_code (BIT_XOR_EXPR, type,
+   optab_default);
+  rtx xori = expand_binop (mode, this_optab, op0, op1, target,
+  unsignedp, OPTAB_LIB_WIDEN);
+  rtx_insn *xor_insns = get_insns ();
+  end_sequence ();
+  unsigned sub_cost = seq_cost (sub_insns, speed_p);
+  unsigned xor_cost = seq_cost (xor_insns, speed_p);
+  /* If costs are the same then use as tie breaker the other other
+factor.  */
+  if (sub_cost == xor_cost)
+   {
+ sub_cost = seq_cost (sub_insns, !speed_p);
+ xor_cost = seq_cost (xor_insns, !speed_p);
+   }
+
+  if (sub_cost <= xor_cost)
+   {
+ emit_insn (sub_insns);
+ return subi;
+   }
+
+  emit_insn (xor_insns);
+  return xori;
+}
+
+  return NULL_RTX;
+}
+
 /* Optimize x - y < 0 into x < 0 if x - y has undefined overflow.  */

 void
diff --git a/gcc/expr.h b/gcc/expr.h
index 035118324057..9c4f2ed02fcb 100644
--- a/gcc/expr.h
+++ b/gcc/expr.h
@@ -317,6 +317,8 @@ extern tree string_constant (tree, tree *, tree *, tree *);
 extern tree byte_representation (tree, tree *, tree *, tree *);

 extern enum tree_code maybe_optimize_mod_cmp (enum tree_code, tree *, tree *);
+extern rtx maybe_optimize_cst_sub (enum tree_code, tree, tree,
+  machine_mode, int, tree , rtx, rtx);
 extern void maybe_optimize_sub_cmp_0 (enum tree_code, tree *, tree *);

 /* Two different ways of generating switch statements.  */

[Bug middle-end/91213] Missed optimization: (sub X Y) -> (xor X Y) when Y <= X and isPowerOf2(X + 1)

2022-08-31 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213

--- Comment #9 from rdapp at gcc dot gnu.org ---
The regressions are unrelated and due to another patch that I still had on the
same branch.

[Bug target/106919] [13 Regression] RTL check: expected code 'set' or 'clobber', have 'if_then_else' in s390_rtx_costs, at config/s390/s390.cc:3672on s390x-linux-gnu since r13-2251-g1930c5d05ceff2

2022-09-23 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106919

--- Comment #8 from rdapp at gcc dot gnu.org ---
Yes, one of dst and dest is superflous. Looks good like that. I bootstrapped
the same patch locally already, no regressions.

[Bug tree-optimization/100756] vect: Superfluous epilog created on s390x

2022-10-20 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100756

rdapp at gcc dot gnu.org changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org

--- Comment #4 from rdapp at gcc dot gnu.org ---
Anything that can/should be done here in case we'd still want to tackle it in
this P1 cycle?

[Bug middle-end/107617] New: SCC-VN with len_store and big endian

2022-11-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617

Bug ID: 107617
   Summary: SCC-VN with len_store and big endian
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
CC: richard.guenther at gmail dot com
  Target Milestone: ---
Target: s

Created attachment 53871
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53871&action=edit
s390 patch for len_load/len_store

Hi,

Richard and I already quickly discussed this on the mailing list but I didn't
manage to progress analyzing as I was tied up with other things.  Figured I
open a bug for tracking purposes and the possibility to maybe fix it in a later
stage.

I'm evaluating len_load/len_store support on s390 via the attached patch and
seeing a FAIL in

testsuite/gfortran.dg/power_3.f90

built with -march=z16 -O3 --param vect-partial-vector-usage=1

The problem seems to be that we evaluate a vector constant
{-1, 1, -1, 1} loaded with length 11 + 1(bias) = 12 as
{1, -1, 1} instead of {-1, 1, -1}.

Richard wrote on the mailing list:
> The error is probably in vn_reference_lookup_3 which assumes that
> 'len' applies to the vector elements in element order.  See the part
> of the code where it checks for internal_store_fn_p.  If 'len' is with
> respect to the memory and thus endianess has to be taken into
> account then for the IFN_LEN_STORE
> 
> else if (fn == IFN_LEN_STORE)
>   {
> pd.rhs_off = 0;
> pd.offset = offset2i;
> pd.size = (tree_to_uhwi (len)
>+ -tree_to_shwi (bias)) * BITS_PER_UNIT;
> if (ranges_known_overlap_p (offset, maxsize,
> pd.offset, pd.size))
>   return data->push_partial_def (pd, set, set,
>  offseti, maxsizei);
> 
> likely needs to adjust rhs_off from zero for big endian?

[Bug middle-end/107617] SCC-VN with len_store and big endian

2022-11-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617

rdapp at gcc dot gnu.org changed:

   What|Removed |Added

   Priority|P3  |P4

[Bug middle-end/107617] SCC-VN with len_store and big endian

2022-11-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617

--- Comment #1 from rdapp at gcc dot gnu.org ---
For completeness, the mailing list thread is here:

https://gcc.gnu.org/pipermail/gcc-patches/2022-September/602252.html

[Bug target/113827] New: MrBayes benchmark redundant load

2024-02-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827

Bug ID: 113827
   Summary: MrBayes benchmark redundant load
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org,
pan2.li at intel dot com
Blocks: 79704
  Target Milestone: ---
Target: riscv

A hot block in the MrBayes benchmark (as used in the Phoronix testsuite) has a
redundant scalar load when vectorized.

Minimal example, compiled with -march=rv64gcv -O3

int foo (float **a, float f, int n)
{
  for (int i = 0; i < n; i++)
{
  a[i][0] /= f;
  a[i][1] /= f;
  a[i][2] /= f;
  a[i][3] /= f;
  a[i] += 4;
}
}

GCC:
.L3:
ld  a5,0(a0)
vle32.v v1,0(a5)
vfmul.vvv1,v1,v2
vse32.v v1,0(a5)
addia5,a5,16
sd  a5,0(a0)
addia0,a0,8
bne a0,a4,.L3

The value of a5 doesn't change after the store to 0(a0).

LLVM:
.L3
vle32.v   v8,(a1)
addi  a3,a1,16
sda3,0(a2)
vfdiv.vf  v8,v8,fa5
addi  a2,a2,8
vse32.v   v8,(a1)
bne   a2,a0,.L3


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79704
[Bug 79704] [meta-bug] Phoronix Test Suite compiler performance issues

[Bug target/113827] MrBayes benchmark redundant load on riscv

2024-02-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827

--- Comment #1 from Robin Dapp  ---
x86 (-march=native -O3 on an i7 12th gen) looks pretty similar:

.L3:
movq(%rdi), %rax
vmovups (%rax), %xmm1
vdivps  %xmm0, %xmm1, %xmm1
vmovups %xmm1, (%rax)
addq$16, %rax
movq%rax, (%rdi)
addq$8, %rdi
cmpq%rdi, %rdx
jne .L3

So probably not target specific.  Costing?

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-02-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #4 from Robin Dapp  ---
Judging by the graph it looks like it was slow before, then got faster and now
slower again.  Is there some more info on why it got faster in the first place?
 Did the patch reverse something or is it rather a secondary effect?  I don't
have a zen4 handy to check.

[Bug target/114027] [14] RISC-V vector: miscompile at -O3

2024-02-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027

Robin Dapp  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org
   Last reconfirmed||2024-2-22
 Target|riscv   |x86_64-*-* riscv*-*-*
   ||aarch64-*-*

--- Comment #5 from Robin Dapp  ---
To me it looks like we interpret e.g. c_53 = _43 ? prephitmp_13 : 0 as the only
reduction statement and simplify to MAX because of the wrong assumption that
this is the only reduction statement in the chain when we actually have
several. 
(See "condition expression based on compile time constant").

--- Comment #6 from Robin Dapp  ---
Btw this fails on x86 and aarch64 for me with -fno-vect-cost-model.  So it
definitely looks generic.

[Bug target/114027] [14] RISC-V vector: miscompile at -O3

2024-02-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027

--- Comment #9 from Robin Dapp  ---
Argh,  I actually just did a gcc -O3 -march=native pr114027.c
-fno-vect-cost-model on cfarm188 with a recent-ish GCC but realized that I used
my slightly modified version and not the original test case.

long a;
int b[10][8] = {{},
{},
{},
{},
{},
{},
{0, 0, 0, 0, 0, 1, 1},
{1, 1, 1, 1, 1, 1, 1},
{1, 1, 1, 1, 1, 1, 1}};
int c;
int main() {
int d;
c = 0x;
for (; a < 6; a++) {
d = 0;
for (; d < 6; d++) {
c ^= -3L;
if (b[a + 3][d])
continue;
c = 0;
}
}

if (c == -3) {
return 0;
} else {
return 1;
}
}

This was from an initial attempt to minimize it further but I didn't really
verify if I'm breaking the test case by that (or causing undefined behavior).

With that I get a "1" with default options and "0" with -fno-tree-vectorize.
Maybe my snippet is broken then?

[Bug target/114028] [14] RISC-V rv64gcv_zvl256b: miscompile at -O3

2024-02-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114028

--- Comment #2 from Robin Dapp  ---
This is a target issue.  It looks like we try to construct a "superword"
sequence when the element size is already == Pmode.  Testing a patch.

[Bug middle-end/114109] New: x264 satd vectorization vs LLVM

2024-02-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

Bug ID: 114109
   Summary: x264 satd vectorization vs LLVM
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: enhancement
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org
  Target Milestone: ---
Target: x86_64-*-* riscv*-*-*

Looking at the following code of x264 (SPEC 2017):

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

static inline uint32_t abs2 (uint32_t a)
{
uint32_t s = ((a >> 15) & 0x10001) * 0x;
return (a + s) ^ s;
}

int x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2)
{
uint32_t tmp[4][4];
uint32_t a0, a1, a2, a3;
int sum = 0;

for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
{
a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
{
  int t0 = a0 + a1;
  int t1 = a0 - a1;
  int t2 = a2 + a3;
  int t3 = a2 - a3;
  tmp[i][0] = t0 + t2;
  tmp[i][1] = t1 + t3;
  tmp[i][2] = t0 - t2;
  tmp[i][3] = t1 - t3;
};
}
for( int i = 0; i < 4; i++ )
{
{ int t0 = tmp[0][i] + tmp[1][i];
  int t1 = tmp[0][i] - tmp[1][i];
  int t2 = tmp[2][i] + tmp[3][i];
  int t3 = tmp[2][i] - tmp[3][i];
  a0 = t0 + t2;
  a2 = t0 - t2;
  a1 = t1 + t3;
  a3 = t1 - t3;
};
sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3);
}
return (((uint16_t) sum) + ((uint32_t) sum > >16)) >> 1;
}

I first checked on riscv but x86 and aarch64 are pretty similar.  (Refer
https://godbolt.org/z/vzf5ha44r that compares at -O3 -mavx512f)

Vectorizing the first loop seems to be a costing issue.  By default we don't
vectorize and the code becomes much larger when disabling vector costing, so
the costing decision in itself seems correct.
Clang's version is significantly shorter and it looks like it just directly
vec_sets/vec_inits the individual elements.  On riscv it can be handled rather
elegantly with strided loads that we don't emit right now.
As there are only 4 active vector elements and the loop is likely load bound it
might be debatable whether LLVM's version is better?

The second loop we do vectorize (4 elements at a time) but end up with e.g.
four XORs for the four inlined abs2 calls while clang chooses a larger
vectorization factor and does all the xors in one.

On my laptop (no avx512) I don't see a huge difference (113s GCC vs 108s LLVM)
but I guess the general case is still interesting?

[Bug middle-end/114109] x264 satd vectorization vs LLVM

2024-02-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #2 from Robin Dapp  ---
It is vectorized with a higher zvl, e.g. zvl512b, refer
https://godbolt.org/z/vbfjYn5Kd.

[Bug middle-end/114109] x264 satd vectorization vs LLVM

2024-02-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #4 from Robin Dapp  ---
Yes, as mentioned, vectorization of the first loop is debatable.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #6 from Robin Dapp  ---
Honestly, I don't know how to analyze/debug this without a zen4, in particular
as it only seems to happen with PGO.  I tried locally but of course the
execution time doesn't change (same as with zen3 according to the database).
Is there a way to obtain the binaries in order to tell a difference?

[Bug target/114200] [14] RISC-V fixed-length vector miscompile at -O3

2024-03-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114200

--- Comment #1 from Robin Dapp  ---
Took me a while to analyze this... needed more time than I'd like to admit to
make sense of the somewhat weird code created by fully unrolling and peeling.

I believe the problem is that we reload the output register of a vfmacc/fma via
vmv.v.v (subject to length masking) but we should be using vmv1r.v.  The result
is used by a reduction which always operates on the full length.  As annoying
as it was to find - it's definitely a good catch.

I'm testing a patch.  PR114202 is indeed a duplicate.  Going to add its test
case to the patch.

[Bug middle-end/114196] [13/14 Regression] Fixed length vector ICE: in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9454

2024-03-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114196

Robin Dapp  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=113163

--- Comment #2 from Robin Dapp  ---
To me this looks like it already came up in the context of early-break
vectorization (PR113163) but is not actually dependent on it.  I'm testing a
patch that disables epilogue peeling also without early break.

[Bug target/114200] [14] RISC-V fixed-length vector miscompile at -O3

2024-03-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114200

--- Comment #3 from Robin Dapp  ---
*** Bug 114202 has been marked as a duplicate of this bug. ***

[Bug target/114202] [14] RISC-V rv64gcv: miscompile at -O3

2024-03-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114202

Robin Dapp  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from Robin Dapp  ---
Same as PR114200.

*** This bug has been marked as a duplicate of bug 114200 ***

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #7 from Robin Dapp  ---
I built executables with and without the commit (-Ofast -march=znver4 -flto). 
There is no difference so it must really be something that happens with PGO.
I'd really need access to a zen4 box or the pgo executables at least.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #10 from Robin Dapp  ---
(In reply to Sam James from comment #9)
> (In reply to Filip Kastl from comment #8)
> > I'd like to help but I'm afraid I cannot send you the SPEC binaries with PGO
> > applied since SPEC is licensed nor can I give you access to a Zen4 computer.
> > I suppose someone else will have to analyze this bug.
> 
> Could you perhaps send only the gcda files so Robin can build again with
> -fprofile-use?

Yes, that would be helpful.

Or Filip builds the executables himself and posts (some of) the difference
here.  Maybe that also gets us a bit closer to the problem.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #16 from Robin Dapp  ---
Thank you!

I'm having a problem with the data, though.
Compiling with -Ofast -march=znver4 -mtune=znver4 -flto -fprofile-use=/tmp.
Would you mind showing your exact final options for compilation of e.g.
pbeampp.cc?

I see, similar-ish for both commits:
pbeampp.c:119:8: error: number of counters in profile data for function
'primal_bea_mpp' does not match its profile data (counter 'arcs', expected 20
and have 22) [-Werror=coverage-mismatch]

output.c:87:1: error: corrupted profile info: number of executions for edge 3-4
thought to be 1
output.c:87:1: error: corrupted profile info: number of executions for edge 3-5
thought to be -1
output.c:87:1: error: corrupted profile info: number of iterations for basic
block 5 thought to be -1

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #18 from Robin Dapp  ---
Hmm, doesn't help unfortunately.  A full command line for me looks like:

x86_64-pc-linux-gnu-gcc -c -o pbeampp.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO 
-Ofast -march=znver4 -mtune=znver4 -flto=32 -g -fprofile-use=/tmp
-SPEC_CPU_LP64 pbeampp.c.

Could you verify if it's exactly the same for you?  Maybe it would also help if
you explicitly specified znver4?

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #20 from Robin Dapp  ---
No change with -std=gnu99 unfortunately.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #22 from Robin Dapp  ---
Still the same problem unfortunately.

I'm a bit out of ideas - maybe your compiler executables could help?

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #24 from Robin Dapp  ---
I rebuilt GCC from scratch with your options but still have the same problem. 
Could our sources differ?  My SPEC version might not be the most recent but I'm
not aware that mcf changed at some point.

Just to be sure: I'm using r14-5075-gc05f748218a0d5 as the "before" commit.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #27 from Robin Dapp  ---
Can you try it with a simpler (non SPEC) test?  Maybe there is still something
weird happening with SPEC's scripting.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #29 from Robin Dapp  ---
Yes, that also appears to work here.  There was no lto involved this time?
Now we need to figure out what's different with SPEC.

[Bug target/114396] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3 with -fwrapv

2024-03-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

Robin Dapp  changed:

   What|Removed |Added

 Target|riscv*-*-*  |x86_64-*-* riscv*-*-*

--- Comment #2 from Robin Dapp  ---
At first glance it doesn't really look like a target issue.

Tried it on x86 and it fails as well with
-O3 -march=native pr114396.c -fno-vect-cost-model -fwrapv

short a = 0xF;
short b[16];

int main() {
for (int e = 0; e < 9; e += 1)
b[e] = a *= 0x5;

if (a != 2283)
__builtin_abort ();
}

[Bug target/114396] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3 with -fwrapv

2024-03-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #3 from Robin Dapp  ---
-O3 -mavx2 -fno-vect-cost-model -fwrapv seems to be sufficient.

[Bug tree-optimization/114396] [14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv

2024-03-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #7 from Robin Dapp  ---
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 4375ebdcb49..f8f7ba0ccc1 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -9454,7 +9454,7 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree
init_expr,
wi::to_mpz (skipn, exp, UNSIGNED);
mpz_ui_pow_ui (mod, 2, TYPE_PRECISION (type));
mpz_powm (res, base, exp, mod);
-   begin = wi::from_mpz (type, res, TYPE_SIGN (type));
+   begin = wi::from_mpz (type, res, TYPE_SIGN (utype));
tree mult_expr = wide_int_to_tree (utype, begin);
init_expr = gimple_build (stmts, MULT_EXPR, utype,
  init_expr, mult_expr);

This helps for the test case.

[Bug tree-optimization/114396] [14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv

2024-03-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #8 from Robin Dapp  ---
No fallout on x86 or aarch64.

Of course using false instead of TYPE_SIGN (utype) is also possible and maybe
clearer?

[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vector-cost-mode (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)

2024-03-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476

--- Comment #5 from Robin Dapp  ---
So the result is -9 instead of 9 (or vice versa) and this happens (just) with
vectorization.  We only vectorize with -fwrapv.

>From a first quick look, the following is what we have before vect:

(loop)
   [local count: 991171080]:
  ...
  # b_lsm.5_5 = PHI <_4(7), b_lsm.5_17(2)>
  ...
  _4 = -b_lsm.5_5;
(check)
 [local count: 82570744]:
  ...
  # b_lsm.5_22 = PHI 
  ...
  if (b_lsm.5_22 != -9)

I.e. b gets negated with every iteration and we check the second to last
against -9.

With vectorization we have:
(init)
   [local count: 82570744]:
  b_lsm.5_17 = b;

(vectorized loop)
   [local count: 247712231]:
  ...
  # b_lsm.5_5 = PHI <_4(7), b_lsm.5_17(2)>
  ...
_4 = -b_lsm.5_5;
  ...
  goto 

(epilogue)
   [local count: 82570741]:
  ...
  # b_lsm.5_7 = PHI <_25(11), b_lsm.5_17(13)>
  ...
  _25 = -b_lsm.5_7;

(check)
   [local count: 82570744]:
  ...
  # b_lsm.5_22 = PHI 
  if (b_lsm.5_22 != -9)

What looks odd here is that b_lsm.5_7's fallthrough argument is b_lsm.5_17 even
though we must have come through the vectorized loop (which negated b at least
once).  This makes us skip inversions.
Indeed, as b_lsm.5_22 is only dependent on the initial value of b it gets
optimized away and we compare b != -9.

Maybe I missed something but it looks like
  # b_lsm.5_7 = PHI <_25(11), b_lsm.5_17(13)>
should have b_lsm.5_5 or _4 as fallthrough argument.

[Bug tree-optimization/114485] [13/14 Regression] Wrong code with -O3 -march=rv64gcv on riscv or `-O3 -march=armv9-a` for aarch64

2024-03-27 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114485

--- Comment #4 from Robin Dapp  ---
Yes, the vectorization looks ok.  The extracted live values are not used
afterwards and therefore the whole vectorized loop is being thrown away.
Then we do one iteration of the epilogue loop, inverting the original c and end
up with -8 instead of 8.  This is pretty similar to what's happening in the
related PR.

We properly populate the phi in question in slpeel_update_phi_nodes_for_guard1:

c_lsm.7_64 = PHI <_56(23), pretmp_34(17)>

but vect_update_ivs_after_vectorizer changes that into

c_lsm.7_64 = PHI .

Just as a test, commenting out

  if (!LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
  update_e);

at least makes us keep the VEC_EXTRACT and not fail anymore.

[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-04-02 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

Robin Dapp  changed:

   What|Removed |Added

 CC||ewlu at rivosinc dot com,
   ||rdapp at gcc dot gnu.org

--- Comment #7 from Robin Dapp  ---
There is some riscv fallout as well.  Edwin has the details.

[Bug rtl-optimization/108412] RISC-V: Negative optimization of GCSE && LOOP INVARIANTS

2023-08-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108412

Robin Dapp  changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org

--- Comment #3 from Robin Dapp  ---
I played around a bit with the scheduling model and the pressure-aware
scheduling.  -fsched-pressure alone does not seem to help because the problem
is indeed the latency of vector load and store.  The scheduler will try to keep
dependent loads and stores apart (for the number of cycles specified), and
then, after realizing there is nothing to put in between, lump everything
together at the end of the sequence.  That's a well known but unfortunate
property of scheduling.

Will need to think of something but not resolved for now.

[Bug tree-optimization/111136] New: ICE in RISC-V test case since r14-3441-ga1558e9ad85693

2023-08-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36

Bug ID: 36
   Summary: ICE in RISC-V test case since r14-3441-ga1558e9ad85693
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---
Target: riscv

The following RISC-V test case ICEs since r14-3441-ga1558e9ad85693
(mask_gather_load-11.c)

#define uint8_t unsigned char

void
foo (uint8_t *restrict y, uint8_t *restrict x,
 uint8_t *index,
 uint8_t *cond)
{
  for (int i = 0; i < 100; ++i)
{
  if (cond[i * 2])
y[i * 2] = x[index[i * 2]] + 1;
  if (cond[i * 2 + 1])
y[i * 2 + 1] = x[index[i * 2 + 1]] + 2;
}
}

I compiled with
build/gcc/cc1 -march=rv64gcv -mabi=lp64 -O3
--param=riscv-autovec-preference=scalable mask_gather_load-11.c

mask_gather_load-11.c: In function 'foo':
mask_gather_load-11.c:9:1: internal compiler error: in
get_group_load_store_type, at tree-vect-stmts.cc:2121
9 | foo (uint8_t *restrict y, uint8_t *restrict x,
  | ^~~
0x9e2fad get_group_load_store_type
../../gcc/tree-vect-stmts.cc:2121
0x9e2fad get_load_store_type
../../gcc/tree-vect-stmts.cc:2451
0x1ff7221 vectorizable_store
../../gcc/tree-vect-stmts.cc:8309
[...]

[Bug target/108271] Missed RVV cost model

2023-08-25 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108271

Robin Dapp  changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org

--- Comment #3 from Robin Dapp  ---
This is basically the same problem as PR108412.  As long as loads/stores have a
high(ish) latency and we mostly do load/store, they will tend to lump together
at the end of the function.  Setting vector load/store to a latency of <= 2
helps here and we might want to do this in order to avoid excessive spilling.
I had to deal with this before, e.g. in SPEC2006's calculix.
In the end insn scheduling wouldn't buy us anything and rather caused more
spilling causing performance degradationl

[Bug tree-optimization/111136] ICE in RISC-V test case since r14-3441-ga1558e9ad85693

2023-08-25 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36

--- Comment #4 from Robin Dapp  ---
All gather-scatter tests pass for me again (the given example in particular)
after applying this.

[Bug c/111153] RISC-V: Incorrect Vector cost model for reduction

2023-08-25 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53

--- Comment #1 from Robin Dapp  ---
We seem to decide that a slightly more expensive loop (one instruction more)
without an epilogue is better than a loop with an epilogue.  This looks
intentional in the vectorizer cost estimation and is not specific to our lack
of a costing model.  Hmm..

The main loops are (VLA):
.L3:
vsetvli a5,a1,e32,m1,tu,ma
sllia4,a5,2
sub a1,a1,a5
vle32.v v2,0(a0)
add a0,a0,a4
vadd.vv v1,v2,v1
bne a1,zero,.L3

vs (VLS):
.L4:
vle32.v v1,0(a5)
vle32.v v2,0(sp)
addia5,a5,16
vadd.vv v1,v2,v1
vse32.v v1,0(sp)
bne a4,a5,.L4

This is doubly weird because of the spill of the accumulator.  We shouldn't be
generating this sequence but even if so, it should be more expensive.  This can
be achieved e.g. by the following example vectorizer cost function:

static int
riscv_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 tree vectype,
 int misalign ATTRIBUTE_UNUSED)
{
  unsigned elements;

  switch (type_of_cost)
{
  case scalar_stmt:
  case scalar_load:
  case scalar_store:
  case vector_stmt:
  case vector_gather_load:
  case vector_scatter_store:
  case vec_to_scalar:
  case scalar_to_vec:
  case cond_branch_not_taken:
  case vec_perm:
  case vec_promote_demote:
  case unaligned_load:
  case unaligned_store:
return 1;

  case vector_load:
  case vector_store:
return 3;

  case cond_branch_taken:
return 3;

  case vec_construct:
elements = estimated_poly_value (TYPE_VECTOR_SUBPARTS (vectype));
return elements / 2 + 1;

  default:
gcc_unreachable ();
}
}

For a proper loop like
vle32.v v2,0(sp)
.L4:
vle32.v v1,0(a5)
addia5,a5,16
vadd.vv v1,v2,v1
bne a4,a5,.L4
vse32.v v1,0(sp)
I'm not so sure anymore.  For large n this could be preferable depending on the
vectorization factor and other things.

[Bug target/110559] Bad mask_load/mask_store codegen of RVV

2023-08-25 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110559

--- Comment #3 from Robin Dapp  ---
I got back to this again today, now that pressure-aware scheduling is the
default.  As mentioned before, it helps but doesn't get rid of the spills.

Testing with the "generic ooo" scheduling model it looks like vector load/store
latency of 6 is too high.  Yet, even setting them to 1 is not enough to get rid
of spills entirely.  What helps is additionally lowering the vector alu latency
to 2 (from the default 3).

I'm not really sure how to properly handle this.  As far as I can tell spilling
is always going to happen if we try to "wait" for dependencies and delay the
dependent instructions.  In my experience the hardware does a better job at
live scheduling anyway and we only make things worse in several cases. 
Previously I experimented with setting the latency of most instructions to 1
with few exceptions and instead ensure a proper instruction mix i.e. trying to
keep every execution unit busy.  That's not a panacea either, though.

[Bug target/111311] New: RISC-V regression testsuite errors with --param=riscv-autovec-preference=scalable

2023-09-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311

Bug ID: 111311
   Summary: RISC-V regression testsuite errors with
--param=riscv-autovec-preference=scalable
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
CC: jeremy.bennett at embecosm dot com, juzhe.zhong at rivai 
dot ai,
kito.cheng at gmail dot com, law at gcc dot gnu.org,
palmer at dabbelt dot com, vineetg at rivosinc dot com
  Target Milestone: ---

As discussed in yesterday's meeting this is the PR for all current FAILs in
GCC's regression test suite when running it with default vector support.
I used
--target-board=unix/-march=rv64gcv/--param=riscv-autovec-preference=scalable.

Below is the list of FAILs/... I got, hope the message doesn't get too large.

FAIL: gcc.c-torture/execute/pr53645-2.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects  (test for excess errors)
FAIL: gcc.c-torture/execute/pr53645.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects  (test for excess errors)
FAIL: gcc.c-torture/unsorted/dump-noaddr.c.*r.vsetvl,  -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  comparison
FAIL: gcc.dg/analyzer/pr105252.c (test for excess errors)
FAIL: gcc.dg/analyzer/pr96713.c (internal compiler error: in
emit_move_multi_word, at expr.cc:4079)
FAIL: gcc.dg/analyzer/pr96713.c (test for excess errors)
FAIL: c-c++-common/opaque-vector.c  -Wc++-compat  (internal compiler error: in
emit_move_multi_word, at expr.cc:4079)
FAIL: c-c++-common/opaque-vector.c  -Wc++-compat  (test for excess errors)
FAIL: c-c++-common/pr105998.c  -Wc++-compat  (internal compiler error: in
emit_move_multi_word, at expr.cc:4079)
FAIL: c-c++-common/pr105998.c  -Wc++-compat  (test for excess errors)
FAIL: c-c++-common/scal-to-vec2.c  -Wc++-compat  (test for excess errors)
FAIL: c-c++-common/spec-barrier-1.c  -Wc++-compat  (test for excess errors)
FAIL: c-c++-common/vector-compare-1.c  -Wc++-compat  (test for excess errors)
FAIL: c-c++-common/vector-compare-2.c  -Wc++-compat  (test for excess errors)
FAIL: c-c++-common/vector-scalar.c  -Wc++-compat  (internal compiler error: in
emit_move_multi_word, at expr.cc:4079)
FAIL: c-c++-common/vector-scalar.c  -Wc++-compat  (test for excess errors)
FAIL: gcc.dg/Wstrict-aliasing-bogus-ref-all-2.c (test for excess errors)
XPASS: gcc.dg/Wstringop-overflow-47.c pr97027 (test for warnings, line 72)
XPASS: gcc.dg/Wstringop-overflow-47.c pr97027 (test for warnings, line 77)
XPASS: gcc.dg/Wstringop-overflow-47.c pr97027 note (test for warnings, line 68)
FAIL: gcc.dg/Wstringop-overflow-70.c  (test for warnings, line 22)
XPASS: gcc.dg/attr-alloc_size-11.c missing range info for short (test for
warnings, line 51)
XPASS: gcc.dg/attr-alloc_size-11.c missing range info for signed char (test for
warnings, line 50)
FAIL: gcc.dg/pr100239.c (internal compiler error: in emit_move_multi_word, at
expr.cc:4079)
FAIL: gcc.dg/pr100239.c (test for excess errors)
FAIL: gcc.dg/pr100292.c (test for excess errors)
FAIL: gcc.dg/pr104992.c scan-tree-dump-times optimized " % " 9
FAIL: gcc.dg/pr105049.c (test for excess errors)
FAIL: gcc.dg/pr108805.c (test for excess errors)
FAIL: gcc.dg/pr34856.c (test for excess errors)
FAIL: gcc.dg/pr35442.c (test for excess errors)
FAIL: gcc.dg/pr42685.c (test for excess errors)
FAIL: gcc.dg/pr45105.c (test for excess errors)
FAIL: gcc.dg/pr53060.c (test for excess errors)
FAIL: gcc.dg/pr63914.c (test for excess errors)
FAIL: gcc.dg/pr70252.c (internal compiler error: in
gimple_expand_vec_cond_expr, at gimple-isel.cc:283)
FAIL: gcc.dg/pr70252.c (test for excess errors)
FAIL: gcc.dg/pr85430.c (test for excess errors)
FAIL: gcc.dg/pr85467.c (test for excess errors)
FAIL: gcc.dg/pr91441.c  at line 11 (test for warnings, line )
FAIL: gcc.dg/pr92301.c execution test
FAIL: gcc.dg/pr96453.c (test for excess errors)
FAIL: gcc.dg/pr96466.c (test for excess errors)
FAIL: gcc.dg/pr97238.c (internal compiler error: in emit_move_multi_word, at
expr.cc:4079)
FAIL: gcc.dg/pr97238.c (test for excess errors)
FAIL: gcc.dg/signbit-2.c scan-tree-dump-not optimized "s+>>s+31"
FAIL: gcc.dg/signbit-5.c execution test
FAIL: gcc.dg/unroll-8.c scan-rtl-dump loop2_unroll "Not unrolling loop, doesn't
roll"
FAIL: gcc.dg/unroll-8.c scan-rtl-dump loop2_unroll "likely upper bound: 6"
FAIL: gcc.dg/unroll-8.c scan-rtl-dump loop2_unroll "realistic bound: -1"
FAIL: gcc.dg/var-expand1.c scan-rtl-dump loop2_unroll "Expanding Accumulator"
FAIL: gcc.dg/vshift-6.c (test for excess errors)
FAIL: gcc.dg/vshift-7.c (test for excess errors)
FAIL: gcc.dg/ipa/ipa-sra-19.c (test for excess errors)
FAIL: gcc.dg/lto/pr83719 c_lto_pr83719_0.o assemble,  -flto -g -gsplit-dwarf 
FAIL: gcc.dg/pch/save-temps-1.c   -O0  -I. -Dwith_PCH (te

[Bug c/111337] ICE in gimple-isel.cc for RISC-V port

2023-09-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337

Robin Dapp  changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org

--- Comment #1 from Robin Dapp  ---
This is gcc.dg/pr70252.c BTW.

What happens is that, starting with
  maskdest = (vec_cond mask1 1 0) >= (vec_cond mask2 1 0)
we fold to
  maskdest = mask1 >= (vec_cond (mask2 1 0))
and then sink the ">=" into the vec_cond so we end up with
  maskdest = vec_cond (mask2 ? mask1 : 0),
i.e. a vec_cond with a mask "data mode".

In gimple-isel, when the target does not provide a vcond_mask
implementation for that (which none does) we assert that the mask mode
be MODE_VECTOR_INT.

IMHO this should not happen and we should not sink comparisons (that could be
folded to masks) into vec_cond.

I'm preparing a patch that prevents the sinking of comparisons for mask types.

[Bug middle-end/111337] ICE in gimple-isel.cc for RISC-V port

2023-09-12 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337

--- Comment #8 from Robin Dapp  ---
Yes, I doubt we would get much below 4 instructions with riscv specifics.

A quick grep yesterday didn't reveal any aarch64 or gcn patterns for those (as
long as they are not hidden behind some pattern replacement).  But aarch64
doesn't encounter this situation anyway as we fold differently before.

[Bug middle-end/111337] ICE in gimple-isel.cc for RISC-V port

2023-09-12 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337

--- Comment #10 from Robin Dapp  ---
I would be OK with the riscv implementation, then we don't need to touch isel. 
Maybe a future vector extension will also help us here so we could just switch
the implementation then.

[Bug middle-end/111337] ICE in gimple-isel.cc for RISC-V port

2023-09-12 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337

--- Comment #12 from Robin Dapp  ---
Yes, as far as I know.  I would also go ahead and merge the test suite patch
now as there is already a v2 fix posted.  Even if it's not the correct one it
will be done soon so we should not let that block enabling the test suite.

[Bug target/111317] RISC-V: Incorrect COST model for RVV conversions

2023-09-12 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111317

--- Comment #1 from Robin Dapp  ---
I think the default cost model is not too bad for these simple cases.  Our
emitted instructions match gimple pretty well.

The thing we don't model is vsetvl.  We could ignore it under the assumption
that it is going to be rather cheap on most uarchs.

Something that needs to be fixed is the general costing used for
length-masking:

/* Each may need two MINs and one MINUS to update lengths in body
   for next iteration.  */
if (need_iterate_p)
  body_stmts += 3 * num_vectors;

We don't actually need min with vsetvl (they are our mins) so this would need
to be adjusted down, provided vsetvl is cheap.  

This is the scalar baseline:
.L3:
lw  a5,0(a0)
sd  a5,0(a1)
addia0,a0,4
addia1,a1,8
bne a4,a0,.L3


While this is what zvl128b would emit:
 .L3:
vsetvli a5,a2,e8,mf8,ta,ma
vle32.v v2,0(a0)
vsetvli a4,zero,e64,m1,ta,ma
vsext.vf2   v1,v2
vsetvli zero,a2,e64,m1,ta,ma
vse64.v v1,0(a1)
sllia4,a5,2
add a0,a0,a4
sllia4,a5,3
add a1,a1,a4
sub a2,a2,a5
bne a2,zero,.L3

With a vectorization factor of 2 (might effectively be higher of course but
possibly unknown at compile time) I'm not sure vectorization is always a win
and the costs actually reflect that.  If we disregard vsetvl for now we have 8
instructions in the vectorized loop and 2 * 4 instructions in the scalar loop
for the same amount of data.  Factoring in the vsetvls I'd say it's worse.
Once we statically know the VF is higher, we will vectorize.

[Bug c/111153] RISC-V: Incorrect Vector cost model for reduction

2023-09-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53

--- Comment #2 from Robin Dapp  ---
With the current trunk we don't spill anymore:

(VLS)
.L4:
vle32.v v2,0(a5)
vadd.vv v1,v1,v2
addia5,a5,16
bne a5,a4,.L4

Considering just that loop I'd say costing works as designed.  Even though the
epilog and boilerplate code seems "crude" the main loop is as short as it can
be and is IMHO preferable.

.L3:
vsetvli a5,a1,e32,m1,tu,ma
sllia4,a5,2
sub a1,a1,a5
vle32.v v2,0(a0)
add a0,a0,a4
vadd.vv v1,v2,v1
bne a1,zero,.L3

This has 6 instructions (disregarding the jump) and can't be faster than the 3
instructions for the VLS loop.  Provided we iterate often enough the VLS loop
should always be a win.

Regarding "looking slow" - I think ideally we would have the VLS loop followed
directly by the VLA loop for the residual iterations and next to no additional
statements.  That would require changes in the vectorizer, though.

In total: I think the current behavior is reasonable.

[Bug c/111153] RISC-V: Incorrect Vector cost model for reduction

2023-09-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53

--- Comment #4 from Robin Dapp  ---
Yes, with VLS reduction this will improve.

On aarch64 + sve I see
loop inside costs: 2
This is similar to our VLS costs.

And their loop is indeed short:

ld1wz30.s, p7/z, [x0, x2, lsl 2]
add x2, x2, x3
add z31.s, p7/m, z31.s, z30.s
whilelo p7.s, w2, w1
b.any   .L3

Not much to be squeezed out with a VLS approach.  I guess that's why.

[Bug middle-end/111401] Middle-end: Missed optimization of MASK_LEN_FOLD_LEFT_PLUS

2023-09-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401

Robin Dapp  changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org

--- Comment #2 from Robin Dapp  ---
I played around with this a bit.  Emitting a COND_LEN in if-convert is easy:

_ifc__35 = .COND_ADD (_23, init_20, _8, init_20);

However, during reduction handling we rely on the reduction being a gimple
assign and binary operation, though so I needed to fix some places and indices
as well as the proper mask.

What complicates things a bit is that we assume that "init_20" (i.e. the
reduction def) occurs once when we have it twice in the COND_ADD.  I just
special cased that for now.  Is this the proper thing to do?

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 23c6e8259e7..e99add3cf16 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3672,7 +3672,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared
*shared)
 static bool
 fold_left_reduction_fn (code_helper code, internal_fn *reduc_fn)
 {
-  if (code == PLUS_EXPR)
+  if (code == PLUS_EXPR || code == IFN_COND_ADD)
 {
   *reduc_fn = IFN_FOLD_LEFT_PLUS;
   return true;
@@ -4106,8 +4106,11 @@ vect_is_simple_reduction (loop_vec_info loop_info,
stmt_vec_info phi_info,
   return NULL;
 }

-  nphi_def_loop_uses++;
-  phi_use_stmt = use_stmt;
+  if (use_stmt != phi_use_stmt)
+   {
+ nphi_def_loop_uses++;
+ phi_use_stmt = use_stmt;
+   }

@@ -7440,6 +7457,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (i == STMT_VINFO_REDUC_IDX (stmt_info))
continue;

+  if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)])
+   continue;
+

Apart from that I think what's mainly missing is making the added code nicer. 
Going to attach a tentative patch later.

[Bug middle-end/111401] Middle-end: Missed optimization of MASK_LEN_FOLD_LEFT_PLUS

2023-09-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401

--- Comment #3 from Robin Dapp  ---
Several other things came up, so I'm just going to post the latest status here
without having revised or tested it.  Going to try fixing it and testing
tomorrow.

--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3672,7 +3672,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared
*shared)
 static bool
 fold_left_reduction_fn (code_helper code, internal_fn *reduc_fn)
 {
-  if (code == PLUS_EXPR)
+  if (code == PLUS_EXPR || code == IFN_COND_ADD)
 {
   *reduc_fn = IFN_FOLD_LEFT_PLUS;
   return true;
@@ -4106,8 +4106,13 @@ vect_is_simple_reduction (loop_vec_info loop_info,
stmt_vec_info phi_info,
   return NULL;
 }

-  nphi_def_loop_uses++;
-  phi_use_stmt = use_stmt;
+  /* We might have two uses in the same instruction, only count them as
+one. */
+  if (use_stmt != phi_use_stmt)
+   {
+ nphi_def_loop_uses++;
+ phi_use_stmt = use_stmt;
+   }
 }

   tree latch_def = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
@@ -6861,7 +6866,7 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
   gimple **vec_stmt, slp_tree slp_node,
   gimple *reduc_def_stmt,
   tree_code code, internal_fn reduc_fn,
-  tree ops[3], tree vectype_in,
+  tree *ops, int num_ops, tree vectype_in,
   int reduc_index, vec_loop_masks *masks,
   vec_loop_lens *lens)
 {
@@ -6883,11 +6888,24 @@ vectorize_fold_left_reduction (loop_vec_info
loop_vinfo,
 gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (vectype_out),
  TYPE_VECTOR_SUBPARTS (vectype_in)));

-  tree op0 = ops[1 - reduc_index];
+  /* The operands either come from a binary operation or a COND_ADD operation.
+ The former is a gimple assign and the latter is a gimple call with four
+ arguments.  */
+  gcc_assert (num_ops == 2 || num_ops == 4);
+  bool is_cond_add = num_ops == 4;
+  tree op0, opmask;
+  if (!is_cond_add)
+op0 = ops[1 - reduc_index];
+  else
+{
+  op0 = ops[2];
+  opmask = ops[0];
+  gcc_assert (!slp_node);
+}
   int group_size = 1;
   stmt_vec_info scalar_dest_def_info;
-  auto_vec vec_oprnds0;
+  auto_vec vec_oprnds0, vec_opmask;
   if (slp_node)
 {
   auto_vec > vec_defs (2);
@@ -6903,9 +6921,18 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
   vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, 1,
 op0, &vec_oprnds0);
   scalar_dest_def_info = stmt_info;
+  if (is_cond_add)
+   {
+ vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, 1,
+opmask, &vec_opmask);
+ gcc_assert (vec_opmask.length() == 1);
+   }
 }

-  tree scalar_dest = gimple_assign_lhs (scalar_dest_def_info->stmt);
+  gimple *sdef = scalar_dest_def_info->stmt;
+  tree scalar_dest = is_gimple_call (sdef)
+  ? gimple_call_lhs (sdef)
+  : gimple_assign_lhs (scalar_dest_def_info->stmt);
   tree scalar_type = TREE_TYPE (scalar_dest);
   tree reduc_var = gimple_phi_result (reduc_def_stmt);

@@ -6945,7 +6972,11 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
   i, 1);
  signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS
(loop_vinfo);
  bias = build_int_cst (intQI_type_node, biasval);
- mask = build_minus_one_cst (truth_type_for (vectype_in));
+ /* If we have a COND_ADD take its mask.  Otherwise use {-1, ...}.  */
+ if (is_cond_add)
+   mask = vec_opmask[0];
+ else
+   mask = build_minus_one_cst (truth_type_for (vectype_in));
}

   /* Handle MINUS by adding the negative.  */
@@ -7440,6 +7471,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (i == STMT_VINFO_REDUC_IDX (stmt_info))
continue;

+  if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)])
+   continue;
+
   /* There should be only one cycle def in the stmt, the one
  leading to reduc_def.  */
   if (VECTORIZABLE_CYCLE_DEF (dt))
@@ -8211,8 +8245,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   vec_num = 1;
 }

-  code_helper code = canonicalize_code (op.code, op.type);
-  internal_fn cond_fn = get_conditional_internal_fn (code, op.type);
+  code_helper code (op.code);
+  internal_fn cond_fn;
+
+  if (code.is_internal_fn ())
+{
+  internal_fn ifn = internal_fn (op.code);
+  code = canonicalize_code (conditional_internal_fn_code (ifn), op.type);
+  cond_fn = ifn;
+}
+  else
+{
+  code = canonicalize_code (op.code, op.type);
+  cond_fn = get_conditional_internal_fn (code, op.type);
+}
+
   vec_loop_masks *masks = &LOOP_

[Bug middle-end/111401] Middle-end: Missed optimization of MASK_LEN_FOLD_LEFT_PLUS

2023-09-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401

--- Comment #6 from Robin Dapp  ---
Created attachment 55902
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55902&action=edit
Tentative

You're referring to the case where we have init = -0.0, the condition is false
and we end up wrongly doing -0.0 + 0.0 = 0.0?
I suppose -0.0 the proper neutral element for PLUS (and WIDEN_SUM?) when
honoring signed zeros?  And 0.0 for MINUS?  Doesn't that also depend on the
rounding mode?

neutral_op_for_reduction could return a -0 for PLUS if we honor it for that
type.  Or is that too intrusive?

Guess I should add a test case for that as well.

Another thing is that swapping operands is not as easy with COND_ADD because
the addition would be in the else.  I'd punt for that case for now.

Next problem - might be a mistake on my side.  For avx512 we create a COND_ADD
but the respective MASK_FOLD_LEFT_PLUS is not available, causing us to create
numerous vec_extracts as fallback that increase the cost until we don't
vectorize anymore.

Therefore I added a
vectorized_internal_fn_supported_p (IFN_FOLD_LEFT_PLUS, TREE_TYPE (lhs)).
SLP paths and ncopies != 1 are excluded as well.  Not really happy with how the
patch looks now but at least the testsuites on aarch and x86 pass.

[Bug target/111488] New: ICE ion riscv gcc.dg/vect/vect-126.c

2023-09-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111488

Bug ID: 111488
   Summary: ICE ion riscv gcc.dg/vect/vect-126.c
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---
Target: riscv

I see an ICE in vect-126.c.  A small reproducer is:

int *a[1024], b[1024];

void
f1 (void)
{
  for (int i = 0; i < 1024; i++)
{
  int *p = &b[0];
  a[i] = p + i;
}
}

vect-126.c:18:1: internal compiler error: Segmentation fault
   18 | }
  | ^
0x111e61f crash_signal
../../gcc/toplev.cc:314
0xcfc91d mark_label_nuses
../../gcc/emit-rtl.cc:3755
0xcfc969 mark_label_nuses
../../gcc/emit-rtl.cc:3763
0xcfc969 mark_label_nuses
../../gcc/emit-rtl.cc:3763
0xcfc969 mark_label_nuses
../../gcc/emit-rtl.cc:3763

This happens after the splitter
(define_insn_and_split "*single_widen_fma".

At first glance it seems as if the insn sequence is corrupt as we're looking
into a  value but I haven't checked further.  This is likely the same
error that prevents several SPECfp testcases to build.  Can investigate further
tomorrow.

[Bug target/111488] ICE ion riscv gcc.dg/vect/vect-126.c

2023-09-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111488

Robin Dapp  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #1 from Robin Dapp  ---
Also happens in the rvv.exp testsuite now, e.g. gather_load_run-11.c.

[Bug target/111428] RISC-V vector: Flaky segfault in {min|max}val_char_{1|2}.f90

2023-09-21 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111428

--- Comment #2 from Robin Dapp  ---
Reproduced locally.  The identical binary sometimes works and sometimes doesn't
so it must be a race...

[Bug target/111506] RISC-V: Failed to vectorize conversion from INT64 -> _Float16

2023-10-02 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111506

Robin Dapp  changed:

   What|Removed |Added

 CC||joseph at codesourcery dot com

--- Comment #3 from Robin Dapp  ---
I just got back.

The problem with this is not -fno-trapping-math - it will vectorize just fine
with -ftrapping-math (and the vectorizer doesn't depend on it either).  We also
already have tests for this in rvv/autovec/conversions.

However, not all int64 values can be represented in the intermediate type int32
which is why we don't vectorize unless the range of a[i] is know to be inside
int32's range.  If I'm reading the C standard correctly it says such cases are
implementation-defined behavior and I'm not sure we should work around the
vectorizer by defining an expander that essentially hides the intermediate
type.
If that's an OK thing to do then I won't complain, though.

CC'ing jmyers and rsandi because they would know best.  

>From what I can tell aarch64 also does not vectorize this and I wonder why
LLVM's behavior is dependent on -fno-trapping-math.

We have the same issue with the reversed conversion from _Float16 to int64.  In
that case trapping math is relevant, though, but we could apply the same logic
as in this patch and circumvent it by an expander.  To me this doesn't seem
right.

[Bug target/111600] [14 Regression] RISC-V bootstrap time regression

2023-10-02 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600

Robin Dapp  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org

--- Comment #12 from Robin Dapp  ---
We're really at a point where just building becomes a burden and turnaround
times are annoyingly high.  My suspicion is that the large number of modes
combined with the number of insn patterns slows us down.  Juzhe added a lot of
VLS patterns (or rather added VLS modes to existing patterns) around the
Cauldron and this is where we saw the largest relative slowdown.

Maybe we need to bite the bullet and not use the convenience helpers anymore or
at least very sparingly?  I'm going to make some experiments on Wednesday to
see where that gets us.

[Bug target/111506] RISC-V: Failed to vectorize conversion from INT64 -> _Float16

2023-10-02 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111506

--- Comment #5 from Robin Dapp  ---
Ah, thanks Joseph, so this at least means that we do not need
!flag_trapping_math here.

However, the vectorizer emulates the 64-bit integer to _Float16 conversion via
an intermediate int32_t and now the riscv expander does the same just without
the same restrictions.

I'm assuming the restrictions currently imposed on two-step vectorizing
conversions are correct.  For e.g. int64_t -> _Float16 we require the value
range of the source fit in int32_t (first step int64_t -> int32_t).  For
_Float16 -> int64_t we require -fno-trapping-math (first step _Float16 ->
int32_t).  The latter follows from Annex F of the C standard.

Therefore, my general question would rather be:
- Is it OK to circumvent either restriction by pretending to have an
instruction that performs the conversion in two steps but doesn't actually do
the checks?  I.e. does "implementation-defined behavior" cover the vectorizer
checking one thing and one target not doing it?

In our case the int64_t -> int32_t conversion is implementation defined when
the source doesn't fit the target.

[Bug target/111600] [14 Regression] RISC-V bootstrap time regression

2023-10-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600

--- Comment #16 from Robin Dapp  ---
Confirming that it's the compilation of insn-emit.cc which takes > 10 minutes. 
The rest (including auto generating of files) is reasonably fast.  Going to do
some experiments with it and see which pass takes the most time.

[Bug target/111600] [14 Regression] RISC-V bootstrap time regression

2023-10-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600

--- Comment #18 from Robin Dapp  ---
Just finished an initial timing run, sorted, first 10:

Time variable   usr   sys  wall
  GGC
 phase opt and generate : 567.60 ( 97%)  38.23 ( 87%) 608.13 ( 97%)
22060M ( 90%)
 callgraph functions expansion  : 491.16 ( 84%)  31.48 ( 72%) 524.60 ( 83%)
18537M ( 75%)
 integration:  90.09 ( 15%)  11.68 ( 27%) 103.25 ( 16%)
13408M ( 54%)
 tree CFG cleanup   :  74.43 ( 13%)   1.02 (  2%)  74.66 ( 12%)
  201M (  1%)
 callgraph ipa passes   :  70.16 ( 12%)   6.21 ( 14%)  76.66 ( 12%)
 2921M ( 12%)
 tree STMT verifier :  64.03 ( 11%)   3.52 (  8%)  67.61 ( 11%)
0  (  0%)
 tree CCP   :  44.78 (  8%)   2.91 (  7%)  47.65 (  8%)
  314M (  1%)
 integrated RA  :  42.82 (  7%)   0.86 (  2%)  42.71 (  7%)
  880M (  4%)
 `- tree CFG cleanup:  30.57 (  5%)   0.38 (  1%)  32.03 (  5%)
  198M (  1%)
 `- tree CCP:  29.78 (  5%)   0.05 (  0%)  29.87 (  5%)
  168M (  1%)
 tree SSA verifier  :  28.07 (  5%)   1.42 (  3%)  30.91 (  5%)
0  (  0%)

Per-function sorted expansion time (first 10):
insn_code maybe_code_for_pred_indexed_store(int, machine_mode, machine_mode);
3.05
insn_code maybe_code_for_pred_indexed_load(int, machine_mode, machine_mode);
2.68
insn_code maybe_code_for_pred(int, machine_mode); 1.49
rtx_insn* gen_split_4213(rtx_insn*, rtx_def**); 1.33
insn_code maybe_code_for_pred_scalar(rtx_code, machine_mode); 1.18
rtx_insn* gen_split_1266(rtx_insn*, rtx_def**); 0.70
insn_code maybe_code_for_pred_slide(int, machine_mode); 0.51
insn_code maybe_code_for_pred_scalar(int, machine_mode); 0.34
insn_code maybe_code_for_pred_dual_widen(rtx_code, rtx_code, machine_mode);
0.30
insn_code maybe_code_for_pred_dual_widen_scalar(rtx_code, rtx_code,
machine_mode); 0.29

Expanding all splitter functions (~8000) takes 214s, so roughly 40% of the
expansion time.
This we wouldn't get rid of even when not using insn helpers.

[Bug target/111600] [14 Regression] RISC-V bootstrap time regression

2023-10-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600

--- Comment #20 from Robin Dapp  ---
Mhm, why is your profile so different from mine?  I'm also on an x86_64 host
with a 13.2.1 host compiler (Fedora).
Is it because of the preprocessed source?  Or am I just reading the timing
report wrong?

[Bug target/111600] [14 Regression] RISC-V bootstrap time regression

2023-10-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600

--- Comment #22 from Robin Dapp  ---
Ah, then it's not that different, your machine is just faster ;)

 callgraph ipa passes   :  69.77 ( 11%)   5.97 ( 13%)  76.05 ( 12%)
 2409M ( 10%)
 integration:  91.95 ( 15%)  12.52 ( 27%) 105.93 ( 16%)
13408M ( 56%)
 tree CFG cleanup   :  76.98 ( 13%)   1.09 (  2%)  78.01 ( 12%)
  201M (  1%)
 tree STMT verifier :  66.62 ( 11%)   3.75 (  8%)  68.31 ( 10%)
0  (  0%)
 integrated RA  :  47.04 (  8%)   1.00 (  2%)  47.79 (  7%)
  879M (  4%)
 tree CCP   :  44.31 (  7%)   3.00 (  6%)  48.39 (  7%)
  314M (  1%)
 tree SSA verifier  :  31.40 (  5%)   1.60 (  3%)  32.25 (  5%)
0  (  0%)
 CFG verifier   :  14.93 (  2%)   0.74 (  2%)  16.53 (  3%)
0  (  0%)
 callgraph verifier :  14.26 (  2%)   1.07 (  2%)  15.55 (  2%)
0  (  0%)
 tree operand scan  :  12.58 (  2%)   3.73 (  8%)  15.14 (  2%)
 1649M (  7%)
 verify RTL sharing :  11.70 (  2%)   0.89 (  2%)  13.31 (  2%)
0  (  0%)
 TOTAL  : 609.73 46.53659.45   
24127M

FWIW we are much faster with -fno-inline (somewhat expected but I didn't expect
a factor of 3):

 callgraph ipa passes   :  53.47 ( 27%)   5.84 ( 26%)  59.52 ( 26%)
 2231M ( 26%)
 tree STMT verifier :  19.67 ( 10%)   1.95 (  9%)  21.47 ( 10%)
0  (  0%)
 tree SSA verifier  :  11.80 (  6%)   1.20 (  5%)  13.32 (  6%)
0  (  0%)
 integrated RA  :   8.73 (  4%)   0.72 (  3%)   9.83 (  4%)
  898M ( 10%)
 verify RTL sharing :   7.90 (  4%)   0.69 (  3%)   8.49 (  4%)
0  (  0%)
 scheduling 2   :   7.32 (  4%)   0.31 (  1%)   7.90 (  4%)
   43M (  1%)
 tree PTA   :   6.68 (  3%)   0.69 (  3%)   7.51 (  3%)
   71M (  1%)
 CFG verifier   :   6.67 (  3%)   0.81 (  4%)   7.29 (  3%)
0  (  0%)
 rest of compilation:   6.42 (  3%)   0.93 (  4%)   6.88 (  3%)
   89M (  1%)
 parser function body   :   6.35 (  3%)   2.13 (  9%)   8.40 (  4%)
  903M ( 11%)
 TOTAL  : 201.12 22.90225.17   
 8575M

[Bug target/111428] RISC-V vector: Flaky segfault in {min|max}val_char_{1|2}.f90

2023-10-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111428

--- Comment #3 from Robin Dapp  ---
Still difficult to track down.  The following is a smaller reproducer:

program main
  implicit none
  integer, parameter :: n=5, m=3
  integer, dimension(n,m) :: v
  real, dimension(n,m) :: r

  do
 call random_number(r)
 v = int(r * 2)
 if (count(v < 1) > 1) exit
  end do
  write (*,*) 'asdf'
end program main

I compiled libgfortran without vector but this doesn't change anything.  It's
really just the vectorization of that snippet but I haven't figured out why,
yet.  The stack before the random_number call looks identical.

Also tried valgrind which complains about compares dependent on uninitialized
data (those only show up once compiled with vectorization).  However I suspect
those are false negatives after chasing them for some hours.

Going to try another angle of attack.  Maybe it's a really simple thing I
overlooked.

[Bug tree-optimization/111760] risc-v regression: COND_LEN_* incorrect fold/simplify in middle-end

2023-10-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111760

Robin Dapp  changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org

--- Comment #2 from Robin Dapp  ---
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629904.html

prevents the wrong code but still leaves us with a redundant negation (and it
is not the only missed optimization of that kind):

  vect_neg_xi_14.4_23 = -vect_xi_13.3_22;
  vect_res_2.5_24 = .COND_LEN_ADD ({ -1, ... }, vect_res_1.0_17,
vect_neg_xi_14.4_23, vect_res_1.0_17, _29, 0);

That's because my "hackaround" doesn't recognize a valueized sequence
_30 = vect_res_1.0_17 - vect_xi_13.3_22;

Of course I could (reverse valueize) recognize that again and convert it to a
COND_LEN... but that doesn't seem elegant at all.  There must be a simpler way
that I'm missing entirely right now.  That said, converting the last statement
of such a sequence should be sufficient?

[Bug tree-optimization/111760] risc-v regression: COND_LEN_* incorrect fold/simplify in middle-end

2023-10-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111760

--- Comment #6 from Robin Dapp  ---
Yes, thanks for filing this bug separately.  The patch doesn't disable all of
those optimizations, of course I paid special attention not mess up with them.

The difference here is that we valueize, add statements to *seq and the last
statement is a
 _30 = bla.

Then we'd either need to "_30 = COND_LEN_MOVE bla" or predicate bla itself.

Surely there is a better way.

[Bug bootstrap/116146] Split insn-recog.cc

2024-07-31 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116146

--- Comment #3 from Robin Dapp  ---
On riscv insn-output is the largest file right now as well.  I have a local
patch that splits it - it's a bit cumbersome because the static initializer
needs to be made non-static i.e. the initialization must be in an init function
and that needs to be called at some point.  But as a proof of concept it
worked.

Once I have more time (hah) I'm going to post a patch but it will still take a
while.

[Bug target/111600] [14/15 Regression] RISC-V bootstrap time regression

2024-07-31 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600

--- Comment #37 from Robin Dapp  ---
> The size of the partitions is a little uneven though. Using
> --with-emitinsn-partitions=48 I get some empty partitions and some bigger
> than 2MB:

> Another problematic file is insn-recog.cc which is 19MB and takes 1 hour+ to
> compile for me.

It's not very difficult to make the partitions even.  I have a patch locally
that follows the same approach Tamar took with the match split and it seems to
work nicely.  I haven't gotten around to testing and posting it yet, though.

[Bug target/116149] RISC-V: Miscompile at -O3 with zvl256b -mrvv-vector-bits=zvl

2024-07-31 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116149

--- Comment #1 from Robin Dapp  ---
> Still present when rvv_ta_all_1s=true is omitted.

My result is '0' when rvv_ta_all_1s=false, is that what you meant?

I didn't have time to check this in detail but it's not the missing else for
masked loads.  It looks like we should use the "tu" policy instead of "ta" when
doing those intermediate steps.  When I change everything with a vl != 4 (so 3
and 1) to the "tu" policy the result is correct.  Need to check where we go
wrong.

[Bug target/116149] RISC-V: Miscompile at -O3 with zvl256b -mrvv-vector-bits=zvl

2024-07-31 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116149

--- Comment #2 from Robin Dapp  ---
Correction, it's actually just the wx adds with a length of 1 and those should
be "tu".  Quite likely this only got exposed recently with the late-combine
change in place.

[Bug target/116149] RISC-V: Miscompile at -O3 with zvl256b -mrvv-vector-bits=zvl

2024-07-31 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116149

--- Comment #3 from Robin Dapp  ---
It looks like the problem is a wrong mode_idx attribute for the wx variants of
the adds.  The widening adds's mode is the one of the non-widened input operand
but for the wx/scalar variants this is a scalar mode instead of a vector mode.
That confuses avlprop so that it uses 1 instead of 4 as vector length.

Testing a patch.

[Bug target/116149] RISC-V: Miscompile at -O3 with zvl256b -mrvv-vector-bits=zvl

2024-08-01 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116149

Robin Dapp  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #5 from Robin Dapp  ---
Fixed on trunk.

[Bug target/116202] RISC-V: Miscompile at -O3 with zvl256b

2024-08-03 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116202

Robin Dapp  changed:

   What|Removed |Added

 CC||pan2.li at intel dot com

--- Comment #1 from Robin Dapp  ---
Looks like a mistake in the SAT_TRUNC pattern.  Probably -1 instead of 1.

[Bug middle-end/115495] [15 Regression] ICE in smallest_mode_for_size, at stor-layout.cc:356 during combine on RISC-V rv64gcv_zvl256b at -O3 since r15-1042-g68b0742a49d

2024-08-20 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115495

Robin Dapp  changed:

   What|Removed |Added

  Component|rtl-optimization|middle-end

--- Comment #6 from Robin Dapp  ---
Finally looking into this one.  The fix is pretty simple and it's similar to
other occurrences of smallest_int_mode_for_size.
smallest_int_mode_for_size expects to find at least one mode equal to or larger
than the provided size but in some cases this fails - in particular when we
have full-vector-size structures like here.

Testing a patch.

[Bug middle-end/115495] [15 Regression] ICE in smallest_mode_for_size, at stor-layout.cc:356 during combine on RISC-V rv64gcv_zvl256b at -O3 since r15-1042-g68b0742a49d

2024-08-20 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115495

--- Comment #7 from Robin Dapp  ---
Ah, hmm, this doesn't seem to occur on trunk anymore for me.  It's still likely
latent.  Patrick, does it still happen for you?

[Bug middle-end/115495] [15 Regression] ICE in smallest_mode_for_size, at stor-layout.cc:356 during combine on RISC-V rv64gcv_zvl256b at -O3 since r15-1042-g68b0742a49d

2024-08-23 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115495

Robin Dapp  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #10 from Robin Dapp  ---
Fixed.

[Bug target/116086] RISC-V: Hash mismatch with vectorized 557.xz_r at zvl128b and LMUL=m2

2024-08-29 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116086

Robin Dapp  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from Robin Dapp  ---
Fixed.

As a reminder for posterity here:  Richi called for a unified subreg handling
and also argued (I agree) that LMUL > 1 VLS modes that are larger than a
minimum-sized vector need to be treated like VLA modes.  I don't think we do
that everywhere already but let's fix things as they arise.

[Bug target/116242] [meta-bug] Tracker for zvl issues in RISC-V

2024-08-29 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116242
Bug 116242 depends on bug 116086, which changed state.

Bug 116086 Summary: RISC-V: Hash mismatch with vectorized 557.xz_r at zvl128b 
and LMUL=m2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116086

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/116611] Inefficient mix of contiguous and load-lane vectorization due to missing permutes

2024-09-05 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611

--- Comment #1 from Robin Dapp  ---
For the record, with the default -march=rv64gcv I don't see any LOAD_LANES,
with -march=rv64gcv -mrvv-vector-bits=zvl I do.

[Bug target/116611] Inefficient mix of contiguous and load-lane vectorization due to missing permutes

2024-09-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611

--- Comment #3 from Robin Dapp  ---
Actually we're already supposed to be handling all constant permutes.

Maybe what's in the way is

  /* FIXME: Explicitly disable VLA interleave SLP vectorization when we
 may encounter ICE for poly size (1, 1) vectors in loop vectorizer.
 Ideally, middle-end loop vectorizer should be able to disable it
 itself, We can remove the codes here when middle-end code is able
 to disable VLA SLP vectorization for poly size (1, 1) VF.  */
  if (!BYTES_PER_RISCV_VECTOR.is_constant ()
  && maybe_lt (BYTES_PER_RISCV_VECTOR * TARGET_MAX_LMUL,
   poly_int64 (16, 16)))
return false;

which was introduced in r14-5917-g9f3f0b829b62f1.

I'm running the testsuite to see if it's still a problem.  If so, let's see if
we can work around the issue differently.

[Bug target/116611] Inefficient mix of contiguous and load-lane vectorization due to missing permutes

2024-09-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611

--- Comment #4 from Robin Dapp  ---
I just send a patch to get rid of this early exit in our backend.

However with test testsuite compile options
 -O3 -march=rv64gcv -fno-vect-cost-model 
I still see MASK_LEN_LOAD_LANES.

[Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V

2024-09-17 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #7 from Robin Dapp  ---
I'm testing a patch that basically does what Richi proposes.

I was also playing around with mixed lane configurations where we potentially
reuse the pointer increment from another pointer update.  To me the code looked
promising and I think we could at least make it work for a subset of lane
configurations.  I didn't manage to get everything correct, though so the patch
tries to only restore the status quo. 

Some info about vsetvl because the question also came up on the cauldron -
according to the vector spec it has the (for the compiler) annoying  property
that it can basically set the length freely within a certain range.  This is
for load-balancing reasons and intended to give hardware implementations more
freedom.  (I'm not sure that is a useful tradeoff as the compiler's freedom is
significantly reduced)

vsetvl takes the "application vector length" (AVL) so the total number of
elements the whole loop wants to process and returns a vl.
VLMAX is the maximum number of elements a single vector (or vector group with
LMUL) can hold.

If the AVL is larger than VLMAX but <= 2 * VLMAX vsetvl can set vl to a value
inside the range
[ceil(AVL / 2), VLMAX].
So for e.g. AVL = 37, ceil(37/2) = 19 would, unfortunately, be a legal vl
value.
For the other possible values of AVL (<= VLMAX, > 2*VLMAX) the behavior is as
expected.

My hope is that most hardware implementations would take a saner approach and
have vsetvl always act as a "min (AVL, VLMAX)".  That would enable easy scalar
evolution and would possible also allow mixed-lane settings with reuse of the
vl value.  I suppose we could have a target hook or target query mechanism that
asks for "sane" behavior of vsetvl?  Thus we could have optimized SELECT_VL
behavior for those implementations.

[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vect-cost-model (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)

2024-04-03 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476

--- Comment #8 from Robin Dapp  ---
I tried some things (for the related bug without -fwrapv) then got busy with
some other things.  I'm going to have another look later this week.

[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA

2024-04-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247

--- Comment #5 from Robin Dapp  ---
This fixes the test case for me locally, thanks.
I can run the testsuite with it later if you'd like.

[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA

2024-04-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247

--- Comment #6 from Robin Dapp  ---
Testsuite looks unchanged on rv64gcv.

[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665

--- Comment #1 from Robin Dapp  ---
Hmm, my local version is a bit older and seems to give the same result for both
-O2 and -O3.  At least a good starting point for bisection then.

[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665

--- Comment #2 from Robin Dapp  ---
Checked with the latest commit on a different machine but still cannot
reproduce the error.  PR114668 I can reproduce.  Maybe a copy and paste
problem?

[Bug target/114668] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114668

--- Comment #2 from Robin Dapp  ---
This, again, seems to be a problem with bit extraction from masks.
For some reason I didn't add the VLS modes to the corresponding vec_extract
patterns.  With those in place the problem is gone because we go through the
expander which does the right thing.

I'm still checking what exactly goes wrong without those as there is likely a
latent bug.

[Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension

2024-04-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686

--- Comment #3 from Robin Dapp  ---
I think we have always maintained that this can definitely be a per-uarch
default but shouldn't be a generic default.

> I don't see any reason why this wouldn't be the case for the vast majority of
> implementations, especially high performance ones would benefit from having
> more work to saturate the execution units with, since a larger LMUL works
> quite
> similar to loop unrolling.

One argument is reduced freedom for renaming and the out of order machinery. 
It's much easier to shuffle individual registers around than large blocks. 
Also lower-latency insns are easier to schedule than longer-latency ones and
faults, rejects, aborts etc. get proportionally more expensive.
I was under the impression that unrolling doesn't help a whole lot (sometimes
even slows things down a bit) on modern cores and certainly is not
unconditionally helpful.  Granted, I haven't seen a lot of data on it recently.
An exception is of course breaking dependency chains.

In general nothing stands in the way of having a particular tune target use
dynamic LMUL by default even now but nobody went ahead and posted a patch for
theirs.  One could maybe argue that it should be the default for in-order
uarchs?

Should it become obvious in the future that LMUL > 1 is indeed,
unconditionally, a "better unrolling" because of its favorable icache footprint
and other properties (which I doubt - happy to be proved wrong) then we will
surely re-evaluation the decision or rather have a different consensus.

The data we publicly have so far is all in-order cores and my expectation is
that the picture will change once out-of-order cores hit the scene.

[Bug target/114668] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114668

Robin Dapp  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from Robin Dapp  ---
I didn't have the time to fully investigate but the default path without vec
extract is definitely broken for masks.  I'd probably sleep better if we fixed
that at some point but for now the obvious fix is to add the missing expanders.

Patrick, I'm still unable to reproduce PR114665 (maybe also a qemu
difference?).  Could you re-check with this fix?  Thanks.

[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665

--- Comment #5 from Robin Dapp  ---
Weird,  I tried your exact qemu version and still can't reproduce the problem.  
My results are always FFB5.

Binutils difference?  Very unlikely.  Could you post your QEMU_CPU settings
just to be sure?

[Bug middle-end/114733] [14] Miscompile with -march=rv64gcv -O3 on riscv

2024-04-16 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114733

--- Comment #1 from Robin Dapp  ---
Confirmed, also shows up here.

[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl

2024-04-16 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734

--- Comment #1 from Robin Dapp  ---
Confirmed.

  1   2   3   4   >