[Bug tree-optimization/100756] [12 Regression] vect: Superfluous epilog created on s390x
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100756 --- Comment #8 from rdapp at gcc dot gnu.org --- For completeness: haven't observed any fallout on s390 since and the regression is fixed.
[Bug middle-end/106527] New: ICE with modulo scheduling dump (-fdump-rtl-sms)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106527 Bug ID: 106527 Summary: ICE with modulo scheduling dump (-fdump-rtl-sms) Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: zhroma at gcc dot gnu.org Target Milestone: --- Host: s390 Target: s390 Hi, on s390 we are observing more and more problems with -fmodulo-sched. I initially tried debugging an -fcompare-debug failure with -fmodulo-sched but we already ICE when just dumping via `-fdump-rtl-sms`. The problem occurs when compiling the test case gcc.dg/sms-compare-debug-1.c with gcc -O2 -fmodulo-sched sms-compare-debug-1.c -fdump-rtl-sms: sms-compare-debug-1.c:36:1: internal compiler error: in linemap_ordinary_map_lookup, at libcpp/line-map.cc:1064 36 | } | ^ 0x2694499 linemap_ordinary_map_lookup ../../libcpp/line-map.cc:1064 0x2694ef7 linemap_macro_loc_to_exp_point ../../libcpp/line-map.cc:1561 0x266a5c5 expand_location_1 ../../gcc/input.cc:243 0x266c54d expand_location(unsigned int) ../../gcc/input.cc:956 0x1513ecb insn_location(rtx_insn const*) ../../gcc/emit-rtl.cc:6558 0x24cb523 dump_insn_location ../../gcc/modulo-sched.cc:1250 0x24cb523 dump_insn_location ../../gcc/modulo-sched.cc:1246 0x24cf5d7 sms_schedule ../../gcc/modulo-sched.cc:1418 0x24d267f execute ../../gcc/modulo-sched.cc:3358 I didn't manage to simplify the test case further. It works fine on x86. The ICE does not seem to occur with GCC 11, therefore I can bisect the issue if it's of any help. Given the several other problems we're having with modulo scheduling I figured it's better to ask for general guidance here first. Regards Robin
[Bug rtl-optimization/105988] [10/11/12/13 Regression] ICE in linemap_ordinary_map_lookup, at libcpp/line-map.cc:1064 since r6-4873-gebedc9a3414d8422
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105988 rdapp at gcc dot gnu.org changed: What|Removed |Added Target|x86_64-pc-linux-gnu |x86_64-pc-linux-gnu s390 --- Comment #6 from rdapp at gcc dot gnu.org --- We are also seeing this on s390 as well as several other problems with -fmodulo-sched. Is this pass here to stay or is it safe to ignore all issues/FAILs with it because it's going away anyway? Regards Robin
[Bug target/106701] Compiler does not take into account number range limitation to avoid subtract from immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701 rdapp at gcc dot gnu.org changed: What|Removed |Added Target|s390|s390 x86_64-linux-gnu CC||glisse at gcc dot gnu.org, ||rdapp at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Summary|s390: Compiler does not |Compiler does not take into |take into account number|account number range |range limitation to avoid |limitation to avoid |subtract from immediate |subtract from immediate --- Comment #1 from rdapp at gcc dot gnu.org --- Added x86 to targets because we don't seem to optimize this there either (at least I didn't see it on my recent-ish GCC). The following (not regtested) helps on s390 diff --git a/gcc/match.pd b/gcc/match.pd index e486b4be282c..2ebbf68010f9 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -7992,3 +7992,27 @@ and, (match (bitwise_induction_p @0 @2 @3) (bit_not (nop_convert1? (bit_xor@0 (convert2? (lshift integer_onep@1 @2)) @3 + +/* cst - a -> cst ^ a if 0 >= a <= cst and integer_pow2p (cst + 1). */ +#if GIMPLE +(simplify + (minus INTEGER_CST@0 @1) + (with { +wide_int w = wi::to_wide (@0) + 1; +value_range vr; +wide_int wmin = w; +wide_int wmax = w; +if (get_global_range_query ()->range_of_expr (vr, @1) + && vr.kind () == VR_RANGE) + { + wmin = vr.lower_bound (); + wmax = vr.upper_bound (); + } + } + (if (wi::exact_log2 (w) != -1 + && wi::geu_p (wmin, 0) + && wi::leu_p (wmax, w)) +(bit_xor @0 @1)) + ) +) +#endif but it can surely be improved by some match.pd magic still. A second question would be, do we unconditionally want to simplify this or should it rather be backend dependent?
[Bug target/106701] Compiler does not take into account number range limitation to avoid subtract from immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701 --- Comment #3 from rdapp at gcc dot gnu.org --- I though expand (or combine) were independent of value range. What would be the proper place for it then?
[Bug middle-end/91213] Missed optimization: (sub X Y) -> (xor X Y) when Y <= X and isPowerOf2(X + 1)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213 rdapp at gcc dot gnu.org changed: What|Removed |Added CC||rdapp at gcc dot gnu.org --- Comment #6 from rdapp at gcc dot gnu.org --- What's the mechanism to get range information at RTL level? The only related thing I saw in (e.g.) simplify-rtx.cc is nonzero_bits and this does not seem to be propagated from gimple.
[Bug middle-end/91213] Missed optimization: (sub X Y) -> (xor X Y) when Y <= X and isPowerOf2(X + 1)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213 --- Comment #8 from rdapp at gcc dot gnu.org --- Hacked something together, inspired by the other cases that try two different sequences. Does this go into the right direction? Works for me on s390. I see some regressions related to predictive commoning that I will look into. diff --git a/gcc/expr.cc b/gcc/expr.cc index c90cde35006b..395b4df2e214 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -23,6 +23,7 @@ along with GCC; see the file COPYING3. If not see #include "backend.h" #include "target.h" #include "rtl.h" +#include "tree-core.h" #include "tree.h" #include "gimple.h" #include "predict.h" @@ -65,7 +66,7 @@ along with GCC; see the file COPYING3. If not see #include "rtx-vector-builder.h" #include "tree-pretty-print.h" #include "flags.h" /* If this is nonzero, we do not bother generating VOLATILE around volatile memory references, and we are willing to @@ -9358,6 +9359,21 @@ expand_expr_real_2 (sepops ops, rtx target, machine_mode tmode, return simplify_gen_binary (MINUS, mode, op0, op1); } + /* Convert const - A to A xor const if integer_pow2p (const + 1) +and 0 <= A <= const. */ + if (code == MINUS_EXPR + && TREE_CODE (treeop0) == INTEGER_CST + && SCALAR_INT_MODE_P (mode) + && unsignedp + && wi::exact_log2 (wi::to_wide (treeop0) + 1) != -1) + { + rtx res = maybe_optimize_cst_sub (code, treeop0, treeop1, + mode, unsignedp, type, + target, subtarget); + if (res) + return res; + } + /* No sense saving up arithmetic to be done if it's all in the wrong mode to form part of an address. And force_operand won't know whether to sign-extend or @@ -12641,6 +12657,77 @@ maybe_optimize_mod_cmp (enum tree_code code, tree *arg0, tree *arg1) return code == EQ_EXPR ? LE_EXPR : GT_EXPR; } +/* Optimize cst - x if integer_pow2p (cst + 1) and 0 >= x <= cst. */ + +rtx +maybe_optimize_cst_sub (enum tree_code code, tree treeop0, tree treeop1, + machine_mode mode, int unsignedp, tree type, + rtx target, rtx subtarget) +{ + gcc_checking_assert (code == MINUS_EXPR); + gcc_checking_assert (TREE_CODE (treeop0) == INTEGER_CST); + gcc_checking_assert (TREE_CODE (TREE_TYPE (treeop1)) == INTEGER_TYPE); + gcc_checking_assert (wi::exact_log2 (wi::to_wide (treeop0) + 1) != -1); + + if (!optimize) +return NULL_RTX; + + optab this_optab; + rtx op0, op1; + + if (wi::leu_p (tree_nonzero_bits (treeop1), tree_nonzero_bits (treeop0))) +{ + expand_operands (treeop0, treeop1, subtarget, &op0, &op1, + EXPAND_NORMAL); + bool speed_p = optimize_insn_for_speed_p (); + do_pending_stack_adjust (); + start_sequence (); + this_optab = optab_for_tree_code (MINUS_EXPR, type, + optab_default); + rtx subi = expand_binop (mode, this_optab, op0, op1, target, + unsignedp, OPTAB_LIB_WIDEN); + + rtx_insn *sub_insns = get_insns (); + end_sequence (); + start_sequence (); + this_optab = optab_for_tree_code (BIT_XOR_EXPR, type, + optab_default); + rtx xori = expand_binop (mode, this_optab, op0, op1, target, + unsignedp, OPTAB_LIB_WIDEN); + rtx_insn *xor_insns = get_insns (); + end_sequence (); + unsigned sub_cost = seq_cost (sub_insns, speed_p); + unsigned xor_cost = seq_cost (xor_insns, speed_p); + /* If costs are the same then use as tie breaker the other other +factor. */ + if (sub_cost == xor_cost) + { + sub_cost = seq_cost (sub_insns, !speed_p); + xor_cost = seq_cost (xor_insns, !speed_p); + } + + if (sub_cost <= xor_cost) + { + emit_insn (sub_insns); + return subi; + } + + emit_insn (xor_insns); + return xori; +} + + return NULL_RTX; +} + /* Optimize x - y < 0 into x < 0 if x - y has undefined overflow. */ void diff --git a/gcc/expr.h b/gcc/expr.h index 035118324057..9c4f2ed02fcb 100644 --- a/gcc/expr.h +++ b/gcc/expr.h @@ -317,6 +317,8 @@ extern tree string_constant (tree, tree *, tree *, tree *); extern tree byte_representation (tree, tree *, tree *, tree *); extern enum tree_code maybe_optimize_mod_cmp (enum tree_code, tree *, tree *); +extern rtx maybe_optimize_cst_sub (enum tree_code, tree, tree, + machine_mode, int, tree , rtx, rtx); extern void maybe_optimize_sub_cmp_0 (enum tree_code, tree *, tree *); /* Two different ways of generating switch statements. */
[Bug middle-end/91213] Missed optimization: (sub X Y) -> (xor X Y) when Y <= X and isPowerOf2(X + 1)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91213 --- Comment #9 from rdapp at gcc dot gnu.org --- The regressions are unrelated and due to another patch that I still had on the same branch.
[Bug target/106919] [13 Regression] RTL check: expected code 'set' or 'clobber', have 'if_then_else' in s390_rtx_costs, at config/s390/s390.cc:3672on s390x-linux-gnu since r13-2251-g1930c5d05ceff2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106919 --- Comment #8 from rdapp at gcc dot gnu.org --- Yes, one of dst and dest is superflous. Looks good like that. I bootstrapped the same patch locally already, no regressions.
[Bug tree-optimization/100756] vect: Superfluous epilog created on s390x
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100756 rdapp at gcc dot gnu.org changed: What|Removed |Added CC||rdapp at gcc dot gnu.org --- Comment #4 from rdapp at gcc dot gnu.org --- Anything that can/should be done here in case we'd still want to tackle it in this P1 cycle?
[Bug middle-end/107617] New: SCC-VN with len_store and big endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617 Bug ID: 107617 Summary: SCC-VN with len_store and big endian Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: richard.guenther at gmail dot com Target Milestone: --- Target: s Created attachment 53871 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53871&action=edit s390 patch for len_load/len_store Hi, Richard and I already quickly discussed this on the mailing list but I didn't manage to progress analyzing as I was tied up with other things. Figured I open a bug for tracking purposes and the possibility to maybe fix it in a later stage. I'm evaluating len_load/len_store support on s390 via the attached patch and seeing a FAIL in testsuite/gfortran.dg/power_3.f90 built with -march=z16 -O3 --param vect-partial-vector-usage=1 The problem seems to be that we evaluate a vector constant {-1, 1, -1, 1} loaded with length 11 + 1(bias) = 12 as {1, -1, 1} instead of {-1, 1, -1}. Richard wrote on the mailing list: > The error is probably in vn_reference_lookup_3 which assumes that > 'len' applies to the vector elements in element order. See the part > of the code where it checks for internal_store_fn_p. If 'len' is with > respect to the memory and thus endianess has to be taken into > account then for the IFN_LEN_STORE > > else if (fn == IFN_LEN_STORE) > { > pd.rhs_off = 0; > pd.offset = offset2i; > pd.size = (tree_to_uhwi (len) >+ -tree_to_shwi (bias)) * BITS_PER_UNIT; > if (ranges_known_overlap_p (offset, maxsize, > pd.offset, pd.size)) > return data->push_partial_def (pd, set, set, > offseti, maxsizei); > > likely needs to adjust rhs_off from zero for big endian?
[Bug middle-end/107617] SCC-VN with len_store and big endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617 rdapp at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P4
[Bug middle-end/107617] SCC-VN with len_store and big endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107617 --- Comment #1 from rdapp at gcc dot gnu.org --- For completeness, the mailing list thread is here: https://gcc.gnu.org/pipermail/gcc-patches/2022-September/602252.html
[Bug target/113827] New: MrBayes benchmark redundant load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827 Bug ID: 113827 Summary: MrBayes benchmark redundant load Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org, pan2.li at intel dot com Blocks: 79704 Target Milestone: --- Target: riscv A hot block in the MrBayes benchmark (as used in the Phoronix testsuite) has a redundant scalar load when vectorized. Minimal example, compiled with -march=rv64gcv -O3 int foo (float **a, float f, int n) { for (int i = 0; i < n; i++) { a[i][0] /= f; a[i][1] /= f; a[i][2] /= f; a[i][3] /= f; a[i] += 4; } } GCC: .L3: ld a5,0(a0) vle32.v v1,0(a5) vfmul.vvv1,v1,v2 vse32.v v1,0(a5) addia5,a5,16 sd a5,0(a0) addia0,a0,8 bne a0,a4,.L3 The value of a5 doesn't change after the store to 0(a0). LLVM: .L3 vle32.v v8,(a1) addi a3,a1,16 sda3,0(a2) vfdiv.vf v8,v8,fa5 addi a2,a2,8 vse32.v v8,(a1) bne a2,a0,.L3 Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79704 [Bug 79704] [meta-bug] Phoronix Test Suite compiler performance issues
[Bug target/113827] MrBayes benchmark redundant load on riscv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827 --- Comment #1 from Robin Dapp --- x86 (-march=native -O3 on an i7 12th gen) looks pretty similar: .L3: movq(%rdi), %rax vmovups (%rax), %xmm1 vdivps %xmm0, %xmm1, %xmm1 vmovups %xmm1, (%rax) addq$16, %rax movq%rax, (%rdi) addq$8, %rdi cmpq%rdi, %rdx jne .L3 So probably not target specific. Costing?
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #4 from Robin Dapp --- Judging by the graph it looks like it was slow before, then got faster and now slower again. Is there some more info on why it got faster in the first place? Did the patch reverse something or is it rather a secondary effect? I don't have a zen4 handy to check.
[Bug target/114027] [14] RISC-V vector: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027 Robin Dapp changed: What|Removed |Added CC||rguenth at gcc dot gnu.org Last reconfirmed||2024-2-22 Target|riscv |x86_64-*-* riscv*-*-* ||aarch64-*-* --- Comment #5 from Robin Dapp --- To me it looks like we interpret e.g. c_53 = _43 ? prephitmp_13 : 0 as the only reduction statement and simplify to MAX because of the wrong assumption that this is the only reduction statement in the chain when we actually have several. (See "condition expression based on compile time constant"). --- Comment #6 from Robin Dapp --- Btw this fails on x86 and aarch64 for me with -fno-vect-cost-model. So it definitely looks generic.
[Bug target/114027] [14] RISC-V vector: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027 --- Comment #9 from Robin Dapp --- Argh, I actually just did a gcc -O3 -march=native pr114027.c -fno-vect-cost-model on cfarm188 with a recent-ish GCC but realized that I used my slightly modified version and not the original test case. long a; int b[10][8] = {{}, {}, {}, {}, {}, {}, {0, 0, 0, 0, 0, 1, 1}, {1, 1, 1, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}}; int c; int main() { int d; c = 0x; for (; a < 6; a++) { d = 0; for (; d < 6; d++) { c ^= -3L; if (b[a + 3][d]) continue; c = 0; } } if (c == -3) { return 0; } else { return 1; } } This was from an initial attempt to minimize it further but I didn't really verify if I'm breaking the test case by that (or causing undefined behavior). With that I get a "1" with default options and "0" with -fno-tree-vectorize. Maybe my snippet is broken then?
[Bug target/114028] [14] RISC-V rv64gcv_zvl256b: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114028 --- Comment #2 from Robin Dapp --- This is a target issue. It looks like we try to construct a "superword" sequence when the element size is already == Pmode. Testing a patch.
[Bug middle-end/114109] New: x264 satd vectorization vs LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109 Bug ID: 114109 Summary: x264 satd vectorization vs LLVM Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* riscv*-*-* Looking at the following code of x264 (SPEC 2017): typedef unsigned char uint8_t; typedef unsigned short uint16_t; typedef unsigned int uint32_t; static inline uint32_t abs2 (uint32_t a) { uint32_t s = ((a >> 15) & 0x10001) * 0x; return (a + s) ^ s; } int x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2) { uint32_t tmp[4][4]; uint32_t a0, a1, a2, a3; int sum = 0; for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) { a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); { int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[i][0] = t0 + t2; tmp[i][1] = t1 + t3; tmp[i][2] = t0 - t2; tmp[i][3] = t1 - t3; }; } for( int i = 0; i < 4; i++ ) { { int t0 = tmp[0][i] + tmp[1][i]; int t1 = tmp[0][i] - tmp[1][i]; int t2 = tmp[2][i] + tmp[3][i]; int t3 = tmp[2][i] - tmp[3][i]; a0 = t0 + t2; a2 = t0 - t2; a1 = t1 + t3; a3 = t1 - t3; }; sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3); } return (((uint16_t) sum) + ((uint32_t) sum > >16)) >> 1; } I first checked on riscv but x86 and aarch64 are pretty similar. (Refer https://godbolt.org/z/vzf5ha44r that compares at -O3 -mavx512f) Vectorizing the first loop seems to be a costing issue. By default we don't vectorize and the code becomes much larger when disabling vector costing, so the costing decision in itself seems correct. Clang's version is significantly shorter and it looks like it just directly vec_sets/vec_inits the individual elements. On riscv it can be handled rather elegantly with strided loads that we don't emit right now. As there are only 4 active vector elements and the loop is likely load bound it might be debatable whether LLVM's version is better? The second loop we do vectorize (4 elements at a time) but end up with e.g. four XORs for the four inlined abs2 calls while clang chooses a larger vectorization factor and does all the xors in one. On my laptop (no avx512) I don't see a huge difference (113s GCC vs 108s LLVM) but I guess the general case is still interesting?
[Bug middle-end/114109] x264 satd vectorization vs LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109 --- Comment #2 from Robin Dapp --- It is vectorized with a higher zvl, e.g. zvl512b, refer https://godbolt.org/z/vbfjYn5Kd.
[Bug middle-end/114109] x264 satd vectorization vs LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109 --- Comment #4 from Robin Dapp --- Yes, as mentioned, vectorization of the first loop is debatable.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #6 from Robin Dapp --- Honestly, I don't know how to analyze/debug this without a zen4, in particular as it only seems to happen with PGO. I tried locally but of course the execution time doesn't change (same as with zen3 according to the database). Is there a way to obtain the binaries in order to tell a difference?
[Bug target/114200] [14] RISC-V fixed-length vector miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114200 --- Comment #1 from Robin Dapp --- Took me a while to analyze this... needed more time than I'd like to admit to make sense of the somewhat weird code created by fully unrolling and peeling. I believe the problem is that we reload the output register of a vfmacc/fma via vmv.v.v (subject to length masking) but we should be using vmv1r.v. The result is used by a reduction which always operates on the full length. As annoying as it was to find - it's definitely a good catch. I'm testing a patch. PR114202 is indeed a duplicate. Going to add its test case to the patch.
[Bug middle-end/114196] [13/14 Regression] Fixed length vector ICE: in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9454
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114196 Robin Dapp changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=113163 --- Comment #2 from Robin Dapp --- To me this looks like it already came up in the context of early-break vectorization (PR113163) but is not actually dependent on it. I'm testing a patch that disables epilogue peeling also without early break.
[Bug target/114200] [14] RISC-V fixed-length vector miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114200 --- Comment #3 from Robin Dapp --- *** Bug 114202 has been marked as a duplicate of this bug. ***
[Bug target/114202] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114202 Robin Dapp changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #3 from Robin Dapp --- Same as PR114200. *** This bug has been marked as a duplicate of bug 114200 ***
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #7 from Robin Dapp --- I built executables with and without the commit (-Ofast -march=znver4 -flto). There is no difference so it must really be something that happens with PGO. I'd really need access to a zen4 box or the pgo executables at least.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #10 from Robin Dapp --- (In reply to Sam James from comment #9) > (In reply to Filip Kastl from comment #8) > > I'd like to help but I'm afraid I cannot send you the SPEC binaries with PGO > > applied since SPEC is licensed nor can I give you access to a Zen4 computer. > > I suppose someone else will have to analyze this bug. > > Could you perhaps send only the gcda files so Robin can build again with > -fprofile-use? Yes, that would be helpful. Or Filip builds the executables himself and posts (some of) the difference here. Maybe that also gets us a bit closer to the problem.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #16 from Robin Dapp --- Thank you! I'm having a problem with the data, though. Compiling with -Ofast -march=znver4 -mtune=znver4 -flto -fprofile-use=/tmp. Would you mind showing your exact final options for compilation of e.g. pbeampp.cc? I see, similar-ish for both commits: pbeampp.c:119:8: error: number of counters in profile data for function 'primal_bea_mpp' does not match its profile data (counter 'arcs', expected 20 and have 22) [-Werror=coverage-mismatch] output.c:87:1: error: corrupted profile info: number of executions for edge 3-4 thought to be 1 output.c:87:1: error: corrupted profile info: number of executions for edge 3-5 thought to be -1 output.c:87:1: error: corrupted profile info: number of iterations for basic block 5 thought to be -1
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #18 from Robin Dapp --- Hmm, doesn't help unfortunately. A full command line for me looks like: x86_64-pc-linux-gnu-gcc -c -o pbeampp.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -Ofast -march=znver4 -mtune=znver4 -flto=32 -g -fprofile-use=/tmp -SPEC_CPU_LP64 pbeampp.c. Could you verify if it's exactly the same for you? Maybe it would also help if you explicitly specified znver4?
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #20 from Robin Dapp --- No change with -std=gnu99 unfortunately.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #22 from Robin Dapp --- Still the same problem unfortunately. I'm a bit out of ideas - maybe your compiler executables could help?
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #24 from Robin Dapp --- I rebuilt GCC from scratch with your options but still have the same problem. Could our sources differ? My SPEC version might not be the most recent but I'm not aware that mcf changed at some point. Just to be sure: I'm using r14-5075-gc05f748218a0d5 as the "before" commit.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #27 from Robin Dapp --- Can you try it with a simpler (non SPEC) test? Maybe there is still something weird happening with SPEC's scripting.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #29 from Robin Dapp --- Yes, that also appears to work here. There was no lto involved this time? Now we need to figure out what's different with SPEC.
[Bug target/114396] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3 with -fwrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 Robin Dapp changed: What|Removed |Added Target|riscv*-*-* |x86_64-*-* riscv*-*-* --- Comment #2 from Robin Dapp --- At first glance it doesn't really look like a target issue. Tried it on x86 and it fails as well with -O3 -march=native pr114396.c -fno-vect-cost-model -fwrapv short a = 0xF; short b[16]; int main() { for (int e = 0; e < 9; e += 1) b[e] = a *= 0x5; if (a != 2283) __builtin_abort (); }
[Bug target/114396] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3 with -fwrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #3 from Robin Dapp --- -O3 -mavx2 -fno-vect-cost-model -fwrapv seems to be sufficient.
[Bug tree-optimization/114396] [14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #7 from Robin Dapp --- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 4375ebdcb49..f8f7ba0ccc1 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -9454,7 +9454,7 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, wi::to_mpz (skipn, exp, UNSIGNED); mpz_ui_pow_ui (mod, 2, TYPE_PRECISION (type)); mpz_powm (res, base, exp, mod); - begin = wi::from_mpz (type, res, TYPE_SIGN (type)); + begin = wi::from_mpz (type, res, TYPE_SIGN (utype)); tree mult_expr = wide_int_to_tree (utype, begin); init_expr = gimple_build (stmts, MULT_EXPR, utype, init_expr, mult_expr); This helps for the test case.
[Bug tree-optimization/114396] [14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #8 from Robin Dapp --- No fallout on x86 or aarch64. Of course using false instead of TYPE_SIGN (utype) is also possible and maybe clearer?
[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vector-cost-mode (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476 --- Comment #5 from Robin Dapp --- So the result is -9 instead of 9 (or vice versa) and this happens (just) with vectorization. We only vectorize with -fwrapv. >From a first quick look, the following is what we have before vect: (loop) [local count: 991171080]: ... # b_lsm.5_5 = PHI <_4(7), b_lsm.5_17(2)> ... _4 = -b_lsm.5_5; (check) [local count: 82570744]: ... # b_lsm.5_22 = PHI ... if (b_lsm.5_22 != -9) I.e. b gets negated with every iteration and we check the second to last against -9. With vectorization we have: (init) [local count: 82570744]: b_lsm.5_17 = b; (vectorized loop) [local count: 247712231]: ... # b_lsm.5_5 = PHI <_4(7), b_lsm.5_17(2)> ... _4 = -b_lsm.5_5; ... goto (epilogue) [local count: 82570741]: ... # b_lsm.5_7 = PHI <_25(11), b_lsm.5_17(13)> ... _25 = -b_lsm.5_7; (check) [local count: 82570744]: ... # b_lsm.5_22 = PHI if (b_lsm.5_22 != -9) What looks odd here is that b_lsm.5_7's fallthrough argument is b_lsm.5_17 even though we must have come through the vectorized loop (which negated b at least once). This makes us skip inversions. Indeed, as b_lsm.5_22 is only dependent on the initial value of b it gets optimized away and we compare b != -9. Maybe I missed something but it looks like # b_lsm.5_7 = PHI <_25(11), b_lsm.5_17(13)> should have b_lsm.5_5 or _4 as fallthrough argument.
[Bug tree-optimization/114485] [13/14 Regression] Wrong code with -O3 -march=rv64gcv on riscv or `-O3 -march=armv9-a` for aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114485 --- Comment #4 from Robin Dapp --- Yes, the vectorization looks ok. The extracted live values are not used afterwards and therefore the whole vectorized loop is being thrown away. Then we do one iteration of the epilogue loop, inverting the original c and end up with -8 instead of 8. This is pretty similar to what's happening in the related PR. We properly populate the phi in question in slpeel_update_phi_nodes_for_guard1: c_lsm.7_64 = PHI <_56(23), pretmp_34(17)> but vect_update_ivs_after_vectorizer changes that into c_lsm.7_64 = PHI . Just as a test, commenting out if (!LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)) vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf, update_e); at least makes us keep the VEC_EXTRACT and not fail anymore.
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 Robin Dapp changed: What|Removed |Added CC||ewlu at rivosinc dot com, ||rdapp at gcc dot gnu.org --- Comment #7 from Robin Dapp --- There is some riscv fallout as well. Edwin has the details.
[Bug rtl-optimization/108412] RISC-V: Negative optimization of GCSE && LOOP INVARIANTS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108412 Robin Dapp changed: What|Removed |Added CC||rdapp at gcc dot gnu.org --- Comment #3 from Robin Dapp --- I played around a bit with the scheduling model and the pressure-aware scheduling. -fsched-pressure alone does not seem to help because the problem is indeed the latency of vector load and store. The scheduler will try to keep dependent loads and stores apart (for the number of cycles specified), and then, after realizing there is nothing to put in between, lump everything together at the end of the sequence. That's a well known but unfortunate property of scheduling. Will need to think of something but not resolved for now.
[Bug tree-optimization/111136] New: ICE in RISC-V test case since r14-3441-ga1558e9ad85693
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36 Bug ID: 36 Summary: ICE in RISC-V test case since r14-3441-ga1558e9ad85693 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org Target Milestone: --- Target: riscv The following RISC-V test case ICEs since r14-3441-ga1558e9ad85693 (mask_gather_load-11.c) #define uint8_t unsigned char void foo (uint8_t *restrict y, uint8_t *restrict x, uint8_t *index, uint8_t *cond) { for (int i = 0; i < 100; ++i) { if (cond[i * 2]) y[i * 2] = x[index[i * 2]] + 1; if (cond[i * 2 + 1]) y[i * 2 + 1] = x[index[i * 2 + 1]] + 2; } } I compiled with build/gcc/cc1 -march=rv64gcv -mabi=lp64 -O3 --param=riscv-autovec-preference=scalable mask_gather_load-11.c mask_gather_load-11.c: In function 'foo': mask_gather_load-11.c:9:1: internal compiler error: in get_group_load_store_type, at tree-vect-stmts.cc:2121 9 | foo (uint8_t *restrict y, uint8_t *restrict x, | ^~~ 0x9e2fad get_group_load_store_type ../../gcc/tree-vect-stmts.cc:2121 0x9e2fad get_load_store_type ../../gcc/tree-vect-stmts.cc:2451 0x1ff7221 vectorizable_store ../../gcc/tree-vect-stmts.cc:8309 [...]
[Bug target/108271] Missed RVV cost model
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108271 Robin Dapp changed: What|Removed |Added CC||rdapp at gcc dot gnu.org --- Comment #3 from Robin Dapp --- This is basically the same problem as PR108412. As long as loads/stores have a high(ish) latency and we mostly do load/store, they will tend to lump together at the end of the function. Setting vector load/store to a latency of <= 2 helps here and we might want to do this in order to avoid excessive spilling. I had to deal with this before, e.g. in SPEC2006's calculix. In the end insn scheduling wouldn't buy us anything and rather caused more spilling causing performance degradationl
[Bug tree-optimization/111136] ICE in RISC-V test case since r14-3441-ga1558e9ad85693
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36 --- Comment #4 from Robin Dapp --- All gather-scatter tests pass for me again (the given example in particular) after applying this.
[Bug c/111153] RISC-V: Incorrect Vector cost model for reduction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53 --- Comment #1 from Robin Dapp --- We seem to decide that a slightly more expensive loop (one instruction more) without an epilogue is better than a loop with an epilogue. This looks intentional in the vectorizer cost estimation and is not specific to our lack of a costing model. Hmm.. The main loops are (VLA): .L3: vsetvli a5,a1,e32,m1,tu,ma sllia4,a5,2 sub a1,a1,a5 vle32.v v2,0(a0) add a0,a0,a4 vadd.vv v1,v2,v1 bne a1,zero,.L3 vs (VLS): .L4: vle32.v v1,0(a5) vle32.v v2,0(sp) addia5,a5,16 vadd.vv v1,v2,v1 vse32.v v1,0(sp) bne a4,a5,.L4 This is doubly weird because of the spill of the accumulator. We shouldn't be generating this sequence but even if so, it should be more expensive. This can be achieved e.g. by the following example vectorizer cost function: static int riscv_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, tree vectype, int misalign ATTRIBUTE_UNUSED) { unsigned elements; switch (type_of_cost) { case scalar_stmt: case scalar_load: case scalar_store: case vector_stmt: case vector_gather_load: case vector_scatter_store: case vec_to_scalar: case scalar_to_vec: case cond_branch_not_taken: case vec_perm: case vec_promote_demote: case unaligned_load: case unaligned_store: return 1; case vector_load: case vector_store: return 3; case cond_branch_taken: return 3; case vec_construct: elements = estimated_poly_value (TYPE_VECTOR_SUBPARTS (vectype)); return elements / 2 + 1; default: gcc_unreachable (); } } For a proper loop like vle32.v v2,0(sp) .L4: vle32.v v1,0(a5) addia5,a5,16 vadd.vv v1,v2,v1 bne a4,a5,.L4 vse32.v v1,0(sp) I'm not so sure anymore. For large n this could be preferable depending on the vectorization factor and other things.
[Bug target/110559] Bad mask_load/mask_store codegen of RVV
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110559 --- Comment #3 from Robin Dapp --- I got back to this again today, now that pressure-aware scheduling is the default. As mentioned before, it helps but doesn't get rid of the spills. Testing with the "generic ooo" scheduling model it looks like vector load/store latency of 6 is too high. Yet, even setting them to 1 is not enough to get rid of spills entirely. What helps is additionally lowering the vector alu latency to 2 (from the default 3). I'm not really sure how to properly handle this. As far as I can tell spilling is always going to happen if we try to "wait" for dependencies and delay the dependent instructions. In my experience the hardware does a better job at live scheduling anyway and we only make things worse in several cases. Previously I experimented with setting the latency of most instructions to 1 with few exceptions and instead ensure a proper instruction mix i.e. trying to keep every execution unit busy. That's not a panacea either, though.
[Bug target/111311] New: RISC-V regression testsuite errors with --param=riscv-autovec-preference=scalable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111311 Bug ID: 111311 Summary: RISC-V regression testsuite errors with --param=riscv-autovec-preference=scalable Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: jeremy.bennett at embecosm dot com, juzhe.zhong at rivai dot ai, kito.cheng at gmail dot com, law at gcc dot gnu.org, palmer at dabbelt dot com, vineetg at rivosinc dot com Target Milestone: --- As discussed in yesterday's meeting this is the PR for all current FAILs in GCC's regression test suite when running it with default vector support. I used --target-board=unix/-march=rv64gcv/--param=riscv-autovec-preference=scalable. Below is the list of FAILs/... I got, hope the message doesn't get too large. FAIL: gcc.c-torture/execute/pr53645-2.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects (test for excess errors) FAIL: gcc.c-torture/execute/pr53645.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects (test for excess errors) FAIL: gcc.c-torture/unsorted/dump-noaddr.c.*r.vsetvl, -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions comparison FAIL: gcc.dg/analyzer/pr105252.c (test for excess errors) FAIL: gcc.dg/analyzer/pr96713.c (internal compiler error: in emit_move_multi_word, at expr.cc:4079) FAIL: gcc.dg/analyzer/pr96713.c (test for excess errors) FAIL: c-c++-common/opaque-vector.c -Wc++-compat (internal compiler error: in emit_move_multi_word, at expr.cc:4079) FAIL: c-c++-common/opaque-vector.c -Wc++-compat (test for excess errors) FAIL: c-c++-common/pr105998.c -Wc++-compat (internal compiler error: in emit_move_multi_word, at expr.cc:4079) FAIL: c-c++-common/pr105998.c -Wc++-compat (test for excess errors) FAIL: c-c++-common/scal-to-vec2.c -Wc++-compat (test for excess errors) FAIL: c-c++-common/spec-barrier-1.c -Wc++-compat (test for excess errors) FAIL: c-c++-common/vector-compare-1.c -Wc++-compat (test for excess errors) FAIL: c-c++-common/vector-compare-2.c -Wc++-compat (test for excess errors) FAIL: c-c++-common/vector-scalar.c -Wc++-compat (internal compiler error: in emit_move_multi_word, at expr.cc:4079) FAIL: c-c++-common/vector-scalar.c -Wc++-compat (test for excess errors) FAIL: gcc.dg/Wstrict-aliasing-bogus-ref-all-2.c (test for excess errors) XPASS: gcc.dg/Wstringop-overflow-47.c pr97027 (test for warnings, line 72) XPASS: gcc.dg/Wstringop-overflow-47.c pr97027 (test for warnings, line 77) XPASS: gcc.dg/Wstringop-overflow-47.c pr97027 note (test for warnings, line 68) FAIL: gcc.dg/Wstringop-overflow-70.c (test for warnings, line 22) XPASS: gcc.dg/attr-alloc_size-11.c missing range info for short (test for warnings, line 51) XPASS: gcc.dg/attr-alloc_size-11.c missing range info for signed char (test for warnings, line 50) FAIL: gcc.dg/pr100239.c (internal compiler error: in emit_move_multi_word, at expr.cc:4079) FAIL: gcc.dg/pr100239.c (test for excess errors) FAIL: gcc.dg/pr100292.c (test for excess errors) FAIL: gcc.dg/pr104992.c scan-tree-dump-times optimized " % " 9 FAIL: gcc.dg/pr105049.c (test for excess errors) FAIL: gcc.dg/pr108805.c (test for excess errors) FAIL: gcc.dg/pr34856.c (test for excess errors) FAIL: gcc.dg/pr35442.c (test for excess errors) FAIL: gcc.dg/pr42685.c (test for excess errors) FAIL: gcc.dg/pr45105.c (test for excess errors) FAIL: gcc.dg/pr53060.c (test for excess errors) FAIL: gcc.dg/pr63914.c (test for excess errors) FAIL: gcc.dg/pr70252.c (internal compiler error: in gimple_expand_vec_cond_expr, at gimple-isel.cc:283) FAIL: gcc.dg/pr70252.c (test for excess errors) FAIL: gcc.dg/pr85430.c (test for excess errors) FAIL: gcc.dg/pr85467.c (test for excess errors) FAIL: gcc.dg/pr91441.c at line 11 (test for warnings, line ) FAIL: gcc.dg/pr92301.c execution test FAIL: gcc.dg/pr96453.c (test for excess errors) FAIL: gcc.dg/pr96466.c (test for excess errors) FAIL: gcc.dg/pr97238.c (internal compiler error: in emit_move_multi_word, at expr.cc:4079) FAIL: gcc.dg/pr97238.c (test for excess errors) FAIL: gcc.dg/signbit-2.c scan-tree-dump-not optimized "s+>>s+31" FAIL: gcc.dg/signbit-5.c execution test FAIL: gcc.dg/unroll-8.c scan-rtl-dump loop2_unroll "Not unrolling loop, doesn't roll" FAIL: gcc.dg/unroll-8.c scan-rtl-dump loop2_unroll "likely upper bound: 6" FAIL: gcc.dg/unroll-8.c scan-rtl-dump loop2_unroll "realistic bound: -1" FAIL: gcc.dg/var-expand1.c scan-rtl-dump loop2_unroll "Expanding Accumulator" FAIL: gcc.dg/vshift-6.c (test for excess errors) FAIL: gcc.dg/vshift-7.c (test for excess errors) FAIL: gcc.dg/ipa/ipa-sra-19.c (test for excess errors) FAIL: gcc.dg/lto/pr83719 c_lto_pr83719_0.o assemble, -flto -g -gsplit-dwarf FAIL: gcc.dg/pch/save-temps-1.c -O0 -I. -Dwith_PCH (te
[Bug c/111337] ICE in gimple-isel.cc for RISC-V port
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337 Robin Dapp changed: What|Removed |Added CC||rdapp at gcc dot gnu.org --- Comment #1 from Robin Dapp --- This is gcc.dg/pr70252.c BTW. What happens is that, starting with maskdest = (vec_cond mask1 1 0) >= (vec_cond mask2 1 0) we fold to maskdest = mask1 >= (vec_cond (mask2 1 0)) and then sink the ">=" into the vec_cond so we end up with maskdest = vec_cond (mask2 ? mask1 : 0), i.e. a vec_cond with a mask "data mode". In gimple-isel, when the target does not provide a vcond_mask implementation for that (which none does) we assert that the mask mode be MODE_VECTOR_INT. IMHO this should not happen and we should not sink comparisons (that could be folded to masks) into vec_cond. I'm preparing a patch that prevents the sinking of comparisons for mask types.
[Bug middle-end/111337] ICE in gimple-isel.cc for RISC-V port
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337 --- Comment #8 from Robin Dapp --- Yes, I doubt we would get much below 4 instructions with riscv specifics. A quick grep yesterday didn't reveal any aarch64 or gcn patterns for those (as long as they are not hidden behind some pattern replacement). But aarch64 doesn't encounter this situation anyway as we fold differently before.
[Bug middle-end/111337] ICE in gimple-isel.cc for RISC-V port
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337 --- Comment #10 from Robin Dapp --- I would be OK with the riscv implementation, then we don't need to touch isel. Maybe a future vector extension will also help us here so we could just switch the implementation then.
[Bug middle-end/111337] ICE in gimple-isel.cc for RISC-V port
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111337 --- Comment #12 from Robin Dapp --- Yes, as far as I know. I would also go ahead and merge the test suite patch now as there is already a v2 fix posted. Even if it's not the correct one it will be done soon so we should not let that block enabling the test suite.
[Bug target/111317] RISC-V: Incorrect COST model for RVV conversions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111317 --- Comment #1 from Robin Dapp --- I think the default cost model is not too bad for these simple cases. Our emitted instructions match gimple pretty well. The thing we don't model is vsetvl. We could ignore it under the assumption that it is going to be rather cheap on most uarchs. Something that needs to be fixed is the general costing used for length-masking: /* Each may need two MINs and one MINUS to update lengths in body for next iteration. */ if (need_iterate_p) body_stmts += 3 * num_vectors; We don't actually need min with vsetvl (they are our mins) so this would need to be adjusted down, provided vsetvl is cheap. This is the scalar baseline: .L3: lw a5,0(a0) sd a5,0(a1) addia0,a0,4 addia1,a1,8 bne a4,a0,.L3 While this is what zvl128b would emit: .L3: vsetvli a5,a2,e8,mf8,ta,ma vle32.v v2,0(a0) vsetvli a4,zero,e64,m1,ta,ma vsext.vf2 v1,v2 vsetvli zero,a2,e64,m1,ta,ma vse64.v v1,0(a1) sllia4,a5,2 add a0,a0,a4 sllia4,a5,3 add a1,a1,a4 sub a2,a2,a5 bne a2,zero,.L3 With a vectorization factor of 2 (might effectively be higher of course but possibly unknown at compile time) I'm not sure vectorization is always a win and the costs actually reflect that. If we disregard vsetvl for now we have 8 instructions in the vectorized loop and 2 * 4 instructions in the scalar loop for the same amount of data. Factoring in the vsetvls I'd say it's worse. Once we statically know the VF is higher, we will vectorize.
[Bug c/111153] RISC-V: Incorrect Vector cost model for reduction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53 --- Comment #2 from Robin Dapp --- With the current trunk we don't spill anymore: (VLS) .L4: vle32.v v2,0(a5) vadd.vv v1,v1,v2 addia5,a5,16 bne a5,a4,.L4 Considering just that loop I'd say costing works as designed. Even though the epilog and boilerplate code seems "crude" the main loop is as short as it can be and is IMHO preferable. .L3: vsetvli a5,a1,e32,m1,tu,ma sllia4,a5,2 sub a1,a1,a5 vle32.v v2,0(a0) add a0,a0,a4 vadd.vv v1,v2,v1 bne a1,zero,.L3 This has 6 instructions (disregarding the jump) and can't be faster than the 3 instructions for the VLS loop. Provided we iterate often enough the VLS loop should always be a win. Regarding "looking slow" - I think ideally we would have the VLS loop followed directly by the VLA loop for the residual iterations and next to no additional statements. That would require changes in the vectorizer, though. In total: I think the current behavior is reasonable.
[Bug c/111153] RISC-V: Incorrect Vector cost model for reduction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53 --- Comment #4 from Robin Dapp --- Yes, with VLS reduction this will improve. On aarch64 + sve I see loop inside costs: 2 This is similar to our VLS costs. And their loop is indeed short: ld1wz30.s, p7/z, [x0, x2, lsl 2] add x2, x2, x3 add z31.s, p7/m, z31.s, z30.s whilelo p7.s, w2, w1 b.any .L3 Not much to be squeezed out with a VLS approach. I guess that's why.
[Bug middle-end/111401] Middle-end: Missed optimization of MASK_LEN_FOLD_LEFT_PLUS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401 Robin Dapp changed: What|Removed |Added CC||rdapp at gcc dot gnu.org --- Comment #2 from Robin Dapp --- I played around with this a bit. Emitting a COND_LEN in if-convert is easy: _ifc__35 = .COND_ADD (_23, init_20, _8, init_20); However, during reduction handling we rely on the reduction being a gimple assign and binary operation, though so I needed to fix some places and indices as well as the proper mask. What complicates things a bit is that we assume that "init_20" (i.e. the reduction def) occurs once when we have it twice in the COND_ADD. I just special cased that for now. Is this the proper thing to do? diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 23c6e8259e7..e99add3cf16 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -3672,7 +3672,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) static bool fold_left_reduction_fn (code_helper code, internal_fn *reduc_fn) { - if (code == PLUS_EXPR) + if (code == PLUS_EXPR || code == IFN_COND_ADD) { *reduc_fn = IFN_FOLD_LEFT_PLUS; return true; @@ -4106,8 +4106,11 @@ vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info, return NULL; } - nphi_def_loop_uses++; - phi_use_stmt = use_stmt; + if (use_stmt != phi_use_stmt) + { + nphi_def_loop_uses++; + phi_use_stmt = use_stmt; + } @@ -7440,6 +7457,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (i == STMT_VINFO_REDUC_IDX (stmt_info)) continue; + if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)]) + continue; + Apart from that I think what's mainly missing is making the added code nicer. Going to attach a tentative patch later.
[Bug middle-end/111401] Middle-end: Missed optimization of MASK_LEN_FOLD_LEFT_PLUS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401 --- Comment #3 from Robin Dapp --- Several other things came up, so I'm just going to post the latest status here without having revised or tested it. Going to try fixing it and testing tomorrow. --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -3672,7 +3672,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) static bool fold_left_reduction_fn (code_helper code, internal_fn *reduc_fn) { - if (code == PLUS_EXPR) + if (code == PLUS_EXPR || code == IFN_COND_ADD) { *reduc_fn = IFN_FOLD_LEFT_PLUS; return true; @@ -4106,8 +4106,13 @@ vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info, return NULL; } - nphi_def_loop_uses++; - phi_use_stmt = use_stmt; + /* We might have two uses in the same instruction, only count them as +one. */ + if (use_stmt != phi_use_stmt) + { + nphi_def_loop_uses++; + phi_use_stmt = use_stmt; + } } tree latch_def = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop)); @@ -6861,7 +6866,7 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo, gimple **vec_stmt, slp_tree slp_node, gimple *reduc_def_stmt, tree_code code, internal_fn reduc_fn, - tree ops[3], tree vectype_in, + tree *ops, int num_ops, tree vectype_in, int reduc_index, vec_loop_masks *masks, vec_loop_lens *lens) { @@ -6883,11 +6888,24 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo, gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (vectype_out), TYPE_VECTOR_SUBPARTS (vectype_in))); - tree op0 = ops[1 - reduc_index]; + /* The operands either come from a binary operation or a COND_ADD operation. + The former is a gimple assign and the latter is a gimple call with four + arguments. */ + gcc_assert (num_ops == 2 || num_ops == 4); + bool is_cond_add = num_ops == 4; + tree op0, opmask; + if (!is_cond_add) +op0 = ops[1 - reduc_index]; + else +{ + op0 = ops[2]; + opmask = ops[0]; + gcc_assert (!slp_node); +} int group_size = 1; stmt_vec_info scalar_dest_def_info; - auto_vec vec_oprnds0; + auto_vec vec_oprnds0, vec_opmask; if (slp_node) { auto_vec > vec_defs (2); @@ -6903,9 +6921,18 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo, vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, 1, op0, &vec_oprnds0); scalar_dest_def_info = stmt_info; + if (is_cond_add) + { + vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, 1, +opmask, &vec_opmask); + gcc_assert (vec_opmask.length() == 1); + } } - tree scalar_dest = gimple_assign_lhs (scalar_dest_def_info->stmt); + gimple *sdef = scalar_dest_def_info->stmt; + tree scalar_dest = is_gimple_call (sdef) + ? gimple_call_lhs (sdef) + : gimple_assign_lhs (scalar_dest_def_info->stmt); tree scalar_type = TREE_TYPE (scalar_dest); tree reduc_var = gimple_phi_result (reduc_def_stmt); @@ -6945,7 +6972,11 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo, i, 1); signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo); bias = build_int_cst (intQI_type_node, biasval); - mask = build_minus_one_cst (truth_type_for (vectype_in)); + /* If we have a COND_ADD take its mask. Otherwise use {-1, ...}. */ + if (is_cond_add) + mask = vec_opmask[0]; + else + mask = build_minus_one_cst (truth_type_for (vectype_in)); } /* Handle MINUS by adding the negative. */ @@ -7440,6 +7471,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (i == STMT_VINFO_REDUC_IDX (stmt_info)) continue; + if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)]) + continue; + /* There should be only one cycle def in the stmt, the one leading to reduc_def. */ if (VECTORIZABLE_CYCLE_DEF (dt)) @@ -8211,8 +8245,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo, vec_num = 1; } - code_helper code = canonicalize_code (op.code, op.type); - internal_fn cond_fn = get_conditional_internal_fn (code, op.type); + code_helper code (op.code); + internal_fn cond_fn; + + if (code.is_internal_fn ()) +{ + internal_fn ifn = internal_fn (op.code); + code = canonicalize_code (conditional_internal_fn_code (ifn), op.type); + cond_fn = ifn; +} + else +{ + code = canonicalize_code (op.code, op.type); + cond_fn = get_conditional_internal_fn (code, op.type); +} + vec_loop_masks *masks = &LOOP_
[Bug middle-end/111401] Middle-end: Missed optimization of MASK_LEN_FOLD_LEFT_PLUS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111401 --- Comment #6 from Robin Dapp --- Created attachment 55902 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55902&action=edit Tentative You're referring to the case where we have init = -0.0, the condition is false and we end up wrongly doing -0.0 + 0.0 = 0.0? I suppose -0.0 the proper neutral element for PLUS (and WIDEN_SUM?) when honoring signed zeros? And 0.0 for MINUS? Doesn't that also depend on the rounding mode? neutral_op_for_reduction could return a -0 for PLUS if we honor it for that type. Or is that too intrusive? Guess I should add a test case for that as well. Another thing is that swapping operands is not as easy with COND_ADD because the addition would be in the else. I'd punt for that case for now. Next problem - might be a mistake on my side. For avx512 we create a COND_ADD but the respective MASK_FOLD_LEFT_PLUS is not available, causing us to create numerous vec_extracts as fallback that increase the cost until we don't vectorize anymore. Therefore I added a vectorized_internal_fn_supported_p (IFN_FOLD_LEFT_PLUS, TREE_TYPE (lhs)). SLP paths and ncopies != 1 are excluded as well. Not really happy with how the patch looks now but at least the testsuites on aarch and x86 pass.
[Bug target/111488] New: ICE ion riscv gcc.dg/vect/vect-126.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111488 Bug ID: 111488 Summary: ICE ion riscv gcc.dg/vect/vect-126.c Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org Target Milestone: --- Target: riscv I see an ICE in vect-126.c. A small reproducer is: int *a[1024], b[1024]; void f1 (void) { for (int i = 0; i < 1024; i++) { int *p = &b[0]; a[i] = p + i; } } vect-126.c:18:1: internal compiler error: Segmentation fault 18 | } | ^ 0x111e61f crash_signal ../../gcc/toplev.cc:314 0xcfc91d mark_label_nuses ../../gcc/emit-rtl.cc:3755 0xcfc969 mark_label_nuses ../../gcc/emit-rtl.cc:3763 0xcfc969 mark_label_nuses ../../gcc/emit-rtl.cc:3763 0xcfc969 mark_label_nuses ../../gcc/emit-rtl.cc:3763 This happens after the splitter (define_insn_and_split "*single_widen_fma". At first glance it seems as if the insn sequence is corrupt as we're looking into a value but I haven't checked further. This is likely the same error that prevents several SPECfp testcases to build. Can investigate further tomorrow.
[Bug target/111488] ICE ion riscv gcc.dg/vect/vect-126.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111488 Robin Dapp changed: What|Removed |Added CC||juzhe.zhong at rivai dot ai --- Comment #1 from Robin Dapp --- Also happens in the rvv.exp testsuite now, e.g. gather_load_run-11.c.
[Bug target/111428] RISC-V vector: Flaky segfault in {min|max}val_char_{1|2}.f90
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111428 --- Comment #2 from Robin Dapp --- Reproduced locally. The identical binary sometimes works and sometimes doesn't so it must be a race...
[Bug target/111506] RISC-V: Failed to vectorize conversion from INT64 -> _Float16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111506 Robin Dapp changed: What|Removed |Added CC||joseph at codesourcery dot com --- Comment #3 from Robin Dapp --- I just got back. The problem with this is not -fno-trapping-math - it will vectorize just fine with -ftrapping-math (and the vectorizer doesn't depend on it either). We also already have tests for this in rvv/autovec/conversions. However, not all int64 values can be represented in the intermediate type int32 which is why we don't vectorize unless the range of a[i] is know to be inside int32's range. If I'm reading the C standard correctly it says such cases are implementation-defined behavior and I'm not sure we should work around the vectorizer by defining an expander that essentially hides the intermediate type. If that's an OK thing to do then I won't complain, though. CC'ing jmyers and rsandi because they would know best. >From what I can tell aarch64 also does not vectorize this and I wonder why LLVM's behavior is dependent on -fno-trapping-math. We have the same issue with the reversed conversion from _Float16 to int64. In that case trapping math is relevant, though, but we could apply the same logic as in this patch and circumvent it by an expander. To me this doesn't seem right.
[Bug target/111600] [14 Regression] RISC-V bootstrap time regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600 Robin Dapp changed: What|Removed |Added CC||law at gcc dot gnu.org --- Comment #12 from Robin Dapp --- We're really at a point where just building becomes a burden and turnaround times are annoyingly high. My suspicion is that the large number of modes combined with the number of insn patterns slows us down. Juzhe added a lot of VLS patterns (or rather added VLS modes to existing patterns) around the Cauldron and this is where we saw the largest relative slowdown. Maybe we need to bite the bullet and not use the convenience helpers anymore or at least very sparingly? I'm going to make some experiments on Wednesday to see where that gets us.
[Bug target/111506] RISC-V: Failed to vectorize conversion from INT64 -> _Float16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111506 --- Comment #5 from Robin Dapp --- Ah, thanks Joseph, so this at least means that we do not need !flag_trapping_math here. However, the vectorizer emulates the 64-bit integer to _Float16 conversion via an intermediate int32_t and now the riscv expander does the same just without the same restrictions. I'm assuming the restrictions currently imposed on two-step vectorizing conversions are correct. For e.g. int64_t -> _Float16 we require the value range of the source fit in int32_t (first step int64_t -> int32_t). For _Float16 -> int64_t we require -fno-trapping-math (first step _Float16 -> int32_t). The latter follows from Annex F of the C standard. Therefore, my general question would rather be: - Is it OK to circumvent either restriction by pretending to have an instruction that performs the conversion in two steps but doesn't actually do the checks? I.e. does "implementation-defined behavior" cover the vectorizer checking one thing and one target not doing it? In our case the int64_t -> int32_t conversion is implementation defined when the source doesn't fit the target.
[Bug target/111600] [14 Regression] RISC-V bootstrap time regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600 --- Comment #16 from Robin Dapp --- Confirming that it's the compilation of insn-emit.cc which takes > 10 minutes. The rest (including auto generating of files) is reasonably fast. Going to do some experiments with it and see which pass takes the most time.
[Bug target/111600] [14 Regression] RISC-V bootstrap time regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600 --- Comment #18 from Robin Dapp --- Just finished an initial timing run, sorted, first 10: Time variable usr sys wall GGC phase opt and generate : 567.60 ( 97%) 38.23 ( 87%) 608.13 ( 97%) 22060M ( 90%) callgraph functions expansion : 491.16 ( 84%) 31.48 ( 72%) 524.60 ( 83%) 18537M ( 75%) integration: 90.09 ( 15%) 11.68 ( 27%) 103.25 ( 16%) 13408M ( 54%) tree CFG cleanup : 74.43 ( 13%) 1.02 ( 2%) 74.66 ( 12%) 201M ( 1%) callgraph ipa passes : 70.16 ( 12%) 6.21 ( 14%) 76.66 ( 12%) 2921M ( 12%) tree STMT verifier : 64.03 ( 11%) 3.52 ( 8%) 67.61 ( 11%) 0 ( 0%) tree CCP : 44.78 ( 8%) 2.91 ( 7%) 47.65 ( 8%) 314M ( 1%) integrated RA : 42.82 ( 7%) 0.86 ( 2%) 42.71 ( 7%) 880M ( 4%) `- tree CFG cleanup: 30.57 ( 5%) 0.38 ( 1%) 32.03 ( 5%) 198M ( 1%) `- tree CCP: 29.78 ( 5%) 0.05 ( 0%) 29.87 ( 5%) 168M ( 1%) tree SSA verifier : 28.07 ( 5%) 1.42 ( 3%) 30.91 ( 5%) 0 ( 0%) Per-function sorted expansion time (first 10): insn_code maybe_code_for_pred_indexed_store(int, machine_mode, machine_mode); 3.05 insn_code maybe_code_for_pred_indexed_load(int, machine_mode, machine_mode); 2.68 insn_code maybe_code_for_pred(int, machine_mode); 1.49 rtx_insn* gen_split_4213(rtx_insn*, rtx_def**); 1.33 insn_code maybe_code_for_pred_scalar(rtx_code, machine_mode); 1.18 rtx_insn* gen_split_1266(rtx_insn*, rtx_def**); 0.70 insn_code maybe_code_for_pred_slide(int, machine_mode); 0.51 insn_code maybe_code_for_pred_scalar(int, machine_mode); 0.34 insn_code maybe_code_for_pred_dual_widen(rtx_code, rtx_code, machine_mode); 0.30 insn_code maybe_code_for_pred_dual_widen_scalar(rtx_code, rtx_code, machine_mode); 0.29 Expanding all splitter functions (~8000) takes 214s, so roughly 40% of the expansion time. This we wouldn't get rid of even when not using insn helpers.
[Bug target/111600] [14 Regression] RISC-V bootstrap time regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600 --- Comment #20 from Robin Dapp --- Mhm, why is your profile so different from mine? I'm also on an x86_64 host with a 13.2.1 host compiler (Fedora). Is it because of the preprocessed source? Or am I just reading the timing report wrong?
[Bug target/111600] [14 Regression] RISC-V bootstrap time regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600 --- Comment #22 from Robin Dapp --- Ah, then it's not that different, your machine is just faster ;) callgraph ipa passes : 69.77 ( 11%) 5.97 ( 13%) 76.05 ( 12%) 2409M ( 10%) integration: 91.95 ( 15%) 12.52 ( 27%) 105.93 ( 16%) 13408M ( 56%) tree CFG cleanup : 76.98 ( 13%) 1.09 ( 2%) 78.01 ( 12%) 201M ( 1%) tree STMT verifier : 66.62 ( 11%) 3.75 ( 8%) 68.31 ( 10%) 0 ( 0%) integrated RA : 47.04 ( 8%) 1.00 ( 2%) 47.79 ( 7%) 879M ( 4%) tree CCP : 44.31 ( 7%) 3.00 ( 6%) 48.39 ( 7%) 314M ( 1%) tree SSA verifier : 31.40 ( 5%) 1.60 ( 3%) 32.25 ( 5%) 0 ( 0%) CFG verifier : 14.93 ( 2%) 0.74 ( 2%) 16.53 ( 3%) 0 ( 0%) callgraph verifier : 14.26 ( 2%) 1.07 ( 2%) 15.55 ( 2%) 0 ( 0%) tree operand scan : 12.58 ( 2%) 3.73 ( 8%) 15.14 ( 2%) 1649M ( 7%) verify RTL sharing : 11.70 ( 2%) 0.89 ( 2%) 13.31 ( 2%) 0 ( 0%) TOTAL : 609.73 46.53659.45 24127M FWIW we are much faster with -fno-inline (somewhat expected but I didn't expect a factor of 3): callgraph ipa passes : 53.47 ( 27%) 5.84 ( 26%) 59.52 ( 26%) 2231M ( 26%) tree STMT verifier : 19.67 ( 10%) 1.95 ( 9%) 21.47 ( 10%) 0 ( 0%) tree SSA verifier : 11.80 ( 6%) 1.20 ( 5%) 13.32 ( 6%) 0 ( 0%) integrated RA : 8.73 ( 4%) 0.72 ( 3%) 9.83 ( 4%) 898M ( 10%) verify RTL sharing : 7.90 ( 4%) 0.69 ( 3%) 8.49 ( 4%) 0 ( 0%) scheduling 2 : 7.32 ( 4%) 0.31 ( 1%) 7.90 ( 4%) 43M ( 1%) tree PTA : 6.68 ( 3%) 0.69 ( 3%) 7.51 ( 3%) 71M ( 1%) CFG verifier : 6.67 ( 3%) 0.81 ( 4%) 7.29 ( 3%) 0 ( 0%) rest of compilation: 6.42 ( 3%) 0.93 ( 4%) 6.88 ( 3%) 89M ( 1%) parser function body : 6.35 ( 3%) 2.13 ( 9%) 8.40 ( 4%) 903M ( 11%) TOTAL : 201.12 22.90225.17 8575M
[Bug target/111428] RISC-V vector: Flaky segfault in {min|max}val_char_{1|2}.f90
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111428 --- Comment #3 from Robin Dapp --- Still difficult to track down. The following is a smaller reproducer: program main implicit none integer, parameter :: n=5, m=3 integer, dimension(n,m) :: v real, dimension(n,m) :: r do call random_number(r) v = int(r * 2) if (count(v < 1) > 1) exit end do write (*,*) 'asdf' end program main I compiled libgfortran without vector but this doesn't change anything. It's really just the vectorization of that snippet but I haven't figured out why, yet. The stack before the random_number call looks identical. Also tried valgrind which complains about compares dependent on uninitialized data (those only show up once compiled with vectorization). However I suspect those are false negatives after chasing them for some hours. Going to try another angle of attack. Maybe it's a really simple thing I overlooked.
[Bug tree-optimization/111760] risc-v regression: COND_LEN_* incorrect fold/simplify in middle-end
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111760 Robin Dapp changed: What|Removed |Added CC||rdapp at gcc dot gnu.org, ||rguenth at gcc dot gnu.org --- Comment #2 from Robin Dapp --- https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629904.html prevents the wrong code but still leaves us with a redundant negation (and it is not the only missed optimization of that kind): vect_neg_xi_14.4_23 = -vect_xi_13.3_22; vect_res_2.5_24 = .COND_LEN_ADD ({ -1, ... }, vect_res_1.0_17, vect_neg_xi_14.4_23, vect_res_1.0_17, _29, 0); That's because my "hackaround" doesn't recognize a valueized sequence _30 = vect_res_1.0_17 - vect_xi_13.3_22; Of course I could (reverse valueize) recognize that again and convert it to a COND_LEN... but that doesn't seem elegant at all. There must be a simpler way that I'm missing entirely right now. That said, converting the last statement of such a sequence should be sufficient?
[Bug tree-optimization/111760] risc-v regression: COND_LEN_* incorrect fold/simplify in middle-end
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111760 --- Comment #6 from Robin Dapp --- Yes, thanks for filing this bug separately. The patch doesn't disable all of those optimizations, of course I paid special attention not mess up with them. The difference here is that we valueize, add statements to *seq and the last statement is a _30 = bla. Then we'd either need to "_30 = COND_LEN_MOVE bla" or predicate bla itself. Surely there is a better way.
[Bug bootstrap/116146] Split insn-recog.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116146 --- Comment #3 from Robin Dapp --- On riscv insn-output is the largest file right now as well. I have a local patch that splits it - it's a bit cumbersome because the static initializer needs to be made non-static i.e. the initialization must be in an init function and that needs to be called at some point. But as a proof of concept it worked. Once I have more time (hah) I'm going to post a patch but it will still take a while.
[Bug target/111600] [14/15 Regression] RISC-V bootstrap time regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111600 --- Comment #37 from Robin Dapp --- > The size of the partitions is a little uneven though. Using > --with-emitinsn-partitions=48 I get some empty partitions and some bigger > than 2MB: > Another problematic file is insn-recog.cc which is 19MB and takes 1 hour+ to > compile for me. It's not very difficult to make the partitions even. I have a patch locally that follows the same approach Tamar took with the match split and it seems to work nicely. I haven't gotten around to testing and posting it yet, though.
[Bug target/116149] RISC-V: Miscompile at -O3 with zvl256b -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116149 --- Comment #1 from Robin Dapp --- > Still present when rvv_ta_all_1s=true is omitted. My result is '0' when rvv_ta_all_1s=false, is that what you meant? I didn't have time to check this in detail but it's not the missing else for masked loads. It looks like we should use the "tu" policy instead of "ta" when doing those intermediate steps. When I change everything with a vl != 4 (so 3 and 1) to the "tu" policy the result is correct. Need to check where we go wrong.
[Bug target/116149] RISC-V: Miscompile at -O3 with zvl256b -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116149 --- Comment #2 from Robin Dapp --- Correction, it's actually just the wx adds with a length of 1 and those should be "tu". Quite likely this only got exposed recently with the late-combine change in place.
[Bug target/116149] RISC-V: Miscompile at -O3 with zvl256b -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116149 --- Comment #3 from Robin Dapp --- It looks like the problem is a wrong mode_idx attribute for the wx variants of the adds. The widening adds's mode is the one of the non-widened input operand but for the wx/scalar variants this is a scalar mode instead of a vector mode. That confuses avlprop so that it uses 1 instead of 4 as vector length. Testing a patch.
[Bug target/116149] RISC-V: Miscompile at -O3 with zvl256b -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116149 Robin Dapp changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #5 from Robin Dapp --- Fixed on trunk.
[Bug target/116202] RISC-V: Miscompile at -O3 with zvl256b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116202 Robin Dapp changed: What|Removed |Added CC||pan2.li at intel dot com --- Comment #1 from Robin Dapp --- Looks like a mistake in the SAT_TRUNC pattern. Probably -1 instead of 1.
[Bug middle-end/115495] [15 Regression] ICE in smallest_mode_for_size, at stor-layout.cc:356 during combine on RISC-V rv64gcv_zvl256b at -O3 since r15-1042-g68b0742a49d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115495 Robin Dapp changed: What|Removed |Added Component|rtl-optimization|middle-end --- Comment #6 from Robin Dapp --- Finally looking into this one. The fix is pretty simple and it's similar to other occurrences of smallest_int_mode_for_size. smallest_int_mode_for_size expects to find at least one mode equal to or larger than the provided size but in some cases this fails - in particular when we have full-vector-size structures like here. Testing a patch.
[Bug middle-end/115495] [15 Regression] ICE in smallest_mode_for_size, at stor-layout.cc:356 during combine on RISC-V rv64gcv_zvl256b at -O3 since r15-1042-g68b0742a49d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115495 --- Comment #7 from Robin Dapp --- Ah, hmm, this doesn't seem to occur on trunk anymore for me. It's still likely latent. Patrick, does it still happen for you?
[Bug middle-end/115495] [15 Regression] ICE in smallest_mode_for_size, at stor-layout.cc:356 during combine on RISC-V rv64gcv_zvl256b at -O3 since r15-1042-g68b0742a49d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115495 Robin Dapp changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #10 from Robin Dapp --- Fixed.
[Bug target/116086] RISC-V: Hash mismatch with vectorized 557.xz_r at zvl128b and LMUL=m2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116086 Robin Dapp changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #11 from Robin Dapp --- Fixed. As a reminder for posterity here: Richi called for a unified subreg handling and also argued (I agree) that LMUL > 1 VLS modes that are larger than a minimum-sized vector need to be treated like VLA modes. I don't think we do that everywhere already but let's fix things as they arise.
[Bug target/116242] [meta-bug] Tracker for zvl issues in RISC-V
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116242 Bug 116242 depends on bug 116086, which changed state. Bug 116086 Summary: RISC-V: Hash mismatch with vectorized 557.xz_r at zvl128b and LMUL=m2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116086 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug target/116611] Inefficient mix of contiguous and load-lane vectorization due to missing permutes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611 --- Comment #1 from Robin Dapp --- For the record, with the default -march=rv64gcv I don't see any LOAD_LANES, with -march=rv64gcv -mrvv-vector-bits=zvl I do.
[Bug target/116611] Inefficient mix of contiguous and load-lane vectorization due to missing permutes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611 --- Comment #3 from Robin Dapp --- Actually we're already supposed to be handling all constant permutes. Maybe what's in the way is /* FIXME: Explicitly disable VLA interleave SLP vectorization when we may encounter ICE for poly size (1, 1) vectors in loop vectorizer. Ideally, middle-end loop vectorizer should be able to disable it itself, We can remove the codes here when middle-end code is able to disable VLA SLP vectorization for poly size (1, 1) VF. */ if (!BYTES_PER_RISCV_VECTOR.is_constant () && maybe_lt (BYTES_PER_RISCV_VECTOR * TARGET_MAX_LMUL, poly_int64 (16, 16))) return false; which was introduced in r14-5917-g9f3f0b829b62f1. I'm running the testsuite to see if it's still a problem. If so, let's see if we can work around the issue differently.
[Bug target/116611] Inefficient mix of contiguous and load-lane vectorization due to missing permutes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116611 --- Comment #4 from Robin Dapp --- I just send a patch to get rid of this early exit in our backend. However with test testsuite compile options -O3 -march=rv64gcv -fno-vect-cost-model I still see MASK_LEN_LOAD_LANES.
[Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573 --- Comment #7 from Robin Dapp --- I'm testing a patch that basically does what Richi proposes. I was also playing around with mixed lane configurations where we potentially reuse the pointer increment from another pointer update. To me the code looked promising and I think we could at least make it work for a subset of lane configurations. I didn't manage to get everything correct, though so the patch tries to only restore the status quo. Some info about vsetvl because the question also came up on the cauldron - according to the vector spec it has the (for the compiler) annoying property that it can basically set the length freely within a certain range. This is for load-balancing reasons and intended to give hardware implementations more freedom. (I'm not sure that is a useful tradeoff as the compiler's freedom is significantly reduced) vsetvl takes the "application vector length" (AVL) so the total number of elements the whole loop wants to process and returns a vl. VLMAX is the maximum number of elements a single vector (or vector group with LMUL) can hold. If the AVL is larger than VLMAX but <= 2 * VLMAX vsetvl can set vl to a value inside the range [ceil(AVL / 2), VLMAX]. So for e.g. AVL = 37, ceil(37/2) = 19 would, unfortunately, be a legal vl value. For the other possible values of AVL (<= VLMAX, > 2*VLMAX) the behavior is as expected. My hope is that most hardware implementations would take a saner approach and have vsetvl always act as a "min (AVL, VLMAX)". That would enable easy scalar evolution and would possible also allow mixed-lane settings with reuse of the vl value. I suppose we could have a target hook or target query mechanism that asks for "sane" behavior of vsetvl? Thus we could have optimized SELECT_VL behavior for those implementations.
[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vect-cost-model (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476 --- Comment #8 from Robin Dapp --- I tried some things (for the related bug without -fwrapv) then got busy with some other things. I'm going to have another look later this week.
[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247 --- Comment #5 from Robin Dapp --- This fixes the test case for me locally, thanks. I can run the testsuite with it later if you'd like.
[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247 --- Comment #6 from Robin Dapp --- Testsuite looks unchanged on rv64gcv.
[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665 --- Comment #1 from Robin Dapp --- Hmm, my local version is a bit older and seems to give the same result for both -O2 and -O3. At least a good starting point for bisection then.
[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665 --- Comment #2 from Robin Dapp --- Checked with the latest commit on a different machine but still cannot reproduce the error. PR114668 I can reproduce. Maybe a copy and paste problem?
[Bug target/114668] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114668 --- Comment #2 from Robin Dapp --- This, again, seems to be a problem with bit extraction from masks. For some reason I didn't add the VLS modes to the corresponding vec_extract patterns. With those in place the problem is gone because we go through the expander which does the right thing. I'm still checking what exactly goes wrong without those as there is likely a latent bug.
[Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686 --- Comment #3 from Robin Dapp --- I think we have always maintained that this can definitely be a per-uarch default but shouldn't be a generic default. > I don't see any reason why this wouldn't be the case for the vast majority of > implementations, especially high performance ones would benefit from having > more work to saturate the execution units with, since a larger LMUL works > quite > similar to loop unrolling. One argument is reduced freedom for renaming and the out of order machinery. It's much easier to shuffle individual registers around than large blocks. Also lower-latency insns are easier to schedule than longer-latency ones and faults, rejects, aborts etc. get proportionally more expensive. I was under the impression that unrolling doesn't help a whole lot (sometimes even slows things down a bit) on modern cores and certainly is not unconditionally helpful. Granted, I haven't seen a lot of data on it recently. An exception is of course breaking dependency chains. In general nothing stands in the way of having a particular tune target use dynamic LMUL by default even now but nobody went ahead and posted a patch for theirs. One could maybe argue that it should be the default for in-order uarchs? Should it become obvious in the future that LMUL > 1 is indeed, unconditionally, a "better unrolling" because of its favorable icache footprint and other properties (which I doubt - happy to be proved wrong) then we will surely re-evaluation the decision or rather have a different consensus. The data we publicly have so far is all in-order cores and my expectation is that the picture will change once out-of-order cores hit the scene.
[Bug target/114668] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114668 Robin Dapp changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #4 from Robin Dapp --- I didn't have the time to fully investigate but the default path without vec extract is definitely broken for masks. I'd probably sleep better if we fixed that at some point but for now the obvious fix is to add the missing expanders. Patrick, I'm still unable to reproduce PR114665 (maybe also a qemu difference?). Could you re-check with this fix? Thanks.
[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665 --- Comment #5 from Robin Dapp --- Weird, I tried your exact qemu version and still can't reproduce the problem. My results are always FFB5. Binutils difference? Very unlikely. Could you post your QEMU_CPU settings just to be sure?
[Bug middle-end/114733] [14] Miscompile with -march=rv64gcv -O3 on riscv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114733 --- Comment #1 from Robin Dapp --- Confirmed, also shows up here.
[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734 --- Comment #1 from Robin Dapp --- Confirmed.