[Bug tree-optimization/90332] New test case gcc.dg/vect/slp-reduc-sad-2.c in r270847 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90332 Kewen Lin changed: What|Removed |Added CC||linkw at gcc dot gnu.org --- Comment #7 from Kewen Lin --- (In reply to Richard Biener from comment #5) > I don't see a vec_initv16qiv8qi on power either, so that might be it - > there's no > effective target for building a vector from halves (and I wonder how > code-generation fares here). > > So an option is to simply xfail for all but x86_64-*-* and i?86-*-* ... > > Or try more fancy code-generation options (build from two large integer > modes, > but I don't see vec_initv2didi either). It's wired, I found rs6000 has supported vec_initv2didi. gcc/insn-opinit.c: { 0x2f0a36, CODE_FOR_vec_initv2didi },
[Bug testsuite/94023] [9 regression] gcc.dg/vect/slp-perm-12.c fails starting with r9-5008
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94023 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Kewen Lin --- Fixed on trunk and backported.
[Bug testsuite/94019] [9 regression] gcc.dg/vect/vect-over-widen-17.c fails starting with g:370c2ebe8fa20e0812cd2d533d4ed38ee2d37c85, r9-1590
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94019 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #7 from Kewen Lin --- Fixed on trunk and backported.
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 Kewen Lin changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #3 from Kewen Lin --- Yes, very likely to just expose one latent bug, anyway I'll have a first look.
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 --- Comment #4 from Kewen Lin --- This was just exposed from my commit, it can also be reproduced without my commit but with -fno-vect-cost-model. Some loops we have for this case: ;; Loop 1 ;; header 3, latch 10 ;; depth 1, outer 0 ;; nodes: 3 10 8 23 25 34 35 26 29 32 33 38 4 11 37 31 ;; Loop 2 ;; header 4, latch 11 ;; depth 2, outer 1 ;; nodes: 4 11 ;; Loop 4 ;; header 26, latch 29 ;; depth 2, outer 1 ;; nodes: 26 29 When we are doing versioning for loop4 required for aliasing check, the related scalar_loop_iters is based on e2.2_31, which is defined in BB 4, that is: [local count: 4343773762]: # e2.2_31 = PHI <_15(11), 1(37)> # ivtmp_14 = PHI For the codes: if ((def_bb = gimple_bb (SSA_NAME_DEF_STMT (USE_FROM_PTR (use_p && flow_bb_inside_loop_p (outermost, def_bb)) outermost = superloop_at_depth (loop, bb_loop_depth (def_bb) + 1) bb_loop_depth is 2, the +1 make the assertion in superloop_at_depth fail since the current loop 4 only has the depth 2. I think the existing code has the assumption that all operands in stmts of cond_expr_stmt_list are defined in some outer loop of current, but the assumption breaks in this case. I guess the current scalar_loop_iters is valid? Then the fix can be: --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -3312,7 +3312,13 @@ vect_loop_versioning (loop_vec_info loop_vinfo, FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE) if ((def_bb = gimple_bb (SSA_NAME_DEF_STMT (USE_FROM_PTR (use_p && flow_bb_inside_loop_p (outermost, def_bb)) - outermost = superloop_at_depth (loop, loop_depth (outermost) + 1); + { + /* Def block can be in either one outer loop of loop_to_version or + one sibling of outer loop of loop_to_version. */ + class loop *common_loop + = find_common_loop (def_bb->loop_father, loop); + outermost = superloop_at_depth (loop, loop_depth (common_loop) + 1); + } }
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 --- Comment #6 from Kewen Lin --- (In reply to rguent...@suse.de from comment #5) > On Fri, 20 Mar 2020, linkw at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 > > > > --- Comment #4 from Kewen Lin --- > > This was just exposed from my commit, it can also be reproduced without my > > commit but with -fno-vect-cost-model. > > > > Some loops we have for this case: > > ;; Loop 1 > > ;; header 3, latch 10 > > ;; depth 1, outer 0 > > ;; nodes: 3 10 8 23 25 34 35 26 29 32 33 38 4 11 37 31 > > > > ;; Loop 2 > > ;; header 4, latch 11 > > ;; depth 2, outer 1 > > ;; nodes: 4 11 > > > > ;; Loop 4 > > ;; header 26, latch 29 > > ;; depth 2, outer 1 > > ;; nodes: 26 29 > > > > > > When we are doing versioning for loop4 required for aliasing check, the > > related > > scalar_loop_iters is based on e2.2_31, which is defined in BB 4, that is: > > > >[local count: 4343773762]: > > # e2.2_31 = PHI <_15(11), 1(37)> > > # ivtmp_14 = PHI > > > > > > For the codes: > > > > if ((def_bb = gimple_bb (SSA_NAME_DEF_STMT (USE_FROM_PTR (use_p > > && flow_bb_inside_loop_p (outermost, def_bb)) > > outermost = superloop_at_depth (loop, bb_loop_depth (def_bb) + 1) > > > > bb_loop_depth is 2, the +1 make the assertion in superloop_at_depth fail > > since > > the current loop 4 only has the depth 2. I think the existing code has the > > assumption that all operands in stmts of cond_expr_stmt_list are defined in > > some outer loop of current, but the assumption breaks in this case. > > > > I guess the current scalar_loop_iters is valid? Then the fix can be: > > What is not valid I think is that e2.2_31 should have a loop-closed PHI > node which would place it in an outer loop. You'd have to see why > either the loop-closed PHI is not present or why the aliasing check > doesn't use that (it's more likely this) > Thanks for the confirmation Richi! There is a loop-closed PHI for it in bb 33: [local count: 35145078524]: # e2.2_31 = PHI <_15(11), 1(31)> # ivtmp_14 = PHI _11 = (integer(kind=8)) e2.2_31; _12 = _10 + _11; _13 = _12 + -7; hx[_13] = 0; _15 = e2.2_31 + 1; ivtmp_23 = ivtmp_14 - 1; if (ivtmp_23 == 0) goto ; [11.00%] else goto ; [89.00%] [local count: 3865958617]: # _51 = PHI <_15(4)> I'll further investigate why the scalar_loop_iters is constructed directly from e2.2_31 instead of _51.
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 --- Comment #8 from Kewen Lin --- > It's most likely either SCEV or expand_simple_operations looking throuhg > the single-arg PHI (which we should avoid for LC PHI nodes) Thanks Richi, I found the loop-closed PHI form was broken after we finished the vectorization on the loop 2, BB 38 was inserted, the function gimple_find_edge_insert_loc will get one new BB if the dest has phis, even it's unrelated. ;; basic block 4, loop depth 2 ;; pred: 11 ;; 37 ... _15 = e2.2_31 + 1; ... if (ivtmp_59 >= 1) goto ; [100.00%] else goto ; [0.00%] ;; succ: 38 ;; 11 ;; basic block 38, loop depth 1 ;; pred: 4 _40 = BIT_FIELD_REF ; ;; succ: 33 ;; basic block 33, loop depth 1 ;; pred: 38 # _51 = PHI <_15(38)> ;; succ: 34 The alternatives seems could be 1) extend the current gimple_find_edge_insert_loc to handle the phi nodes, if the phis aren't related, just insert there, otherwise, insert some phis for uses of those stmts and remove the related phis and create new assignments after those new stmts, or 2) call rewrite_into_loop_closed_ssa for each successful vectorization.
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 --- Comment #10 from Kewen Lin --- (In reply to Richard Biener from comment #9) > OK, so it's indeed vectorizable_live_operation not paying attention to > loop-closed SSA form. > > What it should do before building the lane extract is create a _new_ > loop-closed PHI node for the vectorized def (vec_lhs), and then > demote the loop-closed PHI node for the scalar def (lhs) which should > _always_ exist to a copy. So from > > > loop; > > # lhs' = PHI > > > go to > > loop; > > # vec_lhs' = PHI > new_tree = BIT_FIELD_REF ; > lhs' = new_tree; > > I think you can assert that the block of the loop-closed PHI > (single_exit()->dest) always has a single predecessor, otherwise > things will be more complicated. > > Can you try rework the code in this way? If that's too much just tell > me and I'll take care of this. Thanks Richi, I'll give it a shot! I'd like to ensure my understanding: with the proposed fix, we ensure the single_exit()->dest should be the correct BB to be inserted, no chance like gimple_find_edge_insert_loc to get one new BB to be inserted, is it right?
[Bug testsuite/93935] [9/10 regression] gcc.dg/vect/bb-slp-over-widen-2.c fails starting with r262371 (r10-6856)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93935 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Kewen Lin --- Should be fixed on both trunk and gcc-9.
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 --- Comment #12 from Kewen Lin --- Created attachment 48122 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48122&action=edit ppc64le tested patch Thanks Richi! A patch draft attached to ensure on the right track, also bootstrapped/regresstested. I tried to reproduce the case that the stmts for lane extracting is empty (due to folding) with test cases associated in that old commit but failed. I think we don't need to deal with it? The new copy assignment instead of the phi could not be caught by the LC-PHI check in expand_simple_operations.
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 --- Comment #14 from Kewen Lin --- (In reply to Richard Biener from comment #13) > > + /* Find all SSA NAMEs in stmts which is defined in current loop, > create > +PHIs for them, and replace them with phi results accordingly. */ > + for (gsi = gsi_start (stmts); !gsi_end_p (gsi); gsi_next (&gsi)) > + { > + gimple *stmt = gsi_stmt (gsi); > + update_stmt (stmt); > + > ... > > should not be necessary. What's missing in your patch is that when the > current code has computed vec_lhs it needs to create a LC PHI node for it > _before_ computing the lane extraction and instead use vec_lhs' there. OK, I was thinking the mask for LOOP_VINFO_FULLY_MASKED_P part is probably a SSA name and can live out, as your comments, it looks impossible. Will update it and send for review after testing. Thanks again!
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 Kewen Lin changed: What|Removed |Added Attachment #48122|0 |1 is obsolete|| --- Comment #16 from Kewen Lin --- Created attachment 48125 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48125&action=edit untested patch
[Bug tree-optimization/90332] New test case gcc.dg/vect/slp-reduc-sad-2.c in r270847 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90332 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #9 from Kewen Lin --- Should be fixed by latest trunk on ppc64le P9.
[Bug tree-optimization/94401] pr92420.c fails on aarch64 since r10-7415
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94401 Kewen Lin changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org CC||linkw at gcc dot gnu.org --- Comment #1 from Kewen Lin --- Thanks for reporting this! Do I need some special arch configuration options for the gcc build to reproduce this on some aarch machine in CFarm? or fine with default?
[Bug tree-optimization/94401] [10 Regression] pr92420.c fails on aarch64 since r10-7415
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94401 Kewen Lin changed: What|Removed |Added CC||segher at gcc dot gnu.org, ||wschmidt at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #4 from Kewen Lin --- My commit extends the current scalar epilogue peeling for gaps elimination, it makes the case can make use of int for the construction. But it reveals the existing handlings misses to handle VMAT_CONTIGUOUS_REVERSE case, currently it assumes overrun happens on high address end, it's true for almost all cases, but this case is on the low address end. So if we have to load the high part and put it in the latter part of constructed vector for VMAT_CONTIGUOUS_REVERSE. The IR before/after the commit looks good: vect__9.16_80 = MEM [(int *)vectp_y.14_78]; vect__9.17_81 = VEC_PERM_EXPR ; vect__9.18_82 = VEC_PERM_EXPR ; bad: _30 = MEM[(int *)vectp_y.12_34]; _20 = {_30, 0}; vect__9.14_19 = VIEW_CONVERT_EXPR(_20); vect__9.15_61 = VEC_PERM_EXPR ; vect__9.16_54 = VEC_PERM_EXPR ;
[Bug tree-optimization/94401] [10 Regression] pr92420.c fails on aarch64 since r10-7415
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94401 --- Comment #5 from Kewen Lin --- Created attachment 48150 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48150&action=edit untested patch This can fix the REG failures on aarch64.
[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 Kewen Lin changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #19 from Kewen Lin --- should be fixed on trunk now.
[Bug tree-optimization/94043] [9 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 --- Comment #21 from Kewen Lin --- (In reply to Richard Biener from comment #20) > Re-open. It's marked as broken in GCC 9 so a backport is in oder (if the > issue really reproduces there). Thanks for pointing it out. I'll backport it two weeks later with no regressions found in trunk.
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 Kewen Lin changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org --- Comment #3 from Kewen Lin --- Thanks for reporting this, confirmed.
[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449 Kewen Lin changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org --- Comment #7 from Kewen Lin --- Thanks for reporting this, looks duplicated of pr94443
[Bug tree-optimization/94451] [10 Regression] April 1st 2020 GCC does not compile spec 2017 gcc_r benchmark with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94451 Kewen Lin changed: What|Removed |Added CC||linkw at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org Last reconfirmed||2020-04-02 Ever confirmed|0 |1 --- Comment #3 from Kewen Lin --- Thanks for reporting this, Mike. It looks duplicated of pr94443.
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 --- Comment #4 from Kewen Lin --- This case has one conversion insn generated after bit_field_ref, the patch introduces one stupid mistake to use gsi_insert_before instead of gsi_insert_seq_before, it leads to miss the conversion insn. The below patch makes it work. It also polishes copy related code a bit although not really necessary to make this case pass. diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index c9b6534..4c2c9f7 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -8050,7 +8050,7 @@ vectorizable_live_operation (stmt_vec_info stmt_info, if (stmts) { gimple_stmt_iterator exit_gsi = gsi_after_labels (exit_bb); - gsi_insert_before (&exit_gsi, stmts, GSI_CONTINUE_LINKING); + gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT); /* Remove existing phi from lhs and create one copy from new_tree. */ tree lhs_phi = NULL_TREE; @@ -8060,10 +8060,10 @@ vectorizable_live_operation (stmt_vec_info stmt_info, gimple *phi = gsi_stmt (gsi); if ((gimple_phi_arg_def (phi, 0) == lhs)) { - remove_phi_node (&gsi, false); lhs_phi = gimple_phi_result (phi); gimple *copy = gimple_build_assign (lhs_phi, new_tree); - gsi_insert_after (&exit_gsi, copy, GSI_CONTINUE_LINKING); + gsi_insert_after (&exit_gsi, copy, GSI_NEW_STMT); + remove_phi_node (&gsi, false); break; } }
[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449 Kewen Lin changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #8 from Kewen Lin --- May I ask for the configuration option? I used x86_64 machine in CFarm with cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz ... I was unable to reproduce it with default configuration setting, but I was able to see the ICE with -march=znver2 specified for the failures. I suspected there was some basic arch setting in your configuration? If so, I'm wondering one more reasonable configuration option for E5 machine, it would help to catch regression failures like this. Thanks in advance!
[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449 --- Comment #10 from Kewen Lin --- (In reply to H.J. Lu from comment #9) > (In reply to Kewen Lin from comment #8) > > May I ask for the configuration option? > > > > I used x86_64 machine in CFarm with cpuinfo > > > > I used > > --prefix=/usr/10.0.1 --enable-clocale=gnu --with-system-zlib --enable-shared > --enable-cet --with-demangler-in-ld --with-fpmath=sse Thanks, but it didn't work on my side. I guessed it's due to different native. gcc -march=native -Q --help=target|grep march -march= corei7-avx $ ./t.sh -march=znver2 internal compiler error: verify_ssa failed $ ./t.sh -march=icelake-server internal compiler error: verify_ssa failed $ ./t.sh -march=corei7-avx ==> works fine. I guess I can't just specify the arch option like --with-arch=znver2 for configure, since native arch probably misses the support of some instructions for znver2? I have no idea on x86 arch, is it possible?
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 Kewen Lin changed: What|Removed |Added CC||hjl.tools at gmail dot com --- Comment #5 from Kewen Lin --- *** Bug 94449 has been marked as a duplicate of this bug. ***
[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449 Kewen Lin changed: What|Removed |Added Resolution|--- |DUPLICATE Status|ASSIGNED|RESOLVED --- Comment #11 from Kewen Lin --- Verified that the patch in pr94443 fix these failures as well. *** This bug has been marked as a duplicate of bug 94443 ***
[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449 --- Comment #12 from Kewen Lin --- Sorry, correction: corei7-avx is from system gcc. With my built gcc, it's sandybridge. But no difference for the pass/fail result.
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 --- Comment #7 from Kewen Lin --- Yes, thanks Richi! I had the same update locally but didn't update here. The latest whole patch is diff --git a/gcc/testsuite/gcc.dg/vect/pr94443.c b/gcc/testsuite/gcc.dg/vect/pr94443.c new file mode 100644 index 000..f8cbaf1 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/pr94443.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-march=znver2" { target { x86_64-*-* i?86-*-* } } } */ + +/* Check it to be compiled successfully without any ICE. */ + +int a; +unsigned *b; + +void foo() +{ + for (unsigned i; i <= a; ++i, ++b) +; +} diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index c9b6534..b621f89 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -8050,7 +8050,7 @@ vectorizable_live_operation (stmt_vec_info stmt_info, if (stmts) { gimple_stmt_iterator exit_gsi = gsi_after_labels (exit_bb); - gsi_insert_before (&exit_gsi, stmts, GSI_CONTINUE_LINKING); + gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT); /* Remove existing phi from lhs and create one copy from new_tree. */ tree lhs_phi = NULL_TREE; @@ -8060,10 +8060,10 @@ vectorizable_live_operation (stmt_vec_info stmt_info, gimple *phi = gsi_stmt (gsi); if ((gimple_phi_arg_def (phi, 0) == lhs)) { - remove_phi_node (&gsi, false); lhs_phi = gimple_phi_result (phi); gimple *copy = gimple_build_assign (lhs_phi, new_tree); - gsi_insert_after (&exit_gsi, copy, GSI_CONTINUE_LINKING); + gsi_insert_before (&exit_gsi, copy, GSI_SAME_STMT); + remove_phi_node (&gsi, false); break; } } Still waiting for regression testing result.
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 --- Comment #8 from Kewen Lin --- > > > + remove_phi_node (&gsi, false); > > I prefer to have the PHI removed before you re-use its LHS. > Oops, missed this, will move it back when posting to email list.
[Bug tree-optimization/94451] [10 Regression] April 1st 2020 GCC does not compile spec 2017 gcc_r benchmark with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94451 Kewen Lin changed: What|Removed |Added Resolution|DUPLICATE |FIXED --- Comment #6 from Kewen Lin --- Reproduced and verified with the proposed fix in pr94443, sorry for the trouble.
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 Kewen Lin changed: What|Removed |Added CC||clyon at gcc dot gnu.org --- Comment #10 from Kewen Lin --- *** Bug 94456 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/94456] ICE in aarch64/sve/pr87815.c since r10-7491
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94456 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED CC||linkw at gcc dot gnu.org Resolution|--- |DUPLICATE --- Comment #1 from Kewen Lin --- Thanks for reporting, should be duplicated as the symptom. *** This bug has been marked as a duplicate of bug 94443 ***
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 --- Comment #13 from Kewen Lin --- (In reply to Khem Raj from comment #11) > this patch seems to be causing gcc ICE on ARM when compiling lz4 sources in > kernel, lz4, vlc almost identical ICE is seen > > attached is the test case please compile it with -O3 > > during GIMPLE pass: vect > lz4.c: In function 'LZ4_compress_fast_extState': > lz4.c:1180:5: internal compiler error: Segmentation fault > 1180 | int LZ4_compress_fast_extState(void* state, const char* source, > char* dest, int inputSize, int maxOutputSize, int acceleration) > | ^~ > Please submit a full bug report, Same symptom: for SSA_NAME: _1689 in statement: op_1747 = _1689; during GIMPLE pass: vect lz4.c:1180:5: internal compiler error: verify_ssa failed 0x100d0ab verify_ssa(bool, bool) Verified it can be fixed with posted patch in gcc-patch ML: https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543137.html
[Bug tree-optimization/94401] [10 Regression] pr92420.c fails on aarch64 since r10-7415
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94401 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #7 from Kewen Lin --- Should be fixed now.
[Bug tree-optimization/94451] [10 Regression] April 1st 2020 GCC does not compile spec 2017 gcc_r benchmark with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94451 Kewen Lin changed: What|Removed |Added Resolution|FIXED |DUPLICATE --- Comment #7 from Kewen Lin --- Correct the status, looks updated mistakenly somehow. *** This bug has been marked as a duplicate of bug 94443 ***
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 --- Comment #15 from Kewen Lin --- *** Bug 94451 has been marked as a duplicate of this bug. ***
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 94443, which changed state. Bug 94443 Summary: [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #17 from Kewen Lin --- (In reply to Martin Liška from comment #16) > Can we close it as fixed? I guess so. Although this commit should be one part of backport for PR94393, I guess I can only leave that bug open and close this one? Please feel free to correct me.
[Bug testsuite/94079] gfortran.dg/vect/pr83232.f90 fails on power 7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94079 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED commit r10-7646-ge7c4084d11b957d925ba586f86db2f346fb3bfe0 Author: Kewen Lin Date: Wed Apr 8 21:52:00 2020 -0500 [testsuite] Fix PR94079 by respecting vect_hw_misalign [PR94079] This is another vect case which requires special handling with vect_hw_misalign. The alignment of the second part requires misaligned vector access supports. This patch is to adjust the related guard condition and comments. Verified it on ppc64-redhat-linux (Power7 BE). 2020-04-09 Kewen Lin gcc/testsuite/ChangeLog PR testsuite/94023 * gfortran.dg/vect/pr83232.f90: Expect 2 rather than 3 times SLP on non-vect_hw_misalign targets. Wrong PR in the commit, have fixed it. Manually pasted here.
[Bug tree-optimization/94043] [9 Regression] ICE in superloop_at_depth, at cfgloop.c:78
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #24 from Kewen Lin --- Backported via r9-8506 and its related r9-8507.
[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451 Kewen Lin changed: What|Removed |Added Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org Last reconfirmed||2020-08-04 Status|UNCONFIRMED |ASSIGNED --- Comment #1 from Kewen Lin --- Thanks for reporting! I will have a look at it.
[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451 --- Comment #3 from Kewen Lin --- (In reply to Richard Biener from comment #2) > possibly a latent issue since the patch is supposed to be cost-only Yes, this case will hit ICE too with -fno-vect-cost-model even without the culprit commit. Without that commit, the costing says it's not profitable to vectorize the epilogue further, while with that we are able to vectorize the epilogue. With the forced option -fdbg-cnt=vect_loop:1, it only allows us to vectorize one loop, so it skips the epilogue which has the scalar mask_store statement from if-cvt and is determined to be vectorized. I'm not sure what the dbg counter should mean for loop vect. If it's for the original scalar loop, then the main vectorized loop and the epilogue loop to be vectorized should be vectorized. The fix could be: diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index 26a1846..150bdcf 100644 --- a/gcc/tree-vectorizer.c +++ b/gcc/tree-vectorizer.c @@ -1066,7 +1066,7 @@ try_vectorize_loop_1 (hash_table *&simduid_to_vf_htab, return ret; } - if (!dbg_cnt (vect_loop)) + if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !dbg_cnt (vect_loop)) { /* Free existing information if loop is analyzed with some assumptions. */ If the dbg counter is for all kinds of loop (main or epilogue), the fix seems to be: add one interface for dbg counter framework to query the remaining allowed count, compare the remaining number and the number of epilogue loops in vect_do_peeling, then remove the exceeding epilogue loops there.
[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451 --- Comment #5 from Kewen Lin --- Created attachment 49000 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49000&action=edit untested patch Just noticed the dbgcnt supports several intervals, if we want to count epilogue loop, we probably need to call dbgcnt in vect_do_peeling. One untested patch attached to show the idea.
[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451 --- Comment #6 from Kewen Lin --- (In reply to rguent...@suse.de from comment #4) > On Wed, 5 Aug 2020, linkw at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451 > > > > --- Comment #3 from Kewen Lin --- > > (In reply to Richard Biener from comment #2) > > > possibly a latent issue since the patch is supposed to be cost-only > > > > Yes, this case will hit ICE too with -fno-vect-cost-model even without the > > culprit commit. > > > > Without that commit, the costing says it's not profitable to vectorize the > > epilogue further, while with that we are able to vectorize the epilogue. > > With > > the forced option -fdbg-cnt=vect_loop:1, it only allows us to vectorize one > > loop, so it skips the epilogue which has the scalar mask_store statement > > from > > if-cvt and is determined to be vectorized. > > > > I'm not sure what the dbg counter should mean for loop vect. If it's for the > > original scalar loop, then the main vectorized loop and the epilogue loop > > to be > > vectorized should be vectorized. The fix could be: > > > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c > > index 26a1846..150bdcf 100644 > > --- a/gcc/tree-vectorizer.c > > +++ b/gcc/tree-vectorizer.c > > @@ -1066,7 +1066,7 @@ try_vectorize_loop_1 (hash_table > > *&simduid_to_vf_htab, > >return ret; > > } > > > > - if (!dbg_cnt (vect_loop)) > > + if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !dbg_cnt (vect_loop)) > > { > >/* Free existing information if loop is analyzed with some > > assumptions. */ > > > > If the dbg counter is for all kinds of loop (main or epilogue), the fix > > seems > > to be: add one interface for dbg counter framework to query the remaining > > allowed count, compare the remaining number and the number of epilogue > > loops in > > vect_do_peeling, then remove the exceeding epilogue loops there. > > I think the above patch is OK and is what was originally intended. > > Care to push it to master? Thanks for the confirmation! I'll proceed with one formal patch.
[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #8 from Kewen Lin --- The proposed fix has been committed in r11-2585-gea858d09571f3f6dcce92d8bfaf077f9d44c6ad6 Sorry that forgot to put the PR No. to the changelog.
[Bug target/94077] gcc.dg/gomp/pr82374.c fails on power 7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077 Kewen Lin changed: What|Removed |Added Last reconfirmed||2020-08-12 Ever confirmed|0 |1 CC||linkw at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED --- Comment #1 from Kewen Lin --- This issue only exists on gcc8 and gcc9, it's gone with gcc10 and trunk. The main difference is listed below: with gcc8/gcc9, the cost modeling says it's not profitable because of high cost realign vector load/store for vectorization body, that is: gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: Cost model analysis: Vector inside of loop cost: 32 Vector prologue cost: 6 Vector epilogue cost: 0 Scalar iteration cost: 4 Scalar outside cost: 0 Vector outside cost: 6 prologue iterations: 0 epilogue iterations: 0 gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: cost model: the vector iteration cost = 32 divided by the scalar iteration cost = 4 is greater or equal to the vectorization factor = 4. gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: not vectorized: vectorization not profitable. gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: not vectorized: vector version will never be profitable. While with gcc10 and trunk, the information looks like: gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: Cost model analysis: Vector inside of loop cost: 6 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 6 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 0 gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note:Runtime profitability threshold = 4 gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note:Static estimate profitability threshold = 4 By tracing back, I noticed the difference comes from: gcc8/gcc9 can't force alignment of ref: a[i_12] gcc10/trunk: force alignment of a[i_12] I guess it's not a good idea to backport some patch to get the alignment forced (probably risky?), instead I think we can append an additional option -mefficient-unaligned-vsx together with -mvsx to ensure we can use unaligned vector load/store, or set the target requirement into powerpc_vsx_ok && vect_hw_misalign, both meet the original testing purpose. Hi @Jakub, what do you think of this?
[Bug target/94077] gcc.dg/gomp/pr82374.c fails on power 7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077 --- Comment #2 from Kewen Lin --- To be more specific, the reason causing the available alignment forcing is the default setting of -fcommon, we set -fno-common as default from GCC10, it makes decl_binds_to_current_def_p return true then. I can observe this case fail if with explicit -fcommon.
[Bug target/94077] gcc.dg/gomp/pr82374.c fails on power 7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077 --- Comment #3 from Kewen Lin --- > > I can observe this case fail if with explicit -fcommon. I mean even with gcc10 or trunk.
[Bug target/94077] gcc.dg/gomp/pr82374.c fails on power 7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077 --- Comment #6 from Kewen Lin --- (In reply to Jakub Jelinek from comment #5) > I mean -fno-common, sorry. Good idea, that works! I'll send a patch by adding -fno-common into dg-options. Thanks for your suggestion!
[Bug testsuite/94077] gcc.dg/gomp/pr82374.c fails on power 7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077 Kewen Lin changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #9 from Kewen Lin --- Should be fixed now.
[Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 Bug ID: 96789 Summary: x264: sub4x4_dct() improves when vectorization is disabled Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- One of my workmates found that if we disable vectorization for SPEC2017 525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it can speed up by 4%. The option used is: -O3 -mcpu=power9 -fcommon -fno-strict-aliasing -fgnu89-inline I confirmed this finding and it can further narrow down to SLP vectorization with attribute __attribute__((optimize("no-tree-slp-vectorize"))). I also checked with r11-0 commit for this particular file, the performance keep unchanged, with/without vectorization attribute. So I think it's a trunk regression, probably exposes one SLP flaw or one cost modeling issue.
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #2 from Kewen Lin --- Created attachment 49124 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49124&action=edit sub4x4_dct SLP dumping
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #3 from Kewen Lin --- Bisection shows it started to fail from r11-205.
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 Kewen Lin changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org --- Comment #6 from Kewen Lin --- (In reply to Richard Biener from comment #4) > This delays some checks to eventually support part of the BB vectorization > which is what succeeds here. I suspect that w/o vectorization we manage > to elide the tmp[] array but with the part vectorization that occurs we > fail to do that. > > On the cost side there would be a lot needed to make the vectorization > not profitable: > > Vector inside of basic block cost: 8 > Vector prologue cost: 36 > Vector epilogue cost: 0 > Scalar cost of basic block: 64 > > the thing to double-check is > > 0x123b1ff0 1 times vec_construct costs 17 in prologue > > that is the cost of the V16QI construct > > _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565, > _576, _125, _143, _161, _179}; > Thanks Richard! I did some cost adjustment experiment last year and the cost for v16qi looks off indeed, but at that time with the cost tweaking for this the SPEC performance doesn't change, I guessed it's just we happened not have this kind of case to trap into. I'll have a look and re-evaluate it for this.
[Bug target/96933] New: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 Bug ID: 96933 Summary: inefficient code for char/short vec CTOR Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- When I'm investigate the vectorization cost for vec_construct, I happened to find the generated code for vector construction is inefficient with DIRECT_MOVE support. The test case looks like: vector unsigned char test_char(unsigned char f1, unsigned char f2, unsigned char f3, unsigned char f4, unsigned char f5, unsigned char f6, unsigned char f7, unsigned char f8, unsigned char f9, unsigned char f10, unsigned char f11, unsigned char f12, unsigned char f13, unsigned char f14, unsigned char f15, unsigned char f16) { vector unsigned char v = {f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13, f14, f15, f16}; return v; } The generated code currently with -mcpu=power9: : 0: e8 ff a1 fb std r29,-24(r1) 4: f0 ff c1 fb std r30,-16(r1) 8: f8 ff e1 fb std r31,-8(r1) c: 60 00 a1 8b lbz r29,96(r1) 10: 68 00 c1 8b lbz r30,104(r1) 14: 70 00 e1 8b lbz r31,112(r1) 18: d1 ff 81 98 stb r4,-47(r1) 1c: d2 ff a1 98 stb r5,-46(r1) 20: 78 00 81 89 lbz r12,120(r1) 24: 80 00 01 88 lbz r0,128(r1) 28: 88 00 61 89 lbz r11,136(r1) 2c: 90 00 81 88 lbz r4,144(r1) 30: 98 00 a1 88 lbz r5,152(r1) 34: d0 ff 61 98 stb r3,-48(r1) 38: d3 ff c1 98 stb r6,-45(r1) 3c: d4 ff e1 98 stb r7,-44(r1) 40: d8 ff a1 9b stb r29,-40(r1) 44: d5 ff 01 99 stb r8,-43(r1) 48: d6 ff 21 99 stb r9,-42(r1) 4c: d7 ff 41 99 stb r10,-41(r1) 50: d9 ff c1 9b stb r30,-39(r1) 54: da ff e1 9b stb r31,-38(r1) 58: db ff 81 99 stb r12,-37(r1) 5c: dc ff 01 98 stb r0,-36(r1) 60: dd ff 61 99 stb r11,-35(r1) 64: de ff 81 98 stb r4,-34(r1) 68: df ff a1 98 stb r5,-33(r1) 6c: e8 ff a1 eb ld r29,-24(r1) 70: f0 ff c1 eb ld r30,-16(r1) 74: f8 ff e1 eb ld r31,-8(r1) 78: d9 ff 41 f4 lxv vs34,-48(r1) 7c: 20 00 80 4e blr But it can be more efficient with direct move and vector merge, such as: 0: 67 01 43 7c mtvsrd vs34,r3 4: 68 00 61 80 lwz r3,104(r1) 8: 60 00 61 81 lwz r11,96(r1) c: 67 01 64 7c mtvsrd vs35,r4 10: 70 00 81 80 lwz r4,112(r1) 14: 67 01 03 7d mtvsrd vs40,r3 18: 78 00 61 80 lwz r3,120(r1) 1c: 67 01 85 7c mtvsrd vs36,r5 20: 67 01 a6 7c mtvsrd vs37,r6 24: 67 01 07 7c mtvsrd vs32,r7 28: 67 01 28 7c mtvsrd vs33,r8 2c: 67 01 24 7d mtvsrd vs41,r4 30: 80 00 81 80 lwz r4,128(r1) 34: 0c 10 43 10 vmrghb v2,v3,v2 38: 67 01 63 7c mtvsrd vs35,r3 3c: 88 00 61 80 lwz r3,136(r1) 40: 67 01 eb 7c mtvsrd vs39,r11 44: 0c 20 85 10 vmrghb v4,v5,v4 48: 67 01 a4 7c mtvsrd vs37,r4 4c: 90 00 81 80 lwz r4,144(r1) 50: 0c 00 01 10 vmrghb v0,v1,v0 54: 67 01 23 7c mtvsrd vs33,r3 58: 98 00 61 80 lwz r3,152(r1) 5c: 67 01 c9 7c mtvsrd vs38,r9 60: 0c 38 e8 10 vmrghb v7,v8,v7 64: 67 01 04 7d mtvsrd vs40,r4 68: 0c 48 63 10 vmrghb v3,v3,v9 6c: 67 01 23 7d mtvsrd vs41,r3 70: 0c 28 a1 10 vmrghb v5,v1,v5 74: 67 01 2a 7c mtvsrd vs33,r10 78: 0c 40 09 11 vmrghb v8,v9,v8 7c: 0c 30 21 10 vmrghb v1,v1,v6 80: 4c 11 44 10 vmrglh v2,v4,v2 84: 4c 39 63 10 vmrglh v3,v3,v7 88: 4c 29 88 10 vmrglh v4,v8,v5 8c: 4c 01 a1 10 vmrglh v5,v1,v0 90: 8c 19 64 10 vmrglw v3,v4,v3 94: 8c 11 45 10 vmrglw v2,v5,v2 98: 57 13 43 f0 xxmrgld vs34,vs35,vs34
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED CC||bergner at gcc dot gnu.org, ||linkw at gcc dot gnu.org, ||segher at gcc dot gnu.org, ||wschmidt at gcc dot gnu.org Summary|inefficient code for|rs6000: inefficient code |char/short vec CTOR |for char/short vec CTOR Last reconfirmed||2020-09-04 Target||powerpc Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #2 from Kewen Lin --- (In reply to Segher Boessenkool from comment #1) > Is that actually faster though? The original has shorter dependency > chains. Or is this to avoid some LHS/SHL? Yes, I tested it with one constructed case, the original version takes 18.20s while the optimized version takes 8.40s. And yes, I guess it's due to LHS/SHL similar to the vec_insert issue xionghu is working on.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #5 from Kewen Lin --- (In reply to Segher Boessenkool from comment #4) > Yes, timing suggests there is some SHL/LHS flush. > > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two > bytes into place at one), which reduces the number of moves from > 16 to 8, and the number of merges from 15 to 7 (and reduces path > length by 1). This sounds like a no-brainer win with that :-) Good idea, it looks better on P9. One thing to double confirm, currently there are no instructions like vmrgob and vmrgoh, so of the mergings you mentioned from vector bytes to vector short and vector short to vector word needs artificial control vector?
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #6 from Kewen Lin --- (In reply to Kewen Lin from comment #5) > (In reply to Segher Boessenkool from comment #4) > > Yes, timing suggests there is some SHL/LHS flush. > > > > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two > > bytes into place at one), which reduces the number of moves from > > 16 to 8, and the number of merges from 15 to 7 (and reduces path > > length by 1). This sounds like a no-brainer win with that :-) > > Good idea, it looks better on P9. One thing to double confirm, currently > there are no instructions like vmrgob and vmrgoh, so of the mergings you > mentioned from vector bytes to vector short and vector short to vector word > needs artificial control vector? Improve the patch to support mtvsrdd, the asm for char looks like: : 0: 00 00 4c 3c addis r2,r12,0 0: R_PPC64_REL16_HA .TOC. 4: 00 00 42 38 addir2,r2,0 4: R_PPC64_REL16_LO .TOC.+0x4 8: e8 ff a1 fb std r29,-24(r1) c: 00 00 a2 3f addis r29,r2,0 c: R_PPC64_TOC16_HA .rodata.cst16 10: f0 ff c1 fb std r30,-16(r1) 14: f8 ff e1 fb std r31,-8(r1) 18: 67 1b 24 7c mtvsrdd vs33,r4,r3 1c: 67 3b 28 7d mtvsrdd vs41,r8,r7 20: 68 00 c1 8b lbz r30,104(r1) 24: 78 00 e1 8b lbz r31,120(r1) 28: 00 00 bd 3b addir29,r29,0 28: R_PPC64_TOC16_LO.rodata.cst16 2c: 60 00 81 89 lbz r12,96(r1) 30: 70 00 61 89 lbz r11,112(r1) 34: 80 00 81 88 lbz r4,128(r1) 38: 88 00 61 88 lbz r3,136(r1) 3c: 90 00 01 89 lbz r8,144(r1) 40: 98 00 e1 88 lbz r7,152(r1) 44: 67 2b 46 7c mtvsrdd vs34,r6,r5 48: 67 4b aa 7d mtvsrdd vs45,r10,r9 4c: 09 00 9d f5 lxv vs44,0(r29) 50: 67 63 5e 7d mtvsrdd vs42,r30,r12 54: 67 5b 1f 7c mtvsrdd vs32,r31,r11 58: e8 ff a1 eb ld r29,-24(r1) 5c: f0 ff c1 eb ld r30,-16(r1) 60: 67 23 63 7d mtvsrdd vs43,r3,r4 64: f8 ff e1 eb ld r31,-8(r1) 68: 3b 0b 42 10 vpermr v2,v2,v1,v12 6c: 67 43 27 7c mtvsrdd vs33,r7,r8 70: 3b 4b ad 11 vpermr v13,v13,v9,v12 74: 3b 53 00 10 vpermr v0,v0,v10,v12 78: 3b 5b 21 10 vpermr v1,v1,v11,v12 7c: 97 11 4d f0 xxmrglw vs34,vs45,vs34 80: 97 01 01 f0 xxmrglw vs32,vs33,vs32 84: 57 13 40 f0 xxmrgld vs34,vs32,vs34 88: 20 00 80 4e blr For: 1) mtvsrdd under TARGET_DIRECT_MOVE_128 2) mtvsrd under TARGET_DIRECT_MOVE 3) original The time evaluation on Power9 looks like 1) 7.28s 2) 7.41s 3) 18.19s
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #8 from Kewen Lin --- (In reply to Segher Boessenkool from comment #7) > There are vmrglb and vrghb etc.? But these are only for low/high part separately, with mtvsrdd both low/high parts (doubleword) have the values, we don't have Vector Merge Even/Odd for char or short to merge them. Now I used one artificial control vector for the merging, correct me if I miss something.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #10 from Kewen Lin --- (In reply to Segher Boessenkool from comment #9) > I'm not sure what you mean. > > vmrglb merges the vectors > abcdefghijklmnop > and > ABCDEFGHIJKLMNOP > to > iIjJkKlLmMnNoOpP > > ... ah, I see what you mean I guess. > > So, use something else instead? How about vpku*um? > > First vpkudum, xforming > xxxAxxxB > and > xxxCxxxD > into > xxxAxxxBxxxCxxxD > > and then vpkuwum: > xxxAxxxBxxxCxxxD > and > xxxExxxFxxxGxxxH > into > xAxBxCxDxExFxGxH > > and finally vpkuhum: > xAxBxCxDxExFxGxH > and > xIxJxKxLxMxNxOxP > into > ABCDEFGHIJKLMNOP > > ? Great, it works! Thanks for the advice. By testing, for type char, it's on par with the artificial control vector version, 7.30s vs. 7.28s, while for type short, it's better, 28.66s vs. 31.52s. Will update the sent patch to V2.
[Bug target/97019] New: rs6000:redundant rldicr fed to lvx/stvx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97019 Bug ID: 97019 Summary: rs6000:redundant rldicr fed to lvx/stvx Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- When we do the early expansion for altivec built-in function vec_ld/vec_st, we can probably leave some redundant rldicr x,y,0,59 which aims to AND (-16) for the vector access address, since the lvx/stvx will do the aligned and with -16 themselves, they are useless. = test case extern int a, b, c; extern vector unsigned long long ev5, ev6, ev7, ev8; int test(unsigned char *pe) { vector unsigned long long v1, v2, v3, v4, v9; vector unsigned long long v5 = ev5; vector unsigned long long v6 = ev6; vector unsigned long long v7 = ev7; vector unsigned long long v8 = ev8; unsigned char *e = pe; do { if (a) { asm("memory"); v1 = __builtin_vec_ld(16, (unsigned long long *)e); v2 = __builtin_vec_ld(32, (unsigned long long *)e); v3 = __builtin_vec_ld(48, (unsigned long long *)e); e = e + 8; for (int i = 0; i < a; i++) { v4 = v5; v5 = __builtin_crypto_vpmsumd(v1, v6); v6 = __builtin_crypto_vpmsumd(v2, v7); v7 = __builtin_crypto_vpmsumd(v3, v8); e = e + 8; } } v5 = __builtin_vec_ld(16, (unsigned long long *)e); v6 = __builtin_vec_ld(32, (unsigned long long *)e); v7 = __builtin_vec_ld(48, (unsigned long long *)e); if (c) b = 1; } while (b); v9 = v4; int p = __builtin_unpack_vector_int128((vector __int128_t)v9, 0); return p; } command -m64 -O2 -mcpu=power8 Currently the function find_alignment_op in RTL swaps pass cares the case where have one single AND operation definition, we can extend it to check all definitions are AND operations and aligned with -16B.
[Bug target/97019] rs6000:redundant rldicr fed to lvx/stvx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97019 Kewen Lin changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED CC||bergner at gcc dot gnu.org, ||segher at gcc dot gnu.org, ||wschmidt at gcc dot gnu.org Keywords||missed-optimization Ever confirmed|0 |1 Last reconfirmed||2020-09-11 Target||powerpc
[Bug target/97019] rs6000:redundant rldicr fed to lvx/stvx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97019 Kewen Lin changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #3 from Kewen Lin --- Should be fixed on latest trunk now.
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 Kewen Lin changed: What|Removed |Added Last reconfirmed||2020-09-16 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 --- Comment #7 from Kewen Lin --- Two questions in mind, need to dig into it further: 1) from the assembly of scalar/vector code, I don't see any stores needed into temp array d (array diff in pixel_sub_wxh), but when modeling we consider the stores. On Power two vector stores take cost 2 while 16 scalar stores takes cost 16, it seems wrong to cost model something useless. Later, for the vector version we need 16 vector halfword extractions from these two halfword vectors, while scalar version the values are just in GPR register, vector version looks inefficient. 2) on Power, the conversion from unsigned char to unsigned short is nop conversion, when we counting scalar cost, it's counted, then add costs 32 totally onto scalar cost. Meanwhile, the conversion from unsigned short to signed short should be counted but it's not (need to check why further). The nop conversion costing looks something we can handle in function rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function tree_nop_conversion_p, but it's only for same mode/precision conversion. Will find/check something else.
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #9 from Kewen Lin --- (In reply to Richard Biener from comment #8) > (In reply to Kewen Lin from comment #7) > > Two questions in mind, need to dig into it further: > > 1) from the assembly of scalar/vector code, I don't see any stores needed > > into temp array d (array diff in pixel_sub_wxh), but when modeling we > > consider the stores. > > Because when modeling they are still there. There's no good way around this. > I noticed the stores get eliminated during FRE. Can we consider running FRE once just before SLP? a bad idea due to compilation time?
[Bug tree-optimization/97075] [11 regression] powerpc64 vector tests fails after r11-3230
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97075 Kewen Lin changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org CC||linkw at gcc dot gnu.org Last reconfirmed||2020-09-17 --- Comment #1 from Kewen Lin --- I'll take a look at this.
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #11 from Kewen Lin --- (In reply to Richard Biener from comment #10) > (In reply to Kewen Lin from comment #9) > > (In reply to Richard Biener from comment #8) > > > (In reply to Kewen Lin from comment #7) > > > > Two questions in mind, need to dig into it further: > > > > 1) from the assembly of scalar/vector code, I don't see any stores > > > > needed > > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we > > > > consider the stores. > > > > > > Because when modeling they are still there. There's no good way around > > > this. > > > > > > > I noticed the stores get eliminated during FRE. Can we consider running FRE > > once just before SLP? a bad idea due to compilation time? > > Yeah, we already run FRE a lot and it is one of the more expensive passes. > > Note there's one point we could do better which is the embedded SESE FRE > run from cunroll which is only run before we consider peeling an outer loop > and thus not for the outermost unrolled/peeled code (but the question would > be from where / up to what to apply FRE to). On x86_64 this would apply to > the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds > quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it > matters for costing though). Thanks for the explanation! I'll look at it after checking 2). IIUC, the advantage to eliminate stores here looks able to get those things which is fed to stores and stores' consumers bundled, then get more things SLP-ed if available? > > I think I looked at this or a related testcase some time ago and split out > some PRs (can't find those right now). For example we are not considering > to simplify > > > the load permutations suggest that splitting the group into 4-lane pieces > would avoid doing permutes but then that would require target support > for V4QI and V4HI vectors. At least the loads could be considered > to be vectorized with strided-SLP, yielding 'int' loads and a vector > build from 4 ints. I'd need to analyze why we do not consider this. Good idea! Curious that is there some port where int load can not work well on 1-byte aligned address like trap?
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #12 from Kewen Lin --- > Thanks for the explanation! I'll look at it after checking 2). IIUC, the > advantage to eliminate stores here looks able to get those things which is > fed to stores and stores' consumers bundled, then get more things SLP-ed if > available? Hmm, I think I was wrong, if both the feeding chain and consuming chain of the stores are SLP-ed, later FRE would be able to fuse them.
[Bug tree-optimization/97075] [11 regression] powerpc64 vector tests fails after r11-3230
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97075 --- Comment #3 from Kewen Lin --- (In reply to akrl from comment #2) > Thanks Kewen, unfortunately I've no Power setup. Sorry for the > inconvenience. My pleasure! If you have interests to run on Power machines, you can apply and use some Power8/Power9 machines in CFarm machine pool https://cfarm.tetaneutral.net/machines/list/.
[Bug tree-optimization/97075] [11 regression] powerpc64 vector tests fails after r11-3230
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97075 --- Comment #4 from Kewen Lin --- > gcc.target/powerpc/p9-vec-length-full-6.c This is a test case issue, 64bit/32bit pairs will use full vector instead of partial vector as Andrea's improvement. > gcc.target/powerpc/p9-vec-length-epil-7.c It exposed one problem: when we call vect_need_peeling_or_partial_vectors_p in function vect_analyze_loop_2, it's in analysis stage, if the loop is one epilogue loop, the loop_vinfo hasn't been fixed up, like LOOP_VINFO_INT_NITERS, the function can probably give the wrong answer. For some 64bit type functions of this failed case, it will return false for the epilogue loops but actually the remaining iteration can't cover the full vector. One simple fix is to exclude epilogue loop for this check. diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index ab627fbf029..7273e998a99 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2278,7 +2278,8 @@ start_over: { /* Don't use partial vectors if we don't need to peel the loop. */ if (param_vect_partial_vector_usage == 0 - || !vect_need_peeling_or_partial_vectors_p (loop_vinfo)) + || (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && !vect_need_peeling_or_partial_vectors_p (loop_vinfo))) LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false; else if (vect_verify_full_masking (loop_vinfo) || vect_verify_loop_lens (loop_vinfo)) Testing is ongoing.
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #13 from Kewen Lin --- > 2) on Power, the conversion from unsigned char to unsigned short is nop > conversion, when we counting scalar cost, it's counted, then add costs 32 > totally onto scalar cost. Meanwhile, the conversion from unsigned short to > signed short should be counted but it's not (need to check why further). UH to SH conversion is true when calling vect_nop_conversion_p, so it's not even put into the cost vector. tree_nop_conversion_p's comments saying: /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates no instruction. */ I may miss something here, but UH to SH conversion does need one explicit extend instruction *extsh*, the precision/mode equality check looks wrong for this conversion.
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #15 from Kewen Lin --- (In reply to rguent...@suse.de from comment #14) > On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 > > > > --- Comment #13 from Kewen Lin --- > > > 2) on Power, the conversion from unsigned char to unsigned short is nop > > > conversion, when we counting scalar cost, it's counted, then add costs 32 > > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to > > > signed short should be counted but it's not (need to check why further). > > > > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not > > even put into the cost vector. > > > > tree_nop_conversion_p's comments saying: > > > > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates > >no instruction. */ > > > > I may miss something here, but UH to SH conversion does need one explicit > > extend instruction *extsh*, the precision/mode equality check looks wrong > > for > > this conversion. > > Well, it isn't a RTL predicate and it only needs extension because > there's never a HImode pseudo but always SImode subregs. Thanks Richi! Should we take care of this case? or neglect this kind of extension as "no instruction"? I was intent to handle it in target specific code, but it isn't recorded into cost vector while it seems too heavy to do the bb_info slp_instances revisits in finish_cost.
[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132 --- Comment #3 from Kewen Lin --- Powerpc already support vcond where A and B are in the same mode or the same size mode. As Richard pointed out, this case requires some packs, it requires powerpc supports vec_cmpv2dfv2di and vcond_mask_v4siv4si, the comparison generates the mask then convert to V4SI to use in condition selection.
[Bug tree-optimization/92185] New: ICE when perform condition reduction vectorization on uchar ind var
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92185 Bug ID: 92185 Summary: ICE when perform condition reduction vectorization on uchar ind var Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- TESTCASE: #include "tree-vect.h" extern void abort (void) __attribute__ ((noreturn)); #define N 27 unsigned char condition_reduction (short *a, short min_v) { unsigned char last = 0; for (unsigned char i = 0; i < 27; i++) if (a[i] < min_v) last = i; return last; } int main (void) { short a[27] = { 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 21, 22, 23, 24, 25, 26, 27 }; check_vect (); int ret = condition_reduction (a, 10); if (ret != 18) abort (); return 0; } BTW, tree-vect.h is from gcc/testsuite/gcc.dg/vect/tree-vect.h Options: -Ofast -fno-inline -fdump-tree-vect-details -fvect-cost-model=unlimited ICE backtrace: 13 | condition_reduction (short *a, short min_v) | ^~~ 0x115016ff vect_create_epilog_for_reduction /home/linkw/gcc/gcc-git-fix/gcc/tree-vect-loop.c:4252 0x1150ccb3 vectorizable_live_operation(_stmt_vec_info*, gimple_stmt_iterator*, _slp_tree*, _slp_instance*, int, bool, vec*) /home/linkw/gcc/gcc-git-fix/gcc/tree-vect-loop.c:7478 0x114df9bf can_vectorize_live_stmts /home/linkw/gcc/gcc-git-fix/gcc/tree-vect-stmts.c:10578 0x114e1933 vect_transform_stmt(_stmt_vec_info*, gimple_stmt_iterator*, _slp_tree*, _slp_instance*) /home/linkw/gcc/gcc-git-fix/gcc/tree-vect-stmts.c:11031 0x1150e9d7 vect_transform_loop_stmt /home/linkw/gcc/gcc-git-fix/gcc/tree-vect-loop.c:7918 0x1150f73f vect_transform_loop(_loop_vec_info*) /home/linkw/gcc/gcc-git-fix/gcc/tree-vect-loop.c:8133 0x1154acc7 try_vectorize_loop_1 /home/linkw/gcc/gcc-git-fix/gcc/tree-vectorizer.c:982 0x1154aff3 try_vectorize_loop /home/linkw/gcc/gcc-git-fix/gcc/tree-vectorizer.c:1035 0x1154b243 vectorize_loops() /home/linkw/gcc/gcc-git-fix/gcc/tree-vectorizer.c:1115 0x1132976f execute /home/linkw/gcc/gcc-git-fix/gcc/tree-ssa-loop.c:414
[Bug tree-optimization/92185] ICE when perform condition reduction vectorization on uchar ind var
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92185 --- Comment #3 from Kewen Lin --- (In reply to Richard Biener from comment #2) > Hmm, I can't reproduce this, I tried ppc64le and x86_64. Sorry, my local codebase is on r277221, trying latest trunk.
[Bug tree-optimization/92162] [10 Regression] ICE in vect_create_epilog_for_reduction, at tree-vect-loop.c:4252
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92162 Kewen Lin changed: What|Removed |Added CC||linkw at gcc dot gnu.org --- Comment #6 from Kewen Lin --- *** Bug 92185 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/92185] ICE when perform condition reduction vectorization on uchar ind var
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92185 Kewen Lin changed: What|Removed |Added Status|RESOLVED|CLOSED Resolution|FIXED |DUPLICATE --- Comment #5 from Kewen Lin --- Confirmed that latest trunk already fixed it and bisect shows the same result as what Martin pointed out (Thanks Martin). *** This bug has been marked as a duplicate of bug 92162 ***
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 92074, which changed state. Bug 92074 Summary: [10 regression] 26% performance regression on Spec2017 548.exchange2_r https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92074 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug ipa/92074] [10 regression] 26% performance regression on Spec2017 548.exchange2_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92074 Kewen Lin changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #7 from Kewen Lin --- Verified and confirm the commit can recover the number.
[Bug testsuite/92127] [10 regression] gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c fails after r276645 on power7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92127 Kewen Lin changed: What|Removed |Added CC||linkw at gcc dot gnu.org --- Comment #3 from Kewen Lin --- (In reply to Richard Biener from comment #2) > I suggest to make the test less dependent on unrolling by placing > > #pragma GCC unroll 0 > > before the inner loop which is likely unrolled now. I wonder whether > the test tests profitability of outer loop vectorization (likely > not profitable)? I see rs6000 adjusts unroll parameters as well. Confirmed that the inner loop is completely unrolled after the suspected commit. I checked the dump details, the test is to test the inner loop profitable or not, the outer loop vectorization fail far ahead of profit determination. /home/linkw/gcc/gcc-git-base/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c:18:20: missed: versioning for alias required: can't determine dependence between *_7 and *_11 consider run-time aliasing test between *_7 and *_11 /home/linkw/gcc/gcc-git-base/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c:18:20: missed: runtime alias check not supported for outer loop. /home/linkw/gcc/gcc-git-base/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c:13:4: missed: bad data dependence. /home/linkw/gcc/gcc-git-base/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c:13:4: missed: couldn't vectorize loop
[Bug testsuite/92127] [10 regression] gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c fails after r276645 on power7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92127 --- Comment #4 from Kewen Lin --- Author: linkw Date: Fri Nov 1 07:11:12 2019 New Revision: 277704 URL: https://gcc.gnu.org/viewcvs?rev=277704&root=gcc&view=rev Log: 2019-11-01 Kewen Lin PR testsuite/92127 * gcc.dg/vect/costmodel/ppc/costmodel-pr37194.c: Disable unroll. * gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c: Likewise. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr37194.c
[Bug testsuite/92127] [10 regression] gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c fails after r276645 on power7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92127 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #5 from Kewen Lin --- Test case fix has been committed.
[Bug ipa/92074] [10 regression] 26% performance regression on Spec2017 548.exchange2_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92074 Kewen Lin changed: What|Removed |Added Status|RESOLVED|CLOSED --- Comment #8 from Kewen Lin --- Closed it.
[Bug testsuite/87306] test case gcc.dg/vect/bb-slp-pow-1.c fails with its introduction in r263290
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87306 --- Comment #6 from Kewen Lin --- Author: linkw Revision: 268003 Modified property: svn:log Modified: svn:log at Tue Nov 5 02:26:38 2019 -- --- svn:log (original) +++ svn:log Tue Nov 5 02:26:38 2019 @@ -1,3 +1,5 @@ +[PATCH, rs6000, testsuite] Fix PR87306 + PR target/87306 * gcc.dg/vect/bb-slp-pow-1.c: Modify to reflect that the loop is not vectorized on POWER unless hardware
[Bug testsuite/92127] [10 regression] gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c fails after r276645 on power7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92127 --- Comment #6 from Kewen Lin --- Author: linkw Revision: 277704 Modified property: svn:log Modified: svn:log at Tue Nov 5 02:36:58 2019 -- --- svn:log (original) +++ svn:log Tue Nov 5 02:36:58 2019 @@ -1,4 +1,6 @@ -2019-11-01 Kewen Lin + PR testsuite/92127: Disable unrolling for some vect code model cases + + 2019-11-01 Kewen Lin PR testsuite/92127 * gcc.dg/vect/costmodel/ppc/costmodel-pr37194.c: Disable unroll.
[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132 --- Comment #4 from Kewen Lin --- Author: linkw Date: Fri Nov 8 07:37:07 2019 New Revision: 277947 URL: https://gcc.gnu.org/viewcvs?rev=277947&root=gcc&view=rev Log: [rs6000]Fix PR92132 by adding vec_cmp and vcond_mask supports To support full condition reduction vectorization, we have to define vec_cmp* and vcond_mask_*. This patch is to add related expands. Also add the missing vector fp comparison RTL pattern supports like: ungt, unge, unlt, unle, ne, lt and le. gcc/ChangeLog 2019-11-08 Kewen Lin PR target/92132 * config/rs6000/predicates.md (signed_or_equality_comparison_operator): New predicate. (unsigned_or_equality_comparison_operator): Likewise. * config/rs6000/rs6000.md (one_cmpl2): Remove expand. (one_cmpl3_internal): Rename to one_cmpl2. * config/rs6000/vector.md (vcond_mask_ for VEC_I and VEC_I): New expand. (vec_cmp for VEC_I and VEC_I): Likewise. (vec_cmpu for VEC_I and VEC_I): Likewise. (vcond_mask_ for VEC_F): New expand for float vector modes and same-size integer vector modes. (vec_cmp for VEC_F): Likewise. (vector_lt for VEC_F): New expand. (vector_le for VEC_F): Likewise. (vector_ne for VEC_F): Likewise. (vector_unge for VEC_F): Likewise. (vector_ungt for VEC_F): Likewise. (vector_unle for VEC_F): Likewise. (vector_unlt for VEC_F): Likewise. (vector_uneq): Expose name. (vector_ltgt): Likewise. (vector_unordered): Likewise. (vector_ordered): Likewise. gcc/testsuite/ChangeLog 2019-11-08 Kewen Lin PR target/92132 * gcc.target/powerpc/pr92132-fp-1.c: New test. * gcc.target/powerpc/pr92132-fp-2.c: New test. * gcc.target/powerpc/pr92132-int-1.c: New test. * gcc.target/powerpc/pr92132-int-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/powerpc/pr92132-fp-1.c trunk/gcc/testsuite/gcc.target/powerpc/pr92132-fp-2.c trunk/gcc/testsuite/gcc.target/powerpc/pr92132-int-1.c trunk/gcc/testsuite/gcc.target/powerpc/pr92132-int-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/predicates.md trunk/gcc/config/rs6000/rs6000.md trunk/gcc/config/rs6000/vector.md trunk/gcc/testsuite/ChangeLog
[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #5 from Kewen Lin --- Fixed on trunk.
[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-11-12 Ever confirmed|0 |1 --- Comment #1 from Kewen Lin --- Before the regressed commit, the cost view looks like: 0x13135eb0 ic[i_35] 2 times vector_stmt costs 2 in prologue 0x13135eb0 ic[i_35] 1 times vector_stmt costs 1 in prologue 0x13135eb0 ic[i_35] 1 times vector_load costs 1 in body 0x13135eb0 ic[i_35] 1 times vec_perm costs 3 in body 0x13135eb0 _5 1 times vector_store costs 1 in body .c:21:3: note: not using a fully-masked loop. cost model: prologue peel iters set to vf/2. cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown. 0x13135eb0 1 times cond_branch_taken costs 3 in prologue 0x13135eb0 1 times cond_branch_not_taken costs 1 in prologue 0x13135eb0 1 times cond_branch_taken costs 3 in epilogue 0x13135eb0 1 times cond_branch_not_taken costs 1 in epilogue 0x13135eb0 ic[i_35] 2 times scalar_load costs 2 in prologue 0x13135eb0 ic[i_35] 2 times scalar_load costs 2 in epilogue 0x13135eb0 _5 2 times scalar_store costs 2 in prologue 0x13135eb0 _5 2 times scalar_store costs 2 in epilogue .c:21:3: note: Cost model analysis: Vector inside of loop cost: 5 Vector prologue cost: 11 Vector epilogue cost: 8 Scalar iteration cost: 2 Scalar outside cost: 0 Vector outside cost: 19 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 19 With the commit, the cost view is changed to: 0x13135eb0 ic[i_35] 2 times vector_stmt costs 2 in prologue 0x13135eb0 ic[i_35] 1 times vector_stmt costs 1 in prologue 0x13135eb0 ic[i_35] 1 times vector_load costs 2 in body 0x13135eb0 ic[i_35] 1 times vec_perm costs 3 in body 0x13135eb0 _5 1 times vector_store costs 1 in body .c:21:3: note: not using a fully-masked loop. cost model: prologue peel iters set to vf/2. cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown. 0x13135eb0 1 times cond_branch_taken costs 3 in prologue 0x13135eb0 1 times cond_branch_not_taken costs 1 in prologue 0x13135eb0 1 times cond_branch_taken costs 3 in epilogue 0x13135eb0 1 times cond_branch_not_taken costs 1 in epilogue 0x13135eb0 ic[i_35] 2 times scalar_load costs 4 in prologue 0x13135eb0 ic[i_35] 2 times scalar_load costs 4 in epilogue 0x13135eb0 _5 2 times scalar_store costs 2 in prologue 0x13135eb0 _5 2 times scalar_store costs 2 in epilogue .c:21:3: note: Cost model analysis: Vector inside of loop cost: 6 Vector prologue cost: 13 Vector epilogue cost: 10 Scalar iteration cost: 3 Scalar outside cost: 0 Vector outside cost: 23 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 12 The cost changes are expected, scalar and vector load cost more. It leads the profitable min iter count become small. I ran both before- and after-executable with 10 invocations at 10 times, the evaluated time are very close, both average time are 65.23s. It means the cost adjustment doesn't make this case worse. One fix idea is to adjust the test case iteration count to 11 lower than the current profitable min iters count.
[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464 --- Comment #3 from Kewen Lin --- (In reply to Segher Boessenkool from comment #2) > What is the testcase testing? Whether we can properly vectorize this > code, right? And for p7 we now do it correctly, but thought it was > too expensive before? On Power7, it's to verify whether the cost model can take the loop as not profitable due to high overhead of peeling to get vector aligned address and not to vectorize the loop. The related patch changes the cost of load insns on Power7, it leads the profitable min iteration count change from 19 to 12. We are not lucky that the case happens to use 12 as iteration count (N-OFF), it hits the threshold. As actual runtime performance evaluation on this case (result mentioned above), the vectorized version works on par with non-vectorized version (before), so I believe the cost change is innocent for this case. One simple fix can be lowered the loop bound N to 15 instead of 16.
[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464 --- Comment #4 from Kewen Lin --- By the way, if I removed the check_vect and result verification code, the vectorized version perform very slightly better than non-vectorized version. And yes, I think it was a bit off before.
[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464 --- Comment #5 from Kewen Lin --- Author: linkw Date: Thu Nov 14 05:57:12 2019 New Revision: 278195 URL: https://gcc.gnu.org/viewcvs?rev=278195&root=gcc&view=rev Log: [testsuite] Fix PR92464 by adjust test case loop bound The recent vectorization cost adjustment on load leads the profitable min iteration count to change from 19 to 12. The case happens to hit the threshold. This patch is to adjust the loop bound from 16 to 14. gcc/testsuite/ChangeLog 2019-11-14 Kewen Lin PR target/92464 * gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c: Adjust loop bound due to load cost adjustment. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464 Kewen Lin changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Kewen Lin --- Fixed on trunk by r278195.
[Bug target/92566] rs6000_preferred_simd_mode isn't very good
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92566 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2019-11-19 CC||linkw at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Kewen Lin --- Currently we guard V2DImode under TARGET_VSX && TARGET_P8_VECTOR in rs6000.c.
[Bug target/92566] rs6000_preferred_simd_mode isn't very good
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92566 --- Comment #2 from Kewen Lin --- Created attachment 47295 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47295&action=edit Guard V2DImode and V1TImode under VSX and P8VECTOR
[Bug target/92534] [10 regression] gcc.dg/vect/bb-slp-42.c fails after r278262
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92534 Kewen Lin changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org --- Comment #3 from Kewen Lin --- I'd like to triage this one.
[Bug target/92566] rs6000_preferred_simd_mode isn't very good
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92566 Kewen Lin changed: What|Removed |Added Attachment #47295|0 |1 is obsolete|| --- Comment #4 from Kewen Lin --- Created attachment 47306 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47306&action=edit Get possible mode and query by VECTOR_UNIT_NONE_P Updated as Segher's comment.
[Bug target/92534] [10 regression] gcc.dg/vect/bb-slp-42.c fails after r278262
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92534 Kewen Lin changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #4 from Kewen Lin --- This is related to the realign vector load. It only fail with -mno-allow-movmisalign (which is disabled from Power8), I can't see the abort if I specified the option explicitly on Power8. The generated IR below is incorrect: vectp.43_73 = &MEM[(int *)b_17(D) + 4B]; vect__160.44_21 = __builtin_altivec_mask_for_load (vectp.43_73); ==> Here we use the vectp.43_73 (b+4), this is unexpected. vectp.46_20 = &MEM[(int *)b_17(D) + 4B]; vectp.46_19 = vectp.46_20 + 18446744073709551612; ==> Here we use the vectp.46_19 (b) vectp.46_18 = vectp.46_19 & -16B; vect__160.47_206 = MEM [(int *)vectp.46_18]; vectp.46_207 = vectp.46_19 + 15; ==> Here we use the vectp.46_19 (b) + 15 vectp.46_208 = vectp.46_207 & -16B; vect__160.48_209 = MEM [(int *)vectp.46_208]; vect__160.49_210 = REALIGN_LOAD ; vect__144.50_211 = VEC_PERM_EXPR ; If I adjusted it as the below code, it can pass. msq = vect_setup_realignment (first_stmt_info_for_drptr && !slp_perm ? first_stmt_info_for_drptr : first_stmt_info, gsi, &realignment_token, alignment_support_scheme, NULL_TREE, &at_loop); Need more time to figure out it's reasonable.
[Bug target/92534] [10 regression] gcc.dg/vect/bb-slp-42.c fails after r278262
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92534 Kewen Lin changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #5 from Kewen Lin --- Seurer told me this has passed since one recent commit r278495 (Thanks Seurer!). I noticed it guards the uniform_vector_p, the case doesn't try to vectorize any more, I'm wondering that for the other cases into that code path, the below code is safe enough? msq = vect_setup_realignment (first_stmt_info_for_drptr ? first_stmt_info_for_drptr : first_stmt_info that is the situation here expecting first_stmt_info even first_stmt_info_for_drptr gets assigned (the behavior of this test case before commit r278495) would never happen? I may suffer from imaginary fears but my concern is that possibly commit r278495 just conceal one bug which gets exposed by this case before. Hi Richard B., since you are also the author of commit r275798, you might be the best person who can answer that? Thanks in advance!
[Bug target/92534] [10 regression] gcc.dg/vect/bb-slp-42.c fails after r278262
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92534 --- Comment #7 from Kewen Lin --- Thanks for your confirmation and notes! Yes, the realignment codes won't take effect from Power8 which supports unaligned vector load/store. I'll learn the code, follow your suggestion and cook some patches later.