[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #14 from sergey.shalnov at intel dot com --- " we have a basic-block vectorizer. Do you propose to remove it? " Definitely not! SLP vectorizer is very good to have! “What's the rationale for not using vector registers” I just tried " -fno-tree-slp-vectorize" option and found the performance gain for different -march= options. I see some misunderstanding here. Let me clarify the original question with –march=znver1. I use " -Ofast -mfpmath=sse -funroll-loops -march=znver1" options set for experiments. For the basic block we are discussing we have (in vect_analyze_slp_cost() in tree-vect-slp.c:1897): tmp[i_220][0] = _150; tmp[i_220][2] = _147; tmp[i_220][1] = _144; tmp[i_220][3] = _141; tmp[i_139][0] = _447; tmp[i_139][2] = _450; tmp[i_139][1] = _453; tmp[i_139][3] = _456; tmp[i_458][0] = _54; tmp[i_458][2] = _56; tmp[i_458][1] = _58; tmp[i_458][3] = _60; this is si->stmt printed in the loop with "vect_prologue" calculation. I see SLP statistic related to this BB: note: Cost model analysis:. Vector inside of basic block cost: 64 Vector prologue cost: 32 Vector epilogue cost: 0 Scalar cost of basic block: 256 note: Basic block will be vectorized using SLP I see 12 statements that are calculated into 3 vector instructions with 4 data type each (4*int->xmm) group_size = 12 ncopies_for_cost = 3 nunits = 4 But I see "count" is 1 in cost vector related to prolog. prologue_cost_vec = {m_vec = 0x3fc6e70 = {{count = 1, kind = vec_construct, stmt = , misalign = 0}}} body_cost_vec = {m_vec = 0x3fc6f70 = {{count = 3, kind = vector_store, stmt = , misalign = 0}}} Please correct me if I wrong but I think we have to have count=3 in prologue_cost_vec. And this could slightly change costs for "Vector prologue cost" and might have an influence to vectorizer decision. Sergey PS Richard, I didn't catch your idea in " but DOM isn't powerful enough " sentence. Could you please slightly clarify it? Thank you.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #16 from sergey.shalnov at intel dot com --- «it's one vec_construct operation - it's the task of the target to turn this into a cost comparable to vector_store» I agree that vec_construct operation cost is based on the target cost model. I think I don't completely understand why we have 1 vec_construct for 3 vector_store. I would expect vec_construct for each vec_store operation. I just looking for asm and see that each "vmovaps %xmm, (%rsp)" has it's own construction procedure (in this asm I extracted only one vector_store from 3 vector_store set generated) vec_construct: 2d3: c5 79 6e dd vmovd %ebp,%xmm11 310: c4 41 79 6e d0 vmovd %r8d,%xmm10 2e0: c4 63 21 22 e2 01 vpinsrd $0x1,%edx,%xmm11,%xmm12 31a: c4 43 29 22 e9 01 vpinsrd $0x1,%r9d,%xmm10,%xmm13 329: c4 41 11 6c f4 vpunpcklqdq %xmm12,%xmm13,%xmm14 vector_store: 340: c5 78 29 74 24 d8 vmovaps %xmm14,-0x28(%rsp)
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #18 from sergey.shalnov at intel dot com --- Yes, I agree that vector_store stage has it’s own vectorization cost. And each vector_store has vector_construction stage. These stages are different in gcc slp (as you know). To better illustrate my point of view I would like to propose a patch. I didn’t submit the patch to community yet because I think it is better to discuss it before that. index 0ca42b4..72e3123 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1825,7 +1825,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree node, record_stmt_cost (prologue_cost_vec, 1, vector_load, stmt_info, 0, vect_prologue); else if (dt == vect_external_def) - record_stmt_cost (prologue_cost_vec, 1, vec_construct, + record_stmt_cost (prologue_cost_vec, ncopies_for_cost, vec_construct, stmt_info, 0, vect_prologue); } }
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #20 from sergey.shalnov at intel dot com --- Richard, I did quick static analysis for your latest patch. Using command line “-g -Ofast -mfpmath=sse -funroll-loops -march=znver1” your latest patch doesn’t affects the issue I discussed but it affects costs for first loop. I thought the loop costs should be calculated in other place (tree-vect-loop.c) but as I can see everything is interconnected. The SLP block we discussed remains with the same statistic: Vector inside of basic block cost: 64 Vector prologue cost: 32 Vector epilogue cost: 0 Scalar cost of basic block: 256 note: Basic block will be vectorized using SLP First loop was: note: Cost model analysis:. Vector inside of loop cost: 5392 Vector prologue cost: 48 Vector epilogue cost: 0 Scalar iteration cost: 464 Scalar outside cost: 0 Vector outside cost: 48 prologue iterations: 0 epilogue iterations: 0 note: cost model: the vector iteration cost = 5392 divided by the scalar iteration cost = 464 is greater or equal to the vectorization factor = 4. Became: note: Cost model analysis: Vector inside of loop cost: 5392 Vector prologue cost: 192 Vector epilogue cost: 0 Scalar iteration cost: 464 Scalar outside cost: 0 Vector outside cost: 192 prologue iterations: 0 epilogue iterations: 0 note: cost model: the vector iteration cost = 5392 divided by the scalar iteration cost = 464 is greater or equal to the vectorization factor = 4. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #21 from sergey.shalnov at intel dot com --- Thanks Richard for your comments. Based on our discussion I've produced the patch attached and run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto -march=skylake-avx512 -mfpmath=sse -funroll-loops]). Please note, I used your last proposed patch and changed loop trip count calculation ("ncopies_for_cost * nunits / group_size" is always 1). I see the following performance changes: SPEC CPU 2017 intrate 500.perlbench_r -0.7% 525.x264_r +7.2% Geomean: +0.8% SPEC CPU 2017 fprate 527.cam4_r -1.1% 538.imagick_r +4.7% 544.nab_r +3.6% Geomean: +0.6% I believe that after appropriate cost model tweaks for other targets a gain could be observed but I haven't checked it carefully. It provides a good performance gain for the original case and a few other improvements. Can you please take a look at the patch and comment (or might propose another method)? Sergey From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001 From: Sergey Shalnov Date: Tue, 9 Jan 2018 14:37:14 +0100 Subject: [PATCH, SLP] SLP_common_algo_changes --- gcc/config/i386/x86-tune-costs.h | 4 ++-- gcc/tree-vect-slp.c | 41 ++-- 2 files changed, 33 insertions(+), 12 deletions(-) diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 312467d..3e0f904 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = { {4, 4, 4}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ - {6, 6, 6}, /* cost of storing integer registers */ + {6, 6, 4}, /* cost of storing integer registers. */ 2, /* cost of reg,reg fld/fst */ {6, 6, 8}, /* cost of loading fp registers in SFmode, DFmode and XFmode */ @@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = { {6, 6, 6, 10, 20}, /* cost of loading SSE registers in 32,64,128,256 and 512-bit */ {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ - {8, 8, 8, 8, 16},/* cost of storing SSE registers + {8, 8, 8, 16, 32}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ {8, 8, 8, 8, 16},/* cost of unaligned stores. */ 2, 2,/* SSE->integer and integer->SSE moves */ diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 0ca42b4..7e63a1c 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree node, enum vect_def_type dt; if (!op || op == lhs) continue; - if (vect_is_simple_use (op, stmt_info->vinfo, &def_stmt, &dt)) + if (vect_is_simple_use (op, stmt_info->vinfo, &def_stmt, &dt) +&& (dt == vect_constant_def || dt == vect_external_def)) { /* Without looking at the actual initializer a vector of constants can be implemented as load from the constant pool. -??? We need to pass down stmt_info for a vector type -even if it points to the wrong stmt. */ - if (dt == vect_constant_def) - record_stmt_cost (prologue_cost_vec, 1, vector_load, - stmt_info, 0, vect_prologue); - else if (dt == vect_external_def) - record_stmt_cost (prologue_cost_vec, 1, vec_construct, - stmt_info, 0, vect_prologue); +When all elements are the same we can use a splat. */ + unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length (); + tree elt = NULL_TREE; + unsigned nelt = 0; + for (unsigned j = 0; j < ncopies_for_cost; ++j) + for (unsigned k = 0; k < group_size; ++k) + { + if (nelt == 0) + elt = gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], i); + /* ??? We're just tracking whether all operands of a single + vector initializer are the same, ideally we'd check if + we emitted the same one already. */ + else if (elt != gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], + i)) + elt = NULL_TREE; + nelt++; + if (nelt == group_size) + { + /* ??? We need to pass down
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #24 from sergey.shalnov at intel dot com --- Richard, The latest "SLP costing for constants/externs improvement" patch generates the same code as baseline for the test example. Are you sure that "num_vects_to_check" should 1 if vector is not constant? I would expect "num_vects_to_check = ncopies_for_cost;" here: 1863 else 1864{ 1865 num_vects_to_check = 1; 1866 nelt_limit = group_size; 1867} Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #26 from sergey.shalnov at intel dot com --- Sorry, did you meant "arm_sve.h" on ARM? In this case we have machine specific code in common part of the gcc code. Should we make it as machine dependent callback function because having "num_vects_to_check = 1" is incorrect for SSE?
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #28 from sergey.shalnov at intel dot com --- Richard, Thank you for your comments. I see that TYPE_VECTOR_SUBPARTS is constant for for the test case but multiple_p (group_size, const_nunits) returns 1 in the code: if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits) && ! multiple_p (group_size, const_nunits)) { num_vects_to_check = ncopies_for_cost * const_nunits / group_size; nelt_limit = const_nunits; } else { num_vects_to_check = 1; nelt_limit = group_size; } This is because group_size = 16 and const_nunits = 8 in the test case with -march=skylake-avx512. And control flow goes to "num_vects_to_check = 1". Anyway, the ncopies_for_cost = 2 and equation " ncopies_for_cost * const_nunits / group_size " will be 1. Do you think we have any possibility to make a conditional clause to make num_vects_to_check = 2? Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #29 from sergey.shalnov at intel dot com --- Richard, Thank you for your latest patch. I would like to clarify the multiple_p() function usage in if() clause. First of all, I assume that architectures with fixed size of HW registers (x86) should go to if(){} branch, but arch with unknown registers size should go to else{}. This is because is_constant() function used. The problem is in multiple_p() function. In the test case we have group_size=16 and const_nunits=8. So, the multiple_p(16, 8) returns 1 and with –march=skylake-avx512 (or –march=znver1) we go to else{} branch which is not correct. I used following change in your patch and test works as I expect. --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1923,7 +1923,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree node, unsigned HOST_WIDE_INT const_nunits; unsigned nelt_limit; if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits) - && ! multiple_p (group_size, const_nunits)) + && multiple_p (group_size, const_nunits)) { num_vects_to_check = ncopies_for_cost * const_nunits / group_size; nelt_limit = const_nunits; -- Other thing here is what we should do if group_size is, for example, 17. In this case (after my changes) wrong else{} branch will be taken. I would propose to change this particular if() in following way. if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits)) { If(multiple_p (group_size, const_nunits)) { num_vects_to_check = ncopies_for_cost * const_nunits / group_size; nelt_limit = const_nunits; } else { ...not clear what we should have here... } } else { num_vects_to_check = 1; nelt_limit = group_size; } Or perhaps multiple_p() should not be here at all? Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #31 from sergey.shalnov at intel dot com --- Richard, Thank you for your latest patch. This patch is exactly that I’ve discussed in this issue request. I tested it with SPEC20[06|17] and see no performance/stability degradation. Could you please merge your latest patch into main trunk? I will provide Intel specific changes (in gcc/config/i386) that will leverage your patch to get the performance better for the provided testcase (step N2) Thank you Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #33 from sergey.shalnov at intel dot com --- Richard, I'm not sure is it a regression or not. I see code has been visibly refactored in this commit https://github.com/gcc-mirror/gcc/commit/ee6e9ba576099aed29f1097195c649fc796ecf5e in 2013 year. However this fix is highly important for our workloads and really desirable for GCC8. For example, your patch plus SKX cost model changes give us ~+1% in geomean with spec2017 intrate. It would be really good for us to make this patch into GCC8. Cost model changes planned to be (will be proposed separately): diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index e943d13..d5e6ef6 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = { {4, 4, 4}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ - {6, 6, 6}, /* cost of storing integer registers */ + {6, 6, 3}, /* cost of storing integer registers */ 2, /* cost of reg,reg fld/fst */ {6, 6, 8}, /* cost of loading fp registers in SFmode, DFmode and XFmode */ @@ -1572,7 +1572,7 @@ struct processor_costs skylake_cost = { {6, 6, 6, 10, 20}, /* cost of loading SSE registers in 32,64,128,256 and 512-bit */ {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ - {8, 8, 8, 8, 16},/* cost of storing SSE registers + {8, 8, 8, 14, 24}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ {8, 8, 8, 8, 16},/* cost of unaligned stores. */ 2, 2,/* SSE->integer and integer->SSE moves */ I know that you have some concerns about costs above, I'm saving it for next discussion because your patch is a good foundation to proceed with costs tuning. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #36 from sergey.shalnov at intel dot com --- The patch fixes the issue for SKX is in https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html I will close the PR after the patch has been merged. Thank you very much for all involved. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 sergey.shalnov at intel dot com changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #38 from sergey.shalnov at intel dot com --- Implemented
[Bug rtl-optimization/85017] New: Missed constant propagation to index of array reference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85017 Bug ID: 85017 Summary: Missed constant propagation to index of array reference Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: sergey.shalnov at intel dot com Target Milestone: --- I found extra instruction generated on x86_64 platform. I have no performance data to prove performance gap for this but other compilers generates more optimal code from my point of view. Command line (the same with -O2 and -O3): gcc -c -O1 test.c The source code: char arr[400]; char foo(int c) { int i = c + 20; return arr[i]; } Main trunk GCC generates following: addl$20, %edi movslq %edi, %rdi movzbl arr(%rdi), %eax ret but other compilers (example with clang): movslq %edi, %rax movbarr+20(%rax), %al retq The constant $20 is better to propagate into array loading instead adding it to the register explicitly. Could someone please advice me a workaround or patch? I will provide a performance data for this. Sergey
[Bug target/83008] New: [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 Bug ID: 83008 Summary: [performance] Is it better to avoid extra instructions in data passing between loops? Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: sergey.shalnov at intel dot com Target Milestone: --- I found strange code generated by GCC-8.0/7.x with following command line options: -g -Ofast -march=skylake-avx512 -ftree-vectorize There are not vectorized two loops. First one doesn’t vectorized because: test.c:6:23: note: cost model: the vector iteration cost = 1488 divided by the scalar iteration cost = 328 is greater or equal to the vectorization factor = 4. test.c:6:23: note: not vectorized: vectorization not profitable. test.c:6:23: note: not vectorized: vector version will never be profitable. Second one doesn’t vectorized because: test.c:20:23: note: step unknown. test.c:20:23: note: reduction: not commutative/associative: sum_87 = (int) _61; test.c:20:23: note: Unknown def-use cycle pattern. test.c:20:23: note: Unsupported pattern. test.c:20:23: note: not vectorized: unsupported use in stmt. test.c:20:23: note: unexpected pattern. If we look into asm we found strange method to passing data in “tmp” array between loops: …loop 1 body… 13f: 41 8d 04 13 lea(%r11,%rdx,1),%eax 143: c5 f9 6e d8 vmovd %eax,%xmm3 147: c5 e1 62 db vpunpckldq %xmm3,%xmm3,%xmm3 14b: c5 f9 62 c0 vpunpckldq %xmm0,%xmm0,%xmm0 14f: c5 f1 62 c9 vpunpckldq %xmm1,%xmm1,%xmm1 153: c5 e9 62 d2 vpunpckldq %xmm2,%xmm2,%xmm2 157: c5 e9 6c d2 vpunpcklqdq %xmm2,%xmm2,%xmm2 15b: c5 f1 6c c9 vpunpcklqdq %xmm1,%xmm1,%xmm1 15f: c5 f9 6c c0 vpunpcklqdq %xmm0,%xmm0,%xmm0 163: c5 e1 6c db vpunpcklqdq %xmm3,%xmm3,%xmm3 167: c4 e3 6d 38 c9 01 vinserti128 $0x1,%xmm1,%ymm2,%ymm1 16d: c4 e3 7d 38 c3 01 vinserti128 $0x1,%xmm3,%ymm0,%ymm0 173: 62 f3 f5 48 3a c0 01vinserti64x4 $0x1,%ymm0,%zmm1,%zmm0 17a: 62 f1 fd 48 7f 44 24vmovdqa64 %zmm0,-0x40(%rsp) 181: ff 182: 8b 54 24 e0 mov-0x20(%rsp),%edx 186: 03 54 24 f0 add-0x10(%rsp),%edx …loop 2 body… if I'm not mistaken the algorithm looks like following: 1. Do first loop and keep values in GPR 2. Move these GPRs to XMMs 3. Pack these XMMs into YMMs 4. Pack these YMMs to ZMM 5. Spill ZMM into stack 6. Get values from stack to GPRs of the second loop It might be better, from performance perspective, to pass values from first loop directly to the second loop with GPRs (without all these vector registers)? The reproducer is: 1 int test(unsigned char * input1, unsigned char * input2) 2 { 3 unsigned int tmp[4][4]; 4 unsigned int var0, var1, var2, var3; 5 int sum = 0; 6 for (int i = 0; i < 4; i++, input1 += 4, input2 += 4) { 7 var0 = (input1[0] + input2[0]) + (input1[4] + input2[4]); 8 var1 = (input1[1] + input2[1]) + (input1[5] + input2[5]); 9 var2 = (input1[2] + input2[2]) + (input1[6] + input2[6]); 10 var3 = (input1[3] + input2[3]) + (input1[7] + input2[7]); 11 int inter0 = var0 + var1; 12 int inter1 = var0 + var1; 13 int inter2 = var2 + var3; 14 int inter3 = var2 + var3; 15 tmp[i][0] = inter0 + inter2; 16 tmp[i][2] = inter0 + inter2; 17 tmp[i][1] = inter1 + inter3; 18 tmp[i][3] = inter1 + inter3; 19 } 20 for (int i = 0; i < 4; i++) { 21 int inter0 = tmp[0][i] + tmp[1][i]; 22 int inter1 = tmp[0][i] + tmp[1][i]; 23 int inter2 = tmp[2][i] + tmp[3][i]; 24 int inter3 = tmp[2][i] + tmp[3][i]; 25 var0 = inter0 + inter2; 26 var2 = inter0 + inter2; 27 var1 = inter1 + inter3; 28 var3 = inter1 + inter3; 29 sum += var0 + var1 + var2 + var3; 30 } 31 32 return sum; 33 } Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #1 from sergey.shalnov at intel dot com --- Created attachment 42616 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42616&action=edit reproducer
[Bug tree-optimization/65930] Reduction with sign-change not handled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930 sergey.shalnov at intel dot com changed: What|Removed |Added CC||sergey.shalnov at intel dot com --- Comment #4 from sergey.shalnov at intel dot com --- Richard, I can confirm that the issue exists and it can be solved by make data types of accumulator and equation equal. As I can see you propose to introduce intermediate internal accumulator in uint32 type and store it into accumulator (int) later. Please correct me if I'm wrong. Anyway, may I, please, propose a bit simpler way to solve the issue? In GIMPLE statement we have (in the place of reduction tree-vect=loop.c:2877): "sum_12 = (int) _6;" I believe the issue disappears if we change it to: "sum_12 = _6;" I'm not sure but, if I remember correctly, the C standard treat these types (uint->int) casting as "undefined behavior". In this case, the compiler can do this type cast(it might be under some command line switches). What do you think, could it help? Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #5 from sergey.shalnov at intel dot com --- (In reply to Richard Biener from comment #2) > The strange code is because we perform basic-block vectorization resulting in > > vect_cst__249 = {_251, _251, _251, _251, _334, _334, _334, _334, _417, > _417, _417, _417, _48, _48, _48, _48}; > MEM[(unsigned int *)&tmp] = vect_cst__249; > _186 = tmp[0][0]; > _185 = tmp[1][0]; > ... > > which for some reason is deemed profitable: > > t.c:32:12: note: Cost model analysis: > Vector inside of basic block cost: 24 > Vector prologue cost: 64 > Vector epilogue cost: 0 > Scalar cost of basic block: 192 > t.c:32:12: note: Basic block will be vectorized using SLP > > what is odd is that the single vector store is costed 24 while the 16 scalar > int stores are costed 192. The vector build from scalar costs 64. > > I guess Honzas cost-model tweaks might have gone wrong here or we're hitting > an oddity in the SLP costing. > > Even if it looks strange maybe the sequence _is_ profitable? > > The second loop would be vectorized if 'sum' was unsigned. Richard, No, the sequence is not profitable. If we don't use any vector registers here the performance will be better for all architectures. I'm talking about vectorized code only here. I'm trying to look into vect_stmt_relevant_p() function to implement additional limitations and avoid block vectorization if the loop is not vectorized. If you have any idea how to avoid vectorization in this particular place - please let me know. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #6 from sergey.shalnov at intel dot com --- I found the issue request related to the vactorization issues in second loop (reduction uint->int). https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #8 from sergey.shalnov at intel dot com --- Richard, This is great changes and I see the first loop became vectorized for the test example I provided with gcc-8.0 main trunk. But I think the issue a bit more complicated. Vectorization of the first loop just hide the issue I reported. Currently I see following: test loop: “for (int i = 0; i < 4; i++, input1 += 4, input2 += 4)” test_bugzilla.c:6:5: note: Cost model analysis:. Vector inside of loop cost: 1136 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 328 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 0 test_bugzilla.c:6:5: note: Runtime profitability threshold = 4 test_bugzilla.c:6:5: note: Static estimate profitability threshold = 4 test_bugzilla.c:6:5: note: loop vectorized if I slightly change the loop (to be closer to real application): “for (int i = 0; i < 4; i++, input1 += stride1, input2 += stride2)” test_bugzilla1.c:6:5: note: Cost model analysis:. Vector inside of loop cost: 5232 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 328 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 5232 divided by the scalar iteration cost = 328 is greater or equal to the vectorization factor = 4. test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable. test_bugzilla1.c:6:5: note: not vectorized: vector version will never be profitable. And the issue with extra vector operations remains the same. I’m not sure but I think it is really profitable to avoid vector registers usage if the loop is not vectorized. Do you agree? Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #9 from sergey.shalnov at intel dot com --- Created attachment 42813 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42813&action=edit New reproducer Slightly changed first loop
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #11 from sergey.shalnov at intel dot com --- Richard, “Is this about the "stupid" attempt to use as little AVX512 as possible” No, it is not. I provided asm listing at the beginning with zmm only to illustrate the issue more crystalized. I used "-mprefer-vector-width=512" option to get that asm output. I believe that if we remove these registers (any. Including ymm and xmm) from BB you mentioned - it will be profitable for all architectures. " Sergey, can you test this?" Yes. Going to try this. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #12 from sergey.shalnov at intel dot com --- Richard, Your last proposal changed the code generated a bit. Currently is shows: test_bugzilla1.c:6:5: note: Cost model analysis:. Vector inside of loop cost: 62576 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 328 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 62576 divided by the scalar iteration cost = 328 is greater or equal to the vectorization factor = 4. test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable. test_bugzilla1.c:6:5: note: not vectorized: vector version will never be profitable. And it uses xmm+ vpbroadcastd to spill tmp[] to stack ... 1e7: 62 d2 7d 08 7c c9 vpbroadcastd %r9d,%xmm1 1ed: c4 c1 79 7e c9 vmovd %xmm1,%r9d 1f2: 62 f1 fd 08 7f 8c 24vmovdqa64 %xmm1,-0x38(%rsp) 1f9: c8 ff ff ff 1fd: 62 f2 7d 08 7c d7 vpbroadcastd %edi,%xmm2 203: c5 f9 7e d7 vmovd %xmm2,%edi 207: 62 f1 fd 08 7f 94 24vmovdqa64 %xmm2,-0x28(%rsp) 20e: d8 ff ff ff 212: 62 f2 7d 08 7c db vpbroadcastd %ebx,%xmm3 218: c5 f9 7e de vmovd %xmm3,%esi 21c: 62 f1 fd 08 7f 9c 24vmovdqa64 %xmm3,-0x18(%rsp) 223: e8 ff ff ff 227: 01 fe add%edi,%esi 229: 45 01 c8add%r9d,%r8d 22c: 41 01 f0add%esi,%r8d 22f: 8b 5c 24 dc mov-0x24(%rsp),%ebx 233: 03 5c 24 ec add-0x14(%rsp),%ebx 237: 8b 6c 24 bc mov-0x44(%rsp),%ebp 23b: 03 6c 24 cc add-0x34(%rsp),%ebp ... I think this is better in case of performance perspective but, as I said before, not using vector registers here is the best option if no loops vectorized. In case of static loop increment (the first test case) - the first loop vectorized as before. Sergey