On 2021/6/28 16:25, Richard Biener wrote:
> On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/6/25 18:02, Richard Biener wrote:
>>> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luo...@linux.ibm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2021/6/25 16:54, Richard Biener wrote:
>>>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>>
>>>>>> From: Xiong Hu Luo <luo...@linux.ibm.com>
>>>>>>
>>>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
>>>>>> on Power. For example, it generates mismatched address offset after
>>>>>> adjust iv update statement position:
>>>>>>
>>>>>> <bb 32> [local count: 70988443]:
>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>>>> _34 = ref_180 + 18446744073709551615;
>>>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>>>>> if (_84 == _86)
>>>>>> goto <bb 56>; [94.50%]
>>>>>> else
>>>>>> goto <bb 87>; [5.50%]
>>>>>>
>>>>>> Disable it will produce:
>>>>>>
>>>>>> <bb 32> [local count: 70988443]:
>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>>>> if (_84 == _86)
>>>>>> goto <bb 56>; [94.50%]
>>>>>> else
>>>>>> goto <bb 87>; [5.50%]
>>>>>>
>>>>>> Then later pass loop unroll could benefit from same address offset
>>>>>> with different base address and reduces register dependency.
>>>>>> This patch could improve performance by 10% for typical case on Power,
>>>>>> no performance change observed for X86 or Aarch64 due to small loops
>>>>>> not unrolled on these platforms. Any comments?
>>>>>
>>>>> The case you quote is special in that if we hoisted the IV update before
>>>>> the other MEM _also_ used in the condition it would be fine again.
>>>>
>>>> Thanks. I tried to hoist the IV update statement before the first MEM
>>>> (Fix 2), it
>>>> shows even worse performance due to not unroll(two more "base-1" is
>>>> generated in gimple,
>>>> then loop->ninsns is 11 so small loops is not unrolled), change the
>>>> threshold from
>>>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times,
>>>> the
>>>> performance is SAME to the one that IV update statement in the *MIDDLE*
>>>> (trunk).
>>>> From the ASM, we can see the index register %r4 is used in two
>>>> iterations which
>>>> maybe a bottle neck for hiding instruction latency?
>>>>
>>>> Then it seems reasonable the performance would be better if keep the IV
>>>> update
>>>> statement at *LAST* (Fix 1).
>>>>
>>>> (Fix 2):
>>>> <bb 32> [local count: 70988443]:
>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>> _34 = ip_229 + 18446744073709551615;
>>>> _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>>> _33 = ref_180 + 18446744073709551615;
>>>> _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
>>>> if (_84 == _86)
>>>> goto <bb 56>; [94.50%]
>>>> else
>>>> goto <bb 87>; [5.50%]
>>>>
>>>>
>>>> .L67:
>>>> lbzx %r12,%r24,%r4
>>>> lbzx %r25,%r7,%r4
>>>> cmpw %cr0,%r12,%r25
>>>> bne %cr0,.L11
>>>> mr %r26,%r4
>>>> addi %r4,%r4,1
>>>> lbzx %r12,%r24,%r4
>>>> lbzx %r25,%r7,%r4
>>>> mr %r6,%r26
>>>> cmpw %cr0,%r12,%r25
>>>> bne %cr0,.L11
>>>> mr %r26,%r4
>>>> .L12:
>>>> cmpdi %cr0,%r10,1
>>>> addi %r4,%r26,1
>>>> mr %r6,%r26
>>>> addi %r10,%r10,-1
>>>> bne %cr0,.L67
>>>>
>>>>>
>>>>> Now, adjust_iv_update_pos doesn't seem to check that the
>>>>> condition actually uses the IV use stmt def, so it likely applies to
>>>>> too many cases.
>>>>>
>>>>> Unfortunately the introducing rev didn't come with a testcase,
>>>>> but still I think fixing up adjust_iv_update_pos is better than
>>>>> introducing a way to short-cut it per target decision.
>>>>>
>>>>> One "fix" might be to add a check that either the condition
>>>>> lhs or rhs is the def of the IV use and the other operand
>>>>> is invariant. Or if it's of similar structure hoist across the
>>>>> other iv-use as well. Not that I understand the argument
>>>>> about the overlapping life-range.
>>>>>
>>>>> You also don't provide a complete testcase ...
>>>>>
>>>>
>>>> Attached the test code, will also add it it patch in future version.
>>>> The issue comes from a very small hot loop:
>>>>
>>>> do {
>>>> len++;
>>>> } while(len < maxlen && ip[len] == ref[len]);
>>>
>>> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int
>>> maxlen)
>>> {
>>> unsigned int len = 2;
>>> do {
>>> len++;
>>> }while(len < maxlen && ip[len] == ref[len]);
>>> return len;
>>> }
>>>
>>> I can see the effect on this loop on x86_64 as well, we end up with
>>>
>>> .L6:
>>> movzbl (%rdi,%rax), %ecx
>>> addq $1, %rax
>>> cmpb -1(%rsi,%rax), %cl
>>> jne .L1
>>> .L3:
>>> movl %eax, %r8d
>>> cmpl %edx, %eax
>>> jb .L6
>>>
>>> but without the trick it is
>>>
>>> .L6:
>>> movzbl (%rdi,%rax), %r8d
>>> movzbl (%rsi,%rax), %ecx
>>> addq $1, %rax
>>> cmpb %cl, %r8b
>>> jne .L1
>>> .L3:
>>> movl %eax, %r9d
>>> cmpl %edx, %eax
>>> jb .L6
>>
>> Verified this small piece of code on X86, there is no performance
>> change with or without adjust_iv_update_pos (I've checked the ASM
>> exactly same with yours):
>>
>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c
>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1
>>
>> real 0m7.003s
>> user 0m6.996s
>> sys 0m0.000s
>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$
>> /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g
>> cc/ -O2 test.c
>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out 1
>>
>> real 0m7.070s
>> user 0m7.068s
>> sys 0m0.000s
>>
>>
>> But for AArch64, current GCC code also generates similar code with
>> or without adjust_iv_update_pos, the runtime is 10.705s for them.
>>
>> L6:
>> ldrb w4, [x6, x3]
>> add x3, x3, 1
>> ldrb w5, [x1, x3]
>> cmp w5, w4
>> bne .L1
>> .L3:
>> mov w0, w3
>> cmp w2, w3
>> bhi .L6
>>
>> No adjust_iv_update_pos:
>>
>> .L6:
>> ldrb w5, [x6, x3]
>> ldrb w4, [x1, x3]
>> add x3, x3, 1
>> cmp w5, w4
>> bne .L1
>> .L3:
>> mov w0, w3
>> cmp w2, w3
>> bhi .L6
>>
>>
>> While built with old GCC(gcc version 7.4.1 20190424), it generates
>> worse code and runtime is 11.664s:
>>
>> .L6:
>> ldrb w4, [x6, x3]
>> add x3, x3, 1
>> add x5, x1, x3
>> ldrb w5, [x5, -1]
>> cmp w5, w4
>> bne .L1
>> .L3:
>> cmp w3, w2
>> mov w0, w3
>> bcc .L6
>>
>>
>> In general, only Power shows negative performance with adjust_iv_update_pos,
>> that's why I tried to add target hook for it, is this reasonable? Or we
>> should
>> just remove adjust_iv_update_pos since it doesn't help performance for X86 or
>> other targets?
>
> It was Bin who added the functionality - maybe he can remember the cases
> he saw improvements for.
>
> I think we should instead try to change the code that it doesn't apply to
> CONDs where both operands are defined by stmts using the IV?
> And this time add a testcase ;)
Thanks. I am not sure whether should check both cond_lhs and cond_rhs or just
cond_lhs is enough if COND expr's LHS is always processed first?
For example, adjust_iv_update_pos will be called twice for this case,
after the first call, the gimple will be updated to:
<bb 4> [local count: 1014686026]:
_1 = (sizetype) len_8;
_2 = ip_10(D) + _1;
_3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
_5 = ref_12(D) + _1;
_6 = *_5;
ivtmp.15_16 = ivtmp.15_15 + 1;
if (_3 == _6)
goto <bb 6>; [94.50%]
else
goto <bb 10>; [5.50%]
At this moment, use->stmt is "_6 = *_5;", it is not using IV directly?
[PATCH] ivopts: Don't adjust IV update statement if both operands use the IV in
COND [PR101250]
gcc/ChangeLog:
PR middle-end/101250
* tree-ssa-loop-ivopts.c (adjust_iv_update_pos): Check the COND
operands whether both use IV.
---
gcc/testsuite/gcc.dg/tree-ssa/pr101250.c | 15 ++++++++++++
gcc/tree-ssa-loop-ivopts.c | 31 +++++++++++++++++++++++-
2 files changed, 45 insertions(+), 1 deletion(-)
create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
b/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
new file mode 100644
index 00000000000..70d3701de49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
+{
+ unsigned int len = 2;
+ do
+ {
+ len++;
+ }
+ while (len < maxlen && ip[len] == ref[len]);
+ return len;
+}
+
+/* { dg-final { scan-tree-dump-not "Reordering" "ivopts" } } */
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 12a8a49a307..082097bff38 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -7352,7 +7352,7 @@ static void
adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use)
{
tree var_after;
- gimple *iv_update, *stmt;
+ gimple *iv_update, *stmt, *cond_stmt;
basic_block bb;
gimple_stmt_iterator gsi, gsi_iv;
@@ -7370,6 +7370,7 @@ adjust_iv_update_pos (struct iv_cand *cand, struct iv_use
*use)
if (gimple_code (stmt) != GIMPLE_COND)
return;
+ cond_stmt = stmt;
gsi_prev_nondebug (&gsi);
stmt = gsi_stmt (gsi);
if (stmt != iv_update)
@@ -7386,6 +7387,34 @@ adjust_iv_update_pos (struct iv_cand *cand, struct
iv_use *use)
if (stmt != use->stmt)
return;
+ gcond *gcond_stmt = as_a <gcond *> (cond_stmt);
+ tree *cond_lhs = gimple_cond_lhs_ptr (gcond_stmt);
+ tree *cond_rhs = gimple_cond_rhs_ptr (gcond_stmt);
+ if (TREE_CODE (*cond_lhs) == SSA_NAME && TREE_CODE (*cond_rhs) == SSA_NAME)
+ {
+ /* If both sides are memory operations use iv var_before, adjust the
+ iv update stmt before second meory operations will cause offset index
+ mismatch, which may hurt performance if unroll, so return here to avoid
+ reorder.
+
+ _1 = (sizetype) len_8;
+ _2 = ip_10(D) + _1;
+ _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
+ _5 = ref_12(D) + _1;
+ _6 = *_5;
+ ivtmp.15_16 = ivtmp.15_15 + 1;
+ if (_3 == _6). */
+
+ gimple *lhs_mem = SSA_NAME_DEF_STMT (*cond_lhs);
+ gimple *rhs_mem = SSA_NAME_DEF_STMT (*cond_rhs);
+ gimple *use_stmt;
+ imm_use_iterator use_iter;
+ FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, cand->var_before)
+ if ((use_stmt == lhs_mem || use_stmt == rhs_mem)
+ && use_stmt != use->stmt)
+ return;
+ }
+
if (TREE_CODE (gimple_assign_lhs (stmt)) != SSA_NAME)
return;
--
2.27.0.90.geebb51ba8c
--
Thanks,
Xionghu