When checking whether vectorised accesses at A and B are independent, the vectoriser falls back to tests of the form:
A + size <= B || B + size <= A But in the common case that "size" is just the constant size of a vector (or a small multiple), it would be more efficient to do: ((size_t) A - (size_t) B + size - 1) > (size - 1) * 2 This patch adds folds to do that. E.g. before the patch, the alias checks for: for (int j = 0; j < n; ++j) { for (int i = 0; i < 16; ++i) a[i] = (b[i] + c[i]) >> 1; a += step; b += step; c += step; } were: add x7, x1, 15 add x5, x0, 15 cmp x0, x7 add x7, x2, 15 ccmp x1, x5, 2, ls cset w8, hi cmp x0, x7 ccmp x2, x5, 2, ls cset w4, hi tst w8, w4 while after the patch they're: sub x7, x0, x1 sub x5, x0, x2 add x7, x7, 15 add x5, x5, 15 cmp x7, 30 ccmp x5, 30, 0, hi The old scheme needs: [A] one addition per vector pointer [B] two comparisons and one IOR per range check The new one needs: [C] one subtraction, one addition and one comparison per range check The range checks are then ANDed together, with the same number of ANDs either way. With conditional comparisons (such as on AArch64), we're able to remove the IOR between comparisons in the old scheme, but then need an explicit AND or branch when combining the range checks, as the example above shows. With the new scheme we can instead use conditional comparisons for the AND chain. So even with conditional comparisons, the new scheme should in practice be a win in almost all cases. Without conditional comparisons, the new scheme removes [A] and replaces [B] with an equivalent number of operations [C], so should always give fewer operations overall. Although each [C] is linear, multiple [C]s are easily parallelisable. A better implementation of the above would be: add x5, x0, 15 sub x7, x5, x1 sub x5, x5, x2 cmp x7, 30 ccmp x5, 30, 0, hi where we add 15 to "a" before the subtraction. Unfortunately, canonicalisation rules mean that even if we try to create it in that form, it gets folded into the one above instead. An alternative would be not to do this in match.pd and instead get tree-data-ref.c to do it itself. I started out that way but thought the match.pd approach seemed cleaner. Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf and x86_64-linux-gnu. OK to install? Richard 2018-07-20 Richard Sandiford <richard.sandif...@arm.com> gcc/ * match.pd: Optimise pointer range checks. gcc/testsuite/ * gcc.dg/pointer-range-check-1.c: New test. * gcc.dg/pointer-range-check-2.c: Likewise. Index: gcc/match.pd =================================================================== --- gcc/match.pd 2018-07-18 18:44:22.565914281 +0100 +++ gcc/match.pd 2018-07-20 11:24:33.692045585 +0100 @@ -4924,3 +4924,37 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (inverse_conditions_p (@0, @2) && element_precision (type) == element_precision (op_type)) (view_convert (cond_op @2 @3 @4 @5 (view_convert:op_type @1))))))) + +/* For pointers @0 and @2 and unsigned constant offset @1, look for + expressions like: + + A: (@0 + @1 < @2) | (@2 + @1 < @0) + B: (@0 + @1 <= @2) | (@2 + @1 <= @0) + + If pointers are known not to wrap, B checks whether @1 bytes starting at + @0 and @2 do not overlap, while A tests the same thing for @1 + 1 bytes. + A is more efficiently tested as: + + ((sizetype) @0 - (sizetype) @2 + @1) > (@1 * 2) + + as long as @1 * 2 doesn't overflow. B is the same with @1 replaced + with @1 - 1. */ +(for ior (truth_orif truth_or bit_ior) + (for cmp (le lt) + (simplify + (ior (cmp (pointer_plus:s @0 INTEGER_CST@1) @2) + (cmp (pointer_plus:s @2 @1) @0)) + (if (!flag_trapv && !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) + /* Convert the B form to the A form. */ + (with { offset_int off = wi::to_offset (@1) - (cmp == LE_EXPR ? 1 : 0); } + /* Always fails for negative values. */ + (if (wi::min_precision (off, UNSIGNED) * 2 <= TYPE_PRECISION (sizetype)) + /* It doesn't matter whether we use @2 - @0 or @0 - @2, so let + tree_swap_operands_p pick a canonical order. */ + (with { tree ptr1 = @0, ptr2 = @2; + if (tree_swap_operands_p (ptr1, ptr2)) + std::swap (ptr1, ptr2); } + (gt (plus (minus (convert:sizetype { ptr1; }) + (convert:sizetype { ptr2; })) + { wide_int_to_tree (sizetype, off); }) + { wide_int_to_tree (sizetype, off * 2); })))))))) Index: gcc/testsuite/gcc.dg/pointer-range-check-1.c =================================================================== --- /dev/null 2018-07-09 14:52:09.234750850 +0100 +++ gcc/testsuite/gcc.dg/pointer-range-check-1.c 2018-07-20 11:24:33.692045585 +0100 @@ -0,0 +1,37 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fno-ipa-icf -fdump-tree-optimized" } */ + +/* All four functions should be folded to: + + ((sizetype) a - (sizetype) b + 15) < 30. */ + +_Bool +f1 (char *a, char *b) +{ + return (a + 16 <= b) || (b + 16 <= a); +} + +_Bool +f2 (char *a, char *b) +{ + return (a + 15 < b) || (b + 15 < a); +} + +_Bool +f3 (char *a, char *b) +{ + return (a + 16 <= b) || (b + 16 <= a); +} + +_Bool +f4 (char *a, char *b) +{ + return (a + 15 < b) || (b + 15 < a); +} + +/* { dg-final { scan-tree-dump-times { = [^\n]* - [^\n]*;} 4 "optimized" } } */ +/* { dg-final { scan-tree-dump-times { = [^\n]* \+ [^\n]*;} 4 "optimized" } } */ +/* { dg-final { scan-tree-dump-times { = [^\n]*\ > [^\n]*;} 4 "optimized" } } */ +/* { dg-final { scan-tree-dump-not {=[^\n]*\ < [^\n]*;} "optimized" } } */ +/* { dg-final { scan-tree-dump-times { \+ 15} 4 "optimized" } } */ +/* { dg-final { scan-tree-dump-times { > 30} 4 "optimized" } } */ Index: gcc/testsuite/gcc.dg/pointer-range-check-2.c =================================================================== --- /dev/null 2018-07-09 14:52:09.234750850 +0100 +++ gcc/testsuite/gcc.dg/pointer-range-check-2.c 2018-07-20 11:24:33.692045585 +0100 @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fno-ipa-icf -fwrapv-pointer -fdump-tree-optimized" } */ + +_Bool +f1 (char *a, char *b) +{ + return (a + 16 <= b) || (b + 16 <= a); +} + +_Bool +f2 (char *a, char *b) +{ + return (a + 15 < b) || (b + 15 < a); +} + +_Bool +f3 (char *a, char *b) +{ + return (a + 16 <= b) || (b + 16 <= a); +} + +_Bool +f4 (char *a, char *b) +{ + return (a + 15 < b) || (b + 15 < a); +} + +/* { dg-final { scan-tree-dump-not { = [^\n]* - [^\n]*;} "optimized" } } */ +/* { dg-final { scan-tree-dump-times { = [^\n]* \+ [^\n]*;} 8 "optimized" } } */ +/* { dg-final { scan-tree-dump-times { \+ 15} 4 "optimized" } } */ +/* { dg-final { scan-tree-dump-times { \+ 16} 4 "optimized" } } */