Fold pointer range checks with equal spans

Richard Sandiford Fri, 20 Jul 2018 03:28:19 -0700

When checking whether vectorised accesses at A and B are independent,
the vectoriser falls back to tests of the form:


    A + size <= B || B + size <= A

But in the common case that "size" is just the constant size of a vector
(or a small multiple), it would be more efficient to do:

   ((size_t) A - (size_t) B + size - 1) > (size - 1) * 2

This patch adds folds to do that.  E.g. before the patch, the alias
checks for:

  for (int j = 0; j < n; ++j)
    {
      for (int i = 0; i < 16; ++i)
        a[i] = (b[i] + c[i]) >> 1;
      a += step;
      b += step;
      c += step;
    }

were:

        add     x7, x1, 15
        add     x5, x0, 15
        cmp     x0, x7
        add     x7, x2, 15
        ccmp    x1, x5, 2, ls
        cset    w8, hi
        cmp     x0, x7
        ccmp    x2, x5, 2, ls
        cset    w4, hi
        tst     w8, w4

while after the patch they're:

        sub     x7, x0, x1
        sub     x5, x0, x2
        add     x7, x7, 15
        add     x5, x5, 15
        cmp     x7, 30
        ccmp    x5, 30, 0, hi

The old scheme needs:

[A] one addition per vector pointer
[B] two comparisons and one IOR per range check

The new one needs:

[C] one subtraction, one addition and one comparison per range check

The range checks are then ANDed together, with the same number of
ANDs either way.

With conditional comparisons (such as on AArch64), we're able to remove
the IOR between comparisons in the old scheme, but then need an explicit
AND or branch when combining the range checks, as the example above shows.
With the new scheme we can instead use conditional comparisons for
the AND chain.

So even with conditional comparisons, the new scheme should in practice
be a win in almost all cases.  Without conditional comparisons, the new
scheme removes [A] and replaces [B] with an equivalent number of operations
[C], so should always give fewer operations overall.  Although each [C]
is linear, multiple [C]s are easily parallelisable.

A better implementation of the above would be:

        add     x5, x0, 15
        sub     x7, x5, x1
        sub     x5, x5, x2
        cmp     x7, 30
        ccmp    x5, 30, 0, hi

where we add 15 to "a" before the subtraction.  Unfortunately,
canonicalisation rules mean that even if we try to create it in
that form, it gets folded into the one above instead.

An alternative would be not to do this in match.pd and instead get
tree-data-ref.c to do it itself.  I started out that way but thought
the match.pd approach seemed cleaner.

Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
and x86_64-linux-gnu.  OK to install?

Richard


2018-07-20  Richard Sandiford  <[email protected]>

gcc/
        * match.pd: Optimise pointer range checks.

gcc/testsuite/
        * gcc.dg/pointer-range-check-1.c: New test.
        * gcc.dg/pointer-range-check-2.c: Likewise.

Index: gcc/match.pd
===================================================================
--- gcc/match.pd        2018-07-18 18:44:22.565914281 +0100
+++ gcc/match.pd        2018-07-20 11:24:33.692045585 +0100
@@ -4924,3 +4924,37 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    (if (inverse_conditions_p (@0, @2)
         && element_precision (type) == element_precision (op_type))
     (view_convert (cond_op @2 @3 @4 @5 (view_convert:op_type @1)))))))
+
+/* For pointers @0 and @2 and unsigned constant offset @1, look for
+   expressions like:
+
+   A: (@0 + @1 < @2) | (@2 + @1 < @0)
+   B: (@0 + @1 <= @2) | (@2 + @1 <= @0)
+
+   If pointers are known not to wrap, B checks whether @1 bytes starting at
+   @0 and @2 do not overlap, while A tests the same thing for @1 + 1 bytes.
+   A is more efficiently tested as:
+
+   ((sizetype) @0 - (sizetype) @2 + @1) > (@1 * 2)
+
+   as long as @1 * 2 doesn't overflow.  B is the same with @1 replaced
+   with @1 - 1.  */
+(for ior (truth_orif truth_or bit_ior)
+ (for cmp (le lt)
+  (simplify
+   (ior (cmp (pointer_plus:s @0 INTEGER_CST@1) @2)
+       (cmp (pointer_plus:s @2 @1) @0))
+   (if (!flag_trapv && !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
+    /* Convert the B form to the A form.  */
+    (with { offset_int off = wi::to_offset (@1) - (cmp == LE_EXPR ? 1 : 0); }
+     /* Always fails for negative values.  */
+     (if (wi::min_precision (off, UNSIGNED) * 2 <= TYPE_PRECISION (sizetype))
+      /* It doesn't matter whether we use @2 - @0 or @0 - @2, so let
+        tree_swap_operands_p pick a canonical order.  */
+      (with { tree ptr1 = @0, ptr2 = @2;
+             if (tree_swap_operands_p (ptr1, ptr2))
+               std::swap (ptr1, ptr2); }
+       (gt (plus (minus (convert:sizetype { ptr1; })
+                       (convert:sizetype { ptr2; }))
+                { wide_int_to_tree (sizetype, off); })
+          { wide_int_to_tree (sizetype, off * 2); }))))))))
Index: gcc/testsuite/gcc.dg/pointer-range-check-1.c
===================================================================
--- /dev/null   2018-07-09 14:52:09.234750850 +0100
+++ gcc/testsuite/gcc.dg/pointer-range-check-1.c        2018-07-20 
11:24:33.692045585 +0100
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-ipa-icf -fdump-tree-optimized" } */
+
+/* All four functions should be folded to:
+
+   ((sizetype) a - (sizetype) b + 15) < 30.  */
+
+_Bool
+f1 (char *a, char *b)
+{
+  return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f2 (char *a, char *b)
+{
+  return (a + 15 < b) || (b + 15 < a);
+}
+
+_Bool
+f3 (char *a, char *b)
+{
+  return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f4 (char *a, char *b)
+{
+  return (a + 15 < b) || (b + 15 < a);
+}
+
+/* { dg-final { scan-tree-dump-times { = [^\n]* - [^\n]*;} 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { = [^\n]* \+ [^\n]*;} 4 "optimized" } } 
*/
+/* { dg-final { scan-tree-dump-times { = [^\n]*\ > [^\n]*;} 4 "optimized" } } 
*/
+/* { dg-final { scan-tree-dump-not {=[^\n]*\ < [^\n]*;} "optimized" } } */
+/* { dg-final { scan-tree-dump-times { \+ 15} 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { > 30} 4 "optimized" } } */
Index: gcc/testsuite/gcc.dg/pointer-range-check-2.c
===================================================================
--- /dev/null   2018-07-09 14:52:09.234750850 +0100
+++ gcc/testsuite/gcc.dg/pointer-range-check-2.c        2018-07-20 
11:24:33.692045585 +0100
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-ipa-icf -fwrapv-pointer -fdump-tree-optimized" } */
+
+_Bool
+f1 (char *a, char *b)
+{
+  return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f2 (char *a, char *b)
+{
+  return (a + 15 < b) || (b + 15 < a);
+}
+
+_Bool
+f3 (char *a, char *b)
+{
+  return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f4 (char *a, char *b)
+{
+  return (a + 15 < b) || (b + 15 < a);
+}
+
+/* { dg-final { scan-tree-dump-not { = [^\n]* - [^\n]*;} "optimized" } } */
+/* { dg-final { scan-tree-dump-times { = [^\n]* \+ [^\n]*;} 8 "optimized" } } 
*/
+/* { dg-final { scan-tree-dump-times { \+ 15} 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { \+ 16} 4 "optimized" } } */

Fold pointer range checks with equal spans

Reply via email to