Re: [PATCH] match.pd: Add std::pow folding optimizations.

Jennifer Schmitz Wed, 23 Oct 2024 01:55:23 -0700


> On 22 Oct 2024, at 13:14, Richard Biener <rguent...@suse.de> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, 22 Oct 2024, Jennifer Schmitz wrote:
> 
>> 
>> 
>>> On 22 Oct 2024, at 11:05, Richard Biener <rguent...@suse.de> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> On Tue, 22 Oct 2024, Jennifer Schmitz wrote:
>>> 
>>>> 
>>>> 
>>>>> On 21 Oct 2024, at 10:51, Richard Biener <rguent...@suse.de> wrote:
>>>>> 
>>>>> External email: Use caution opening links or attachments
>>>>> 
>>>>> 
>>>>> On Fri, 18 Oct 2024, Jennifer Schmitz wrote:
>>>>> 
>>>>>> This patch adds the following two simplifications in match.pd:
>>>>>> - pow (1.0/x, y) to pow (x, -y), avoiding the division
>>>>>> - pow (0.0, x) to 0.0, avoiding the call to pow.
>>>>>> The patterns are guarded by flag_unsafe_math_optimizations,
>>>>>> !flag_trapping_math, !flag_errno_math, !HONOR_SIGNED_ZEROS,
>>>>>> and !HONOR_INFINITIES.
>>>>>> 
>>>>>> Tests were added to confirm the application of the transform for float,
>>>>>> double, and long double.
>>>>>> 
>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu and
>>>>>> x86_64-linux-gnu, no regression.
>>>>>> OK for mainline?
>>>>>> 
>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>> 
>>>>>> gcc/
>>>>>>    * match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and
>>>>>>    pow (0.0, x) -> 0.0.
>>>>>> 
>>>>>> gcc/testsuite/
>>>>>>    * gcc.dg/tree-ssa/pow_fold_1.c: New test.
>>>>>> ---
>>>>>> gcc/match.pd                               | 14 +++++++++
>>>>>> gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 34 ++++++++++++++++++++++
>>>>>> 2 files changed, 48 insertions(+)
>>>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
>>>>>> 
>>>>>> diff --git a/gcc/match.pd b/gcc/match.pd
>>>>>> index 12d81fcac0d..ba100b117e7 100644
>>>>>> --- a/gcc/match.pd
>>>>>> +++ b/gcc/match.pd
>>>>>> @@ -8203,6 +8203,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>>>>>  (rdiv @0 (exps:s @1))
>>>>>>   (mult @0 (exps (negate @1)))))
>>>>>> 
>>>>>> + /* Simplify pow(1.0/x, y) into pow(x, -y).  */
>>>>>> + (if (! HONOR_INFINITIES (type)
>>>>>> +      && ! HONOR_SIGNED_ZEROS (type)
>>>>>> +      && ! flag_trapping_math
>>>>>> +      && ! flag_errno_math)
>>>>>> +  (simplify
>>>>>> +   (POW (rdiv:s real_onep@0 @1) @2)
>>>>>> +    (POW @1 (negate @2)))
>>>>> 
>>>>> This one shouldn't need HONOR_SIGNED_ZEROS?
>>>>> 
>>>>>> +
>>>>>> +  /* Simplify pow(0.0, x) into 0.0.  */
>>>>>> +  (simplify
>>>>>> +   (POW real_zerop@0 @1)
>>>>> 
>>>>> I think this needs !HONOR_NANS (type)?
>>>>> 
>>>>> Otherwise OK.
>>>> Thanks for the feedback, Richard and Andrew. I made the following changes 
>>>> to the patch (current version of the patch below):
>>>> - also applied the pattern to POWI and added tests for pow, powif, powil
>>>> - not gate first pattern under !HONOR_SIGNED_ZEROS, but second one 
>>>> additionally under !HONOR_NANS (type)
>>>> - added tests for powf16
>>> 
>>> Note powi is GCC internal, it doesn't set errno and it should be subject
>>> to different rules - I'd rather have patterns working on powi separate.
>> How about moving the patterns for POWI into the section 
>> flag_unsafe_math_optimizations && canonicalize_math_p () and not use 
>> (!flag_errno_math)?
> 
> Sounds good.
> 
>>> 
>>>> Now, I am encountering two problems:
>>>> 
>>>> First, the transform is not applied for float16 (even if 
>>>> -fexcess-precision=16). Do you know what the problem could be?
>>> 
>>> I think you want to use POW_ALL instead of POW.  The generated
>>> cfn-operators.pd shows
>>> 
>>> (define_operator_list POW
>>>   BUILT_IN_POWF
>>>   BUILT_IN_POW
>>>   BUILT_IN_POWL
>>>   IFN_POW)
>>> (define_operator_list POW_FN
>>>   BUILT_IN_POWF16
>>>   BUILT_IN_POWF32
>>>   BUILT_IN_POWF64
>>>   BUILT_IN_POWF128
>>>   BUILT_IN_POWF32X
>>>   BUILT_IN_POWF64X
>>>   BUILT_IN_POWF128X
>>>   null)
>>> (define_operator_list POW_ALL
>>>   BUILT_IN_POWF
>>>   BUILT_IN_POW
>>>   BUILT_IN_POWL
>>>   BUILT_IN_POWF16
>>> ...
>>> 
>>> note this comes at expense of more generated code (in
>>> gimple/generic-match.pd).
>> Thanks, that solved the Float16 issue.
>>> 
>>>> Second, validation on aarch64 shows a regression in tests
>>>> - gcc.dg/recip_sqrt_mult_1.c and
>>>> - gcc.dg/recip_sqrt_mult_5.c,
>>>> because the pattern (POWI(1/x, y) -> POWI(x, -y)) is applied before the 
>>>> recip pass and prevents application of the recip-patterns. The reason for 
>>>> this might be that the single-use restriction only work if the integer 
>>>> argument is non-constant, but in the failing test cases, the integer 
>>>> argument is 2 and the pattern is applied despite the :s flag.
>>>> For example, my pattern is **not** applied (single-use restriction works) 
>>>> for:
>>>> double res, res2;
>>>> void foo (double a, int b)
>>>> {
>>>> double f (double);
>>>> double t1 = 1.0 / a;
>>>> res = __builtin_powi (t1, b);
>>>> res2 = f (t1);
>>>> }
>>>> 
>>>> But the pattern **is** applied and single-use restriction does **not** 
>>>> work for:
>>>> double res, res2;
>>>> void foo (double a)
>>>> {
>>>> double f (double);
>>>> double t1 = 1.0 / a;
>>>> res = __builtin_powi (t1, 2);
>>>> res2 = f (t1);
>>>> }
>>> 
>>> This must be because the result is a single operation.  :s only applies
>>> when the result has sub-expresions.  This is to make CSE work.
>>> The "fix" is to add explicit && single_use (@n) to override that
>>> behavior.  Note that I think the transform is good even when the
>>> division is used because the result reduces the dependence chain length.
>>> It's only when @2 is non-constant that we're introducing another
>>> stmt for the negation that re-introduces this latency (even if in
>>> practice it would be smaller).
>>> 
>>>> Possible options to resolve this are:
>>>> - gate pattern to run after recip pass
>>>> - do not apply pattern for POWI
>>> 
>>> - adjust the testcase (is the final outcome still good?)
>> Without the patch, there is one fdiv instruction less (below is the assembly 
>> for recip_sqrt_mult_1.c, but for _5.c it’s analogous):
>> No patch or with single_use of the result of the division:
>> foo:
>>      fmov d30, 1.0e+0
>>      fsqrt d31, d0
>>      adrp x0, .LANCHOR0
>>      add x1, x0, :lo12:.LANCHOR0
>>      fdiv d30, d30, d0
>>      fmul d0, d31, d30
>>      str d0, [x0, #:lo12:.LANCHOR0]
>>      stp d30, d31, [x1, 8]
>>      ret
>> 
>> With patch:
>> foo:
>>        fsqrt   d31, d0
>>        fmov    d30, 1.0e+0
>>        adrp    x1, .LANCHOR0
>>        add     x0, x1, :lo12:.LANCHOR0
>>        fdiv    d31, d30, d31
>>        fdiv    d30, d30, d0
>>        str     d31, [x1, #:lo12:.LANCHOR0]
>>        fmul    d31, d31, d0
>>        stp     d30, d31, [x0, 8]
>>        ret
>> So, we might want to use the single_use guard.
> 
> Yeah, this is because the powi inline expansion will add back
> the division.
Below is the updated patch, I re-validated with no regression on aarch64 and 
x86_64.
Thanks,
Jenni


This patch adds the following two simplifications in match.pd for
POW_ALL and POWI:
- pow (1.0/x, y) to pow (x, -y), avoiding the division
- pow (0.0, x) to 0.0, avoiding the call to pow.
The patterns are guarded by flag_unsafe_math_optimizations,
!flag_trapping_math, and !HONOR_INFINITIES.
The POW_ALL patterns are also gated under !flag_errno_math.
The second pattern is also guarded by !HONOR_NANS and
!HONOR_SIGNED_ZEROS.

Tests were added to confirm the application of the transform for
builtins pow, powf, powl, powi, powif, powil, and powf16.

The patch was bootstrapped and regtested on aarch64-linux-gnu and
x86_64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>

gcc/
        * match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and
        pow (0.0, x) -> 0.0.

gcc/testsuite/
        * gcc.dg/tree-ssa/pow_fold_1.c: New test.
---
 gcc/match.pd                               | 28 +++++++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 42 ++++++++++++++++++++++
 2 files changed, 70 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 12d81fcac0d..6d9868d2bb1 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8203,6 +8203,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    (rdiv @0 (exps:s @1))
     (mult @0 (exps (negate @1)))))
 
+ (for pow (POW_ALL)
+  (if (! HONOR_INFINITIES (type)
+       && ! flag_trapping_math
+       && ! flag_errno_math)
+   /* Simplify pow(1.0/x, y) into pow(x, -y).  */
+   (simplify
+    (pow (rdiv:s real_onep@0 @1) @2)
+     (pow @1 (negate @2)))
+
+   /* Simplify pow(0.0, x) into 0.0.  */
+   (if (! HONOR_NANS (type) && ! HONOR_SIGNED_ZEROS (type))
+    (simplify
+     (pow real_zerop@0 @1)
+      @0))))
+
  (if (! HONOR_SIGN_DEPENDENT_ROUNDING (type)
       && ! HONOR_NANS (type) && ! HONOR_INFINITIES (type)
       && ! flag_trapping_math
@@ -8561,6 +8576,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (mult (POW:s @0 @1) (POW:s @2 @1))
    (POW (mult @0 @2) @1))
 
+ (if (! HONOR_INFINITIES (type) && ! flag_trapping_math)
+  /* Simplify powi(1.0/x, y) into powi(x, -y).  */
+  (simplify
+   (POWI (rdiv@3 real_onep@0 @1) @2)
+   (if (single_use (@3))
+    (POWI @1 (negate @2))))
+
+  /* Simplify powi(0.0, x) into 0.0.  */
+  (if (! HONOR_NANS (type) && ! HONOR_SIGNED_ZEROS (type))
+   (simplify
+    (POWI real_zerop@0 @1)
+     @0)))
+
  /* Simplify powi(x,y) * powi(z,y) -> powi(x*z,y). */
  (simplify
   (mult (POWI:s @0 @1) (POWI:s @2 @1))
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
new file mode 100644
index 00000000000..d98bcb0827e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized -fexcess-precision=16" } */
+/* { dg-add-options float16 } */
+/* { dg-require-effective-target float16_runtime } */
+/* { dg-require-effective-target c99_runtime } */
+
+extern void link_error (void);
+
+#define POW1OVER(TYPE1, TYPE2, CTY, TY)                        \
+  void                                                 \
+  pow1over_##TY (TYPE1 x, TYPE2 y)                     \
+  {                                                    \
+    TYPE1 t1 = 1.0##CTY / x;                           \
+    TYPE1 t2 = __builtin_pow##TY (t1, y);              \
+    TYPE2 t3 = -y;                                     \
+    TYPE1 t4 = __builtin_pow##TY (x, t3);              \
+    if (t2 != t4)                                      \
+      link_error ();                                   \
+  }                                                    \
+
+#define POW0(TYPE1, TYPE2, CTY, TY)                    \
+  void                                                 \
+  pow0_##TY (TYPE2 x)                                  \
+  {                                                    \
+    TYPE1 t1 = __builtin_pow##TY (0.0##CTY, x);                \
+    if (t1 != 0.0##CTY)                                        \
+      link_error ();                                   \
+  }                                                    \
+
+#define TEST_ALL(TYPE1, TYPE2, CTY, TY)                        \
+  POW1OVER (TYPE1, TYPE2, CTY, TY)                     \
+  POW0 (TYPE1, TYPE2, CTY, TY)
+
+TEST_ALL (double, double, , )
+TEST_ALL (float, float, f, f)
+TEST_ALL (_Float16, _Float16, f16, f16)
+TEST_ALL (long double, long double, L, l)
+TEST_ALL (double, int, , i)
+TEST_ALL (float, int, f, if)
+TEST_ALL (long double, int, L, il)
+
+/* { dg-final { scan-tree-dump-not "link_error" "optimized" } } */
-- 
2.44.0

> 
> Richard.
> 
>> Best,
>> Jennifer
>> 
>>> 
>>>> What are your thoughts on this?
>>>> Thanks,
>>>> Jennifer
>>>> 
>>>> This patch adds the following two simplifications in match.pd for POW
>>>> and POWI:
>>>> - pow (1.0/x, y) to pow (x, -y), avoiding the division
>>>> - pow (0.0, x) to 0.0, avoiding the call to pow.
>>>> The patterns are guarded by flag_unsafe_math_optimizations,
>>>> !flag_trapping_math, !flag_errno_math, and !HONOR_INFINITIES.
>>>> The second pattern is also guarded by !HONOR_NANS and
>>>> !HONOR_SIGNED_ZEROS.
>>>> 
>>>> Tests were added to confirm the application of the transform for
>>>> builtins pow, powf, powl, powi, powif, powil, and powf16.
>>>> 
>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu and
>>>> x86_64-linux-gnu, no regression.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>> 
>>>> gcc/
>>>>     * match.pd: Fold pow (1.0/x, y) -> pow (x, -y) and
>>>>     pow (0.0, x) -> 0.0.
>>>> 
>>>> gcc/testsuite/
>>>>     * gcc.dg/tree-ssa/pow_fold_1.c: New test.
>>>> ---
>>>> gcc/match.pd                               | 15 ++++++++
>>>> gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c | 42 ++++++++++++++++++++++
>>>> 2 files changed, 57 insertions(+)
>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
>>>> 
>>>> diff --git a/gcc/match.pd b/gcc/match.pd
>>>> index 12d81fcac0d..b061ef9dc91 100644
>>>> --- a/gcc/match.pd
>>>> +++ b/gcc/match.pd
>>>> @@ -8203,6 +8203,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>>>   (rdiv @0 (exps:s @1))
>>>>    (mult @0 (exps (negate @1)))))
>>>> 
>>>> + (for pow (POW POWI)
>>>> +  (if (! HONOR_INFINITIES (type)
>>>> +       && ! flag_trapping_math
>>>> +       && ! flag_errno_math)
>>>> +   /* Simplify pow(1.0/x, y) into pow(x, -y).  */
>>>> +   (simplify
>>>> +    (pow (rdiv:s real_onep@0 @1) @2)
>>>> +     (pow @1 (negate @2)))
>>>> +
>>>> +   /* Simplify pow(0.0, x) into 0.0.  */
>>>> +   (if (! HONOR_NANS (type) && ! HONOR_SIGNED_ZEROS (type))
>>>> +    (simplify
>>>> +     (pow real_zerop@0 @1)
>>>> +      @0))))
>>>> +
>>>> (if (! HONOR_SIGN_DEPENDENT_ROUNDING (type)
>>>>      && ! HONOR_NANS (type) && ! HONOR_INFINITIES (type)
>>>>      && ! flag_trapping_math
>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c 
>>>> b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
>>>> new file mode 100644
>>>> index 00000000000..c38b7390478
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pow_fold_1.c
>>>> @@ -0,0 +1,42 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-O2 -ffast-math -fdump-tree-optimized" } */
>>>> +/* { dg-add-options float16 } */
>>>> +/* { dg-require-effective-target float16_runtime } */
>>>> +/* { dg-require-effective-target c99_runtime } */
>>>> +
>>>> +extern void link_error (void);
>>>> +
>>>> +#define POW1OVER(TYPE1, TYPE2, CTY, TY)                      \
>>>> +  void                                                       \
>>>> +  pow1over_##TY (TYPE1 x, TYPE2 y)                   \
>>>> +  {                                                  \
>>>> +    TYPE1 t1 = 1.0##CTY / x;                         \
>>>> +    TYPE1 t2 = __builtin_pow##TY (t1, y);            \
>>>> +    TYPE2 t3 = -y;                                   \
>>>> +    TYPE1 t4 = __builtin_pow##TY (x, t3);            \
>>>> +    if (t2 != t4)                                    \
>>>> +      link_error ();                                 \
>>>> +  }                                                  \
>>>> +
>>>> +#define POW0(TYPE1, TYPE2, CTY, TY)                  \
>>>> +  void                                                       \
>>>> +  pow0_##TY (TYPE2 x)                                        \
>>>> +  {                                                  \
>>>> +    TYPE1 t1 = __builtin_pow##TY (0.0##CTY, x);              \
>>>> +    if (t1 != 0.0##CTY)                                      \
>>>> +      link_error ();                                 \
>>>> +  }                                                  \
>>>> +
>>>> +#define TEST_ALL(TYPE1, TYPE2, CTY, TY)                      \
>>>> +  POW1OVER (TYPE1, TYPE2, CTY, TY)                   \
>>>> +  POW0 (TYPE1, TYPE2, CTY, TY)
>>>> +
>>>> +TEST_ALL (double, double, , )
>>>> +TEST_ALL (float, float, f, f)
>>>> +TEST_ALL (_Float16, _Float16, f16, f16)
>>>> +TEST_ALL (long double, long double, L, l)
>>>> +TEST_ALL (double, int, , i)
>>>> +TEST_ALL (float, int, f, if)
>>>> +TEST_ALL (long double, int, L, il)
>>>> +
>>>> +/* { dg-final { scan-tree-dump-not "link_error" "optimized" } } */
>>>> 
>>> 
>>> --
>>> Richard Biener <rguent...@suse.de>
>>> SUSE Software Solutions Germany GmbH,
>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>> 
>> 
>> 
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] match.pd: Add std::pow folding optimizations.

Reply via email to