On Mon, May 18, 2026 at 2:24 AM Guoce Feng <[email protected]> wrote:

> Hi Andrew,
>
> Thanks for the detailed explanation and for pointing me to PR123113 and
> the related patch series.
>
> I checked the scope of PR123113 more carefully. My understanding is that
> those patches mainly relax existing restrictions in phiopt / cselim /
> factor_out_conditional_operation around merge blocks with more than two
> predecessors. That certainly seems relevant to part of the problem. (Inbox
> <https://inbox.sourceware.org/gcc-bugs/?t=20251215070155&utm_source=chatgpt.com>
> )
>
> However, I do not think those patches fully cover the motivating case I
> had in mind.
>
> The vectorized example you showed has the form
>
> sqrt(expr1[i])
> sqrt(expr2[i])
> sqrt(expr3[i])
>
> where expr1/expr2/expr3 are already available values, so ifcvt can form
>
> select(select(...))
> sqrt(selected_value)
>
> quite naturally.
>
> In my motivating kernel, the operands of sqrt are themselves computed
> inside the individual control-flow paths. In simplified form, the three
> arms look more like
>
> sqrt(complex_expr1)
> sqrt(complex_expr2)
> sqrt(complex_expr3)
>
> where the third arm also contains extra intermediate computations before
> the final squared-distance expression is formed.
>
> Here is a reduced Compiler Explorer example that still does not get
> converted into the desired “blend operands first, then perform one sqrt”
> shape:
>
> https://godbolt.org/z/savsxzaWz
>
>
With my patch set with the above testcase with
-O3  -fno-math-errno  -fno-trapping-math, we get in ifcvt:
```
  _3 = _66 ? _21 : _16;
  _65 = _122 ? _11 : _3;
  distance_24 = __builtin_sqrt (_65);
  total_49 = distance_24 + total_53;
 ```
So it is working like you want it to work. There is only one sqrt in the
vectorized code.

As for the other expressions that could/should be factor out we definitely
need something better than just this.
LLVM has a pass which matching up the expressions and then doing some cost
model.

The factor_* code in ifcvt could be improved to see if it is the same
operation and produce 2/3 phis.
Right now the factor code is only designed to match if one of the operands
of the operation matches because that was easiest at the time of writing it.
We could do this as a "pre"-pass for the ifcvt'able loops too.

Thanks,
Andrea

 So the transformation I am interested in is still closer to:
>
> PHI <sqrt(a), sqrt(b), sqrt(c)>
>     =>
> sqrt(PHI <a, b, c>)
>
> in cases where a/b/c may be branch-local computed expressions, not just
> already-materialized values.
>
> Put differently:
>
> PR123113 improves factoring around multi-pred PHIs,
> but does GCC still lack a stronger common pure-operation
> factoring/sinking transformation for branch-local computed operands?
>
> I would be interested in your thoughts on whether this stronger form is
> considered worthwhile in GCC, and if so, whether you would expect it to
> belong in phiopt, ifcvt, or some other middle-end transformation.
>
> Thanks again for the clarification.
>
> Best regards,
> Guoce Feng
>
> ---- Replied Message ----
> From Andrew Pinski<[email protected]>
> <[email protected]>
> Date 5/18/2026 16:49
> To Guoce Feng<[email protected]> <[email protected]>
> Cc [email protected]<[email protected]>,
> <[email protected]>Richard Biener<[email protected]>,
> <[email protected]>Andrew Pinski<[email protected]>,
> <[email protected]>Jeff Law<[email protected]>
> <[email protected]>
> Subject Re: [RFC] Factoring common pure operations across PHIs / branches
> to reduce duplicated sqrt calls
>
> On Mon, May 18, 2026 at 1:16 AM Guoce Feng via Gcc <[email protected]> wrote:
>
>
>  Hi,
>
>  I am investigating an optimization opportunity in GCC related to
>  factoring common pure operations out of control-flow joins.
>
>  Consider code of this general form:
>
>  if (c1 <= 0.0)
>    distance = sqrt(expr1);
>  else if (c2 <= c1)
>    distance = sqrt(expr2);
>  else
>    distance = sqrt(expr3);
>
>
>  After control-flow simplification / if-conversion, this can effectively
>  lead to a form resembling:
>
>  distance = select(cond1,
>                    sqrt(expr1),
>                    select(cond2, sqrt(expr2), sqrt(expr3)));
>
>
>  For vectorized code, this may become "three vector sqrt operations plus
>  blends", while a more profitable form would be:
>
>  merged_expr = select(cond1,
>                       expr1,
>                       select(cond2, expr2, expr3));
>  distance = sqrt(merged_expr);
>
>
>  That is, transform a conceptual pattern like
>
>  phi(sqrt(a), sqrt(b), sqrt(c))
>
>
>  into
>
>  sqrt(phi(a, b, c))
>
>
>  when the operation is safe to factor out and doing so is profitable.
>
>  I noticed that LLVM has a related transformation in SimplifyCFG under
>  the sink-common-insts machinery. In GCC, I am trying to understand what
>  the most appropriate place would be for a similar optimization.
>
>  Two possible implementation directions seem plausible:
>
>    1.  Extend the existing PHI factoring logic in
>  tree-ssa-phiopt.cc, especially around
>  factor_out_conditional_operation, to also handle suitable
>  side-effect-free builtins / calls such as sqrt under the
>  appropriate semantic restrictions.
>
>
> So part of the problem with the above case at -Ofast is really
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123113 .
> I have a patch set which is mostly approved but I need to double check
> on the order of basic-blocks being looked at.
> The patches are:
> https://inbox.sourceware.org/gcc-patches/[email protected]/
> https://inbox.sourceware.org/gcc-patches/[email protected]/
> https://patchwork.sourceware.org/project/gcc/patch/[email protected]/
>
> In that order. (I can't seen to find the 3rd patch on inbox for some reason).
> With those patches we get:
> ```
>   <bb 2> [local count: 1073741824]:
>   if (c1_2(D) <= 0.0)
>     goto <bb 3>; [41.00%]
>   else
>     goto <bb 4>; [59.00%]
>
>   <bb 3> [local count: 440234144]:
>   distance_7 = __builtin_sqrt (expr1_6(D)); [tail call]
>   goto <bb 7>; [100.00%]
>
>   <bb 4> [local count: 633507680]:
>   if (c1_2(D) >= c2_3(D))
>     goto <bb 5>; [50.00%]
>   else
>     goto <bb 6>; [50.00%]
>
>   <bb 5> [local count: 316753840]:
>
>   <bb 6> [local count: 633507680]:
>   # _8 = PHI <expr3_4(D)(4), expr2_5(D)(5)>
>   distance_10 = __builtin_sqrt (_8); [tail call]
>
>   <bb 7> [local count: 1073741824]:
>   # distance_1 = PHI <distance_10(6), distance_7(3)>
>   return distance_1;
> ```
>
> Extending factor_* to handle the above could be done but I am not sure
> phiopt is the correct place.
> Note there is code in the ifcvt that does a similar thing as factor_*
> in phiopt which then does handle the rest of it and for:
> ```
> double f(double *expr1, double *expr2, double *expr3, double *c1, double *c2)
> {
> double t = 0.0;
> for(int i =0 ;i < 1024; i++)
> {
>   double distance = 0.0;
> if (c1[i] <= 0.0)
>   distance = __builtin_sqrt(expr1[i]);
> else if (c2[i] <= c1[i])
>   distance = __builtin_sqrt(expr2[i]);
> else
>   distance = __builtin_sqrt(expr3[i]);
> t+=distance;
> }
>   return t;
> }
>
> ```
>
> With -mavx512f (since I was too lazy to figure out what expr* was and
> too lazy to test non-x86_64 [since that is what I have built]) we get:
> ```
>   _7 = _34 ? _14 : _12;
>   _63 = _42 ? _6 : _7;
>   distance_17 = __builtin_sqrt (_63);
>   t_25 = distance_17 + t_29;
> ```
> In ifcvt.
>
> Which I assume is exactly what you were expecting.
>
>
>
>    2.  Add a more general "common instruction sinking" transformation to
>  tree-ssa-sink.cc, analogous in spirit to the existing common-store
>  sinking support, but for a restricted class of pure scalar
>  computations.
>
>  My current intuition is that the PHIOPT direction may be a better fit
>  for this specific pattern, because the optimization is fundamentally
>  about factoring a common operation across PHI/merge structure rather
>  than ordinary code sinking. However, I would appreciate guidance from
>  the GCC community on the preferred design.
>
>  I am especially interested in feedback on:
>
>    *   Whether tree-ssa-phiopt.cc is indeed the right place for this form of
>  transformation.
>
>    *   Whether handling builtins such as sqrt is acceptable under the
>  relevant floating-point / errno / exception semantics.
>
>    *   Whether this should be restricted to cases where the operation is
>  known to be side-effect-free and safe to move.
>
>    *   Whether profitability should be guided primarily by code size,
>  vectorization opportunities, or target-independent cost heuristics.
>
>    *   Whether there is existing work or a PR in this area that I should
>  build upon.
>
>
> See above.
>
>
>  A motivating reduced example comes from a geometric distance kernel in
>  which three branch-local sqrt computations could ideally be turned into
>  one sqrt after value merging, improving the shape of the vectorized code.
>
> See above.
>
> Thanks,
> Andrea
>
>
>  If this direction seems reasonable, I am interested in preparing a
>  prototype patch and test cases.
>
>  Thanks,
>  Guoce Feng
>
>

Reply via email to