From: Kyrylo Tkachov <[email protected]>
noce_convert_multiple_sets turns a single-arm IF-THEN-JOIN that writes
several registers into a series of conditional moves. An IF-THEN-ELSE-JOIN
diamond that writes the same registers on both arms was left to the branch,
even when every value is cheap to compute unconditionally. This shows up in
byte-stream decoders, where a small tag-driven diamond sits on the hot path
and the branch is hard to predict.
Reuse the existing multiple-sets machinery for the diamond. The else block
is validated like a then block (it may have a single set) and checked for
cross-arm independence with bbs_ok_for_cmove_arith, exactly as the
single-output diamond path in noce_try_cmove_arith already does. Its insns
are then emitted into their own registers ahead of the conditional moves, so
the move built for each then set selects between the then value and the else
value now held in the destination, rather than the register's incoming value.
A value produced only by the else arm cannot be selected this way and is
rejected. The else insns run ahead of the conditional moves, so if they
change a register the comparison reads the diamond is rejected, and if they
only clobber the condition code the shared compare is dropped so each move
re-materializes its own. The original cost weights both arms by their
probabilities, so the transform only fires when the conditional moves are
cheaper than the branch.
When the two arms load from different addresses this relies on the arm loads
having first been commoned into a single load with a selected address; arms
that still contain a memory load are not speculated. At -O3 -fsplit-paths can
duplicate the loop body before that commoning happens, leaving a load on each
arm and so blocking the conversion; building with -fno-split-paths (see
PR120892) keeps the single load and lets the diamond be converted. This
gives the ~20% gain on Snappy on aarch64.
I've proposed removing -fsplit-paths as a separate patch so that we get
this gain at plain -O3.
On SPEC CPU 2026 (intrate + fprate, aarch64) the diamond form fires 6226 extra
times at -O2 and 7730 at -O3. The single-arm (triangle) form is
unchanged. Each conversion is gated by the conditional-move cost
model, so hopefully only cheap diamonds are converted.
SPEC CPU 2026 -O3, iterations=3, single copy, on a Neoverse V2,
patched vs clean:
intrate +0.379%
fprate +0.157%
overall +0.281%
No benchmark regresses beyond run-to-run noise, the largest
gains are 708.sqlite_r +3.44%, 765.roms_r +1.07% and 722.palm_r +0.85%.
gcc/ChangeLog:
PR tree-optimization/125557
* ifcvt.cc (bb_ok_for_noce_convert_multiple_sets): Add REQUIRE_MULTIPLE
parameter defaulting to true; relax the more-than-one-set requirement
when it is false.
(noce_else_arm_only_shared_liveout): New function.
(noce_convert_multiple_sets): Record the else block; on success delete
it instead of removing the test->join edge for a diamond.
(noce_convert_multiple_sets_1): For a diamond emit the else block's
insns into their registers before the conditional moves, and drop or
reject the reused comparison if they clobber it.
(noce_process_if_block): Run noce_convert_multiple_sets on a diamond
when the else block is suitable, costing both arms.
gcc/testsuite/ChangeLog:
PR tree-optimization/125557
* gcc.target/aarch64/ifcvt_multiple_sets_diamond.c: New test.
* gcc.target/aarch64/ifcvt_multiple_sets_diamond_2.c: New test.
* gcc.c-torture/execute/ifcvt-diamond-1.c: New test.
Signed-off-by: Kyrylo Tkachov <[email protected]>
---
gcc/ifcvt.cc | 134 +++++++++++++++---
.../gcc.c-torture/execute/ifcvt-diamond-1.c | 132 +++++++++++++++++
.../aarch64/ifcvt_multiple_sets_diamond.c | 66 +++++++++
.../aarch64/ifcvt_multiple_sets_diamond_2.c | 48 +++++++
4 files changed, 364 insertions(+), 16 deletions(-)
create mode 100644 gcc/testsuite/gcc.c-torture/execute/ifcvt-diamond-1.c
create mode 100644
gcc/testsuite/gcc.target/aarch64/ifcvt_multiple_sets_diamond.c
create mode 100644
gcc/testsuite/gcc.target/aarch64/ifcvt_multiple_sets_diamond_2.c
diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
index 37e75746193..745ab80f2e9 100644
--- a/gcc/ifcvt.cc
+++ b/gcc/ifcvt.cc
@@ -3778,6 +3778,7 @@ noce_convert_multiple_sets (struct noce_if_info *if_info)
{
basic_block test_bb = if_info->test_bb;
basic_block then_bb = if_info->then_bb;
+ basic_block else_bb = if_info->else_bb;
basic_block join_bb = if_info->join_bb;
rtx_insn *jump = if_info->jump;
rtx_insn *cond_earliest;
@@ -3887,8 +3888,16 @@ noce_convert_multiple_sets (struct noce_if_info *if_info)
emit_insn_before_setloc (seq, if_info->jump,
INSN_LOCATION (insn_info.last ()->unmodified_insn));
- /* Clean up THEN_BB and the edges in and out of it. */
- remove_edge (find_edge (test_bb, join_bb));
+ /* Clean up the THEN (and, for a diamond, ELSE) block and the edges into and
+ out of the if-region. An IF-THEN-ELSE-JOIN has no test->join edge;
+ deleting ELSE_BB removes the test->else and else->join edges instead. */
+ if (else_bb)
+ {
+ delete_basic_block (else_bb);
+ num_true_changes++;
+ }
+ else
+ remove_edge (find_edge (test_bb, join_bb));
remove_edge (find_edge (then_bb, join_bb));
redirect_edge_and_branch_force (single_succ_edge (test_bb), join_bb);
delete_basic_block (then_bb);
@@ -3934,6 +3943,41 @@ noce_convert_multiple_sets_1 (struct noce_if_info
*if_info,
bool second_try = *last_needs_comparison != -1;
*use_cond_earliest = false;
+ /* For an IF-THEN-ELSE-JOIN, emit the else block's computations first, into
+ their original registers. The conditional move built for each then set
+ then selects between the then value and the else value now held in the
+ destination, rather than the register's incoming value. */
+ if (if_info->else_bb)
+ {
+ rtx_insn *else_last = last_active_insn (if_info->else_bb, false);
+ if (!noce_emit_bb (copy_rtx (PATTERN (else_last)), if_info->else_bb,
+ false))
+ {
+ end_sequence ();
+ return false;
+ }
+
+ /* These insns run ahead of the conditional moves. If they change a
+ register the comparison reads we cannot reuse it, so bail. If they
+ only clobber the condition code, drop the shared compare so that every
+ move re-materializes its own. */
+ for (rtx_insn *ei = get_insns (); ei; ei = NEXT_INSN (ei))
+ {
+ if (modified_in_p (cond, ei))
+ {
+ end_sequence ();
+ return false;
+ }
+ if (cc_cmp
+ && (modified_in_p (cc_cmp, ei)
+ || (rev_cc_cmp && modified_in_p (rev_cc_cmp, ei))))
+ {
+ cc_cmp = NULL_RTX;
+ rev_cc_cmp = NULL_RTX;
+ }
+ }
+ }
+
FOR_BB_INSNS (then_bb, insn)
{
/* Skip over non-insns. */
@@ -4242,13 +4286,16 @@ init_noce_multiple_sets_info (basic_block bb,
}
/* Return true iff basic block TEST_BB is suitable for conversion to a
- series of conditional moves. Also check that we have more than one
- set (other routines can handle a single set better than we would),
- and fewer than PARAM_MAX_RTL_IF_CONVERSION_INSNS sets. While going
- through the insns store the sum of their potential costs in COST. */
+ series of conditional moves. Unless REQUIRE_MULTIPLE is false, also check
+ that we have more than one set (other routines can handle a single set
+ better than we would); the secondary else block of a diamond may have a
+ single set. Require fewer than PARAM_MAX_RTL_IF_CONVERSION_INSNS sets.
+ While going through the insns store the sum of their potential costs in
+ COST. */
static bool
-bb_ok_for_noce_convert_multiple_sets (basic_block test_bb, unsigned *cost)
+bb_ok_for_noce_convert_multiple_sets (basic_block test_bb, unsigned *cost,
+ bool require_multiple = true)
{
rtx_insn *insn;
unsigned count = 0;
@@ -4293,11 +4340,13 @@ bb_ok_for_noce_convert_multiple_sets (basic_block
test_bb, unsigned *cost)
*cost += potential_cost;
/* If we would only put out one conditional move, the other strategies
- this pass tries are better optimized and will be more appropriate.
+ this pass tries are better optimized and will be more appropriate, so
+ require more than one set unless REQUIRE_MULTIPLE is false (the else
+ arm of a diamond, which is not a conversion candidate on its own).
Some targets want to strictly limit the number of conditional moves
that are emitted, they set this through PARAM, we need to respect
that. */
- return count > 1 && count <= param;
+ return count >= (require_multiple ? 2u : 1u) && count <= param;
}
/* Compute average of two given costs weighted by relative probabilities
@@ -4310,6 +4359,42 @@ average_cost (unsigned then_cost, unsigned else_cost,
edge e)
return else_cost + e->probability.apply ((signed) (then_cost - else_cost));
}
+/* Helper for the IF-THEN-ELSE-JOIN form of noce_convert_multiple_sets.
+ THEN_BB and ELSE_BB are the two arms of the diamond. Return true if every
+ register ELSE_BB computes that is live out of the diamond is also written by
+ THEN_BB. Such a register gets a conditional move selecting between the two
+ arms; a value produced only by ELSE_BB would survive unconditionally and so
+ cannot be handled here. */
+
+static bool
+noce_else_arm_only_shared_liveout (basic_block then_bb, basic_block else_bb)
+{
+ auto_bitmap then_dests;
+ rtx_insn *insn;
+
+ FOR_BB_INSNS (then_bb, insn)
+ if (active_insn_p (insn))
+ {
+ rtx set = single_set (insn);
+ if (set && REG_P (SET_DEST (set)))
+ bitmap_set_bit (then_dests, REGNO (SET_DEST (set)));
+ }
+
+ bitmap else_live_out = df_get_live_out (else_bb);
+ FOR_BB_INSNS (else_bb, insn)
+ if (active_insn_p (insn))
+ {
+ rtx set = single_set (insn);
+ if (set
+ && REG_P (SET_DEST (set))
+ && bitmap_bit_p (else_live_out, REGNO (SET_DEST (set)))
+ && !bitmap_bit_p (then_dests, REGNO (SET_DEST (set))))
+ return false;
+ }
+
+ return true;
+}
+
/* Given a simple IF-THEN-JOIN or IF-THEN-ELSE-JOIN block, attempt to convert
it without using conditional execution. Return TRUE if we were successful
at converting the block. */
@@ -4345,14 +4430,31 @@ noce_process_if_block (struct noce_if_info *if_info)
If a target re-uses the existing CC comparison we keep track of that
and add the costs before default noce_conversion_profitable_p. */
- unsigned potential_cost = if_info->original_cost;
unsigned old_cost = if_info->original_cost;
- if (!else_bb
- && HAVE_conditional_move
- && bb_ok_for_noce_convert_multiple_sets (then_bb, &potential_cost))
- {
- /* Temporarily set the original costs to what we estimated so
- we can determine if the transformation is worth it. */
+ unsigned ms_then_cost = 0, ms_else_cost = 0;
+ if (HAVE_conditional_move
+ && bb_ok_for_noce_convert_multiple_sets (then_bb, &ms_then_cost)
+ && (!else_bb
+ || (!if_info->then_else_reversed
+ && bb_ok_for_noce_convert_multiple_sets (else_bb, &ms_else_cost,
+ false)
+ && bbs_ok_for_cmove_arith (else_bb, then_bb, NULL_RTX)
+ && noce_else_arm_only_shared_liveout (then_bb, else_bb))))
+ {
+ /* The original code runs the comparison and one arm. Estimate that cost
+ (for a diamond weight the two arms by their probabilities) and let
+ noce_convert_multiple_sets convert only if the conditional moves come
+ out cheaper. */
+ unsigned potential_cost = old_cost + ms_then_cost;
+ if (else_bb)
+ {
+ if (optimize_bb_for_speed_p (test_bb))
+ potential_cost = old_cost + average_cost (ms_then_cost,
ms_else_cost,
+ find_edge (test_bb,
+ then_bb));
+ else
+ potential_cost = old_cost + ms_then_cost + ms_else_cost;
+ }
if_info->original_cost = potential_cost;
if (noce_convert_multiple_sets (if_info))
{
diff --git a/gcc/testsuite/gcc.c-torture/execute/ifcvt-diamond-1.c
b/gcc/testsuite/gcc.c-torture/execute/ifcvt-diamond-1.c
new file mode 100644
index 00000000000..dbd46e5d60d
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/ifcvt-diamond-1.c
@@ -0,0 +1,132 @@
+/* Runtime correctness of if-converted IF-THEN-ELSE-JOIN diamonds with
+ multiple output registers (noce_convert_multiple_sets). */
+
+long g1, g2, g3;
+
+__attribute__((noipa)) void
+diamond2 (long c, long x, long y)
+{
+ long a, b;
+ if (c & 3)
+ {
+ a = x + 1;
+ b = y - 2;
+ }
+ else
+ {
+ a = x * 4;
+ b = y + 9;
+ }
+ g1 = a;
+ g2 = b;
+}
+
+/* Then arm reads an earlier then output (arm-internal dependency). */
+__attribute__((noipa)) void
+diamond3 (long c, long p, long q)
+{
+ long a, b, d;
+ if (c > 0)
+ {
+ a = p ^ q;
+ b = a + 7;
+ d = q * 2;
+ }
+ else
+ {
+ a = p & q;
+ b = q | 1;
+ d = p - 3;
+ }
+ g1 = a;
+ g2 = b;
+ g3 = d;
+}
+
+/* A single output in the else arm; the other register keeps its incoming
+ value on the else path. */
+__attribute__((noipa)) void
+diamond_then2_else1 (long c, long x, long y)
+{
+ long a = x, b = y;
+ if (c < 0)
+ {
+ a = x + 100;
+ b = y + 200;
+ }
+ else
+ a = x - 50;
+ g1 = a;
+ g2 = b;
+}
+
+/* A live-out value produced only by the else arm. */
+__attribute__((noipa)) void
+diamond_else_only (long c, long p, long q)
+{
+ long a = 100, e = 200;
+ if (c & 4)
+ a = p + q;
+ else
+ {
+ a = p - q;
+ e = q * 2;
+ }
+ g1 = a;
+ g2 = e;
+}
+
+/* Swap idiom inside one arm. */
+__attribute__((noipa)) void
+diamond_swap (long c, long p, long q)
+{
+ long a = p, b = q;
+ if (c & 8)
+ {
+ long t = a;
+ a = b;
+ b = t;
+ }
+ else
+ {
+ a = a + b;
+ b = a - b;
+ }
+ g1 = a;
+ g2 = b;
+}
+
+int
+main (void)
+{
+ for (long c = -4; c <= 12; c++)
+ for (long p = -6; p <= 6; p++)
+ for (long q = -6; q <= 6; q++)
+ {
+ diamond2 (c, p, q);
+ if (g1 != ((c & 3) ? p + 1 : p * 4) || g2 != ((c & 3) ? q - 2 : q +
9))
+ __builtin_abort ();
+
+ diamond3 (c, p, q);
+ {
+ long ea = (c > 0) ? (p ^ q) : (p & q);
+ long eb = (c > 0) ? ea + 7 : (q | 1);
+ long ed = (c > 0) ? (q * 2) : (p - 3);
+ if (g1 != ea || g2 != eb || g3 != ed)
+ __builtin_abort ();
+ }
+
+ diamond_then2_else1 (c, p, q);
+ if (g1 != ((c < 0) ? p + 100 : p - 50) || g2 != ((c < 0) ? q + 200 :
q))
+ __builtin_abort ();
+
+ diamond_else_only (c, p, q);
+ if (g1 != ((c & 4) ? p + q : p - q) || g2 != ((c & 4) ? 200 : q * 2))
+ __builtin_abort ();
+
+ diamond_swap (c, p, q);
+ if (g1 != ((c & 8) ? q : p + q) || g2 != ((c & 8) ? p : (p + q) - q))
+ __builtin_abort ();
+ }
+ return 0;
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/ifcvt_multiple_sets_diamond.c
b/gcc/testsuite/gcc.target/aarch64/ifcvt_multiple_sets_diamond.c
new file mode 100644
index 00000000000..71b6f4be113
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ifcvt_multiple_sets_diamond.c
@@ -0,0 +1,66 @@
+/* Test if-conversion of IF-THEN-ELSE-JOIN diamonds with multiple output
+ registers through noce_convert_multiple_sets. */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-ce1" } */
+
+void sink2 (long, long);
+
+/* Two outputs, both arms write the same registers. */
+void
+diamond_arith (long c, long x, long y)
+{
+ long a, b;
+ if (c > 7)
+ {
+ a = x + 1;
+ b = y - 2;
+ }
+ else
+ {
+ a = x * 4;
+ b = y + 9;
+ }
+ sink2 (a, b);
+}
+
+/* Two outputs computed from constants on each arm. */
+void
+diamond_const (long c, long x, long y)
+{
+ long a, b;
+ if (c == 3)
+ {
+ a = 5;
+ b = 7;
+ }
+ else
+ {
+ a = 9;
+ b = 11;
+ }
+ sink2 (a, b);
+}
+
+/* Two outputs in the then arm, a single output in the else arm; the second
+ register keeps its incoming value on the else path. */
+void
+diamond_then2_else1 (long c, long x, long y)
+{
+ long a = x, b = y;
+ if (c < 0)
+ {
+ a = x + 100;
+ b = y + 200;
+ }
+ else
+ a = x - 50;
+ sink2 (a, b);
+}
+
+/* { dg-final { scan-rtl-dump-times "if-conversion succeeded through
noce_convert_multiple_sets" 3 "ce1" } } */
+
+/* The converted diamonds are branchless: no conditional branch remains. */
+/* { dg-final { scan-assembler-not "\tb\\.\[a-z\]+\t" } } */
+/* { dg-final { scan-assembler-not "\tcbn?z\t" } } */
+/* { dg-final { scan-assembler-not "\ttbn?z\t" } } */
+/* { dg-final { scan-assembler "\tcsel\t" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ifcvt_multiple_sets_diamond_2.c
b/gcc/testsuite/gcc.target/aarch64/ifcvt_multiple_sets_diamond_2.c
new file mode 100644
index 00000000000..12a041fd036
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ifcvt_multiple_sets_diamond_2.c
@@ -0,0 +1,48 @@
+/* Each arm of the diamond loads from a selected address and advances a
+ pointer by a selected amount. Once the two arm loads are commoned the
+ diamond writes two registers on both arms (the load result and the
+ advance), which noce_convert_multiple_sets turns into conditional moves. */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-ce1" } */
+
+#include <stddef.h>
+#include <stdint.h>
+
+extern const int16_t lentab[256];
+
+static inline uint32_t
+extract (uint32_t val, size_t type)
+{
+ const uint64_t masks = 0x0000FFFF00FF0000ull;
+ return val & (uint32_t) ((masks >> (type * 16)) & 0xFFFF);
+}
+
+ptrdiff_t
+f (const uint8_t *ip, size_t tag, const uint8_t *end, ptrdiff_t op)
+{
+ do
+ {
+ const uint8_t *old_ip = ip;
+ ptrdiff_t lmo = lentab[tag];
+ size_t type = tag & 3;
+ if (type == 0)
+ {
+ size_t n = (tag >> 2) + 1;
+ tag = ip[n];
+ ip += n + 1;
+ }
+ else
+ {
+ tag = ip[type];
+ ip += type + 1;
+ }
+ uint32_t next = (uint32_t) old_ip[0] | ((uint32_t) old_ip[1] << 8);
+ ptrdiff_t extracted = extract (next, type);
+ op += lmo - extracted;
+ }
+ while (ip < end);
+ return op;
+}
+
+/* { dg-final { scan-rtl-dump "if-conversion succeeded through
noce_convert_multiple_sets" "ce1" } } */
+/* { dg-final { scan-assembler-times "\tcsinc\t" 2 } } */
--
2.50.1 (Apple Git-155)