AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Soumya AR Wed, 12 Nov 2025 22:14:01 -0800

AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

This patch adds a new AArch64 RTL pass that optimizes 64-bit
general purpose register operations to use 32-bit W-registers when the
upper 32 bits of the register are known to be zero.


This is beneficial for the Olympus core, which benefits from using 32-bit
W-registers over 64-bit X-registers if possible. This is recommended by the
updated Olympus Software Optimization Guide, which will be published soon.

This pass can be controlled with -mnarrow-gp-writes and is active at -O2 and
above, but not enabled by default, except for -mcpu=olympus.

---

In AArch64, each 64-bit X register has a corresponding 32-bit W register
that maps to its lower half.  When we can guarantee that the upper 32 bits
are never used, we can safely narrow operations to use W registers instead.

For example, this code:
    uint64_t foo(uint64_t a) {
        return (a & 255) + 3;
    }

Currently compiles to:
    and x8, x0, #0xff
    add x0, x8, #3

But with this pass enabled, it optimizes to:
    and x8, x0, #0xff
    add w0, w8, #3      // Using W register instead of X

---

The pass operates in two phases:

 1) Analysis Phase:
   - Using RTL-SSA, iterates through extended basic blocks (EBBs)
   - Computes nonzero bit masks for each register definition
   - Recursively processes PHI nodes
   - Identifies candidates for narrowing
 2) Transformation Phase:
   - Applies narrowing to validated candidates
   - Converts DImode operations to SImode where safe

The pass runs late in the RTL pipeline, after register allocation, to ensure
stable def-use chains and avoid interfering with earlier optimizations.

---

nonzero_bits(src, DImode) is a function defined in rtlanal.cc that recursively
analyzes RTL expressions to compute a bitmask. However, nonzero_bits has a
limitation: when it encounters a register, it conservatively returns the mode
mask (all bits potentially set). Since this pass analyzes all defs in an
instruction, this information can be used to refine the mask. The pass maintains
a hash map of computed bit masks and installs a custom RTL hooks callback
to consult this mask when encountering a register.

---

PHI nodes require special handling to merge masks from all inputs. This is done
by combine_mask_from_phi. 3 cases are tackled here:
   1. Input Edge has a Definition: This is the simplest case. For each input
   edge to the PHI, the def information is retreived and its mask is looked up.
   2. Input Edge has no Definition: A conservative mask is assumed for that
   input.
   3. Input Edge is a PHI: Recursively call combine_mask_from_phi to
   merge the masks of all incoming values.

---

When processing regular instructions, the pass first tackles SET and PARALLEL
patterns with compare instructions.
  
Single SET instructions:

If the upper 32 bits of the source are known to be zero, then the instruction
qualifies for narrowing. Instead of just using lowpart_subreg for the source,
we define narrow_dimode_src to attempt further optimizations:

- Bitwise operations (AND/OR/XOR/ASHIFT): simplified via simplify_gen_binary
- IF_THEN_ELSE: simplified via simplify_gen_ternary

PARALLEL Instructions (Compare + SET): 

The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where the SET
source equals the first operand of the COMPARE. Depending on the condition code
for the compare, the pass checks for the required bits to be zero:

- CC_Zmode/CC_NZmode: Upper 32 bits
- CC_NZVmode: Upper 32 bits and bit 31 (for overflow)

If the instruction does not match the above patterns (or matches but cannot be 
optimized), the pass still analyzes all its definitions to ensure nzero_map is
complete. This ensures every definition has an entry in nzero_map.

---

When transforming the qualified instructions, the pass uses rtl_ssa::recog and
rtl_ssa::change_is_worthwhile to verify the new pattern and determine if the
transformation is worthwhile.

---

As an additional benefit, testing on Neoverse-V2 shows that instances of
'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2'
instructions after this pass narrows them.

---

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Co-authored-by: Kyrylo Tkachov <[email protected]>
Signed-off-by: Soumya AR <[email protected]>

gcc/ChangeLog:

        * config.gcc: Add aarch64-narrow-gp-writes.o.
        * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert
        pass_narrow_gp_writes before pass_cleanup_barriers.
        * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
        Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES.
        * config/aarch64/tuning_models/olympus.h:
        Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags.
        * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): Declare.
        * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option.
        * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule.
        * doc/invoke.texi: Document -mnarrow-gp-writes.
        * config/aarch64/aarch64-narrow-gp-writes.cc: New file.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/narrow-gp-writes-1.c: New test.
        * gcc.target/aarch64/narrow-gp-writes-2.c: New test.
        * gcc.target/aarch64/narrow-gp-writes-3.c: New test.
        * gcc.target/aarch64/narrow-gp-writes-4.c: New test.
        * gcc.target/aarch64/narrow-gp-writes-5.c: New test.
        * gcc.target/aarch64/narrow-gp-writes-6.c: New test.
        * gcc.target/aarch64/narrow-gp-writes-7.c: New test.

0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch
Description: 0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch

AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Reply via email to