Forgot to add [PATCH] in the subject and didn't want to
accidentally miss people's inboxes. 

Thanks,
Soumya

> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote:
> 
> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit
> 
> This patch adds a new AArch64 RTL pass that optimizes 64-bit
> general purpose register operations to use 32-bit W-registers when the
> upper 32 bits of the register are known to be zero.
> 
> This is beneficial for the Olympus core, which benefits from using 32-bit
> W-registers over 64-bit X-registers if possible. This is recommended by the
> updated Olympus Software Optimization Guide, which will be published soon.
> 
> This pass can be controlled with -mnarrow-gp-writes and is active at -O2 and
> above, but not enabled by default, except for -mcpu=olympus.
> 
> ---
> 
> In AArch64, each 64-bit X register has a corresponding 32-bit W register
> that maps to its lower half.  When we can guarantee that the upper 32 bits
> are never used, we can safely narrow operations to use W registers instead.
> 
> For example, this code:
>    uint64_t foo(uint64_t a) {
>        return (a & 255) + 3;
>    }
> 
> Currently compiles to:
>    and x8, x0, #0xff
>    add x0, x8, #3
> 
> But with this pass enabled, it optimizes to:
>    and x8, x0, #0xff
>    add w0, w8, #3      // Using W register instead of X
> 
> ---
> 
> The pass operates in two phases:
> 
> 1) Analysis Phase:
>   - Using RTL-SSA, iterates through extended basic blocks (EBBs)
>   - Computes nonzero bit masks for each register definition
>   - Recursively processes PHI nodes
>   - Identifies candidates for narrowing
> 2) Transformation Phase:
>   - Applies narrowing to validated candidates
>   - Converts DImode operations to SImode where safe
> 
> The pass runs late in the RTL pipeline, after register allocation, to ensure
> stable def-use chains and avoid interfering with earlier optimizations.
> 
> ---
> 
> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that recursively
> analyzes RTL expressions to compute a bitmask. However, nonzero_bits has a
> limitation: when it encounters a register, it conservatively returns the mode
> mask (all bits potentially set). Since this pass analyzes all defs in an
> instruction, this information can be used to refine the mask. The pass 
> maintains
> a hash map of computed bit masks and installs a custom RTL hooks callback
> to consult this mask when encountering a register.
> 
> ---
> 
> PHI nodes require special handling to merge masks from all inputs. This is 
> done
> by combine_mask_from_phi. 3 cases are tackled here:
>   1. Input Edge has a Definition: This is the simplest case. For each input
>   edge to the PHI, the def information is retreived and its mask is looked up.
>   2. Input Edge has no Definition: A conservative mask is assumed for that
>   input.
>   3. Input Edge is a PHI: Recursively call combine_mask_from_phi to
>   merge the masks of all incoming values.
> 
> ---
> 
> When processing regular instructions, the pass first tackles SET and PARALLEL
> patterns with compare instructions.
> 
> Single SET instructions:
> 
> If the upper 32 bits of the source are known to be zero, then the instruction
> qualifies for narrowing. Instead of just using lowpart_subreg for the source,
> we define narrow_dimode_src to attempt further optimizations:
> 
> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via simplify_gen_binary
> - IF_THEN_ELSE: simplified via simplify_gen_ternary
> 
> PARALLEL Instructions (Compare + SET): 
> 
> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where the 
> SET
> source equals the first operand of the COMPARE. Depending on the condition 
> code
> for the compare, the pass checks for the required bits to be zero:
> 
> - CC_Zmode/CC_NZmode: Upper 32 bits
> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow)
> 
> If the instruction does not match the above patterns (or matches but cannot 
> be 
> optimized), the pass still analyzes all its definitions to ensure nzero_map is
> complete. This ensures every definition has an entry in nzero_map.
> 
> ---
> 
> When transforming the qualified instructions, the pass uses rtl_ssa::recog and
> rtl_ssa::change_is_worthwhile to verify the new pattern and determine if the
> transformation is worthwhile.
> 
> ---
> 
> As an additional benefit, testing on Neoverse-V2 shows that instances of
> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2'
> instructions after this pass narrows them.
> 
> ---
> 
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
> 
> Co-authored-by: Kyrylo Tkachov <[email protected]>
> Signed-off-by: Soumya AR <[email protected]>
> 
> gcc/ChangeLog:
> 
> * config.gcc: Add aarch64-narrow-gp-writes.o.
> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert
> pass_narrow_gp_writes before pass_cleanup_barriers.
> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES.
> * config/aarch64/tuning_models/olympus.h:
> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags.
> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): Declare.
> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option.
> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule.
> * doc/invoke.texi: Document -mnarrow-gp-writes.
> * config/aarch64/aarch64-narrow-gp-writes.cc: New file.
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.target/aarch64/narrow-gp-writes-1.c: New test.
> * gcc.target/aarch64/narrow-gp-writes-2.c: New test.
> * gcc.target/aarch64/narrow-gp-writes-3.c: New test.
> * gcc.target/aarch64/narrow-gp-writes-4.c: New test.
> * gcc.target/aarch64/narrow-gp-writes-5.c: New test.
> * gcc.target/aarch64/narrow-gp-writes-6.c: New test.
> * gcc.target/aarch64/narrow-gp-writes-7.c: New test.
> 
> 
> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch>

Reply via email to