On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote: > > Ping. > > I split the files from the previous mail so it's hopefully easier to review.
I can review this but the approval won't be for until stage1. This pass at this point is too risky for this point of the release cycle. Though I also wonder how much of this can/should be done on the gimple level in a generic way. And if there is a way to get the zero-bits from the gimple level down to the RTL level still so we don't need to keep on recomputing them (this is useful for other passes too). Thanks, Andrew Pinski > > Also CC'ing Alex Coplan to this thread. > > Thanks, > Soumya > > > On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote: > > > > Hi Tamar, > > > > Attaching an updated version of this patch that enables the pass at O2 and > > above > > on aarch64, and can be optionally disabled with -mno-narrow-gp-writes. > > > > Enabling it by default at O2 touched quite a large number of tests, which I > > have updated in this patch. > > > > Most of the updates are straightforward, which involve changing x registers > > to > > (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+). > > > > There are some tests (eg. aarch64/int_mov_immediate_1.c) where the > > representation of the immediate changes: > > > > mov w0, 4294927974 -> mov w0, -39322 > > > > This is because when the following RTL is narrowed to SI: > > (set (reg/i:DI 0 x0) > > (const_int 4294927974 [0xffff6666])) > > > > Due to the MSB changing to Bit 31, which is set, the output is printed as > > signed. > > > > Thanks, > > Soumya > > > > > > > > > On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote: > > > > > > External email: Use caution opening links or attachments > > > > > > > > > Ping. > > > > > > Thanks, > > > Soumya > > > > > >> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote: > > >> > > >> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit > > >> > > >> This patch adds a new AArch64 RTL pass that optimizes 64-bit > > >> general purpose register operations to use 32-bit W-registers when the > > >> upper 32 bits of the register are known to be zero. > > >> > > >> This is beneficial for the Olympus core, which benefits from using 32-bit > > >> W-registers over 64-bit X-registers if possible. This is recommended by > > >> the > > >> updated Olympus Software Optimization Guide, which will be published > > >> soon. > > >> > > >> This pass can be controlled with -mnarrow-gp-writes and is active at -O2 > > >> and > > >> above, but not enabled by default, except for -mcpu=olympus. > > >> > > >> --- > > >> > > >> In AArch64, each 64-bit X register has a corresponding 32-bit W register > > >> that maps to its lower half. When we can guarantee that the upper 32 > > >> bits > > >> are never used, we can safely narrow operations to use W registers > > >> instead. > > >> > > >> For example, this code: > > >> uint64_t foo(uint64_t a) { > > >> return (a & 255) + 3; > > >> } > > >> > > >> Currently compiles to: > > >> and x8, x0, #0xff > > >> add x0, x8, #3 > > >> > > >> But with this pass enabled, it optimizes to: > > >> and x8, x0, #0xff > > >> add w0, w8, #3 // Using W register instead of X > > >> > > >> --- > > >> > > >> The pass operates in two phases: > > >> > > >> 1) Analysis Phase: > > >> - Using RTL-SSA, iterates through extended basic blocks (EBBs) > > >> - Computes nonzero bit masks for each register definition > > >> - Recursively processes PHI nodes > > >> - Identifies candidates for narrowing > > >> 2) Transformation Phase: > > >> - Applies narrowing to validated candidates > > >> - Converts DImode operations to SImode where safe > > >> > > >> The pass runs late in the RTL pipeline, after register allocation, to > > >> ensure > > >> stable def-use chains and avoid interfering with earlier optimizations. > > >> > > >> --- > > >> > > >> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that > > >> recursively > > >> analyzes RTL expressions to compute a bitmask. However, nonzero_bits has > > >> a > > >> limitation: when it encounters a register, it conservatively returns the > > >> mode > > >> mask (all bits potentially set). Since this pass analyzes all defs in an > > >> instruction, this information can be used to refine the mask. The pass > > >> maintains > > >> a hash map of computed bit masks and installs a custom RTL hooks callback > > >> to consult this mask when encountering a register. > > >> > > >> --- > > >> > > >> PHI nodes require special handling to merge masks from all inputs. This > > >> is done > > >> by combine_mask_from_phi. 3 cases are tackled here: > > >> 1. Input Edge has a Definition: This is the simplest case. For each input > > >> edge to the PHI, the def information is retreived and its mask is looked > > >> up. > > >> 2. Input Edge has no Definition: A conservative mask is assumed for that > > >> input. > > >> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to > > >> merge the masks of all incoming values. > > >> > > >> --- > > >> > > >> When processing regular instructions, the pass first tackles SET and > > >> PARALLEL > > >> patterns with compare instructions. > > >> > > >> Single SET instructions: > > >> > > >> If the upper 32 bits of the source are known to be zero, then the > > >> instruction > > >> qualifies for narrowing. Instead of just using lowpart_subreg for the > > >> source, > > >> we define narrow_dimode_src to attempt further optimizations: > > >> > > >> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via > > >> simplify_gen_binary > > >> - IF_THEN_ELSE: simplified via simplify_gen_ternary > > >> > > >> PARALLEL Instructions (Compare + SET): > > >> > > >> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where > > >> the SET > > >> source equals the first operand of the COMPARE. Depending on the > > >> condition code > > >> for the compare, the pass checks for the required bits to be zero: > > >> > > >> - CC_Zmode/CC_NZmode: Upper 32 bits > > >> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow) > > >> > > >> If the instruction does not match the above patterns (or matches but > > >> cannot be > > >> optimized), the pass still analyzes all its definitions to ensure > > >> nzero_map is > > >> complete. This ensures every definition has an entry in nzero_map. > > >> > > >> --- > > >> > > >> When transforming the qualified instructions, the pass uses > > >> rtl_ssa::recog and > > >> rtl_ssa::change_is_worthwhile to verify the new pattern and determine if > > >> the > > >> transformation is worthwhile. > > >> > > >> --- > > >> > > >> As an additional benefit, testing on Neoverse-V2 shows that instances of > > >> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2' > > >> instructions after this pass narrows them. > > >> > > >> --- > > >> > > >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no > > >> regression. > > >> OK for mainline? > > >> > > >> Co-authored-by: Kyrylo Tkachov <[email protected]> > > >> Signed-off-by: Soumya AR <[email protected]> > > >> > > >> gcc/ChangeLog: > > >> > > >> * config.gcc: Add aarch64-narrow-gp-writes.o. > > >> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert > > >> pass_narrow_gp_writes before pass_cleanup_barriers. > > >> * config/aarch64/aarch64-tuning-flags.def > > >> (AARCH64_EXTRA_TUNING_OPTION): > > >> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES. > > >> * config/aarch64/tuning_models/olympus.h: > > >> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags. > > >> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): > > >> Declare. > > >> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option. > > >> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule. > > >> * doc/invoke.texi: Document -mnarrow-gp-writes. > > >> * config/aarch64/aarch64-narrow-gp-writes.cc: New file. > > >> > > >> gcc/testsuite/ChangeLog: > > >> > > >> * gcc.target/aarch64/narrow-gp-writes-1.c: New test. > > >> * gcc.target/aarch64/narrow-gp-writes-2.c: New test. > > >> * gcc.target/aarch64/narrow-gp-writes-3.c: New test. > > >> * gcc.target/aarch64/narrow-gp-writes-4.c: New test. > > >> * gcc.target/aarch64/narrow-gp-writes-5.c: New test. > > >> * gcc.target/aarch64/narrow-gp-writes-6.c: New test. > > >> * gcc.target/aarch64/narrow-gp-writes-7.c: New test. > > >> > > >> > > >> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch> > > > > >
