https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125148
--- Comment #12 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Tamar Christina <[email protected]>: https://gcc.gnu.org/g:a6ee91793b9f4d28ccd3fcc6f607f646d305a39e commit r17-835-ga6ee91793b9f4d28ccd3fcc6f607f646d305a39e Author: Tamar Christina <[email protected]> Date: Wed May 27 10:50:05 2026 +0100 AArch64: fix the SVE->SIMD lowering optimization [PR125148] The optimization added in g:210d06502f22964c7214586c54f8eb54a6965bfd has an implementation bug which makes it generate bogus code. The optimization was support to convert SVE loads with a known predicate into Adv. SIMD loads without the predicate. The current implementation is done at expansion time where the predicate is still clearly available. It does this by rewriting the loads to an Adv. SIMD load and then taking a paradoxical subreg of the result into an SVE vector. i.e. (subreg:VNx16QI (reg:QI 111) 0) for a byte load with a VL1 predicate. The issue is that the SVE loads were UNSPEC before and they didn't get optimized by passes like forwprop and cse. Adv. SIMD loads are. as such in cases where you have such a pattern: char[] p = {1,2,3,3}; load (p, VL1) we used to generate mov w0, 1 strb w0, [x19] ptrue p7.b, vl1 ld1b z30.b, p7/z, [x19] which was dumb, but valid and the above optimization now gets the load eliminated and the constants folded. However, in particular for scalars, AArch64 has an optimization that's been a long for ages in which scalar FPR constants are created using vector broadcasting operations. It assumes scalars are accessed as scalars (as in, in the mode that created them). So the above gets optimized to movi v30.8b, 0x1 which is invalid. The original load requires the inactive elements to be zero, where-as by using the paradoxical subreg it's relying on the implicit (as in, not modelled in RTL) assumption that the load zeros the top bits, but doesn't keep in mind that the load can be optimized away. This patch fixes it by creating a full SVE vector of 0s and writing only the values we want to set using an INSR. (i.e. using VL2 of bytes writes a short). It then provides patterns to optimize this: 1. if it's still following a load, just emit the load. 2. if it's not, then optimize it to a zero'ing operation. so e.g. HI mode issues an fmov h0, h0 and so clears the top bits to zero. I choose this representation because even without the above operations it is semantically valid and will generate correct code. The alternative would be to delay this optimization to e.g. combine however we have two problems there: 1. It's quite late, so the above constant cases for instance don't get optimized and we keep the pointless store and loads. 2. Our RTX costs don't model predicates. and so it may not accept the combination since the replacement is more expensive. So I chose to keep the optimization early, but just replace the paradoxical subreg with a zero-extend. gcc/ChangeLog: PR target/125148 * config/aarch64/aarch64-sve.md (*aarch64_vec_shl_insert_into_zero_<mode>, *aarch64_vec_shl_insert_into_zero_vnx16qi, *aarch64_vec_shl_insert_from_load_<mode>): New. * config/aarch64/aarch64.cc (aarch64_emit_load_store_through_mode): Replace paradoxical subreg with zero-extend. gcc/testsuite/ChangeLog: PR target/125148 * gcc.target/aarch64/sve/highway_run.c: New test.
