https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119393
Bug ID: 119393 Summary: [15 Regression] Worse vectorization of imagick_r hot loop on aarch64 since r15-5024-g2a2e6784074e1f Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: acoplan at gcc dot gnu.org Target Milestone: --- Created attachment 60836 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60836&action=edit reduced LTO testcase I noticed that imagick_r from SPEC CPU 2017 regressed by 3.78% on Neoverse V1 after the ifcombine change r15-5024-g2a2e6784074e1f7b679bc09b1a66982bf60645a5. This is with -Ofast -flto=auto -mcpu=neoverse-v1+nosve -fomit-frame-pointer. I've attached a reduced reproducer which (unfortunately) still requires LTO, but it is at least fairly well reduced (only two small TUs). With compilers built before/after the above commit, we can run the script from the attached reproducer (repro.sh) and compare the resulting disassembly. The first thing to observe is that there is significant code size growth: $ wc -l good.dis bad.dis 234 good.dis 307 bad.dis i.e. the function grows in size by 73 insns, or 31%. Looking at the hot loop, before the above change (in good.dis) we have: c0: 4cdf045c ld4 {v28.8h-v31.8h}, [x2], #64 c4: 91000421 add x1, x1, #0x1 c8: 4f10a7bf sxtl2 v31.4s, v29.8h cc: 0f10a7bd sxtl v29.4s, v29.4h d0: 4f10a797 sxtl2 v23.4s, v28.8h d4: 4f10a7d0 sxtl2 v16.4s, v30.8h d8: 0f20a7f3 sxtl v19.2d, v31.2s dc: 0f20a7b2 sxtl v18.2d, v29.2s e0: 4f20a7ff sxtl2 v31.2d, v31.4s e4: 4f20a7bd sxtl2 v29.2d, v29.4s e8: 4e61da73 scvtf v19.2d, v19.2d ec: 4e61da52 scvtf v18.2d, v18.2d f0: 4e61dbff scvtf v31.2d, v31.2d f4: 4e61dbbd scvtf v29.2d, v29.2d f8: 0f10a79c sxtl v28.4s, v28.4h fc: 0f10a7de sxtl v30.4s, v30.4h 100: 4e7fd67f fadd v31.2d, v19.2d, v31.2d 104: 4e7dd65d fadd v29.2d, v18.2d, v29.2d 108: 0f20a6f5 sxtl v21.2d, v23.2s 10c: 0f20a794 sxtl v20.2d, v28.2s 110: 0f20a7d1 sxtl v17.2d, v30.2s 114: 4e7dd7fd fadd v29.2d, v31.2d, v29.2d 118: 4f20a6f7 sxtl2 v23.2d, v23.4s 11c: 4f20a79c sxtl2 v28.2d, v28.4s 120: 4f20a7de sxtl2 v30.2d, v30.4s 124: 4fdb13b9 fmla v25.2d, v29.2d, v27.d[0] 128: 0f20a61d sxtl v29.2d, v16.2s 12c: 4f20a610 sxtl2 v16.2d, v16.4s 130: 4e61dab5 scvtf v21.2d, v21.2d 134: 4e61daf7 scvtf v23.2d, v23.2d 138: 4e61da94 scvtf v20.2d, v20.2d 13c: 4e61db9c scvtf v28.2d, v28.2d 140: 4e61dbbd scvtf v29.2d, v29.2d 144: 4e61da10 scvtf v16.2d, v16.2d 148: 4e61da31 scvtf v17.2d, v17.2d 14c: 4e61dbde scvtf v30.2d, v30.2d 150: 4e77d6b7 fadd v23.2d, v21.2d, v23.2d 154: 4e7cd69c fadd v28.2d, v20.2d, v28.2d 158: 4e70d7b0 fadd v16.2d, v29.2d, v16.2d 15c: 4e7ed63e fadd v30.2d, v17.2d, v30.2d 160: 4e7cd6fc fadd v28.2d, v23.2d, v28.2d 164: 4e70d7d0 fadd v16.2d, v30.2d, v16.2d 168: 4fdb139a fmla v26.2d, v28.2d, v27.d[0] 16c: 4fdb1218 fmla v24.2d, v16.2d, v27.d[0] 170: eb01009f cmp x4, x1 174: 54fffa61 b.ne c0 <MorphologyApply.constprop.0+0xc0> // b.any but after the above change (in bad.dis), we have: 164: ad41701d ldp q29, q28, [x0, #32] 168: 9101c3e8 add x8, sp, #0x70 16c: ad40781a ldp q26, q30, [x0] 170: 91000421 add x1, x1, #0x1 174: 4ebd1fa0 mov v0.16b, v29.16b 178: 91010000 add x0, x0, #0x40 17c: 4ebc1f81 mov v1.16b, v28.16b 180: 4ebe1fc2 mov v2.16b, v30.16b 184: ad03fbfa stp q26, q30, [sp, #112] 188: 4c40a119 ld1 {v25.16b, v26.16b}, [x8] 18c: 4e0a201e tbl v30.16b, {v0.16b, v1.16b}, v10.16b 190: 4ebd1fa3 mov v3.16b, v29.16b 194: 4e08233c tbl v28.16b, {v25.16b, v26.16b}, v8.16b 198: 0f10a7da sxtl v26.4s, v30.4h 19c: 4e09205d tbl v29.16b, {v2.16b, v3.16b}, v9.16b 1a0: 4f10a7de sxtl2 v30.4s, v30.8h 1a4: 0f10a798 sxtl v24.4s, v28.4h 1a8: 4f10a79c sxtl2 v28.4s, v28.8h 1ac: 0f10a7b9 sxtl v25.4s, v29.4h 1b0: 4f10a7bd sxtl2 v29.4s, v29.8h 1b4: 0f20a70f sxtl v15.2d, v24.2s 1b8: 0f20a790 sxtl v16.2d, v28.2s 1bc: 0f20a731 sxtl v17.2d, v25.2s 1c0: 0f20a7b2 sxtl v18.2d, v29.2s 1c4: 0f20a753 sxtl v19.2d, v26.2s 1c8: 0f20a7d4 sxtl v20.2d, v30.2s 1cc: 4f20a718 sxtl2 v24.2d, v24.4s 1d0: 4f20a79c sxtl2 v28.2d, v28.4s 1d4: 4f20a739 sxtl2 v25.2d, v25.4s 1d8: 4f20a7bd sxtl2 v29.2d, v29.4s 1dc: 4f20a75a sxtl2 v26.2d, v26.4s 1e0: 4f20a7de sxtl2 v30.2d, v30.4s 1e4: 4e61d9ef scvtf v15.2d, v15.2d 1e8: 4e61db18 scvtf v24.2d, v24.2d 1ec: 4e61da10 scvtf v16.2d, v16.2d 1f0: 4e61db9c scvtf v28.2d, v28.2d 1f4: 4e61da31 scvtf v17.2d, v17.2d 1f8: 4e61db39 scvtf v25.2d, v25.2d 1fc: 4e61da52 scvtf v18.2d, v18.2d 200: 4e61dbbd scvtf v29.2d, v29.2d 204: 4e61da73 scvtf v19.2d, v19.2d 208: 4e61db5a scvtf v26.2d, v26.2d 20c: 4e61da94 scvtf v20.2d, v20.2d 210: 4e61dbde scvtf v30.2d, v30.2d 214: 4fdf11fb fmla v27.2d, v15.2d, v31.d[0] 218: 4fdf130d fmla v13.2d, v24.2d, v31.d[0] 21c: 4fdf1205 fmla v5.2d, v16.2d, v31.d[0] 220: 4fdf138c fmla v12.2d, v28.2d, v31.d[0] 224: 4fdf1226 fmla v6.2d, v17.2d, v31.d[0] 228: 4fdf1336 fmla v22.2d, v25.2d, v31.d[0] 22c: 4fdf1255 fmla v21.2d, v18.2d, v31.d[0] 230: 4fdf13a7 fmla v7.2d, v29.2d, v31.d[0] 234: 4fdf126e fmla v14.2d, v19.2d, v31.d[0] 238: 4fdf1344 fmla v4.2d, v26.2d, v31.d[0] 23c: 4fdf128b fmla v11.2d, v20.2d, v31.d[0] 240: 4fdf13d7 fmla v23.2d, v30.2d, v31.d[0] 244: eb01009f cmp x4, x1 248: 54fff8e1 b.ne 164 <MorphologyApply.constprop.0+0x164> // b.any so we lose the ld4 and as a result end up using tbls to do the permutes instead. The change that ifcombine does by itself looks OK. It seems like the resulting perturbations to the tree IR are enough to throw off the SLP vectorizer and we end up with suboptimal code. I'm currently trying to debug why this happens in the vectorizer, but thought it might be worth sharing what I've found so far, at this point.