[Bug tree-optimization/119393] New: [15 Regression] Worse vectorization of imagick_r hot loop on aarch64 since r15-5024-g2a2e6784074e1f

acoplan at gcc dot gnu.org via Gcc-bugs Thu, 20 Mar 2025 09:42:26 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119393


            Bug ID: 119393
           Summary: [15 Regression] Worse vectorization of imagick_r hot
                    loop on aarch64 since r15-5024-g2a2e6784074e1f
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

Created attachment 60836
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60836&action=edit
reduced LTO testcase

I noticed that imagick_r from SPEC CPU 2017 regressed by 3.78% on Neoverse V1
after the ifcombine change r15-5024-g2a2e6784074e1f7b679bc09b1a66982bf60645a5.

This is with -Ofast -flto=auto -mcpu=neoverse-v1+nosve -fomit-frame-pointer.

I've attached a reduced reproducer which (unfortunately) still requires LTO,
but it is at least fairly well reduced (only two small TUs).

With compilers built before/after the above commit, we can run the script from
the attached reproducer (repro.sh) and compare the resulting disassembly.

The first thing to observe is that there is significant code size growth:

$ wc -l good.dis bad.dis
  234 good.dis
  307 bad.dis

i.e. the function grows in size by 73 insns, or 31%.  Looking at the hot loop,
before the above change (in good.dis) we have:

  c0:   4cdf045c        ld4     {v28.8h-v31.8h}, [x2], #64
  c4:   91000421        add     x1, x1, #0x1
  c8:   4f10a7bf        sxtl2   v31.4s, v29.8h
  cc:   0f10a7bd        sxtl    v29.4s, v29.4h
  d0:   4f10a797        sxtl2   v23.4s, v28.8h
  d4:   4f10a7d0        sxtl2   v16.4s, v30.8h
  d8:   0f20a7f3        sxtl    v19.2d, v31.2s
  dc:   0f20a7b2        sxtl    v18.2d, v29.2s
  e0:   4f20a7ff        sxtl2   v31.2d, v31.4s
  e4:   4f20a7bd        sxtl2   v29.2d, v29.4s
  e8:   4e61da73        scvtf   v19.2d, v19.2d
  ec:   4e61da52        scvtf   v18.2d, v18.2d
  f0:   4e61dbff        scvtf   v31.2d, v31.2d
  f4:   4e61dbbd        scvtf   v29.2d, v29.2d
  f8:   0f10a79c        sxtl    v28.4s, v28.4h
  fc:   0f10a7de        sxtl    v30.4s, v30.4h
 100:   4e7fd67f        fadd    v31.2d, v19.2d, v31.2d
 104:   4e7dd65d        fadd    v29.2d, v18.2d, v29.2d
 108:   0f20a6f5        sxtl    v21.2d, v23.2s
 10c:   0f20a794        sxtl    v20.2d, v28.2s
 110:   0f20a7d1        sxtl    v17.2d, v30.2s
 114:   4e7dd7fd        fadd    v29.2d, v31.2d, v29.2d
 118:   4f20a6f7        sxtl2   v23.2d, v23.4s
 11c:   4f20a79c        sxtl2   v28.2d, v28.4s
 120:   4f20a7de        sxtl2   v30.2d, v30.4s
 124:   4fdb13b9        fmla    v25.2d, v29.2d, v27.d[0]
 128:   0f20a61d        sxtl    v29.2d, v16.2s
 12c:   4f20a610        sxtl2   v16.2d, v16.4s
 130:   4e61dab5        scvtf   v21.2d, v21.2d
 134:   4e61daf7        scvtf   v23.2d, v23.2d
 138:   4e61da94        scvtf   v20.2d, v20.2d
 13c:   4e61db9c        scvtf   v28.2d, v28.2d
 140:   4e61dbbd        scvtf   v29.2d, v29.2d
 144:   4e61da10        scvtf   v16.2d, v16.2d
 148:   4e61da31        scvtf   v17.2d, v17.2d
 14c:   4e61dbde        scvtf   v30.2d, v30.2d
 150:   4e77d6b7        fadd    v23.2d, v21.2d, v23.2d
 154:   4e7cd69c        fadd    v28.2d, v20.2d, v28.2d
 158:   4e70d7b0        fadd    v16.2d, v29.2d, v16.2d
 15c:   4e7ed63e        fadd    v30.2d, v17.2d, v30.2d
 160:   4e7cd6fc        fadd    v28.2d, v23.2d, v28.2d
 164:   4e70d7d0        fadd    v16.2d, v30.2d, v16.2d
 168:   4fdb139a        fmla    v26.2d, v28.2d, v27.d[0]
 16c:   4fdb1218        fmla    v24.2d, v16.2d, v27.d[0]
 170:   eb01009f        cmp     x4, x1
 174:   54fffa61        b.ne    c0 <MorphologyApply.constprop.0+0xc0>  // b.any

but after the above change (in bad.dis), we have:

 164:   ad41701d        ldp     q29, q28, [x0, #32]
 168:   9101c3e8        add     x8, sp, #0x70
 16c:   ad40781a        ldp     q26, q30, [x0]
 170:   91000421        add     x1, x1, #0x1
 174:   4ebd1fa0        mov     v0.16b, v29.16b
 178:   91010000        add     x0, x0, #0x40
 17c:   4ebc1f81        mov     v1.16b, v28.16b
 180:   4ebe1fc2        mov     v2.16b, v30.16b
 184:   ad03fbfa        stp     q26, q30, [sp, #112]
 188:   4c40a119        ld1     {v25.16b, v26.16b}, [x8]
 18c:   4e0a201e        tbl     v30.16b, {v0.16b, v1.16b}, v10.16b
 190:   4ebd1fa3        mov     v3.16b, v29.16b
 194:   4e08233c        tbl     v28.16b, {v25.16b, v26.16b}, v8.16b
 198:   0f10a7da        sxtl    v26.4s, v30.4h
 19c:   4e09205d        tbl     v29.16b, {v2.16b, v3.16b}, v9.16b
 1a0:   4f10a7de        sxtl2   v30.4s, v30.8h
 1a4:   0f10a798        sxtl    v24.4s, v28.4h
 1a8:   4f10a79c        sxtl2   v28.4s, v28.8h
 1ac:   0f10a7b9        sxtl    v25.4s, v29.4h
 1b0:   4f10a7bd        sxtl2   v29.4s, v29.8h
 1b4:   0f20a70f        sxtl    v15.2d, v24.2s
 1b8:   0f20a790        sxtl    v16.2d, v28.2s
 1bc:   0f20a731        sxtl    v17.2d, v25.2s
 1c0:   0f20a7b2        sxtl    v18.2d, v29.2s
 1c4:   0f20a753        sxtl    v19.2d, v26.2s
 1c8:   0f20a7d4        sxtl    v20.2d, v30.2s
 1cc:   4f20a718        sxtl2   v24.2d, v24.4s
 1d0:   4f20a79c        sxtl2   v28.2d, v28.4s
 1d4:   4f20a739        sxtl2   v25.2d, v25.4s
 1d8:   4f20a7bd        sxtl2   v29.2d, v29.4s
 1dc:   4f20a75a        sxtl2   v26.2d, v26.4s
 1e0:   4f20a7de        sxtl2   v30.2d, v30.4s
 1e4:   4e61d9ef        scvtf   v15.2d, v15.2d
 1e8:   4e61db18        scvtf   v24.2d, v24.2d
 1ec:   4e61da10        scvtf   v16.2d, v16.2d
 1f0:   4e61db9c        scvtf   v28.2d, v28.2d
 1f4:   4e61da31        scvtf   v17.2d, v17.2d
 1f8:   4e61db39        scvtf   v25.2d, v25.2d
 1fc:   4e61da52        scvtf   v18.2d, v18.2d
 200:   4e61dbbd        scvtf   v29.2d, v29.2d
 204:   4e61da73        scvtf   v19.2d, v19.2d
 208:   4e61db5a        scvtf   v26.2d, v26.2d
 20c:   4e61da94        scvtf   v20.2d, v20.2d
 210:   4e61dbde        scvtf   v30.2d, v30.2d
 214:   4fdf11fb        fmla    v27.2d, v15.2d, v31.d[0]
 218:   4fdf130d        fmla    v13.2d, v24.2d, v31.d[0]
 21c:   4fdf1205        fmla    v5.2d, v16.2d, v31.d[0]
 220:   4fdf138c        fmla    v12.2d, v28.2d, v31.d[0]
 224:   4fdf1226        fmla    v6.2d, v17.2d, v31.d[0]
 228:   4fdf1336        fmla    v22.2d, v25.2d, v31.d[0]
 22c:   4fdf1255        fmla    v21.2d, v18.2d, v31.d[0]
 230:   4fdf13a7        fmla    v7.2d, v29.2d, v31.d[0]
 234:   4fdf126e        fmla    v14.2d, v19.2d, v31.d[0]
 238:   4fdf1344        fmla    v4.2d, v26.2d, v31.d[0]
 23c:   4fdf128b        fmla    v11.2d, v20.2d, v31.d[0]
 240:   4fdf13d7        fmla    v23.2d, v30.2d, v31.d[0]
 244:   eb01009f        cmp     x4, x1
 248:   54fff8e1        b.ne    164 <MorphologyApply.constprop.0+0x164>  //
b.any

so we lose the ld4 and as a result end up using tbls to do the permutes
instead.  The change that ifcombine does by itself looks OK.  It seems like the
resulting perturbations to the tree IR are enough to throw off the SLP
vectorizer and we end up with suboptimal code.  I'm currently trying to debug
why this happens in the vectorizer, but thought it might be worth sharing what
I've found so far, at this point.

[Bug tree-optimization/119393] New: [15 Regression] Worse vectorization of imagick_r hot loop on aarch64 since r15-5024-g2a2e6784074e1f

Reply via email to