https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119361
Bug ID: 119361 Summary: RISC-V: x264 satd_4x4 stack spilling with mtune=generic-ooo for vls code but not on vla code Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ewlu at rivosinc dot com Target Milestone: --- Looking at the code for x264 (SPEC2017) #include <stdint.h> #define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ int t0 = s0 + s1;\ int t1 = s0 - s1;\ int t2 = s2 + s3;\ int t3 = s2 - s3;\ d0 = t0 + t2;\ d2 = t0 - t2;\ d1 = t1 + t3;\ d3 = t1 - t3;\ } static uint32_t abs2( uint32_t a ) { uint32_t s = ((a>>15)&0x10001)*0xffff; return (a+s)^s; } int x264_pixel_satd_4x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) { uint32_t tmp[4][2]; uint32_t a0, a1, a2, a3, b0, b1; int sum = 0; for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) { a0 = pix1[0] - pix2[0]; a1 = pix1[1] - pix2[1]; b0 = (a0+a1) + ((a0-a1)<<16); a2 = pix1[2] - pix2[2]; a3 = pix1[3] - pix2[3]; b1 = (a2+a3) + ((a2-a3)<<16); tmp[i][0] = b0 + b1; tmp[i][1] = b0 - b1; } for( int i = 0; i < 2; i++ ) { HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] ); a0 = abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); sum += ((uint16_t)a0) + (a0>>16); } return sum >> 1; } vls code spills to the stack when vla does not using -mtune=generic-ooo. Without -mtune=generic-ooo, both vls and vla code spill to the stack. vsub.vv v5,v2,v1 vsseg2e32.v v4,(sp) vsetivli zero,2,e32,mf2,ta,ma vmv.v.x v4,a5 vmv.s.x v10,zero vle32.v v6,0(sp) vle32.v v3,0(a1) vle32.v v1,0(a3) vle32.v v2,0(a2) addi sp,sp,32 The effect is seen easier with the -mno-autovec-segment flag enabled vsetivli zero,2,e64,m1,ta,ma vslidedown.vi v3,v2,1 vmv.x.s a5,v2 vslidedown.vi v2,v1,1 sd a5,0(sp) vmv.x.s a5,v3 sd a5,8(sp) vmv.x.s a5,v1 sd a5,16(sp) vmv.x.s a5,v2 vle32.v v2,0(sp) sd a5,24(sp) addi a5,sp,8 vle32.v v1,0(a5) https://godbolt.org/z/GnEWMjr68 vla code was also spilling to the stack before r15-3715-g77bd23a3e24. I was looking through the vect/optimized tree passes for differences but the final optimized gimple output are (from what I can tell) the same. From my understanding, this means that the problem is in the backend somewhere? Probably unrelated but poking around in the ira dumps, I saw the following r444: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS r443: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS r442: preferred V_REGS, alternative NO_REGS, allocno V_REGS r441: preferred V_REGS, alternative NO_REGS, allocno V_REGS r440: preferred V_REGS, alternative NO_REGS, allocno V_REGS r439: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS ... Spill a30(r444,l0) Spill a31(r443,l0) Spill a32(r439,l0) Spill a33(r438,l0) where these vregs correspond to the vmv.x.s insns in the -mno-autovec-segment snippet. I don't know what the `preferred NO_REGS, alternative NO_REGS, allocno NO_REGS` mean but is it potentially a problem with the vmv expander definitions? Or are the vmvs only there because we spill the stack?