https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119361
Bug ID: 119361
Summary: RISC-V: x264 satd_4x4 stack spilling with
mtune=generic-ooo for vls code but not on vla code
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: ewlu at rivosinc dot com
Target Milestone: ---
Looking at the code for x264 (SPEC2017)
#include <stdint.h>
#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
int t0 = s0 + s1;\
int t1 = s0 - s1;\
int t2 = s2 + s3;\
int t3 = s2 - s3;\
d0 = t0 + t2;\
d2 = t0 - t2;\
d1 = t1 + t3;\
d3 = t1 - t3;\
}
static uint32_t abs2( uint32_t a )
{
uint32_t s = ((a>>15)&0x10001)*0xffff;
return (a+s)^s;
}
int x264_pixel_satd_4x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
uint32_t tmp[4][2];
uint32_t a0, a1, a2, a3, b0, b1;
int sum = 0;
for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
{
a0 = pix1[0] - pix2[0];
a1 = pix1[1] - pix2[1];
b0 = (a0+a1) + ((a0-a1)<<16);
a2 = pix1[2] - pix2[2];
a3 = pix1[3] - pix2[3];
b1 = (a2+a3) + ((a2-a3)<<16);
tmp[i][0] = b0 + b1;
tmp[i][1] = b0 - b1;
}
for( int i = 0; i < 2; i++ )
{
HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]
);
a0 = abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
sum += ((uint16_t)a0) + (a0>>16);
}
return sum >> 1;
}
vls code spills to the stack when vla does not using -mtune=generic-ooo.
Without -mtune=generic-ooo, both vls and vla code spill to the stack.
vsub.vv v5,v2,v1
vsseg2e32.v v4,(sp)
vsetivli zero,2,e32,mf2,ta,ma
vmv.v.x v4,a5
vmv.s.x v10,zero
vle32.v v6,0(sp)
vle32.v v3,0(a1)
vle32.v v1,0(a3)
vle32.v v2,0(a2)
addi sp,sp,32
The effect is seen easier with the -mno-autovec-segment flag enabled
vsetivli zero,2,e64,m1,ta,ma
vslidedown.vi v3,v2,1
vmv.x.s a5,v2
vslidedown.vi v2,v1,1
sd a5,0(sp)
vmv.x.s a5,v3
sd a5,8(sp)
vmv.x.s a5,v1
sd a5,16(sp)
vmv.x.s a5,v2
vle32.v v2,0(sp)
sd a5,24(sp)
addi a5,sp,8
vle32.v v1,0(a5)
https://godbolt.org/z/GnEWMjr68
vla code was also spilling to the stack before r15-3715-g77bd23a3e24. I was
looking through the vect/optimized tree passes for differences but the final
optimized gimple output are (from what I can tell) the same. From my
understanding, this means that the problem is in the backend somewhere?
Probably unrelated but poking around in the ira dumps, I saw the following
r444: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
r443: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
r442: preferred V_REGS, alternative NO_REGS, allocno V_REGS
r441: preferred V_REGS, alternative NO_REGS, allocno V_REGS
r440: preferred V_REGS, alternative NO_REGS, allocno V_REGS
r439: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
...
Spill a30(r444,l0)
Spill a31(r443,l0)
Spill a32(r439,l0)
Spill a33(r438,l0)
where these vregs correspond to the vmv.x.s insns in the -mno-autovec-segment
snippet. I don't know what the `preferred NO_REGS, alternative NO_REGS, allocno
NO_REGS` mean but is it potentially a problem with the vmv expander
definitions? Or are the vmvs only there because we spill the stack?