https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119361

            Bug ID: 119361
           Summary: RISC-V: x264 satd_4x4 stack spilling with
                    mtune=generic-ooo for vls code but not on vla code
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ewlu at rivosinc dot com
  Target Milestone: ---

Looking at the code for x264 (SPEC2017)

#include <stdint.h>

#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
    int t0 = s0 + s1;\
    int t1 = s0 - s1;\
    int t2 = s2 + s3;\
    int t3 = s2 - s3;\
    d0 = t0 + t2;\
    d2 = t0 - t2;\
    d1 = t1 + t3;\
    d3 = t1 - t3;\
}

static uint32_t abs2( uint32_t a )
{
    uint32_t s = ((a>>15)&0x10001)*0xffff;
    return (a+s)^s;
}

int x264_pixel_satd_4x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
    uint32_t tmp[4][2];
    uint32_t a0, a1, a2, a3, b0, b1;
    int sum = 0;
    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = pix1[0] - pix2[0];
        a1 = pix1[1] - pix2[1];
        b0 = (a0+a1) + ((a0-a1)<<16);
        a2 = pix1[2] - pix2[2];
        a3 = pix1[3] - pix2[3];
        b1 = (a2+a3) + ((a2-a3)<<16);
        tmp[i][0] = b0 + b1;
        tmp[i][1] = b0 - b1;
    }
    for( int i = 0; i < 2; i++ )
    {
        HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]
);
        a0 = abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
        sum += ((uint16_t)a0) + (a0>>16);
    }
    return sum >> 1;
}

vls code spills to the stack when vla does not using -mtune=generic-ooo.
Without -mtune=generic-ooo, both vls and vla code spill to the stack.

        vsub.vv v5,v2,v1
        vsseg2e32.v     v4,(sp)
        vsetivli        zero,2,e32,mf2,ta,ma
        vmv.v.x v4,a5
        vmv.s.x v10,zero
        vle32.v v6,0(sp)
        vle32.v v3,0(a1)
        vle32.v v1,0(a3)
        vle32.v v2,0(a2)
        addi    sp,sp,32

The effect is seen easier with the -mno-autovec-segment flag enabled

        vsetivli        zero,2,e64,m1,ta,ma
        vslidedown.vi   v3,v2,1
        vmv.x.s a5,v2
        vslidedown.vi   v2,v1,1
        sd      a5,0(sp)
        vmv.x.s a5,v3
        sd      a5,8(sp)
        vmv.x.s a5,v1
        sd      a5,16(sp)
        vmv.x.s a5,v2
        vle32.v v2,0(sp)
        sd      a5,24(sp)
        addi    a5,sp,8
        vle32.v v1,0(a5)

https://godbolt.org/z/GnEWMjr68

vla code was also spilling to the stack before r15-3715-g77bd23a3e24. I was
looking through the vect/optimized tree passes for differences but the final
optimized gimple output are (from what I can tell) the same. From my
understanding, this means that the problem is in the backend somewhere?

Probably unrelated but poking around in the ira dumps, I saw the following

    r444: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
    r443: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
    r442: preferred V_REGS, alternative NO_REGS, allocno V_REGS
    r441: preferred V_REGS, alternative NO_REGS, allocno V_REGS
    r440: preferred V_REGS, alternative NO_REGS, allocno V_REGS
    r439: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
    r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
...
      Spill a30(r444,l0)
      Spill a31(r443,l0)
      Spill a32(r439,l0)
      Spill a33(r438,l0)
where these vregs correspond to the vmv.x.s insns in the -mno-autovec-segment
snippet. I don't know what the `preferred NO_REGS, alternative NO_REGS, allocno
NO_REGS` mean but is it potentially a problem with the vmv expander
definitions?  Or are the vmvs only there because we spill the stack?

Reply via email to