https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152

            Bug ID: 122152
           Summary: riscv64 uses a vector segmented load instead of a
                    vector strided load
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bergner at gcc dot gnu.org
  Target Milestone: ---

I'm marking this as a riscv target issue for now until we know for sure where
the issue is.

The following simplified test case based on code from libquantum shows a
problem where we emit a segmented vector load when a simple strided vector load
would be preferable due to performance reasons:

linux$ cat bug.c
typedef struct
{
  unsigned int unused;
  unsigned int state;
} node_t;

void
test (node_t *reg, unsigned int value)
{
  for (long i = 0; i < 64; i++)
    reg[i].state |= value;
}

linux$ gcc -O3 -march=rv64gcv -S bug.c
linux$ cat bug.s
[snip]
        vsetvli a4,zero,e32,m1,ta,ma
        vmv.v.x v1,a1
        addi    a3,a0,4
        mv      a2,a3
        li      a4,63
        li      a7,8
.L2:
        vsetvli a5,a4,e32,m1,ta,ma
        vlseg2e32.v     v2,(a3)
        slli    a6,a5,3
        sub     a4,a4,a5
        add     a3,a3,a6
        vor.vv  v2,v1,v2
        vsse32.v        v2,0(a2),a7
        add     a2,a2,a6
        bne     a4,zero,.L2
        lw      a5,508(a0)
        or      a5,a5,a1
        sw      a5,508(a0)
        ret

Here, the vlseg2e32.v loads up v2 with reg[*].state values and v3 with
reg[*].unused values.  Given the reg[*].unused values are unused, we loaded
them needlessly.  We also needlessly increased vector register pressure for no
reason.  A simple strided load that only loads the reg[*].state values would
suffice and perform better.

  # RANGE [irange] long int [0, 63] MASK 0x3f VALUE 0x0
  # i_15 = PHI <i_12(5), 0(2)>
  # .MEM_16 = PHI <.MEM_47(5), .MEM_8(D)(2)>
  # ivtmp_19 = PHI <ivtmp_18(5), 64(2)>
  # PT = nonlocal null
  # vectp_reg.8_36 = PHI <vectp_reg.8_37(5), vectp_reg.9_35(2)>
  # PT = nonlocal null
  # vectp_reg.16_45 = PHI <vectp_reg.16_46(5), vectp_reg.9_35(2)>
  # ivtmp_48 = PHI <ivtmp_49(5), 63(2)>
  _50 = .SELECT_VL (ivtmp_48, POLY_INT_CST [4, 4]);
  # RANGE [irange] long unsigned int [0, 63] MASK 0x3f VALUE 0x0
  i.0_1 = (long unsigned intD.5) i_15;
  # RANGE [irange] long unsigned int [0, 0][8, 504] MASK 0x1f8 VALUE 0x0
  _2 = i.0_1 * 8;
  # PT = nonlocal null
  _3 = reg_9(D) + _2;
  ivtmp_34 = _50 * 8;
  # .MEM_51 = VDEF <.MEM_16>
  # USE = anything
  vect_array.10D.2933 = .MASK_LEN_LOAD_LANES (vectp_reg.8_36, 32B, { -1, ... },
_38(D), _50, 0);
  # VUSE <.MEM_51>
  vect__4.12_39 = vect_array.10D.2933[0];
  vect__5.15_42 = vect__4.12_39 | _41;
  # VUSE <.MEM_51>
  vect__4.14_40 = vect_array.10D.2933[1];    <--- vect__4.14_40 is never used
  # .MEM_52 = VDEF <.MEM_51>
  vect_array.10D.2933 ={v} {CLOBBER};

...and vect__4.14_40 is deleted in the dce pass immediately after
vectorization.

Running libquantum with -mno-autovec-segment gives us an almost 5% improvement,
so it would be nice to get a strided load here.

I can imagine a define_split that matches this scenario could replace the
segmented load with a strided load, but wouldn't it be preferable to not
generate the segmented load in the first place when we know we're not going to
use half the data we're loading?

I'll note we saw this issue on Ascalon with the struct fields being "unsigned
long" and using -O3 -march=rv64gcv_zvl256b.

Reply via email to