https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122152
Bug ID: 122152
Summary: riscv64 uses a vector segmented load instead of a
vector strided load
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bergner at gcc dot gnu.org
Target Milestone: ---
I'm marking this as a riscv target issue for now until we know for sure where
the issue is.
The following simplified test case based on code from libquantum shows a
problem where we emit a segmented vector load when a simple strided vector load
would be preferable due to performance reasons:
linux$ cat bug.c
typedef struct
{
unsigned int unused;
unsigned int state;
} node_t;
void
test (node_t *reg, unsigned int value)
{
for (long i = 0; i < 64; i++)
reg[i].state |= value;
}
linux$ gcc -O3 -march=rv64gcv -S bug.c
linux$ cat bug.s
[snip]
vsetvli a4,zero,e32,m1,ta,ma
vmv.v.x v1,a1
addi a3,a0,4
mv a2,a3
li a4,63
li a7,8
.L2:
vsetvli a5,a4,e32,m1,ta,ma
vlseg2e32.v v2,(a3)
slli a6,a5,3
sub a4,a4,a5
add a3,a3,a6
vor.vv v2,v1,v2
vsse32.v v2,0(a2),a7
add a2,a2,a6
bne a4,zero,.L2
lw a5,508(a0)
or a5,a5,a1
sw a5,508(a0)
ret
Here, the vlseg2e32.v loads up v2 with reg[*].state values and v3 with
reg[*].unused values. Given the reg[*].unused values are unused, we loaded
them needlessly. We also needlessly increased vector register pressure for no
reason. A simple strided load that only loads the reg[*].state values would
suffice and perform better.
# RANGE [irange] long int [0, 63] MASK 0x3f VALUE 0x0
# i_15 = PHI <i_12(5), 0(2)>
# .MEM_16 = PHI <.MEM_47(5), .MEM_8(D)(2)>
# ivtmp_19 = PHI <ivtmp_18(5), 64(2)>
# PT = nonlocal null
# vectp_reg.8_36 = PHI <vectp_reg.8_37(5), vectp_reg.9_35(2)>
# PT = nonlocal null
# vectp_reg.16_45 = PHI <vectp_reg.16_46(5), vectp_reg.9_35(2)>
# ivtmp_48 = PHI <ivtmp_49(5), 63(2)>
_50 = .SELECT_VL (ivtmp_48, POLY_INT_CST [4, 4]);
# RANGE [irange] long unsigned int [0, 63] MASK 0x3f VALUE 0x0
i.0_1 = (long unsigned intD.5) i_15;
# RANGE [irange] long unsigned int [0, 0][8, 504] MASK 0x1f8 VALUE 0x0
_2 = i.0_1 * 8;
# PT = nonlocal null
_3 = reg_9(D) + _2;
ivtmp_34 = _50 * 8;
# .MEM_51 = VDEF <.MEM_16>
# USE = anything
vect_array.10D.2933 = .MASK_LEN_LOAD_LANES (vectp_reg.8_36, 32B, { -1, ... },
_38(D), _50, 0);
# VUSE <.MEM_51>
vect__4.12_39 = vect_array.10D.2933[0];
vect__5.15_42 = vect__4.12_39 | _41;
# VUSE <.MEM_51>
vect__4.14_40 = vect_array.10D.2933[1]; <--- vect__4.14_40 is never used
# .MEM_52 = VDEF <.MEM_51>
vect_array.10D.2933 ={v} {CLOBBER};
...and vect__4.14_40 is deleted in the dce pass immediately after
vectorization.
Running libquantum with -mno-autovec-segment gives us an almost 5% improvement,
so it would be nice to get a strided load here.
I can imagine a define_split that matches this scenario could replace the
segmented load with a strided load, but wouldn't it be preferable to not
generate the segmented load in the first place when we know we're not going to
use half the data we're loading?
I'll note we saw this issue on Ascalon with the struct fields being "unsigned
long" and using -O3 -march=rv64gcv_zvl256b.