https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120639
Bug ID: 120639 Summary: vect: Strided memory access type, stores with gaps? Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- Target: riscv In x264 we have several variations of the following loop: void foo (uint8_t *dst, int i_dst_stride, uint8_t *src1, int i_src1_stride, uint8_t *src2, int i_src2_stride, int i_width, int i_height) { for( int y = 0; y < i_height; y++ ) { for( int x = 0; x < i_width; x++ ) dst[x] = ( src1[x] + src2[x] + 1 ) >> 1; dst += i_dst_stride; src1 += i_src1_stride; src2 += i_src2_stride; } } There certainly is a costing tradeoff but for smaller values of i_width we'd ideally want to use strided loads and stores here. Right now we vectorize the inner loop by regular loads (and stores) of length i_width. This is ideal if i_width >= vector size. What I could imagine is perform loop versioning dependent on i_width after we have established that a strided access is possible. But that only gets us half-way. The other issue is the memory access type: Depending on the uarch it might be desirable to build the store "piece by piece" rather than contiguously. Assuming i_width = 12, we could perform a 64-bit element strided load for the first 8 elements of each iteration [src1[x..x+7, x+stride..x+stride+7, ...] then do the arithmetic and a strided store (with gaps, i.e. there are 4 bytes missing). Next a 32-bit element strided load for the remaining 4 elements [src1[x+8..x+11, x+stride+8..x+stride+11, ...] then arithmetic and store as before. I suppose that doesn't strictly fit our regular loop vectorization scheme and might rather be an SLP approach (or a mixed one?). I can only think of two ways to handle this: - Introduce some kind of loop for i_width that does the above scheme until i_width is reached - Increase "ncopies/vec_stmts" so we can accommoate i_width. If possible, it would allow us to use full vectors rather than wasting (1 - i_width / vector_size) of vector capacity. Obviously we'd have to account for the fact that strided ops will usually be slower than contiguous loads but for e.g. i_width = 8 or i_width = 16 and 512-bit vectors there certainly is some creative leeway. Any ideas if and how we could make such an approach happen?