https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120639

            Bug ID: 120639
           Summary: vect: Strided memory access type, stores with gaps?
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---
            Target: riscv

In x264 we have several variations of the following loop:

void foo (uint8_t *dst,  int i_dst_stride,
          uint8_t *src1, int i_src1_stride,
          uint8_t *src2, int i_src2_stride,
          int i_width, int i_height)
{
  for( int y = 0; y < i_height; y++ )
    {
      for( int x = 0; x < i_width; x++ )
        dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
      dst  += i_dst_stride;
      src1 += i_src1_stride;
      src2 += i_src2_stride;
    }
}

There certainly is a costing tradeoff but for smaller values of i_width we'd
ideally want to use strided loads and stores here.

Right now we vectorize the inner loop by regular loads (and stores) of length
i_width.  This is ideal if i_width >= vector size.

What I could imagine is perform loop versioning dependent on i_width after we
have established that a strided access is possible.  But that only gets us
half-way.  The other issue is the memory access type:  Depending on the uarch
it might be desirable to build the store "piece by piece" rather than
contiguously.

Assuming i_width = 12, we could perform a 64-bit element strided load for the
first 8 elements of each iteration

[src1[x..x+7, x+stride..x+stride+7, ...]
then do the arithmetic and a strided store (with gaps, i.e. there are 4 bytes
missing).

Next a 32-bit element strided load for the remaining 4 elements
[src1[x+8..x+11, x+stride+8..x+stride+11, ...]
then arithmetic and store as before.

I suppose that doesn't strictly fit our regular loop vectorization scheme and
might rather be an SLP approach (or a mixed one?).  I can only think of two
ways to handle this:
 - Introduce some kind of loop for i_width that does the above scheme until
i_width is reached
 - Increase "ncopies/vec_stmts" so we can accommoate i_width.

If possible, it would allow us to use full vectors rather than wasting (1 -
i_width / vector_size) of vector capacity.  Obviously we'd have to account for
the fact that strided ops will usually be slower than contiguous loads but for
e.g. i_width = 8 or i_width = 16 and 512-bit vectors there certainly is some
creative leeway.

Any ideas if and how we could make such an approach happen?

Reply via email to