https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120639

--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
> Am 20.06.2025 um 16:17 schrieb rdapp at gcc dot gnu.org 
> <gcc-bugzi...@gcc.gnu.org>:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120639
> 
> --- Comment #5 from Robin Dapp <rdapp at gcc dot gnu.org> ---
>> Well, consider the desired index vector being a real induction (just
>> store it somewhere).  If we can handle that, we should be able to
>> handle the scatter.  If not, we can't handle the scatter.
> 
> Hmm, I think I misunderstood.  You are arguing that we could build an 
> induction
> variable based on the i_height loop, right?  So roughly like
> 
>  vect_vec_iv = {0, 1, ..., i_width};
>  for (... i_height)
>   {
>      ...
>      idxs = "[vect_vec_iv, vect_vec_iv + {i_dst_stride, ...}, ...]"
>      IFN_SCATTER_STORE (dst, idxs);
>      vect_vec_iv += {i_dst_stride, i_dst_stride, ...};
>   }?
> 
> I guess this can always be implemented as a scatter one way or another?
> 
> But my objective is actually two-fold in that I want to use the full vector
> size and also conflate as many elements as possible into a single one (i.e. 8
> chars into one uint64_t).  The second part helps gather/scatter as well as
> strided loads/stores independently as it reduces the number of individual
> elements (thus reducing the scatter/gather latency).
> 
> So I think in order to make full use of the vector size the induction approach
> can work as we construct the index vector appropriately.
> 
> For conflating/reinterpreting a subset of dynamic indices we IMHO need static
> code that is dynamically dispatched as described in my previous message.
> 
> I.e. a loop over i_width:
>  while (rem > 0)
>   {
>     if (rem == 8)
>        "scatter/strided store with 64-bit elements"
>     if (rem == 4)
>        "scatter/strided store with 32-bit elements"
>     rem -= elsz;
>   }
> 
> I realize that's not something we do at all right now, hence my initial
> question.  Irrespective of how/if something like that could be implemented (I
> can only imagine virtual/composition modes right now), is it even desirable in
> any way?  I know that it would help our uarch at least.

It would be possible to devise a versioning scheme plus eventually an in-loop
dispatch for this.  We currently cannot version for multiple vector variants,
but we need to ensure rem is handled?

> 
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Reply via email to