https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|increasing VF during SLP    |Inefficient codegen of
                   |vectorization permutes      |multistep zero-extends and
                   |unnecessarily               |LOAD_LANES

--- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> ends up
> 
> .L4:
>         ld4     {v28.8h - v31.8h}, [x4], 64
>         add     x3, x3, 64
>         uaddl2  v26.4s, v28.8h, v29.8h
>         uaddl   v28.4s, v28.4h, v29.4h
>         uaddw2  v0.4s, v26.4s, v30.8h
>         uaddw   v28.4s, v28.4s, v30.4h
>         uaddw2  v0.4s, v0.4s, v31.8h
>         uaddw   v28.4s, v28.4s, v31.4h
>         sxtl    v1.2d, v0.2s
>         sxtl    v27.2d, v28.2s
>         sxtl2   v0.2d, v0.4s
>         sxtl2   v28.2d, v28.4s
>         scvtf   v1.2d, v1.2d
>         scvtf   v27.2d, v27.2d
>         scvtf   v0.2d, v0.2d
>         scvtf   v28.2d, v28.2d
>         stp     q1, q0, [x3, -32]
>         stp     q27, q28, [x3, -64]
>         cmp     x5, x4
>         bne     .L4
> 
> we can now use widening plus and avoid the HI -> DF conversion penalty.

The conversion penalty is not the issue. This case is still bad since you're
sign extending a zero extend, which is just zero extending anyway.

But again I'm not complaining about that here. though..

> It uses interleaving because there's no ld8 and when
> vect_lower_load_permutations decides heuristically to use load-lanes it
> tries to do so vector-size agnostic so it doesn't consider using two times
> ld4.
> 
> There _are_ permutes because of the use of 4 lanes to compute the single
> lane store in the reduction operation.  The vectorization for the unrolled
> loop not using load-lanes show them:
> 
>   vect_a1_53.10_234 = MEM <vector(8) short unsigned int> [(short unsigned
> int *)vectp_x.8_232];
>   vectp_x.8_235 = vectp_x.8_232 + 16;
>   vect_a1_53.11_236 = MEM <vector(8) short unsigned int> [(short unsigned
> int *)vectp_x.8_235];
>   vectp_x.8_237 = vectp_x.8_232 + 32;
>   vect_a1_53.12_238 = MEM <vector(8) short unsigned int> [(short unsigned
> int *)vectp_x.8_237];
>   vectp_x.8_239 = vectp_x.8_232 + 48;
>   vect_a1_53.13_240 = MEM <vector(8) short unsigned int> [(short unsigned
> int *)vectp_x.8_239];
>   _254 = VEC_PERM_EXPR <vect_a1_53.10_234, vect_a1_53.11_236, { 1, 3, 5, 7,
> 9, 11, 13, 15 }>;
>   _255 = VEC_PERM_EXPR <vect_a1_53.12_238, vect_a1_53.13_240, { 1, 3, 5, 7,
> 9, 11, 13, 15 }>;
>   _286 = VEC_PERM_EXPR <_254, _255, { 1, 3, 5, 7, 9, 11, 13, 15 }>;
> ...
> 
> that's simply load-lanes open-coded.  If open-coding ld4 is better than using
> ld4 just make it not available to the vectorizer?  Similar to ld2 I suppose.

So I now realize that I missed a step in what LLVM is doing here.
You're right in that there is a permute here,  but opencoding the permute is
better for this case since the zero extension from HI to DI can be done using
a permute as well. And this zero extension permute can fold the load lanes
permute into itself.  Which seems to be the trick LLVM is doing..

In my patch I added a new target hook
targetm.vectorize.use_permute_for_promotion
that decides when to use a permute vs an unpack.  I think the right solution
here is to tweak the LOAD_LANES heuristics to not use it when the results feed
into another permute, since that permute can be optimized reducing the total
number of permutes needed.

Does that sound good to you?

Reply via email to