[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

rguenther at suse dot de Thu, 07 Sep 2017 07:39:41 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846


--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On September 7, 2017 1:53:47 PM GMT+02:00, "jakub at gcc dot gnu.org"
<gcc-bugzi...@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
>
>--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
>(In reply to Richard Biener from comment #11)
>> that's not using the unpacking strategy (sum adjacent elements) but
>still the
>> vector shift approach (add upper/lower halves).  That's sth that can
>be
>> changed independently.
>> 
>> Waiting for final vec_extract/init2 optab settling.
>
>That should be settled now.
>
>BTW, for reductions in PR80324 I've added for avx512fintrin.h
>__MM512_REDUCE_OP
>which for reductions from 512-bit vectors uses smaller and smaller
>vectors,
>perhaps that is something we should use for the reductions emitted by
>the
>vectorizer too (perhaps through a target hook that would emit gimple
>for the
>reduction)?

Yeah, I have a patch that does this. The question is how to query the target if
the vector sizes share the same register set. Like we wouldn't want to go to
mmx register size.

Doing this would also allow to execute the adds for 512 to 128 bit reduction in
parallel.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to