On 01/05/2018 02:01 AM, Richard Biener wrote:
> On Tue, 28 Nov 2017, Richard Biener wrote:
> 
>>
>> The following adds a new target hook, targetm.vectorize.split_reduction,
>> which allows the target to specify a preferred mode to perform the
>> final reducion on using either vector shifts or scalar extractions.
>> Up to that mode the vector reduction result is reduced by combining
>> lowparts and highparts recursively.  This avoids lane-crossing operations
>> when doing AVX256 on Zen and Bulldozer and also speeds up things on
>> Haswell (I verified ~20% speedup on Broadwell).
>>
>> Thus the patch implements the target hook on x86 to _always_ prefer
>> SSE modes for the final reduction.
>>
>> For the testcase in the bugzilla
>>
>> int sumint(const int arr[]) {
>>     arr = __builtin_assume_aligned(arr, 64);
>>     int sum=0;
>>     for (int i=0 ; i<1024 ; i++)
>>       sum+=arr[i];
>>     return sum;
>> }
>>
>> this changes -O3 -mavx512f code from
>>
>> sumint:
>> .LFB0:
>>         .cfi_startproc
>>         vpxord  %zmm0, %zmm0, %zmm0
>>         leaq    4096(%rdi), %rax
>>         .p2align 4,,10
>>         .p2align 3
>> .L2:
>>         vpaddd  (%rdi), %zmm0, %zmm0
>>         addq    $64, %rdi
>>         cmpq    %rdi, %rax
>>         jne     .L2
>>         vpxord  %zmm1, %zmm1, %zmm1
>>         vshufi32x4      $78, %zmm1, %zmm0, %zmm2
>>         vpaddd  %zmm2, %zmm0, %zmm0
>>         vmovdqa64       .LC0(%rip), %zmm2
>>         vpermi2d        %zmm1, %zmm0, %zmm2
>>         vpaddd  %zmm2, %zmm0, %zmm0
>>         vmovdqa64       .LC1(%rip), %zmm2
>>         vpermi2d        %zmm1, %zmm0, %zmm2
>>         vpaddd  %zmm2, %zmm0, %zmm0
>>         vmovdqa64       .LC2(%rip), %zmm2
>>         vpermi2d        %zmm1, %zmm0, %zmm2
>>         vpaddd  %zmm2, %zmm0, %zmm0
>>         vmovd   %xmm0, %eax
>>
>> to
>>
>> sumint:
>> .LFB0:
>>         .cfi_startproc
>>         vpxord  %zmm0, %zmm0, %zmm0
>>         leaq    4096(%rdi), %rax
>>         .p2align 4,,10
>>         .p2align 3
>> .L2:
>>         vpaddd  (%rdi), %zmm0, %zmm0
>>         addq    $64, %rdi
>>         cmpq    %rdi, %rax
>>         jne     .L2
>>         vextracti64x4   $0x1, %zmm0, %ymm1
>>         vpaddd  %ymm0, %ymm1, %ymm1
>>         vmovdqa %xmm1, %xmm0
>>         vextracti128    $1, %ymm1, %xmm1
>>         vpaddd  %xmm1, %xmm0, %xmm0
>>         vpsrldq $8, %xmm0, %xmm1
>>         vpaddd  %xmm1, %xmm0, %xmm0
>>         vpsrldq $4, %xmm0, %xmm1
>>         vpaddd  %xmm1, %xmm0, %xmm0
>>         vmovd   %xmm0, %eax
>>
>> and for -O3 -mavx2 from
>>
>> sumint:
>> .LFB0:
>>         .cfi_startproc
>>         vpxor   %xmm0, %xmm0, %xmm0
>>         leaq    4096(%rdi), %rax
>>         .p2align 4,,10
>>         .p2align 3
>> .L2:
>>         vpaddd  (%rdi), %ymm0, %ymm0
>>         addq    $32, %rdi
>>         cmpq    %rdi, %rax
>>         jne     .L2
>>         vpxor   %xmm1, %xmm1, %xmm1
>>         vperm2i128      $33, %ymm1, %ymm0, %ymm2
>>         vpaddd  %ymm2, %ymm0, %ymm0
>>         vperm2i128      $33, %ymm1, %ymm0, %ymm2
>>         vpalignr        $8, %ymm0, %ymm2, %ymm2
>>         vpaddd  %ymm2, %ymm0, %ymm0
>>         vperm2i128      $33, %ymm1, %ymm0, %ymm1
>>         vpalignr        $4, %ymm0, %ymm1, %ymm1
>>         vpaddd  %ymm1, %ymm0, %ymm0
>>         vmovd   %xmm0, %eax
>>
>> to
>>
>> sumint:
>> .LFB0:
>>         .cfi_startproc
>>         vpxor   %xmm0, %xmm0, %xmm0
>>         leaq    4096(%rdi), %rax
>>         .p2align 4,,10
>>         .p2align 3
>> .L2:
>>         vpaddd  (%rdi), %ymm0, %ymm0
>>         addq    $32, %rdi
>>         cmpq    %rdi, %rax
>>         jne     .L2
>>         vmovdqa %xmm0, %xmm1
>>         vextracti128    $1, %ymm0, %xmm0
>>         vpaddd  %xmm0, %xmm1, %xmm0
>>         vpsrldq $8, %xmm0, %xmm1
>>         vpaddd  %xmm1, %xmm0, %xmm0
>>         vpsrldq $4, %xmm0, %xmm1
>>         vpaddd  %xmm1, %xmm0, %xmm0
>>         vmovd   %xmm0, %eax
>>         vzeroupper
>>         ret
>>
>> which besides being faster is also smaller (less prefixes).
>>
>> SPEC 2k6 results on Haswell (thus AVX2) are neutral.  As it merely
>> effects reduction vectorization epilogues I didn't expect big effects
>> but for loops that do not run much (more likely with AVX512).
>>
>> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>>
>> Ok for trunk?
> 
> Ping?
> 
> Richard.
> 
>> The PR mentions some more tricks to optimize the sequence but
>> those look like backend only optimizations.
>>
>> Thanks,
>> Richard.
>>
>> 2017-11-28  Richard Biener  <rguent...@suse.de>
>>
>>      PR tree-optimization/80846
>>      * target.def (split_reduction): New target hook.
>>      * targhooks.c (default_split_reduction): New function.
>>      * targhooks.h (default_split_reduction): Declare.
>>      * tree-vect-loop.c (vect_create_epilog_for_reduction): If the
>>      target requests first reduce vectors by combining low and high
>>      parts.
>>      * tree-vect-stmts.c (vect_gen_perm_mask_any): Adjust.
>>      (get_vectype_for_scalar_type_and_size): Export.
>>      * tree-vectorizer.h (get_vectype_for_scalar_type_and_size): Declare.
>>
>>      * doc/tm.texi.in (TARGET_VECTORIZE_SPLIT_REDUCTION): Document.
>>      * doc/tm.texi: Regenerate.
>>
>>      i386/
>>      * config/i386/i386.c (ix86_split_reduction): Implement
>>      TARGET_VECTORIZE_SPLIT_REDUCTION.
>>
>>      * gcc.target/i386/pr80846-1.c: New testcase.
>>      * gcc.target/i386/pr80846-2.c: Likewise.
I've got no objections here and you know this code far better than I.

OK.
jeff

Reply via email to