On Tue, Nov 7, 2017 at 6:36 AM, Kumar, Venkataramanan <venkataramanan.ku...@amd.com> wrote: > Hi, > > The attached patch implements an RTL pass which splits generated FMA > instruction into MUL/ADD sequence. > The pass is enabled for Zen and done when we find it is profitable to split > the FMA. > > On Zen, we found that for a tight loop with FMA (reduction) operation as show > below, generating MUL/ADD instead of FMA, significantly improves > performance. > > Example: > double a[n],b[n],x; > for(i=;i<n;i++) > { > x = x + a[i] *b[i]; > } > > On Zen: > The latency of floating point ADD/SUB is 3 cycles and floating point MUL is > 3-4 cycles [float, double]. The FMA instruction takes 5 cycles. > There are 4 FPU pipes to handle floating point operations[AVX/SSE]. The FMA > operation is handled in pipe 0 or 1. > The ADD or SUB operation is handled in pipe 2 or 3. MUL is done in pipe 0 or > 1. > > In the reduction pattern shown above, for every operation, the add operand > in the loop is dependent on the previous iteration's result. > If we generate FMA instruction for the operation, it results in 5 cycle > dependent chain. > > On the other hand if we generate MUL/ADD, the multiply operations are > independent of each other and will be carried on parallel pipes. > Given that MUL results are computed ahead, it results in 3 cycle dependent > chain of ADD instructions which is profitable than generating FMA instruction. > > Based on SPEC benchmarks analyzed, we have enabled splitting of FMA to MUL > /ADD when we find in a loop a single FMA operation (reduction pattern) or a > single chain of FMA operations where dependency is only between FMA add > operand and predecessor FMA's result operand. > Also we restricted it to 3 levels of nested loop. > > The patch is bootstrapped and regression tested. Also boot strapped with > -march=znver1 flag. > We ran SPEC benchmarks - CPU2006 and CPU2k17 on Zen AM4. > > We get very good improvement for the below benchmarks. Other benchmarks > remain unaffected. > > CPU2006: > 410.bwaves (O2 -march=znver1): ~6% > 410.bwaves (O3 -march=znver1): ~11% > 454.calculix (O2 -march=znver1): ~23% > 454.calculix (O3 -march=znver1): ~24% > CPU2k17: > 503.bwaves_r (O2 -march=znver1): ~11% > 503.bwaves_r (O3 -march=znver1): ~11% > 510.parest_r (O2 -march=znver1): ~24% > 510.parest_r (O3 -march=znver1): ~24% > 510.parest_r (Ofast -march=znver1): ~24% > > Ok for trunk? > > 2017-11-06 Venkataramanan Kumar <venkataramanan.ku...@amd.com> > Rohit arul raj Dharmakan > <rohitarulraj.dharma...@amd.com> > > * config/i386/i386-passes.def: Add pass_handle_fma_split pass. > * config/i386/i386-protos.h (make_pass_handle_fma_split): New > Prototype. > * config/i386/i386.c (is_fma_insn): New. > (check_dependent_fma_pattern): Likewise. > (insn_defines_operand): Likewise. > (insn_uses_operand): Likewise. > (check_input_dependency): Likewise. > (check_output_dependency): Likewise. > (number_of_inner_loops): Likewise. > (is_fma_reduc_pattern_cand): Likewise. > (is_fma_chain): Likewise. > (split_fma_insns): Likewise. > (rest_of_handle_fma_split): Likewise. > (make_pass_handle_fma_split): Likewise. > (fma_analysis_results): New Enum. > (class pass_handle_fma_split): New Pass. > (pass_data_handle_fma_split); New pass data. > (ix86_target_string): Add -msplit-fma. > (ix86_option_override_internal): Handle new option. > * config/i386/i386.h (TARGET_SPLIT_FMA_OPTIMAL): New macro. > * config/i386/i386.opt (msplit-fma): New flag. > * config/i386/x86-tune.def (X86_TUNE_SPLIT_FMA_OPTIMAL): New tune. > * doc/invoke.texi (SPARC Options): Document -msplit-fma. > > 2017-11-06 Venkataramanan Kumar <venkataramanan.ku...@amd.com> > Rohit arul raj Dharmakan > <rohitarulraj.dharma...@amd.com> > > * gcc.target/i386/fma-split.c: New Test.
Index: gcc/config/i386/i386.opt =================================================================== --- gcc/config/i386/i386.opt (revision 254211) +++ gcc/config/i386/i386.opt (working copy) @@ -595,6 +595,10 @@ mprefer-avx256 Target Report Mask(PREFER_AVX256) Var(ix86_target_flags) Save Use 256-bit AVX instructions instead of 512-bit AVX instructions in the auto-vectorizer. +msplit-fma +Target Report Mask(SPLIT_FMA) Save +Split FMA instructions when profitable. + Please use ix86_target_flags variable here, similar to how mprefer-avx256 is handled. Default target flags are already full and the build will break for some targets. Uros.