[Bug tree-optimization/79291] r244897 introduces IV related performance issues for daxpy on MIPS by enabling peeling for alignment

amker at gcc dot gnu.org Tue, 31 Jan 2017 05:05:43 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291


--- Comment #3 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #2)
> It also looks like mips lacks implementation of any of the vectorizer cost
> hooks and thus defaults to default_builtin_vectorization_cost which means
> that
> unaligned loads/stores have double cost.  And mips supports misaligned
> loads/stores via movmisalign (for MSA).  For daxpy:
> 
>        for (i = 0;i < n; i++) {
>                 dy[i] = dy[i] + da*dx[i];
>         }
> 
> the above makes peeling for alignment of dy[] profitable (and I'd generally
> agree because esp. misaligned stores do have a real penalty - though likely
> not when the store queue is not contended as likely in this case).
> 
> x86_64 peels for alignment as well and we get
> 
> .L6:
>         movups  (%rax,%r8), %xmm1
>         addl    $1, %r9d
>         mulps   %xmm2, %xmm1
>         addps   (%r11,%r8), %xmm1
>         movaps  %xmm1, (%r11,%r8)
>         addq    $16, %r8
>         cmpl    %ebx, %r9d
>         jb      .L6
> 
> and similar base+index addressing.  IVO does see the indices are the same
> though.
> 
>   # i_46 = PHI <i_36(7), 0(4)>
>   prolog_loop_adjusted_niters.6_48 = (sizetype) prolog_loop_niters.5_34;
>   niters.7_49 = niters.3_40 - prolog_loop_niters.5_34;
>   bnd.8_69 = niters.7_49 >> 2;
>   _75 = prolog_loop_adjusted_niters.6_48 * 4;
>   vectp_dy.12_74 = dy_15(D) + _75;
>   _80 = prolog_loop_adjusted_niters.6_48 * 4;
>   vectp_dx.15_79 = dx_16(D) + _80;
>   vect_cst__84 = {da_14(D), da_14(D), da_14(D), da_14(D)};
>   _88 = prolog_loop_adjusted_niters.6_48 * 4;
>   vectp_dy.20_87 = dy_15(D) + _88;
> 
> shows the missed CSE from the vectorizer (and a redundant IV).
Do you mean the IV for exit comparison?  Yes, iv_elimination could be improved,
especially we know there must be no wrapping in address from vectorization
analyzer.

> 
> During DR analysis we can in theory keep a list of stmts that share the
> "same" DR (we have this for group reads already) and record the generated
> IVs on the "master" DR.
Grouping DRs is helpful, but not optimal.  For case like PR68030, we have
starting addresses for vectors as below:
  base1 + offset + init1;
  base1 + offset + init2;
  base1 + offset + init3;
  base1 + offset + init4;

  base2 + offset + init5;
We may still need to reassociate constant part (init) out of offset.  Simply
grouping/reassociation DRs of the same memory object still has problem with the
last one.

I will send patches both for vectroizer and IVOPT as next stage1 opens.
> 
> A region-based CSE/DCE would still be my preference in the end.

[Bug tree-optimization/79291] r244897 introduces IV related performance issues for daxpy on MIPS by enabling peeling for alignment

Reply via email to