https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291
--- Comment #3 from amker at gcc dot gnu.org --- (In reply to Richard Biener from comment #2) > It also looks like mips lacks implementation of any of the vectorizer cost > hooks and thus defaults to default_builtin_vectorization_cost which means > that > unaligned loads/stores have double cost. And mips supports misaligned > loads/stores via movmisalign (for MSA). For daxpy: > > for (i = 0;i < n; i++) { > dy[i] = dy[i] + da*dx[i]; > } > > the above makes peeling for alignment of dy[] profitable (and I'd generally > agree because esp. misaligned stores do have a real penalty - though likely > not when the store queue is not contended as likely in this case). > > x86_64 peels for alignment as well and we get > > .L6: > movups (%rax,%r8), %xmm1 > addl $1, %r9d > mulps %xmm2, %xmm1 > addps (%r11,%r8), %xmm1 > movaps %xmm1, (%r11,%r8) > addq $16, %r8 > cmpl %ebx, %r9d > jb .L6 > > and similar base+index addressing. IVO does see the indices are the same > though. > > # i_46 = PHI <i_36(7), 0(4)> > prolog_loop_adjusted_niters.6_48 = (sizetype) prolog_loop_niters.5_34; > niters.7_49 = niters.3_40 - prolog_loop_niters.5_34; > bnd.8_69 = niters.7_49 >> 2; > _75 = prolog_loop_adjusted_niters.6_48 * 4; > vectp_dy.12_74 = dy_15(D) + _75; > _80 = prolog_loop_adjusted_niters.6_48 * 4; > vectp_dx.15_79 = dx_16(D) + _80; > vect_cst__84 = {da_14(D), da_14(D), da_14(D), da_14(D)}; > _88 = prolog_loop_adjusted_niters.6_48 * 4; > vectp_dy.20_87 = dy_15(D) + _88; > > shows the missed CSE from the vectorizer (and a redundant IV). Do you mean the IV for exit comparison? Yes, iv_elimination could be improved, especially we know there must be no wrapping in address from vectorization analyzer. > > During DR analysis we can in theory keep a list of stmts that share the > "same" DR (we have this for group reads already) and record the generated > IVs on the "master" DR. Grouping DRs is helpful, but not optimal. For case like PR68030, we have starting addresses for vectors as below: base1 + offset + init1; base1 + offset + init2; base1 + offset + init3; base1 + offset + init4; base2 + offset + init5; We may still need to reassociate constant part (init) out of offset. Simply grouping/reassociation DRs of the same memory object still has problem with the last one. I will send patches both for vectroizer and IVOPT as next stage1 opens. > > A region-based CSE/DCE would still be my preference in the end.