https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438
--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Maxim Kuvyrkov from comment #9) > I've looked into another case where inability to handle stores with gaps > generates sub-optimal code. I'm interested in spending some time on fixing > this, provided some guidance in the vectorizer. > > Is it substantially more difficult to handle stores with gaps compared to > loads with gaps? > > The following is [minimally] reduced from 462.libquantum:quantum_sigma_x(), > which is #2 function in 462.libquantum profile. This cycle accounts for > about 25% of total 462.libquantum time. > > ===struct node_struct > { > float _Complex gap; > unsigned long long state; > }; > > struct reg_struct > { > int size; > struct node_struct *node; > }; > > void > func(int target, struct reg_struct *reg) > { > int i; > > for(i=0; i<reg->size; i++) > reg->node[i].state ^= ((unsigned long long) 1 << target); > } > === > > This loop vectorizes into > <bb 5>: > # vectp.8_39 = PHI <vectp.8_40(6), vectp.9_38(4)> > vect_array.10 = LOAD_LANES (MEM[(long long unsigned int *)vectp.8_39]); > vect__5.11_41 = vect_array.10[0]; > vect__5.12_42 = vect_array.10[1]; > vect__7.13_44 = vect__5.11_41 ^ vect_cst__43; > _48 = BIT_FIELD_REF <vect__7.13_44, 64, 0>; > MEM[(long long unsigned int *)ivtmp_45] = _48; > ivtmp_50 = ivtmp_45 + 16; > _51 = BIT_FIELD_REF <vect__7.13_44, 64, 64>; > MEM[(long long unsigned int *)ivtmp_50] = _51; > > which then becomes for aarch64: > .L4: > ld2 {v0.2d - v1.2d}, [x1] > add w2, w2, 1 > cmp w2, w7 > eor v0.16b, v2.16b, v0.16b > umov x4, v0.d[1] > st1 {v0.d}[0], [x1] > add x1, x1, 32 > str x4, [x1, -16] > bcc .L4 What I did for thunderx was create a vector cost model which caused this loop not be vectorized to get the regression from happening. Not this might actually be better code for some micro arch. I need to check with the new processor we have in house but that is next week or so. I don't know how much I can share next week though.