https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79336
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed| |2017-02-02 Component|c |tree-optimization Blocks| |53947 Summary|Poor vectorisation of |Poor vectorisation of |additive reduction of |additive reduction of |complex array |complex array, final SLP | |reduction step inefficient Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. The reduction loop itself is fine, it is the final reduction step involving the SLP reduction result (we reduce two scalars) that is handled less than optimally: <bb 3> [96.97%]: # i_16 = PHI <i_11(4), 0(2)> # p$real_13 = PHI <_17(4), 1.0e+0(2)> # p$imag_14 = PHI <_18(4), 0.0(2)> # ivtmp_34 = PHI <ivtmp_33(4), 32(2)> _1 = (long unsigned int) i_16; _2 = _1 * 8; _3 = x_9(D) + _2; _7 = REALPART_EXPR <*_3>; _12 = IMAGPART_EXPR <*_3>; _17 = _7 + p$real_13; _18 = _12 + p$imag_14; i_11 = i_16 + 1; ivtmp_33 = ivtmp_34 - 1; if (ivtmp_33 != 0) goto <bb 4>; [96.88%] else goto <bb 5>; [3.12%] <bb 4> [93.94%]: goto <bb 3>; [100.00%] <bb 5> [3.03%]: # _36 = PHI <_17(3)> # _35 = PHI <_18(3)> p_10 = COMPLEX_EXPR <_36, _35>; here we simply try to first produce _36 and _35 from the vectorized reduction result and then build the complex function result: <bb 5> [3.03%]: # _36 = PHI <_17(3)> # _35 = PHI <_18(3)> # vect__17.8_22 = PHI <vect__17.8_24(3)> stmp__17.9_21 = BIT_FIELD_REF <vect__17.8_22, 32, 0>; stmp__17.9_20 = BIT_FIELD_REF <vect__17.8_22, 32, 32>; stmp__17.9_19 = BIT_FIELD_REF <vect__17.8_22, 32, 64>; stmp__17.9_15 = BIT_FIELD_REF <vect__17.8_22, 32, 96>; stmp__17.9_6 = BIT_FIELD_REF <vect__17.8_22, 32, 128>; stmp__17.9_5 = BIT_FIELD_REF <vect__17.8_22, 32, 160>; stmp__17.9_4 = BIT_FIELD_REF <vect__17.8_22, 32, 192>; stmp__17.9_29 = BIT_FIELD_REF <vect__17.8_22, 32, 224>; stmp__17.9_28 = stmp__17.9_21 + stmp__17.9_19; stmp__17.9_27 = stmp__17.9_20 + stmp__17.9_15; stmp__17.9_26 = stmp__17.9_28 + stmp__17.9_6; stmp__17.9_37 = stmp__17.9_27 + stmp__17.9_5; stmp__17.9_38 = stmp__17.9_26 + stmp__17.9_4; stmp__17.9_39 = stmp__17.9_37 + stmp__17.9_29; p_10 = COMPLEX_EXPR <stmp__17.9_38, stmp__17.9_39>; return p_10; this doesn't take advantage from the fact that we can do this kind final SLP reduction more efficiently (didn't try to decipher exactly what ICC does here). It may require ABI details or knowing that we can type-pun a vector to a complex... (but only for complex float, for complex double the ABI doesn't work out this way!) Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations