https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2017-05-23 CC| |jakub at gcc dot gnu.org Component|target |tree-optimization Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Uh. .optimized: float sumfloat_omp(const float*) (const float * arr) { unsigned long ivtmp.22; vector(8) float D__lsm0.19; const vector(8) float vect__23.18; const vector(8) float vect__4.16; float stmp_sum_19.12; vector(8) float vect__18.10; float D.2841[8]; vector(8) float _10; void * _77; unsigned long _97; <bb 2> [1.00%]: arr_13 = arr_12(D); __builtin_memset (&D.2841, 0, 32); _10 = MEM[(float *)&D.2841]; ivtmp.22_78 = (unsigned long) arr_13; _97 = ivtmp.22_78 + 4096; ... <bb 4> [1.00%]: MEM[(float *)&D.2841] = vect__23.18_58; vect__18.10_79 = MEM[(float *)&D.2841]; stmp_sum_19.12_50 = [reduc_plus_expr] vect__18.10_79; return stmp_sum_19.12_50; well, that explains it ;) Coming from <bb 7> [99.00%]: # i_33 = PHI <i_25(8), 0(6)> # ivtmp_35 = PHI <ivtmp_28(8), 1024(6)> _21 = GOMP_SIMD_LANE (simduid.0_14(D)); _1 = (long unsigned int) i_33; _2 = _1 * 4; _3 = arr_13 + _2; _4 = *_3; _22 = D.2841[_21]; _23 = _4 + _22; D.2841[_21] = _23; i_25 = i_33 + 1; ivtmp_28 = ivtmp_35 - 1; if (ivtmp_28 != 0) goto <bb 8>; [98.99%] so we perform the reduction in memory, then LIM performs store-motion on it but the memset isn't inlined early enough to rewrite the decl into SSA (CCP from GOMP_SIMD_VF is missing). In DOM we have __builtin_memset (&D.2841, 0, 32); _10 = MEM[(float *)&D.2841]; so we do not fold that. If OMP SIMD always zeros the vector then it could also emit the maybe easier to optimize WITH_SIZE_EXPR<_3, D.2841> = {}; of course gimple_fold_builtin_memset should simply be improved to optimize now constant-size memset to = {}. I'll have a look.