https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346
Tamar Christina <tnfchris at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Potential regression on |[11/12/13 Regression] |vectorization of left shift |Potential regression on |with constants since |vectorization of left shift |r11-5160-g9fc9573f9a5e94 |with constants since | |r11-5160-g9fc9573f9a5e94 Priority|P3 |P2 CC| |rguenth at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org Target Milestone|--- |11.5 Status|NEW |ASSIGNED --- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> --- I believe the problem is actually g:27842e2a1eb26a7eae80b8efd98fb8c8bd74a68e We added an optab for the widening left shift pattern there however the operation requires a uniform shift constant to work. See https://godbolt.org/z/4hqKc69Ke The existing pattern that deals with this is vect_recog_widen_shift_pattern which is a scalar pattern. during build_slp it validates that constants are the same and when they're not it aborts SLP. This is why we lose vectorization. Eventually we hit V4HI for which we have no widening shift optab for and it vectorizes using that low VF. This example shows a number of things wrong: 1. The generic costing seems off, this sequence shouldn't have been generated, as a vector sequence it's more inefficient than the scalar sequence. Using -mcpu=neover-n1 or any other costing structure correctly only gives scalar. 2. vect_recog_widen_shift_pattern is implemented in the wrong place. It predates the existence of the SLP pattern matcher. Because of the uniform requirements it's better to use the SLP pattern matcher where we have access to all the constants to decide whether the pattern is a match or not. That way we don't abort SLP. Are you ok with this as a fix Richi? 3. The epilogue costing seems off.. This example https://godbolt.org/z/YoPcWv6Td ends up generating an exceptionally high epilogue cost and so thinks vectorization at the higher VF is not profitable. *src1_18(D) 1 times vec_to_scalar costs 2 in epilogue MEM[(uint16_t *)src1_18(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue MEM[(uint16_t *)src1_18(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue MEM[(uint16_t *)src1_18(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue MEM[(uint16_t *)src1_18(D) + 8B] 1 times vec_to_scalar costs 2 in epilogue MEM[(uint16_t *)src1_18(D) + 10B] 1 times vec_to_scalar costs 2 in epilogue MEM[(uint16_t *)src1_18(D) + 12B] 1 times vec_to_scalar costs 2 in epilogue MEM[(uint16_t *)src1_18(D) + 14B] 1 times vec_to_scalar costs 2 in epilogue /app/example.c:16:12: note: Cost model analysis for part in loop 0: Vector cost: 23 Scalar cost: 17 For some reason it thinks it needs a scalar epilogue? using -fno-vect-cost-model gives the desired codegen.