https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031
Tamar Christina <tnfchris at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|increasing VF during SLP |Inefficient codegen of |vectorization permutes |multistep zero-extends and |unnecessarily |LOAD_LANES --- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > ends up > > .L4: > ld4 {v28.8h - v31.8h}, [x4], 64 > add x3, x3, 64 > uaddl2 v26.4s, v28.8h, v29.8h > uaddl v28.4s, v28.4h, v29.4h > uaddw2 v0.4s, v26.4s, v30.8h > uaddw v28.4s, v28.4s, v30.4h > uaddw2 v0.4s, v0.4s, v31.8h > uaddw v28.4s, v28.4s, v31.4h > sxtl v1.2d, v0.2s > sxtl v27.2d, v28.2s > sxtl2 v0.2d, v0.4s > sxtl2 v28.2d, v28.4s > scvtf v1.2d, v1.2d > scvtf v27.2d, v27.2d > scvtf v0.2d, v0.2d > scvtf v28.2d, v28.2d > stp q1, q0, [x3, -32] > stp q27, q28, [x3, -64] > cmp x5, x4 > bne .L4 > > we can now use widening plus and avoid the HI -> DF conversion penalty. The conversion penalty is not the issue. This case is still bad since you're sign extending a zero extend, which is just zero extending anyway. But again I'm not complaining about that here. though.. > It uses interleaving because there's no ld8 and when > vect_lower_load_permutations decides heuristically to use load-lanes it > tries to do so vector-size agnostic so it doesn't consider using two times > ld4. > > There _are_ permutes because of the use of 4 lanes to compute the single > lane store in the reduction operation. The vectorization for the unrolled > loop not using load-lanes show them: > > vect_a1_53.10_234 = MEM <vector(8) short unsigned int> [(short unsigned > int *)vectp_x.8_232]; > vectp_x.8_235 = vectp_x.8_232 + 16; > vect_a1_53.11_236 = MEM <vector(8) short unsigned int> [(short unsigned > int *)vectp_x.8_235]; > vectp_x.8_237 = vectp_x.8_232 + 32; > vect_a1_53.12_238 = MEM <vector(8) short unsigned int> [(short unsigned > int *)vectp_x.8_237]; > vectp_x.8_239 = vectp_x.8_232 + 48; > vect_a1_53.13_240 = MEM <vector(8) short unsigned int> [(short unsigned > int *)vectp_x.8_239]; > _254 = VEC_PERM_EXPR <vect_a1_53.10_234, vect_a1_53.11_236, { 1, 3, 5, 7, > 9, 11, 13, 15 }>; > _255 = VEC_PERM_EXPR <vect_a1_53.12_238, vect_a1_53.13_240, { 1, 3, 5, 7, > 9, 11, 13, 15 }>; > _286 = VEC_PERM_EXPR <_254, _255, { 1, 3, 5, 7, 9, 11, 13, 15 }>; > ... > > that's simply load-lanes open-coded. If open-coding ld4 is better than using > ld4 just make it not available to the vectorizer? Similar to ld2 I suppose. So I now realize that I missed a step in what LLVM is doing here. You're right in that there is a permute here, but opencoding the permute is better for this case since the zero extension from HI to DI can be done using a permute as well. And this zero extension permute can fold the load lanes permute into itself. Which seems to be the trick LLVM is doing.. In my patch I added a new target hook targetm.vectorize.use_permute_for_promotion that decides when to use a permute vs an unpack. I think the right solution here is to tweak the LOAD_LANES heuristics to not use it when the results feed into another permute, since that permute can be optimized reducing the total number of permutes needed. Does that sound good to you?