https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102750
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- So when we have BB vectorization and costs like +su3_proj.c:44:24: note: Cost model analysis for part in loop 1: + Vector cost: 776 + Scalar cost: 836 we decide it's worthwhile to vectorize. But that does not account for the fact that the scalar code represents multiple vector lanes possibly executing in parallel while the vector code in combining the lanes makes them data dependent. So considering a VF of at least 2 the scalar code could in the ideal case run with a latency of 836/2 while the vector code would have a latency of 776. Of course the cost number we compute doesn't really resemble overall latency, I will see if we can improve that for GCC 13. What I'm after is that it's maybe not a good idea to compare scalar vs. vector cost for BB vectorization in the way we do. Without any info on the individually costed stmt dependences it's hard to do better of course though the target could try simple adjustments based on available issue with and execution resources to for example assume that two scalar ops can always issue & execute in parallel and thus divide the scalar cost by two (for BB vectorization only or for scalar stmts participating in SLP for loop vect, sth not readily available though). There's always the possibility to special case more constrained resources (integer/fp divider, stores or shifts), but that already gets a bit complicated because we cost scalar stmts independently and thus the code doesn't know how many of them will be combined into a single lane. The alternative is to do the biasing on the vector side but there you do not know the original scalar operation involved (consider patterns). So it might be not easy to adjust the current heuristics in a good way.