https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111905
Bug ID: 111905 Summary: -O3 vectorization terribly pessimizes the code for an already unrolled loop Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kamkaz at windowslive dot com Target Milestone: --- https://godbolt.org/z/rK4nEWovc With -O2, the code is the way you'd expect it to be - with performance benefit for the code "manually unrolled", that processes data in chunks (using assumption that w > 16). With -O3 however, auto-vecorization kicks in for the already unrolled loop, with the results being abysmal, a lot of unnecessary checks and (what I assume to be) dead code. Clang doesn't seem to have this problem.