https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102404
Bug ID: 102404 Summary: Loop vectorized with 32 byte vectors actually uses 16 byte vectors Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: freddie at witherden dot org Target Milestone: --- Created attachment 51480 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51480&action=edit Test case Consider the loop on L11 of the attached file. Compiling as: ❯ gcc -march=tigerlake -Ofast -mprefer-vector-width=512 -S -fopenmp test.c -fopt-info test.c:25:37: optimized: loop vectorized using 32 byte vectors test.c:4:6: optimized: loop turned into non-loop; it never loops which notes that (as requested) the loop has been vectorized using 32-byte (zmm) vectors. Inspecting the resulting assembly (also attached) we observe that has actually ben unrolled by a factor of two and then vectorized using 16-byte (ymm) vectors. As a point of comparison recent versions of Clang use 32-byte vectors for this loop, resulting in code which is half the size.