[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-26 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #12 from N Schaeffer --- I found the "offending" option, and it seems to be indeed a cost-model problem as Andrew Pinski said: good code is generated by: gcc -O2 -ftree-vectorize -march=skylake (since gcc 6.1) gcc -O1 -ftre

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #10 from N Schaeffer --- intrestingly (and maybe surprisingly) I can get gcc to produce nearly optimal code using vbroadcastsd with the following options: -O2 -march=skylake -ftree-vectorize

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #9 from N Schaeffer --- In addition, optimizing for size with -Os leads to a non-vectorized double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced by clang -Os) leads to 40 bytes. It is thus also a missed optimi

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #6 from N Schaeffer --- indeed, aarch64 assembly looks very good.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #4 from N Schaeffer --- ... and thank you for your quick reply!

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #3 from N Schaeffer --- I have not benchmarked. For 4 vmulpd doing the actual work, there are more than 40 permute/mov instructions, among which 24 vpermd instructions which have a 3 cycle latency. That is 6 vpermd per vmulpd. There

[Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 Bug ID: 114107 Summary: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 Product: gcc Version: 13.2.0 Status: UNCONFIRMED

[Bug tree-optimization/98563] [10/11 Regression] vectorization fails while it worked on gcc 9 and earlier

2021-01-07 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563 --- Comment #3 from N Schaeffer --- I'd like to add that when you say "vectorization of the basic block", the code generated is actually worse than non-vectorized naive code: it handles all loads and arithmetic operations in scalar mode (v*sd ins

[Bug tree-optimization/98563] regression: vectorization fails while it worked on gcc 9 and earlier

2021-01-06 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563 --- Comment #1 from N Schaeffer --- I just found the -mprefer-vector-width=512 to force to use zmm. The reported regression however remains.

[Bug tree-optimization/98563] New: regression: vectorization fails while it worked on gcc 9 and earlier

2021-01-06 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563 Bug ID: 98563 Summary: regression: vectorization fails while it worked on gcc 9 and earlier Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal