https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #12 from N Schaeffer ---
I found the "offending" option, and it seems to be indeed a cost-model problem
as Andrew Pinski said:
good code is generated by:
gcc -O2 -ftree-vectorize -march=skylake (since gcc 6.1)
gcc -O1 -ftre
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #10 from N Schaeffer ---
intrestingly (and maybe surprisingly) I can get gcc to produce nearly optimal
code using vbroadcastsd with the following options:
-O2 -march=skylake -ftree-vectorize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #9 from N Schaeffer ---
In addition, optimizing for size with -Os leads to a non-vectorized double-loop
(51 bytes) while the vectorized loop with vbroadcastsd (produced by clang -Os)
leads to 40 bytes.
It is thus also a missed optimi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #6 from N Schaeffer ---
indeed, aarch64 assembly looks very good.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #4 from N Schaeffer ---
... and thank you for your quick reply!
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #3 from N Schaeffer ---
I have not benchmarked.
For 4 vmulpd doing the actual work, there are more than 40 permute/mov
instructions, among which 24 vpermd instructions which have a 3 cycle latency.
That is 6 vpermd per vmulpd.
There
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Bug ID: 114107
Summary: poor vectorization at -O3 when dealing with arrays of
different multiplicity, good with -O2
Product: gcc
Version: 13.2.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563
--- Comment #3 from N Schaeffer ---
I'd like to add that when you say "vectorization of the basic block", the code
generated is actually worse than non-vectorized naive code: it handles all
loads and arithmetic operations in scalar mode (v*sd ins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563
--- Comment #1 from N Schaeffer ---
I just found the -mprefer-vector-width=512 to force to use zmm.
The reported regression however remains.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563
Bug ID: 98563
Summary: regression: vectorization fails while it worked on gcc
9 and earlier
Product: gcc
Version: 10.1.0
Status: UNCONFIRMED
Severity: normal
10 matches
Mail list logo