[Bug target/119702] PPCLE: Inefficient auto-vectorization for 64-bit shifts on Power9

avinashd at linux dot ibm.com via Gcc-bugs Tue, 05 Aug 2025 02:06:24 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119702


--- Comment #14 from Avinash Jayakar <avinashd at linux dot ibm.com> ---
(In reply to Surya Kumari Jangala from comment #12)
> Ok. We also need to tackle the original issue, which is that a shift left
> can be optimized by generating a vector add. Perhaps tackle this issue first?

I looked furthur into how vector multiply is lowered to shifts. This happens in
the "tree vectorization slp" pass, which transforms the gimple into vectorized
form. 
The logic for handling these generic patterns is written as a pattern
recognition function, and this specific function "vect_synth_mult_by_constant"
does the same thing as "expand_mult_const" in the expand rtl pass, but on
gimple tree. 
If I disable this multiply pattern during vectorization, then the expand pass
converts mult to shift instruction. 

If we want to convert mult to an add in a machine dependent pass and not change
the machine independent gimple and rtl passes, then I see that only way would
be to combine the 2 instructions (splat and shift) into one add.
@Segher/@Surya, do you have any other suggestions?

(In reply to Segher Boessenkool from comment #13)
> mults.  And in most cases additions are faster than shifts (or you can do
> more
> of them concurrently or similar), so in many cases they are preferred, but
> that
> is not so super obvious already.  You might be able to do four adds
> concurrently,
> but you might be able to do two shifts concurrently additionally, so it all

Here concurrency means the instruction level parallelism right for e.g., the
concurrent execution of shift/add depends on the number of functional units in
the processor right? Just wanted to be on the same page.

[Bug target/119702] PPCLE: Inefficient auto-vectorization for 64-bit shifts on Power9

Reply via email to