[Bug target/109797] 456.hmmer compiled with -O2 -flto regressed by 15% on AMD zen3 between r14-487-g6f18f344338b37 and r14-540-gb7fe38c14e5f1b

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 15 May 2023 00:28:38 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109797


--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #5)
> (In reply to Richard Biener from comment #4)
> > No, it's indeed plain -O2 with the default architecture level, thus SSE2
> > only.
> > 
> > For the case of "complex" expansions we might want to bite the bullet and
> > fix the GIGO vectorizer cost modeling in the target.  add_stmt_cost should
> > use ix86_multiplication_cost here passing down V2SImode.  There's cases
> > for various integer multiplication emulations but your patch didn't amend
> > this function for the new V2SImode emulation w/o SSE4.
> 
> Maybe the pattern should only be enabled for TARGET_SSE4_1, where it is
> implemented with native instruction. Using PMULUDQ, shuffling arguments and
> the result around looks like it will never be a win for V2SI, while it is
> worthwhile for V4SI, where the approach is taken from.

Yeah, in theory, if the vectorizer would consider keeping some ops scalar
it would be costed against two element extracts, two mulsd, and a build
of a V2SI vector again.

Note even locally "bad" vectorizations can pay off if they are required
to vectorize a larger sequence.  But we are quite bad at estimating overall
latency (and resource utilization).  What also often happens is that
SSE op "badness" is marginal compared to the comperatively very high costs
of memory ops (I think their cost is overly high when from L1).  Esp. as
we consider the costs to be "latency" we probably should cost stores as
zero because we are not interested in when they complete.  IIRC we
consider a scalar store and a vector store to cost 12, so we "win" 12 by
eliding a single store.  In reality if it takes 1 cycle to "issue" the
store we should have them cost 1 so we get a slight win but not 12?

The story is different for loads of course but even there 2 parallel
scalar loads vs. 1 vector load isn't 12 + 12 latency vs 12 but rather
12 + 1(?) vs 12.  But since we cost the scalar and vector IL separately
this is somewhat difficult to assess.  Maybe it would be easier to
compute a (local) "relative" cost, presenting related scalar and vector
stmts to the hook at the same time (or have the hook "guess" the scalar
ops - a good guess is usually straight-forward of course).

> The pattern was introduced as a partial fix for PR109690.

[Bug target/109797] 456.hmmer compiled with -O2 -flto regressed by 15% on AMD zen3 between r14-487-g6f18f344338b37 and r14-540-gb7fe38c14e5f1b

Reply via email to