Re: [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD

Jan Hubicka Tue, 07 Nov 2017 04:47:49 -0800

> >
> > We have mutiple existing transformations that optimize SSE builtins into 
> > different
> > instructions when doing so is win (we run full RTL optimization queue on 
> > them and
> > do usual instruction combining, simplification and splitting). So i would 
> > say that
> > we are OK changing the builtins into different instructoins. After all 
> > there are
> > asm statements if one really wants the precise instruction choice.
> >
> > With FMA however the situation is different becuase there are rounding 
> > differences.
> > Why we can convert multiplicatoin+add into FMA without -ffast-math at first 
> > place?
> 
> We do with -ffp-contract=fast which is the default for C.


Just looked it up.  Fun :)
> 
> > An altnerative would be to prevent the conversion in tree-ssa-mathops? 
> > (I.e. matching
> > the accumulation pattern and having some target hook specifying whether 
> > this is a good
> > idea?)
> >
> > This looks like useful optimization in general - I was just looking into 
> > similar
> > loop from swim of spec2k.
> 
> As of the implementation this really feels like sth the scheduler
> should do - split an
> instruction into two given costs given by the DFA?  Maybe that doesn't
> integrate well
> with the usual "ready queue" design?

This is not first time when instruction choice depends in general depends on 
scheduling
or at least the critical dependency chain.

(another example is a decision whether to use imul which has, say, extra cycle
of latency or discrete sequence of, say, 7 instructions which consume a lot
more decoder bandwidth. At pentium4 times which did two adds per cycle this was
big deal).

I never saw very general way how to integrate this type of construct into 
scheduling.
Moreover our scheduling models for out of order CPUs like Zen are very 
approximate.
Any loop which has FP load in it needs to cover 7 cycles of latency which 
translates
to 7*4 instructions which can possibly execute in between, so in our scheduler 
the pipeline
is almost always inder very light load.

So I would be bit skeptical about the fact that tying this heuristics into 
scheduler 
will bring noticeable improvements.

LLVM seems to have some kind of model for reorder buffer. It would be 
interesting
to see if it helps. Extending DFA interface for that is definitly doable.
> 
> It seems to help only when there's very light load on the pipeline and
> thus instead of
> throughput the important metric is latency because of dependences
> (there's nothing
> to schedule inbetween).

Our cost almost consistently optimize for latencie and not throughputs.  I was 
bit
wondering how to put throughputs here but that is yet another issue.

Honza
> 
> Richard.
> 
> > Honza

Re: [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD

Reply via email to