> > > > We have mutiple existing transformations that optimize SSE builtins into > > different > > instructions when doing so is win (we run full RTL optimization queue on > > them and > > do usual instruction combining, simplification and splitting). So i would > > say that > > we are OK changing the builtins into different instructoins. After all > > there are > > asm statements if one really wants the precise instruction choice. > > > > With FMA however the situation is different becuase there are rounding > > differences. > > Why we can convert multiplicatoin+add into FMA without -ffast-math at first > > place? > > We do with -ffp-contract=fast which is the default for C.
Just looked it up. Fun :) > > > An altnerative would be to prevent the conversion in tree-ssa-mathops? > > (I.e. matching > > the accumulation pattern and having some target hook specifying whether > > this is a good > > idea?) > > > > This looks like useful optimization in general - I was just looking into > > similar > > loop from swim of spec2k. > > As of the implementation this really feels like sth the scheduler > should do - split an > instruction into two given costs given by the DFA? Maybe that doesn't > integrate well > with the usual "ready queue" design? This is not first time when instruction choice depends in general depends on scheduling or at least the critical dependency chain. (another example is a decision whether to use imul which has, say, extra cycle of latency or discrete sequence of, say, 7 instructions which consume a lot more decoder bandwidth. At pentium4 times which did two adds per cycle this was big deal). I never saw very general way how to integrate this type of construct into scheduling. Moreover our scheduling models for out of order CPUs like Zen are very approximate. Any loop which has FP load in it needs to cover 7 cycles of latency which translates to 7*4 instructions which can possibly execute in between, so in our scheduler the pipeline is almost always inder very light load. So I would be bit skeptical about the fact that tying this heuristics into scheduler will bring noticeable improvements. LLVM seems to have some kind of model for reorder buffer. It would be interesting to see if it helps. Extending DFA interface for that is definitly doable. > > It seems to help only when there's very light load on the pipeline and > thus instead of > throughput the important metric is latency because of dependences > (there's nothing > to schedule inbetween). Our cost almost consistently optimize for latencie and not throughputs. I was bit wondering how to put throughputs here but that is yet another issue. Honza > > Richard. > > > Honza