On 01/27/2017 07:19 AM, Segher Boessenkool wrote:
On Fri, Jan 27, 2017 at 02:30:49PM +0100, Richard Biener wrote:
Ok, maybe with -fno-trapping-math we don't consider that case but even
then generating
a NaN is usually dreadfully slow so avoiding speculation of such insns
looks good in
any case (w/o considering its cost).
And -ffast-math includes -ffinite-math-only. No, the testcase never
takes the square root of number smaller than zero, it isn't *that* slow ;-)
Well, the testcase as written doesn't but if you speculate the sqrt it might?
Yeah true. Except we have -ffast-math so we told the compiler that is
just fine to do.
Things slow down so much because there is a loop immediately followed
by a square root insn, and sched-rgn decides it is a good idea to move
it to inside the loop. Which is a bad idea no matter what the frequency
of the loop is because 1) we do not get such profiles very correct, and
2) sqrt is really expensive.
I understood that but then moving sth inside a loop is almost never a win.
It defaults to moving something if it has space for it in the schedule
and it is executed at least 40% of the time (I think).
Can't "not modeled" insns not be marked somehow in the pipeline description?
Well, the only thing from the pipeline description that is used here is
the insn latency, which isn't all that much higher than "normal" FP insns.
And simply "not decribed properly" won't do much good -- if we could
(without blowing up the automata) we would, and sched-rgn would then
still speculate this.
And I think this is the core of the issue. We have multiple ports that
don't necessarily fully describe the latency, issue rates, etc of
certain insns like div/sqrt/rsqrt. There are good reasons for doing that.
Because of the partial description, the scheduler may think those insns
fit into a pipeline bubble within the loop, when reality they do not.
The scheduler currently has no way of knowing what insns have this
property. While there are cases where we'd like to speculate a div or
sqrt to give it more time to complete without stalls -- there's no good
way to do that without fully describing them to the scheduler.
My preference would be somehow either mark those insns as not fully
modeled and avoid speculating on them. Or invent a target hook to allow
the scheduler to query the backend.
Note that these could be used elsewhere -- for example delay slot
scheduling and predication. Delay slot scheduling does speculation and
there's ports that simply refuse to allow certain instructions (div/sqrt
on one port, I think all FP stuff on another) to avoid these kinds of
problems.
Similarly nullification/predication often work by wiping out the final
posting of results into the register file. So imagine a non-pipelined
div/sqrt. Predicating a div/sqrt instruction will actually keep the
pipeline busy computing results that will be thrown away and preventing
other useful work from occurring. And, yes, this really does happen.
THe PA suffered from these problems.
jeff