Re: [Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps

Iain Buclaw via D.gnu Fri, 23 Mar 2018 15:16:15 -0700

On Friday, 23 March 2018 at 00:39:13 UTC, Cecil Ward wrote:

On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
https://bugzilla.gdcproject.org/show_bug.cgi?id=288
--- Comment #1 from Iain Buclaw <ibuc...@gdcproject.org> ---
See the long list of useless conditional jumps towards theend of the first function in the asm output (whose demangledname is test.t1(unit))
Well, you'd never use -O3 if you care about speed anyway. :-)
And they are not useless jumps, it's just the foreach loopunrolled in its entirety. You can see that it's a feature ofthe gcc-7 series and latter, irregardless of the target, theyall produce the same unrolled loop.
https://explore.dgnu.org/g/vD3N4Y
It might be a nice experiment to add pragma(ivdep) andpragma(unroll) support
to give you more control.

https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html

I wouldn't hold my breath though (this is not strictly a bug).
Agreed. It is possibly not a bug, because I don't see that thecode is dysfunctional, but I haven't looked through it. Butsince the backend is doing optimisation here with unrolling,that being sub-optimal given with this weird code is imho a bugin that the achievement of _optimisation_ is not attained.
No I understand this is nothing to do with D, and I understandthat this is unrolling.
But notice the target of the jumps are all to the same locationand finishes off with an unconditional jump to the samelocation.

Not quite, if you look a little closer, some jump to otherbranches hidden inbetween.

I feel this is just a quirk of unrolling, in part, but that'snot all I feel as the jumps don't make sense
 cmp #n / jxx L3
cmp #m / jxx L3
 jmp L3
is what we have so it all basically does absolutely nothing,unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way oftesting ( x>=1 && x<=4 ) but with 30-odd tests it isn't veryfunny.

If you compile with -fdump-tree-optimized=stdout you will seethat it's the middle-end that has lowered the code to a series ofif jumps.

The backend consumer doesn't really have any chance for improvingit.

I know this is merely debug-only code, but am wondering whatelse might happen if you are misguided enough to use the crazy-O3 with unrolled loops that have conditionals in them.
My other complaint about GCC back-end’' code generation is thatit (sometimes) doesn't go for jump-less movcc-style operationswhen it can. For x86/x64, LDC sometimes generates jump-freecode using conditional instructions where GCC does not.
I can fix the problem with GDC by using A single & instead of a&&, which happens to be legal here. Sometimes in the past Ihave needed to take steps to make sure that I can do such anoperator substitution trick in order to get jump-free far farfaster code, faster where the alternatives are extremely short(and side-effect-free) and branch prediction failure is acertainty.

You could also try compiling with -O2. I couldn't really seethis in your given example, but honestly, if you want to optimizereally aggressively you must be willing to coax the compiler instrange ways anyway.

I don't know if there are ways in which the backend could tryto ascertain whether the results of certain unrolling arereally bad. In some cases they could be bad because the code istoo long and generates problems with code cache size or won'tfit into a loop buffer. A highly per-cpu sub-variant checkwould need to be carried out in the generated code size, atleast in all cases where there is still a loop left (as opposedto full unrolling of known size), as every kind of AMD andIntel processor is different, as Agner Fog warns us. Herethough I didn't even explicitly ask for unrolling, so you mightharshly say that it is the compiler’s jib ti work out whetherit is actually an anti-optimisation, regardless of the possiblereasons why the result may be bad news, never mind just basedon total generated code size not fitting into some per-CP limit.

Well again, from past experience -O3 doesn't really care aboutcode size or cache line so much. All optimizations passes whichlower the code this way do so during SSA transformations, soirrespective of what is being targeted.

My reason for reporting this was to inquire about for loopunrolling behaves in later versions of the back end, ask aboutjump generation vs jump-free alternatives (LDC showing thecorrect way to do things) and to ask if there are anysuboptimality nasties junking in code that does not merely comedown to driving an assert.
I would hope for an optimisation that handles the case ofdense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especiallywith no holes, in the case where _all the jumps go to the sametarget_, so this can get reduced down into a two-test rangecheck and huge optimisation. I would also hope that conditionaljumps followed by an unconditional jump could be spotted andhandled too. (Peephole general low level optimisation then? iejxx L1 / jmp L1 = jmp L1)

But is it really sub-optimal what -O3 is doing? Have youbenchmarked it? I certainly didn't.


---
bool IsPowerOf2(T)(T x) pure nothrow @safe @nogc
out ( ret )
{
    assert( ret == true || ret == false );
    debug
    {
        bool b = false;
        foreach( s; 0.. 8 * x.sizeof )
            b = b || ( x == (1uL << s) );
        assert( ret == b );
    }
}
body
{
    return ( ( x & (x - 1) ) == 0 ) & (x > 0);
}

bool t1( int x )
{
    return IsPowerOf2( x );
}

void main(string[] args)
{
    int test()
    {
        return t1(cast(int)args.length);
    }

    import std.datetime.stopwatch;
    import std.stdio;

    auto r = benchmark!(test)(10000000);
    writeln(r);
}
---

[54 ms, 299 μs, and 5 hnsecs]: gdc -O3 -fdebug --parammax-peel-branches=32[168 ms, 71 μs, and 1 hnsec]: gdc -O3 -fdebug --parammax-peel-branches=24

Looks like that despite your complaint, it is 3x faster with whatyou call "strange nonsensical code generation". :-)

Perhaps this is all being generated too late and optimisationshave ready happened and they opportunities those optimisersprovide have been and gone. Would it be possible for thebackend to include a number of repeat optimisation passes ofcertain kinds after unrolled code is generated, or doesn't itwork like that?

I had a quick look, and there is a control parameter for this -'--param max-peel-branches', the default value is 32, reducing itto 24 is enough to ensure that the complete unrolling neverhappens, but see benchmarks for why changing this may not be agood idea.

Anyway, this is not for you but for that particular backend, Isuspect. I was wondering if someone could have a word, pass itin to the relevant people. I think it's worth making -O3 moregenerally usable rather than crazy because it features goodideas gone bad.

No, as a language implementer, our job is only to guarantee thatsemantic never breaks. Anything else is not relevant to us.

In this case though, it looks like everything is fine though andthere's nothing to report.

Re: [Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps

Reply via email to