On Friday, 23 March 2018 at 00:39:13 UTC, Cecil Ward wrote:
On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:
https://bugzilla.gdcproject.org/show_bug.cgi?id=288

--- Comment #1 from Iain Buclaw <ibuc...@gdcproject.org> ---
See the long list of useless conditional jumps towards the end of the first function in the asm output (whose demangled name is test.t1(unit))

Well, you'd never use -O3 if you care about speed anyway. :-)

And they are not useless jumps, it's just the foreach loop unrolled in its entirety. You can see that it's a feature of the gcc-7 series and latter, irregardless of the target, they all produce the same unrolled loop.

https://explore.dgnu.org/g/vD3N4Y

It might be a nice experiment to add pragma(ivdep) and pragma(unroll) support
to give you more control.

https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html

I wouldn't hold my breath though (this is not strictly a bug).

Agreed. It is possibly not a bug, because I don't see that the code is dysfunctional, but I haven't looked through it. But since the backend is doing optimisation here with unrolling, that being sub-optimal given with this weird code is imho a bug in that the achievement of _optimisation_ is not attained.

No I understand this is nothing to do with D, and I understand that this is unrolling.

But notice the target of the jumps are all to the same location and finishes off with an unconditional jump to the same location.


Not quite, if you look a little closer, some jump to other branches hidden inbetween.

I feel this is just a quirk of unrolling, in part, but that's not all I feel as the jumps don't make sense
 cmp #n / jxx L3
cmp #m / jxx L3
 jmp L3

is what we have so it all basically does absolutely nothing, unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very funny.


If you compile with -fdump-tree-optimized=stdout you will see that it's the middle-end that has lowered the code to a series of if jumps.

The backend consumer doesn't really have any chance for improving it.

I know this is merely debug-only code, but am wondering what else might happen if you are misguided enough to use the crazy -O3 with unrolled loops that have conditionals in them.

My other complaint about GCC back-end’' code generation is that it (sometimes) doesn't go for jump-less movcc-style operations when it can. For x86/x64, LDC sometimes generates jump-free code using conditional instructions where GCC does not.

I can fix the problem with GDC by using A single & instead of a &&, which happens to be legal here. Sometimes in the past I have needed to take steps to make sure that I can do such an operator substitution trick in order to get jump-free far far faster code, faster where the alternatives are extremely short (and side-effect-free) and branch prediction failure is a certainty.


You could also try compiling with -O2. I couldn't really see this in your given example, but honestly, if you want to optimize really aggressively you must be willing to coax the compiler in strange ways anyway.

I don't know if there are ways in which the backend could try to ascertain whether the results of certain unrolling are really bad. In some cases they could be bad because the code is too long and generates problems with code cache size or won't fit into a loop buffer. A highly per-cpu sub-variant check would need to be carried out in the generated code size, at least in all cases where there is still a loop left (as opposed to full unrolling of known size), as every kind of AMD and Intel processor is different, as Agner Fog warns us. Here though I didn't even explicitly ask for unrolling, so you might harshly say that it is the compiler’s jib ti work out whether it is actually an anti-optimisation, regardless of the possible reasons why the result may be bad news, never mind just based on total generated code size not fitting into some per-CP limit.


Well again, from past experience -O3 doesn't really care about code size or cache line so much. All optimizations passes which lower the code this way do so during SSA transformations, so irrespective of what is being targeted.

My reason for reporting this was to inquire about for loop unrolling behaves in later versions of the back end, ask about jump generation vs jump-free alternatives (LDC showing the correct way to do things) and to ask if there are any suboptimality nasties junking in code that does not merely come down to driving an assert.

I would hope for an optimisation that handles the case of dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially with no holes, in the case where _all the jumps go to the same target_, so this can get reduced down into a two-test range check and huge optimisation. I would also hope that conditional jumps followed by an unconditional jump could be spotted and handled too. (Peephole general low level optimisation then? ie jxx L1 / jmp L1 = jmp L1)


But is it really sub-optimal what -O3 is doing? Have you benchmarked it? I certainly didn't.

---
bool IsPowerOf2(T)(T x) pure nothrow @safe @nogc
out ( ret )
{
    assert( ret == true || ret == false );
    debug
    {
        bool b = false;
        foreach( s; 0.. 8 * x.sizeof )
            b = b || ( x == (1uL << s) );
        assert( ret == b );
    }
}
body
{
    return ( ( x & (x - 1) ) == 0 ) & (x > 0);
}

bool t1( int x )
{
    return IsPowerOf2( x );
}

void main(string[] args)
{
    int test()
    {
        return t1(cast(int)args.length);
    }

    import std.datetime.stopwatch;
    import std.stdio;

    auto r = benchmark!(test)(10000000);
    writeln(r);
}
---

[54 ms, 299 μs, and 5 hnsecs]: gdc -O3 -fdebug --param max-peel-branches=32 [168 ms, 71 μs, and 1 hnsec]: gdc -O3 -fdebug --param max-peel-branches=24

Looks like that despite your complaint, it is 3x faster with what you call "strange nonsensical code generation". :-)

Perhaps this is all being generated too late and optimisations have ready happened and they opportunities those optimisers provide have been and gone. Would it be possible for the backend to include a number of repeat optimisation passes of certain kinds after unrolled code is generated, or doesn't it work like that?


I had a quick look, and there is a control parameter for this - '--param max-peel-branches', the default value is 32, reducing it to 24 is enough to ensure that the complete unrolling never happens, but see benchmarks for why changing this may not be a good idea.

Anyway, this is not for you but for that particular backend, I suspect. I was wondering if someone could have a word, pass it in to the relevant people. I think it's worth making -O3 more generally usable rather than crazy because it features good ideas gone bad.

No, as a language implementer, our job is only to guarantee that semantic never breaks. Anything else is not relevant to us.

In this case though, it looks like everything is fine though and there's nothing to report.

Reply via email to