https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79593

--- Comment #3 from Katsunori Kumatani <katsunori.kumatani at gmail dot com> ---
Hi, sorry I forgot to mention, I used Godbolt's Compiler Explorer to test it on
GCC 5 and 7 as I only have version 6 deployed on this machine.

On my end, it probably used march 'native' by default (?) but I omitted it for
obvious reasons. The reason I find this important is because in your case, you
have "sahf" and I see your code doesn't use "fcomi" instruction which means it
targets an older architecture.

Try compile with   -march=core2 -m32 -Ofast -mfpmath=387

By the way I'm not talking about the fact that it is used multiple times, but
that it "loads" it (pushes it on the stack) and then pops it in a bit without
any effect in-between that requires this!

Version 5 does not do this. It's not the "double load" that is the issue, but
the "double load followed by a pop later", because it is useless and v5 does it
better.

Look at this following small example at the beginning (cut to make it shorter),
try it on godbolt.org  to see what I mean if you can't reproduce. If you
compare them side-by-side you'll only notice this small difference:

GCC 6:
        fldz
        sub     esp, 20
        mov     eax, DWORD PTR [esp+24]
        mov     edx, DWORD PTR [esp+28]
        cmp     DWORD PTR [eax], edx
        jbe     .L1
        fld     DWORD PTR global_data
        fld     st(0)       # this
        fld     DWORD PTR global_data+4
        fxch    st(3)
        fucomip st, st(2)
        fstp    st(1)       # and this are useless/not in v5


GCC 5:
        fldz
        sub     esp, 20
        mov     eax, DWORD PTR [esp+24]
        mov     edx, DWORD PTR [esp+28]
        cmp     DWORD PTR [eax], edx
        jbe     .L2
        fld     DWORD PTR global_data
        fld     DWORD PTR global_data+4
        fxch    st(2)
        fucomip st, st(1)
        ja      .L20
        fxch    st(1)
        fsubr   DWORD PTR [eax+4]


As you can see, the only difference on this beginning part between version 5
and 6 is that 6 doubles the top of the stack, only to pop it later.

I mean, even if it wanted to "store" the top on a different register on the
stack, it could just use "fst" instruction, without the 'p' which implies a
pop. This way, it could still do its logic but without having to duplicate the
top of the stack needlessly. Thus it would get rid of the "fld st(0)" but not
the fst if it needed it later in another register.

Of course in this example the duplication isn't so bad, because it uses few
registers. But it's a bad case for real code because it will have to spill
st(7) on the stack (when stack is full) in order to duplicate it... IMO it
wastes register stack space for no reason (even if the load is cheap).

In any case is it possible to make it behave like Godbolt's version 5? (idk
what settings it uses, though). I tested on it just to make sure it didn't
always behave this way and I was right at least with certain options...


BTW, any clue why version 7 does even worse in respect to that stack spill? In
version 5 and 6, this part:

        ja      .L20
        fxch    st(1)
        fsubr   DWORD PTR [eax+4]

Becomes this in v7:

        mov     eax, DWORD PTR [eax+4]
        ja      .L20
        mov     DWORD PTR [esp], eax
        fld     DWORD PTR [esp]
        fsubrp  st(2), st

Which is obviously a quite big penalty just for subtracting a float from a long
double (*two* extra memory operations, three in total).

Thanks :)

Reply via email to