https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79593
--- Comment #3 from Katsunori Kumatani <katsunori.kumatani at gmail dot com> --- Hi, sorry I forgot to mention, I used Godbolt's Compiler Explorer to test it on GCC 5 and 7 as I only have version 6 deployed on this machine. On my end, it probably used march 'native' by default (?) but I omitted it for obvious reasons. The reason I find this important is because in your case, you have "sahf" and I see your code doesn't use "fcomi" instruction which means it targets an older architecture. Try compile with -march=core2 -m32 -Ofast -mfpmath=387 By the way I'm not talking about the fact that it is used multiple times, but that it "loads" it (pushes it on the stack) and then pops it in a bit without any effect in-between that requires this! Version 5 does not do this. It's not the "double load" that is the issue, but the "double load followed by a pop later", because it is useless and v5 does it better. Look at this following small example at the beginning (cut to make it shorter), try it on godbolt.org to see what I mean if you can't reproduce. If you compare them side-by-side you'll only notice this small difference: GCC 6: fldz sub esp, 20 mov eax, DWORD PTR [esp+24] mov edx, DWORD PTR [esp+28] cmp DWORD PTR [eax], edx jbe .L1 fld DWORD PTR global_data fld st(0) # this fld DWORD PTR global_data+4 fxch st(3) fucomip st, st(2) fstp st(1) # and this are useless/not in v5 GCC 5: fldz sub esp, 20 mov eax, DWORD PTR [esp+24] mov edx, DWORD PTR [esp+28] cmp DWORD PTR [eax], edx jbe .L2 fld DWORD PTR global_data fld DWORD PTR global_data+4 fxch st(2) fucomip st, st(1) ja .L20 fxch st(1) fsubr DWORD PTR [eax+4] As you can see, the only difference on this beginning part between version 5 and 6 is that 6 doubles the top of the stack, only to pop it later. I mean, even if it wanted to "store" the top on a different register on the stack, it could just use "fst" instruction, without the 'p' which implies a pop. This way, it could still do its logic but without having to duplicate the top of the stack needlessly. Thus it would get rid of the "fld st(0)" but not the fst if it needed it later in another register. Of course in this example the duplication isn't so bad, because it uses few registers. But it's a bad case for real code because it will have to spill st(7) on the stack (when stack is full) in order to duplicate it... IMO it wastes register stack space for no reason (even if the load is cheap). In any case is it possible to make it behave like Godbolt's version 5? (idk what settings it uses, though). I tested on it just to make sure it didn't always behave this way and I was right at least with certain options... BTW, any clue why version 7 does even worse in respect to that stack spill? In version 5 and 6, this part: ja .L20 fxch st(1) fsubr DWORD PTR [eax+4] Becomes this in v7: mov eax, DWORD PTR [eax+4] ja .L20 mov DWORD PTR [esp], eax fld DWORD PTR [esp] fsubrp st(2), st Which is obviously a quite big penalty just for subtracting a float from a long double (*two* extra memory operations, three in total). Thanks :)