[Bug c/35271] New: Stack not aligned at mod 16 byte boundary in x86_64 code
The stack should be aligned at a 8 mod 16 boundary at function entry, under both Apple's ABI and under "System V Application Binary Interface AMD64 Architecture Processor Supplement, draft 0.99". GCC aligns the stack at a 0 mod 8 boundary. To quote from the above mentioned document: The end of the input argument area shall be aligned on a 16 byte boundary. In other words, the value (%rsp - 8) is always a multiple of 16 when control is transferred to the function entry point. How come this hasn't been discovered before (at least I cannot find any bug reports about it)? It is because the x86 is very lax about alignment. But a few instructions are not that lax, MOVDQA will trigger a SIGSEGV on *nix systems. It is a performance issue for other 16-byte loads and stores. I have no test case for this, although it is possible to trigger in a shared library under darwin, since the runtime loader used MOVDQA. Note that this bug has been verified to exist also for x86_64-*-freebsd, and from reading the compiler sources. it also affects gnu/linux. -- Summary: Stack not aligned at mod 16 byte boundary in x86_64 code Product: gcc Version: 4.2.2 Status: UNCONFIRMED Severity: major Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tege-gcc at swox dot com GCC build triplet: i386-apple-darwin8.11.1 GCC host triplet: i386-apple-darwin8.11.1 GCC target triplet: i386-apple-darwin8.11.1 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code
--- Comment #2 from tege-gcc at swox dot com 2008-02-21 13:49 --- Created an attachment (id=15196) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15196&action=view) Alignment test This is not a strictly correct test case, it may fail even if the compiler aligns the stack properly, and it may pass even if the compiler does not correctly align the stack. One needs to read the assembly output to verify that a failure is due to bad stack alignment. The idea is to check that two subsequent invocations of foo places a local variable at the same alignment mod 16. If the stack is aligned at 8 mod 16 as it should directly after the call instruction, any local variable should get the same mod 16 alignment every time. gcc 4.2.2 as well as gcc 4.2.3 fails this test on both Darwin and FreeBSD. The reason is that foo allocates 16 bytes on the stack, plus the 8 bytes implicitly allocated by call. This means the ABI required stack alignment of 16 is not maintained. $ gcc -O -m64 foo.c && ./a.out Abort trap: 6 (core dumped) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code
--- Comment #4 from tege-gcc at swox dot com 2008-02-21 13:57 --- (From update of attachment 15196) #include long align; foo (int flag) { int variable; if (flag == 0) return (((long)&variable ^ align) & 0xf); align = (long)&variable; foo (flag - 1); } main () { if (foo (1) != 0) abort (); exit (0); } -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code
--- Comment #5 from tege-gcc at swox dot com 2008-02-21 14:01 --- The attachment is not the right file. I tried to "edit" it but I cannot find out how to do it. The proper test case is in the comment before this one. Sorry, I am bugzilla challenged. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code
--- Comment #3 from tege-gcc at swox dot com 2008-02-21 13:53 --- Testcase? Because we do align it for both x86_64-* and i386-darwin. Well, not as mandated in the 64-bit ABI. Now the SVSV i386 ABI says it should be aligned at 4 (word) bytes boundary. This is hardly relevant, since that is a 32-bit ABI. A test case is difficult to produce, since it is seems hard to write anything portable. I made an attempt though, please see comment attached to it. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code
--- Comment #7 from tege-gcc at swox dot com 2008-02-21 22:01 --- Sorry, but you ought to read and understand what I write before you comment, otherwise it becomes rather pointless. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug target/35271] Stack not aligned at mod 16 byte boundary in x86_64 code
--- Comment #13 from tege-gcc at swox dot com 2008-02-23 17:09 --- Created an attachment (id=15214) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15214&action=view) This is a minimized version of the original faling code. I understand the reasoning about local calls. The problem here is with what *looks* like a local call, the calls to __gmp_mt_recalc_buffer from __gmp_randget_mt. But in a shared library, the Darwin linker will replace these calls with calls to dyld_stub___gmp_mt_recalc_buffer, and that's where the crash happens. One may argue that it is utterly silly to use runtime linker calls when the function is at a known offset in the same object, and that this is an Apple tools bug. I have not read any ABI document for Darwin, so I will rest my case. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug target/35271] Stack not aligned at mod 16 byte boundary in x86_64 code
--- Comment #16 from tege-gcc at swox dot com 2008-02-23 18:27 --- I don't know how a PLT entry looks like. They use the object format macho, of which I know nothing. Note that the new testcase does not have any recursive calls. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug target/35271] Stack not aligned at mod 16 byte boundary in x86_64 code
--- Comment #19 from tege-gcc at swox dot com 2008-02-24 20:39 --- I believe the "local call" optimization is triggered when compiling __gmp_randget_mt() because its only call is to a function the compiler determines to be local. (One can easily untrigger the optimization by inserting a dummy call to foo() in __gmp_mt_recalc_buffer().) After all the [__gmp_mt_recalc_buffer()] function is global, not local, and can be overridden, so why would GCC assume that just because it knows its body it can ignore the usual alignment requirements for global functions? I think we're using "global" and "local" in two sense here. Sure, __gmp_mt_recalc_buffer() is global in visibility, but it is local to the compilation unit, which is what counts for the semi-invalid optimization discussed. GCC bases its decision whether to call with an unaligned stack solely on the latter definition of "local". -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271
[Bug target/34451] New: Typo in gcc/config/i386.c (ix86_rtx_costs)
In gcc/config/i386.c, there is a typo in the function ix86_rtx_costs. Code snippet: /* Compute costs correctly for widening multiplication. */ if ((GET_CODE (op0) == SIGN_EXTEND || GET_CODE (op1) == ZERO_EXTEND) It should use op0 for both these tests. -- Summary: Typo in gcc/config/i386.c (ix86_rtx_costs) Product: gcc Version: 4.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tege-gcc at swox dot com GCC target triplet: i386-*-* http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34451
[Bug target/34451] Typo in gcc/config/i386.c (ix86_rtx_costs)
--- Comment #1 from tege-gcc at swox dot com 2007-12-13 07:21 --- This bug is present also in the svn head. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34451
[Bug target/34452] New: Multiply-by-constant pessimation
The multiply-by-constant optimization work poorly for x86 targets with multiply units with short latency. Try some small values M for a sample program like, long f (long x) { return x * M; } for e.g. the -mtune=k8 subtarget. One gets slow sequences of lea, add, sub, sal. So what's wrong? Is it expmed.c's synth_mult that doesn't measure costs correctly? Or does i386.c provide poor cost measures? I'd say that synth_mult's cost model is great for 3 operand machines, but not very good for x86. It does not see the additional mov instructions inserted in many of its most clever sequences, nor does it understand that small shifts are done with the 2 cycle lea instruction (when src!=dst). Additionally, synth_mult thinks a 10 operation long sequence that cost 999 is the way to go if a single mult costs 1000. It does not take sequence length into account at all. Perhaps it should? Let's look at some examples of generated code for -mtune=k8. 6: (11 bytes, >= 3 cycles) leaq(%rdi,%rdi), %rax salq$3, %rdi subq%rax, %rdi 10: (12 bytes, >= 4 cycles) leaq0(,%rdi,8), %rax leaq(%rax,%rdi,2), %rax 11: (21 bytes, >= 4 cycles) leaq0(,%rdi,4), %rdx movq%rdi, %rax salq$4, %rax subq%rdx, %rax subq%rdi, %rax 13: (21 bytes, >= 4 cycles) leaq0(,%rdi,4), %rdx movq%rdi, %rax salq$4, %rax subq%rdx, %rax addq%rdi, %rax etc, etc. The cycle counts are only if we get ideal parallel execution, otherwise one additional cycle will be needed. The imul instruction needs 4 cycles (64-bit operation) and is not alignment sensitive and needs just one execution slot. It can therefore execute simultaneously with independent instructions, while the above sequences will use up much decode, execute, and retire resources. What can be done about this? The simple fix is to pretend multiplication is cheaper: --- /u/gcc/gcc-4.2.2/gcc/config/i386/.~/i386.c.~1~ Sat Sep 1 17:28:30 2007 +++ /u/gcc/gcc-4.2.2/gcc/config/i386/i386.c Thu Dec 13 10:12:07 2007 @@ -17254,7 +17254,7 @@ op0 = XEXP (op0, 0), mode = GET_MODE (op0); } - *total = (ix86_cost->mult_init[MODE_INDEX (mode)] + *total = (ix86_cost->mult_init[MODE_INDEX (mode)] - 1 + nbits * ix86_cost->mult_bit + rtx_cost (op0, outer_code) + rtx_cost (op1, outer_code)); This avoids most of the problems, but we still get a 4 cycle, 2 lea sequence for M = 10. Potential problem: This might affect other parts of the optimizer than synth_mult. That might be bad, but it might also be desirable. Another fix would perhaps be to teach synth_mult to understand that it's generating code for a 2.5 operand machine (one that can only do "a x= b", not "a = b x c", for some operation x). We should teach it that there will be moves inserted for sequences that rely on a source register twice (more or less). Letting synth_mult take sequence length into account would also make sense, I think. A cost of 1 per operation does not seem unreasonable. (I wrote synth_mult originally.) -- Summary: Multiply-by-constant pessimation Product: gcc Version: 4.2.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tege-gcc at swox dot com GCC target triplet: i386-*-* http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34452
[Bug target/34452] Multiply-by-constant pessimation
--- Comment #2 from tege-gcc at swox dot com 2007-12-13 10:52 --- It does make sense to bluff somewhat about the costs of the few 3 operand instructions that we have: lea mul const, regx, regy Exactly what cost to assign is not obvious. I think the nominal cost - epsilon is probably better than the nominal cost - 1 + epsilon. The latter is what Honza does. But teaching synth_mult about 2.5 operandness might be the best way to make constant multiplication code come out optimal. I'm not sure how to do that, though. Peeking at the constraints flags? :-) If I understand Honza right, he's trying to please reload more than synth_mult with his tweak. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34452
[Bug c/21331] New: Incorrect folding of comparison
On ia64:gcc -O ~/bug.c On powerpc: gcc -O -m64 ~/bug.c The test case hits abort. (This case came up when trying to compile GNU MP with gcc 4. I have yet to find a platform where gcc 4 works properly.) This is bug.c: #include int bar (void) { return -1; } unsigned long foo () { unsigned long retval; retval = bar (); if (retval == -1) return 0; return 3; } main () { if (foo () != 0) abort (); return 0; } -- Summary: Incorrect folding of comparison Product: gcc Version: 4.0.0 Status: UNCONFIRMED Severity: critical Priority: P2 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: tege-gcc at swox dot com CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: ia64-redhat-linux, powerpc-apple-darwin8 GCC host triplet: ia64-redhat-linux, powerpc-apple-darwin8 GCC target triplet: ia64-redhat-linux, powerpc-apple-darwin8 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21331