[Bug c/35271] New: Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-20 Thread tege-gcc at swox dot com
The stack should be aligned at a 8 mod 16 boundary
at function entry, under both Apple's ABI and under
"System V Application Binary Interface AMD64
Architecture Processor Supplement, draft 0.99".

GCC aligns the stack at a 0 mod 8 boundary.

To quote from the above mentioned document:
  The end of the input argument area shall be aligned
  on a 16 byte boundary.  In other words, the value
  (%rsp - 8) is always a multiple of 16 when control
  is transferred to the function entry point.

How come this hasn't been discovered before (at least I
cannot find any bug reports about it)?  It is because the
x86 is very lax about alignment.  But a few instructions
are not that lax, MOVDQA will trigger a SIGSEGV on *nix
systems.  It is a performance issue for other 16-byte
loads and stores.

I have no test case for this, although it is possible to
trigger in a shared library under darwin, since the runtime
loader used MOVDQA.

Note that this bug has been verified to exist also for
x86_64-*-freebsd, and from reading the compiler sources.
it also affects gnu/linux.


-- 
   Summary: Stack not aligned at mod 16 byte boundary in x86_64 code
   Product: gcc
   Version: 4.2.2
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: tege-gcc at swox dot com
 GCC build triplet: i386-apple-darwin8.11.1
  GCC host triplet: i386-apple-darwin8.11.1
GCC target triplet: i386-apple-darwin8.11.1


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-21 Thread tege-gcc at swox dot com


--- Comment #2 from tege-gcc at swox dot com  2008-02-21 13:49 ---
Created an attachment (id=15196)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15196&action=view)
Alignment test

This is not a strictly correct test case, it may fail even if
the compiler aligns the stack properly, and it may pass even
if the compiler does not correctly align the stack.  One needs
to read the assembly output to verify that a failure is due to
bad stack alignment.

The idea is to check that two subsequent invocations of foo
places a local variable at the same alignment mod 16.  If the
stack is aligned at 8 mod 16 as it should directly after the
call instruction, any local variable should get the same mod
16 alignment every time.

gcc 4.2.2 as well as gcc 4.2.3 fails this test on both Darwin
and FreeBSD.  The reason is that foo allocates 16 bytes on the
stack, plus the 8 bytes implicitly allocated by call.  This
means the ABI required stack alignment of 16 is not maintained.

$ gcc -O -m64 foo.c  && ./a.out
Abort trap: 6 (core dumped)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-21 Thread tege-gcc at swox dot com


--- Comment #4 from tege-gcc at swox dot com  2008-02-21 13:57 ---
(From update of attachment 15196)
#include 

long align;

foo (int flag)
{
  int variable;
  if (flag == 0)
return (((long)&variable ^ align) & 0xf);
  align = (long)&variable;
  foo (flag - 1);
}

main ()
{
  if (foo (1) != 0)
abort ();
  exit (0);
}


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-21 Thread tege-gcc at swox dot com


--- Comment #5 from tege-gcc at swox dot com  2008-02-21 14:01 ---
The attachment is not the right file.
I tried to "edit" it but I cannot find out how to do it.
The proper test case is in the comment before this one.
Sorry, I am bugzilla challenged.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-21 Thread tege-gcc at swox dot com


--- Comment #3 from tege-gcc at swox dot com  2008-02-21 13:53 ---
  Testcase?  Because we do align it for both x86_64-* and i386-darwin.

Well, not as mandated in the 64-bit ABI.

  Now the SVSV i386 ABI says it should be aligned at 4 (word)
  bytes boundary.

This is hardly relevant, since that is a 32-bit ABI.

A test case is difficult to produce, since it is seems hard
to write anything portable.  I made an attempt though, please
see comment attached to it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug c/35271] Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-21 Thread tege-gcc at swox dot com


--- Comment #7 from tege-gcc at swox dot com  2008-02-21 22:01 ---
Sorry, but you ought to read and understand what I write before
you comment, otherwise it becomes rather pointless.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug target/35271] Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-23 Thread tege-gcc at swox dot com


--- Comment #13 from tege-gcc at swox dot com  2008-02-23 17:09 ---
Created an attachment (id=15214)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=15214&action=view)
This is a minimized version of the original faling code.

I understand the reasoning about local calls.  The problem
here is with what *looks* like a local call, the calls to
__gmp_mt_recalc_buffer from __gmp_randget_mt.  But in a shared
library, the Darwin linker will replace these calls with calls
to dyld_stub___gmp_mt_recalc_buffer, and that's where the crash
happens.

One may argue that it is utterly silly to use runtime linker
calls when the function is at a known offset in the same object,
and that this is an Apple tools bug.  I have not read any ABI
document for Darwin, so I will rest my case.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug target/35271] Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-23 Thread tege-gcc at swox dot com


--- Comment #16 from tege-gcc at swox dot com  2008-02-23 18:27 ---
I don't know how a PLT entry looks like.  They use the object format
macho, of which I know nothing.

Note that the new testcase does not have any recursive calls.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug target/35271] Stack not aligned at mod 16 byte boundary in x86_64 code

2008-02-24 Thread tege-gcc at swox dot com


--- Comment #19 from tege-gcc at swox dot com  2008-02-24 20:39 ---
I believe the "local call" optimization is triggered when compiling
__gmp_randget_mt() because its only call is to a function the compiler
determines to be local.  (One can easily untrigger the optimization by
inserting a dummy call to foo() in __gmp_mt_recalc_buffer().)

  After all the [__gmp_mt_recalc_buffer()] function is global, not local,
  and can be overridden, so why would GCC assume that just because it
  knows its body it can ignore the usual alignment requirements for
  global functions?

I think we're using "global" and "local" in two sense here.  Sure,
__gmp_mt_recalc_buffer() is global in visibility, but it is local
to the compilation unit, which is what counts for the semi-invalid
optimization discussed.  GCC bases its decision whether to call with
an unaligned stack solely on the latter definition of "local".


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271



[Bug target/34451] New: Typo in gcc/config/i386.c (ix86_rtx_costs)

2007-12-12 Thread tege-gcc at swox dot com
In gcc/config/i386.c, there is a typo in the function ix86_rtx_costs.
Code snippet:

  /* Compute costs correctly for widening multiplication.  */
  if ((GET_CODE (op0) == SIGN_EXTEND || GET_CODE (op1) == ZERO_EXTEND)

It should use op0 for both these tests.


-- 
   Summary: Typo in gcc/config/i386.c (ix86_rtx_costs)
   Product: gcc
   Version: 4.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: tege-gcc at swox dot com
GCC target triplet: i386-*-*


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34451



[Bug target/34451] Typo in gcc/config/i386.c (ix86_rtx_costs)

2007-12-12 Thread tege-gcc at swox dot com


--- Comment #1 from tege-gcc at swox dot com  2007-12-13 07:21 ---
This bug is present also in the svn head.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34451



[Bug target/34452] New: Multiply-by-constant pessimation

2007-12-13 Thread tege-gcc at swox dot com
The multiply-by-constant optimization work poorly for x86 targets
with multiply units with short latency.

Try some small values M for a sample program like,

  long f (long x) { return x * M; }

for e.g. the -mtune=k8 subtarget.  One gets slow sequences of
lea, add, sub, sal.

So what's wrong?  Is it expmed.c's synth_mult that doesn't measure
costs correctly?  Or does i386.c provide poor cost measures?

I'd say that synth_mult's cost model is great for 3 operand
machines, but not very good for x86.  It does not see the
additional mov instructions inserted in many of its most clever
sequences, nor does it understand that small shifts are done with
the 2 cycle lea instruction (when src!=dst).

Additionally, synth_mult thinks a 10 operation long sequence that
cost 999 is the way to go if a single mult costs 1000.  It does
not take sequence length into account at all.  Perhaps it should?

Let's look at some examples of generated code for -mtune=k8.

6:  (11 bytes, >= 3 cycles)
leaq(%rdi,%rdi), %rax
salq$3, %rdi
subq%rax, %rdi

10: (12 bytes, >= 4 cycles)
leaq0(,%rdi,8), %rax
leaq(%rax,%rdi,2), %rax

11: (21 bytes, >= 4 cycles)
leaq0(,%rdi,4), %rdx
movq%rdi, %rax
salq$4, %rax
subq%rdx, %rax
subq%rdi, %rax

13: (21 bytes, >= 4 cycles)
leaq0(,%rdi,4), %rdx
movq%rdi, %rax
salq$4, %rax
subq%rdx, %rax
addq%rdi, %rax

etc, etc.

The cycle counts are only if we get ideal parallel execution,
otherwise one additional cycle will be needed.  The imul
instruction needs 4 cycles (64-bit operation) and is not
alignment sensitive and needs just one execution slot.  It can
therefore execute simultaneously with independent instructions,
while the above sequences will use up much decode, execute,
and retire resources.

What can be done about this?

The simple fix is to pretend multiplication is cheaper:

--- /u/gcc/gcc-4.2.2/gcc/config/i386/.~/i386.c.~1~  Sat Sep  1 17:28:30
2007
+++ /u/gcc/gcc-4.2.2/gcc/config/i386/i386.c Thu Dec 13 10:12:07 2007
@@ -17254,7 +17254,7 @@
op0 = XEXP (op0, 0), mode = GET_MODE (op0);
}

- *total = (ix86_cost->mult_init[MODE_INDEX (mode)]
+ *total = (ix86_cost->mult_init[MODE_INDEX (mode)] - 1
+ nbits * ix86_cost->mult_bit
+ rtx_cost (op0, outer_code) + rtx_cost (op1, outer_code));

This avoids most of the problems, but we still get a 4 cycle,
2 lea sequence for M = 10.

Potential problem: This might affect other parts of the optimizer
than synth_mult.  That might be bad, but it might also be desirable.

Another fix would perhaps be to teach synth_mult to understand that
it's generating code for a 2.5 operand machine (one that can only
do "a x= b", not "a = b x c", for some operation x).  We should
teach it that there will be moves inserted for sequences that rely
on a source register twice (more or less).

Letting synth_mult take sequence length into account would also
make sense, I think.  A cost of 1 per operation does not seem
unreasonable.

(I wrote synth_mult originally.)


-- 
   Summary: Multiply-by-constant pessimation
   Product: gcc
   Version: 4.2.2
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: tege-gcc at swox dot com
GCC target triplet: i386-*-*


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34452



[Bug target/34452] Multiply-by-constant pessimation

2007-12-13 Thread tege-gcc at swox dot com


--- Comment #2 from tege-gcc at swox dot com  2007-12-13 10:52 ---
It does make sense to bluff somewhat about the costs of the few 3
operand instructions that we have:

  lea
  mul  const, regx, regy

Exactly what cost to assign is not obvious.  I think the nominal
cost - epsilon is probably better than the nominal cost - 1 + epsilon.
The latter is what Honza does.

But teaching synth_mult about 2.5 operandness might be the best way to
make constant multiplication code come out optimal.  I'm not sure how
to do that, though.  Peeking at the constraints flags?  :-)

If I understand Honza right, he's trying to please reload more
than synth_mult with his tweak.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34452



[Bug c/21331] New: Incorrect folding of comparison

2005-05-02 Thread tege-gcc at swox dot com
On ia64:gcc -O ~/bug.c
On powerpc: gcc -O -m64 ~/bug.c

The test case hits abort.

(This case came up when trying to compile GNU MP with gcc 4.
I have yet to find a platform where gcc 4 works properly.)

This is bug.c:

#include 

int bar (void)
{  return -1;  }

unsigned long
foo ()
{ unsigned long retval;
  retval = bar ();
  if (retval == -1)  return 0;
  return 3;  }

main ()
{ if (foo () != 0)  abort ();
  return 0;  }

-- 
   Summary: Incorrect folding of comparison
   Product: gcc
   Version: 4.0.0
Status: UNCONFIRMED
  Severity: critical
  Priority: P2
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: tege-gcc at swox dot com
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: ia64-redhat-linux, powerpc-apple-darwin8
  GCC host triplet: ia64-redhat-linux, powerpc-apple-darwin8
GCC target triplet: ia64-redhat-linux, powerpc-apple-darwin8


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21331