Re: A question about redudant load elimination

2011-11-14 Thread Ye Joey
From tree dump we can see that there are two assignments from x, one
to unsigned and one to signed. I guess that's the reason. Apparently
there is room to improve though.

  int prephitmp.8;
  int * D.2027;
  unsigned int D.2026;
  unsigned int x.1;
  int x.0;

  # BLOCK 2 freq:1
  # PRED: ENTRY [100.0%]  (fallthru,exec)
  x.0_1 = x;
  x.1_2 = (unsigned int) x.0_1;  // unsigned move
  D.2026_3 = x.1_2 * 4;
  D.2027_5 = a_4(D) + D.2026_3;
  *D.2027_5 = 1;
  prephitmp.8_6 = x; // signed move

On Mon, Nov 14, 2011 at 4:01 PM, Jiangning Liu  wrote:
> Hi,
>
> For this test case,
>
> int x;
> extern void f(void);
>
> void g(int *a)
> {
>        a[x] = 1;
>        if (x == 100)
>                f();
>        a[x] = 2;
> }
>
> For trunk, the x86 assembly code is like below,
>
>        movl    x, %eax
>        movl    16(%esp), %ebx
>        movl    $1, (%ebx,%eax,4)
>        movl    x, %eax   // Is this a redundant one?
>        cmpl    $100, %eax
>        je      .L4
>        movl    $2, (%ebx,%eax,4)
>        addl    $8, %esp
>        .cfi_remember_state
>        .cfi_def_cfa_offset 8
>        popl    %ebx
>        .cfi_restore 3
>        .cfi_def_cfa_offset 4
>        ret
>        .p2align 4,,7
>        .p2align 3
> .L4:
>        .cfi_restore_state
>        call    f
>        movl    x, %eax
>        movl    $2, (%ebx,%eax,4)
>        addl    $8, %esp
>        .cfi_def_cfa_offset 8
>        popl    %ebx
>        .cfi_restore 3
>        .cfi_def_cfa_offset 4
>        Ret
>
> Is the 2nd "movl x, %eax" is a redundant one for single thread programming
> model? If yes, can this be optimized away?
>
> Thanks,
> -Jiangning
>
>
>
>


Re: A new stack protector option?

2011-11-29 Thread Ye Joey
On Wed, Nov 30, 2011 at 7:53 AM, Han Shen(沈涵)  wrote:
> Hi, I propose to add to gcc a new option regarding stack protector -
> "-fstack-protector-strong", in addition to current gcc's
> "-fstack-protector-all", which protects ALL functions, and
> "-fstack-protector", which protects functions that have a big
> (signed/unsigned) char array or have alloca called.
>
> Background - some times stack-protector is too-simple while
> stack-protector-all over-kills, for example, to build one of our core
> systems, we forcibly add "-fstack-protector-all" to all compile
> commands, which brings big performance penalty (due to extra stack
> guard/check insns on function prologue and epilogue) on both atom and
> arm. To use "-fstack-protector" is just regarded as not secure enough
> (only "protects" <2% functions) by the system secure team. So I'd like
> to add the option "-fstack-protector-strong", that hits the balance
> between "-fstack-protector" and "-fstack-protector-all".
Any further detail about when the proposed -strong will protect stack?
If the new criteria is general secure principles, maybe you can just
enhance -fstack-prtector instead of introducing new option.

Thanks - Joey


Re: Which Binutils should I use for performing daily regression test on trunk?

2011-12-22 Thread Ye Joey
On Thu, Dec 22, 2011 at 12:43 AM, Ian Lance Taylor  wrote:
> Terry Guo  writes:
>
>> I plan to set up daily regression test on trunk for target
>> ARM-NONE-EABI and post results to gcc-testresults mailing list. Which
>> Binutils should I use, the Binutils trunk or the latest released
>> Binutils? And which way is recommended, building from a combined tree
>> or building separately? If there is something I should pay attention
>> to, please let me know. Thanks very much.
>
> For gcc testing, the latest released binutils is normally fine.  You
> should only move to binutils trunk if there is some specific bug you
> need to work around temporarily.
>
> I personally would recommend building binutils separately.  If you
> choose to build a combined tree, then you should ignore the previous
> paragraph and always use binutils trunk.  For a combined tree you should
> always use sources from the same development date, so using gcc trunk
> implies using binutils trunk.
>
> Ian
Combined build with latest gcc and binutils trunk has the advantage of
monitoring both trunks. I'd prefer this approach.

- Joey


RE: How to debug if scheduling in gcc is wrong?

2008-10-20 Thread Ye, Joey
袁立威 wrote:
> Hi, I'm a guy working with gcc4.1.1 on itanium2. In my work, some
> instrumentations are added by gcc. After instrumentation, all
> specint2000 benchmarks except gzip can successfully run with
> optimization flag -O3. There are some information list below:
No answer from me but hopefully following suggestion useful. Your information 
posted here may not be sufficient for root cause analysis. Posting the full 
patch will be more helpful.

As to the failure itself. Suggest you reduce the it a small case, or at least 
find out exactly with function in gzip is miscompiled and split that function. 
It might not the scheduling problem. Finding exactly which instruction in .s is 
wrong will help tracing back to problem in your patch.

Thanks - Joey


RE: ia32 gcc-Debian 4.3.2-1 "rep ret" ?

2008-12-04 Thread Ye, Joey
Maybe comments at the insn pattern who emit "rep\; ret" can explain it:
";; Used by x86_machine_dependent_reorg to avoid penalty on single byte RET
;; instruction Athlon and K8 have." 

Thanks - Joey

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Etienne Lorrain
Sent: 2008年12月4日 18:31
To: gcc@gcc.gnu.org
Subject: ia32 gcc-Debian 4.3.2-1 "rep ret" ?


 Hello,

 I did not find any documentation of a "rep ret" instruction, at
http://www.intel.com/design/processor/manuals/253667.pdf
 they just say: "The behavior of the REP prefix is undefined when used with 
non-strings instructions".

 Any pointers?
 Thanks,
 Etienne.

etienne:~$ gcc --version
gcc (Debian 4.3.2-1) 4.3.2
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

etienne:~$ cat tmp.c
void fct2(int);

void fct (int i, int a)
{
a *= 2;
if (i == 2)
fct2(a);
}
etienne:~$ gcc -O2 -fomit-frame-pointer -S tmp.c -o tmp.s
etienne:~$ cat tmp.s
.file   "tmp.c"
.text
.p2align 4,,15
.globl fct
.type   fct, @function
fct:
cmpl$2, 4(%esp)
movl8(%esp), %eax
je  .L5
rep
ret
.p2align 4,,7
.p2align 3
.L5:
addl%eax, %eax
movl%eax, 4(%esp)
jmp fct2
.size   fct, .-fct
.ident  "GCC: (Debian 4.3.2-1) 4.3.2"
.section.note.GNU-stack,"",@progbits
etienne:~$






How to define 2 bypasses for a single pair of insn_reservation

2009-01-05 Thread Ye, Joey
When I write schedule model for following instructions:

Insn1: mov %r1, %r2
Insn2: mov %r1, %r3
Insn3: foo %r2, %r3 (foo is a 3 op insn, for example, %r3 = %r3 << %r2)

Latency from insn1 to insn3 is x cycles, and latency from insn2 to insn3 is y 
cycles. x != y.

Both insn1 and insn2 are insn_reservation_mov. Insn3 are insn_reservation_foo.

When I define bypass for them I found I couldn't do it. I can only define one 
bypass from mov to foo, like this:
(define_bypass x "insn_reservation_mov" "insn_reservation_foo" "condition1")

If I define following bypass too, gcc will report error:
(define_bypass y "insn_reservation_mov" "insn_reservation_foo" "condition2")

genautomata: bypass `insn_reservation_lea - insn_reservation_foo' is already 
defined

Anyone can help me through this please?

Thanks - Joey
 


RE: How to define 2 bypasses for a single pair of insn_reservation

2009-01-05 Thread Ye, Joey
Maxim and Vladimir Wrote:
>>> Anyone can help me through this please?
>>>   
>> It was supposed to have two latency definitions at most (one in 
>> define_insn_reservation and another one in define_bypass).  That time it 
>> seemed enough for all processors supported by GCC.  It also simplified 
>> semantics definition when two bypass conditions returns true for the 
>> same insn pair.
>> 
>> If you really need more one bypass for insn pair, I could implement 
>> this.  Please, let me know.  In this case semantics of choosing latency 
>> time could be
>> 
>> o time in first bypass occurred in pipeline description whose condition 
>> returns true
>> o time given in define_insn_reservation
>
> I had a similar problem with ColdFire V4 scheduler model and the 
> solution for me was using adjust_cost() target hook; it is a bit 
> complicated, but it works fine.  Search m68k.c for 'bypass' for more 
> information, comments there describe the thing in sufficient detail.
Thanks Maxim and Vlad, I'd take a look at m68k.c before knowing it is really 
needed to extension the semantics.

Thanks - Joey


RE: How to define 2 bypasses for a single pair of insn_reservation

2009-01-06 Thread Ye, Joey
Maxim and Vladimir Wrote:
>>> Anyone can help me through this please?
>>>   
>> It was supposed to have two latency definitions at most (one in 
>> define_insn_reservation and another one in define_bypass).  That time it 
>> seemed enough for all processors supported by GCC.  It also simplified 
>> semantics definition when two bypass conditions returns true for the 
>> same insn pair.
>> 
>> If you really need more one bypass for insn pair, I could implement 
>> this.  Please, let me know.  In this case semantics of choosing latency 
>> time could be
>> 
>> o time in first bypass occurred in pipeline description whose condition 
>> returns true
>> o time given in define_insn_reservation
>
> I had a similar problem with ColdFire V4 scheduler model and the 
> solution for me was using adjust_cost() target hook; it is a bit 
> complicated, but it works fine.  Search m68k.c for 'bypass' for more 
> information, comments there describe the thing in sufficient detail.
Maxim, I read your implementation in m68k.c. IMHO it is a smart but tricky 
solution. For example it depends on the assumption that 
targetm.sched.adjust_cost () immediately called after bypass_p(). Also the 
redundant check and calls to min_insn_conflict_delay looks inefficient. I'd 
prefer to extend semantics to support more than one bypass.

Thanks - Joey


RE: How to define 2 bypasses for a single pair of insn_reservation

2009-01-06 Thread Ye, Joey
Vladimir Makarov [mailto:vmaka...@redhat.com] wrote:
> It was supposed to have two latency definitions at most (one in 
> define_insn_reservation and another one in define_bypass).  That time it 
> seemed enough for all processors supported by GCC.  It also simplified 
> semantics definition when two bypass conditions returns true for the 
> same insn pair.
> 
> If you really need more one bypass for insn pair, I could implement 
> this.  Please, let me know.  In this case semantics of choosing latency 
> time could be
> 
> o time in first bypass occurred in pipeline description whose condition 
> returns true
> o time given in define_insn_reservation
Maxim and I encountered the same problem, and I believe we won't be the last 
two unlucky guys. Can you please implement the extended semantics, which looks 
good to me?

Thank s- Joey



RE: How to define 2 bypasses for a single pair of insn_reservation

2009-01-06 Thread Ye, Joey
Maxim Kuvyrkov [mailto:ma...@codesourcery.com] wrote:
> Yes, it does depend on this assumption and the comment states exactly that.
What I concerned is that the assumption may be broken someday, unless scheduler 
guarantees it.

> Which check[s] do you have in mind, the gcc_assert's?  Also, out of 
> curiosity, what is inefficient about the use of min_insn_conflict_delay?
> 
> For the record, min_insn_conflict delay has nothing to do with emulating 
> two bypasses; this tweak makes scheduler faster by not adding 
> instructions to the ready list which makes haifa-sched.c:max_issue() do 
> its exhaustive-like search on a smaller set.
I admit your implementation is probably the best correct solution based on 
current semantic. I'm just too lazy to like wrting that additional code and 
defining new data structure, especially after Vladimir said he could extend the 
semantic ;)

> Don't get me wrong, I'm not against adding support for N>1 bypasses; it 
> is not that easy though ;) .
No idea about the effort. But I guess you'd like to re-implement m68k with the 
2nd bypass when it is ready.

Thanks - Joey


Options of fixing biggest alignment in PR target/38736

2009-01-07 Thread Ye, Joey
This is about http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38736 and I'd rather 
discuss it in gcc mail list. Basicly the problem is shown as following example:

Case 1 (on x86 or x86_64):
$ cat i.h
struct s
{
char dummy0;
// align at maxmiun aligned boundary supported by this target.
char dummy __attribute__((aligned)); 
int data;
};
extern void foo(struct s*);
$ cat foo.c
#include "i.h"
void foo(struct s* input)
{ input->data = 1; }
$ cat main.c
#include "i.h"
extern void abort(void);
struct s g_s;
int main()
{
foo(&g_s);
if (g_s.data != 1) abort();
}

$ gcc -S foo.c
$ gcc -S main.c -mavx
$ gcc -o foo.exe foo.s main.s
$ ./foo.exe
Aborted

The reason is that AVX target defines BIGGEST_ALIGNMENT to 32 bytes and non-AVX 
x86 target does as 16 bytes. Since __attribute__((aligned)) aligns struct 
memory according to BIGGEST_ALIGNMENT, objects built by avx/non-avx GCC will 
result in different struct layout.

There are options to solve this problem so far I can think of:
Option 1: Leave BIGGEST_ALIGNMENT as it is nowaday and modify all libraries and 
header files using __attribute__((aligned)) similar to i.h
Option 2: Define BIGGEST_ALIGNMENT as a fixed value 16 bytes, for all x86 
target.
Option 3: Define BIGGEST_ALIGNMENT as a fixed value 32 bytes, for all x86 
target, and extend to 64 or more bytes in future.

Option 1 follows the definition of __attribute__((aligned)) in GCC manual, and 
it works as expected to provide a way to align at maxium target alignment. 
However, fixing all libraries will be tidious and easy to miss. Also 
documentation should mention the potiential issue using this feature.

Option 2 and option 3 seems to be a quick solution, but their draw back is 
obvious. Firstly it doesn't follow the definition of __attribute__((aligned)) 
and can leave confusion to users. Secondly it eliminates a convenient way for 
user utilize the maxium alignment supported in x86 family. Also very 
importantly they won't solve all problem, for example if i.h is like this:
Case 2:
$ cat i2.h
#ifdef __AVX__
#define aligned __aligned__(32)
#else
#define aligned __aligned__(16)
#endif
struct s
{
char dummy0;
char dummy __attribute__((aligned));
int data;
};
extern void foo(struct s*);

Furthermore option 3 will result different behavior for GCC 4.3- and GCC 4.4+, 
case 1 will still fail if foo.c is built by GCC 4.3- and main.c by 4.4+.

In summary, I don't see an obvious best way to solve in PR38736. But IMHO 
option 1 is more reasonable.

Thanks - Joey


RE: Options of fixing biggest alignment in PR target/38736

2009-01-07 Thread Ye, Joey
From: Ian Lance Taylor [mailto:i...@google.com]:
> Therefore, I propose that we do the following:
> 
> 1) Introduce __attribute__ ((aligned (scalar))).  This will be
>documented as having a fixed value for each ABI.  The value will be
>guaranteed to be sufficient to hold any ordinary non-vector type.
>The default will be BIGGEST_ALIGNMENT.  The value for the
>x86/x86_64 will be 128.
> 
> 2) Introduce __attribute__ ((aligned (max))).  This will be documented
>as having the largest value available for any version of the
>architecture, and thus in particular it may change if new versions
>of the architecture are released.  The value will not change based
>on command line options which do not change the ABI; that is, if it
>is possible to link together two files compiled with different set
>of command line options and expect the result to work, then those
>command line options must not change the value of this attribute.
>The value will be guaranteed to be sufficient to hold any type,
>including any vector type.  The default will be BIGGEST_ALIGNMENT.
>The value for the x86/x86_64 will (presumably) be 256.
To me "new version of x86 architecture are released" usually means 
"change based on command line option". How about the default value
grow to 512 or even higher in future?

Thanks - Joey


Suspicious missing tail call opportunity

2013-01-06 Thread Ye Joey
In following example, call to sbfoo isn't a tail call with -O2. GCC
analyzes local variable may be referenced in sbfoo. Is it a reasonable
analysis? In another word, is it a legal program that bar stores
address of local to a static variable, and then for sbfoo to access
it?

This issue cause a missed tail call opportunity in newlib, thus
unnecessarily increased stack consumption.

a.c:
extern int sbfoo(void);
extern int bar(int *);
int foo()
{
int local = 0;
if (bar(&local)) return 0;

return sbfoo();
}

b.c:
int * g;
int bar(int *c) { g=c; return 0;}

int sbfoo() { return *g; }


Re: Stellaris Non-Word-Aligned Write to SRAM Erratum

2013-01-15 Thread Ye Joey
On Fri, Jan 11, 2013 at 2:29 AM, Louis-Philippe Brais
 wrote:
> Hi all,
>
> The latest errata for Texas Instruments' Cortex-M3 family, updated
> last October [1], contains a disturbing new problem triggered by
> non-word-aligned writes to SRAM. This is the kind of errata that is
> effectively addressed with a compiler work-around. FWIW, it has
> already been addressed by a popular commercial toolchain vendor [2]. I
> was wondering if the GCC ARM maintainers were aware of this bug, and
> if somebody implemented or was working on a compiler work-around for
> this problem. I had a look at recent discussions and patches on the
> GCC mailing lists, but could not find anything. I'm looking for
> something along the lines of the -mfix-cortex-m3-ldrd fix, but for
> that new alignment write erratum.
>
> [1] http://www.ti.com/lit/er/spmz642b/spmz642b.pdf
> [2] 
> http://netstorage.iar.com/SuppDB/Public/UPDINFO/007040/arm/doc/infocenter/iccarm.ENU.html
>
> Thanks for your attention,
> LP Brais
I don't see any patch for this erratum. It should be a new option
rather than -mfix-cortex-m3-ldrd.

- Joey


Hoist across FP control register setting

2013-02-06 Thread Ye Joey
Following case attempts to set floating point control register and
execute floating point operation afterward. However, it doesn't works
as expected with -Os, as GCC hoists multiply operation beyond FP
control register setting.

As there is no register dependence between __set_FPSCR and multiply,
hoisting can happen. There is structure dependence indeed but can't be
expressed in GCC semantic.

How about the idea to provide some kind of barrier that can prevent
such a hoisting from happening?

int ftz;
float foo(float a, float b)
{
float r;
unsigned fpscr_orig = __get_FPSCR();
if (ftz) {
__set_FPSCR(fpscr_orig | 0x100);
r = a * b;
}
else {
__set_FPSCR(fpscr_orig & ~0x100);
r = a * b;
}
__set_FPSCR(fpscr_orig);
return r;
}


RE: [discuss] When is RBX used for base pointer?

2008-02-18 Thread Ye, Joey
On Wed, 13 Feb 2008, H.J. Lu wrote:
>>  Recent i386 use arbitrary register as GOT pointer only for leaf
>>  function.  When you call something, the GOT entry uses EBX too.
>>  We use RBX for large PIC model.  But I am with Michael here that I
don't
>>  see reason why choice of register needs to be set in stone.
>>  We can probably use RBX for non-large-PIC and R12 elsewhere.

> Joey ran into issues when he didn't use a hard register to realign
stack.
> It has something to do with reload. We really need some help here with
> reload.  Joey can explain it when he comes from vacation next week.

Michael, Jan,

When aligning stack for those functions who have dynamic stack
allocation, we must use an available callee-saved register in prologue.
We named this hard register DRAP. It is worthwhile to emphasize that
*free* here means "free in prologue". After prologue, a virtual register
will be used instead.

Given the definition of free, we can fix the DRAP register to simplify
the implementation. Original GCC only have limited cases that use
callee-saved register in prologue, such as setting GOT pointer as far as
I know. So choosing the DRAP register is easy: just avoid GOT pointer
register, which is EBX in i386 and RBX in x86_64. As HJ said, R12 is a
good candiate.

It will be more complicated if GOT pointer register is not fixed. In
this case, the DRAP candidate must be avoid using GOT register, or vice
versa. When will current GCC decide the register to use as GOT pointer?

Thanks - Joey


RE: [discuss] When is RBX used for base pointer?

2008-02-25 Thread Ye, Joey
Honza,

> Honza said:
> I am bit confused here.  If I wanted a free register in prologue only,
I
> would probably look at the caller saved ones.  But I gues it is just
> typo.

> I don't see much value in making the register callee-saved especially
if
> you say that virtual reg (pseudo?) is used afterward.
I'm sorry for the confusing word. But I did mean callee-save register,
in case none caller-save register is available. For i386, eax, edx and
ecx can all be used to pass register parameters. So there must be a
callee-save register in stock. Due to faked bug we said only callee-save
register can be used. It has been clarified now.


> When you just need a temporary in prologue, I think you can go with
RAX
> in most cases getting shortest code. It is used by x86-64 stdargs
> prologue and by i386 regparm. You can improve bit broken
> ix86_eax_live_at_start_p to test this. Using alternative choice if RAX
> is taken.
In case callee-save registers are available, ECX is a good candidate for
i386 because it is the latest register to use for parameter passing. RAX
is a good candiate for x86_64.

Thanks - Joey


RE: A proposal to align GCC stack

2008-03-20 Thread Ye, Joey
Ross, Christian,

Here are the patches to implement the idea we discussed before. Can you
take a look at it or try it?

http://gcc.gnu.org/ml/gcc-patches/2008-03/msg01200.html
http://gcc.gnu.org/ml/gcc-patches/2008-03/msg01199.html

Thanks - Joey


I386.md: *_mixed and *_sse

2008-04-22 Thread Ye, Joey
Hi,

From i386.md, alternative 1 of *fop_sf_comm_mixed is duplicated with
*fop_sf_comm_sse. Why do we define a _mixed pattern here?

(define_insn "*fop_sf_comm_mixed"
  [(set (match_operand:SF 0 "register_operand" "=f,x")
(match_operator:SF 3 "binary_fp_operator"
[(match_operand:SF 1 "nonimmediate_operand"
"%0,0")
 (match_operand:SF 2 "nonimmediate_operand"
"fm,xm")]))]
  "TARGET_MIX_SSE_I387
   && COMMUTATIVE_ARITH_P (operands[3])
   && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
  "* return output_387_binary_op (insn, operands);"
  [(set (attr "type")
(if_then_else (eq_attr "alternative" "1")
   (if_then_else (match_operand:SF 3 "mult_operator" "")
  (const_string "ssemul")
  (const_string "sseadd"))
   (if_then_else (match_operand:SF 3 "mult_operator" "")
  (const_string "fmul")
  (const_string "fop"
   (set_attr "mode" "SF")])

(define_insn "*fop_sf_comm_sse"
  [(set (match_operand:SF 0 "register_operand" "=x")
(match_operator:SF 3 "binary_fp_operator"
[(match_operand:SF 1 "nonimmediate_operand"
"%0")
 (match_operand:SF 2 "nonimmediate_operand"
"xm")]))]
  "TARGET_SSE_MATH
   && COMMUTATIVE_ARITH_P (operands[3])
   && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
  "* return output_387_binary_op (insn, operands);"
  [(set (attr "type")
(if_then_else (match_operand:SF 3 "mult_operator" "")
   (const_string "ssemul")
   (const_string "sseadd")))
   (set_attr "mode" "SF")])

Thanks - Joey


Ask for help: constraints error

2008-06-20 Thread Ye, Joey
I got following error after changing some GCC code, can anyone give me
some hints what's wrong here?

---
error: insn does not satisfy its constraints:
(insn:HI 690 689 1267 79 libgcc/config/libbid/bid_binarydecimal.c:146450
(parallel [
(set (mem/c:DI (plus:SI (reg:SI 2 cx [59])
(const_int -264 [0xfef8])) [1440
lC.3833+0 S8 A64])
(sign_extend:DI (reg:SI 0 ax [351])))
(clobber (reg:CC 17 flags))
(clobber (reg:SI 2 cx))
]) 123 {*extendsidi2_1} (nil))

*extendsidi2_1 is like:
(define_insn "*extendsidi2_1"
  [(set (match_operand:DI 0 "nonimmediate_operand" "=*A,r,?r,?*o")
(sign_extend:DI (match_operand:SI 1 "register_operand"
"0,0,r,r")))
   (clobber (reg:CC FLAGS_REG))
   (clobber (match_scratch:SI 2 "=X,X,X,&r"))]
  "!TARGET_64BIT"
  "#") 

Thanks - Joey


CFA expression failure

2008-06-24 Thread Ye, Joey
Daniel,

We generate following DWARF2 instructions for stack alignment prologue.
Basically we use expression to calculate CFA. But it run into some
segfault in libmudflap and libjava. Do you have any hints what's wrong?

  DW_CFA_def_cfa: r4 (esp) ofs 4
  DW_CFA_offset: r8 (eip) at cfa-4
  DW_CFA_nop
  DW_CFA_nop

001c 002c 0020 FDE cie= pc=..0083
  DW_CFA_advance_loc: 1 to 0001
  DW_CFA_def_cfa_offset: 8
  DW_CFA_offset: r7 (edi) at cfa-8
  DW_CFA_advance_loc: 4 to 0005
  DW_CFA_def_cfa: r7 (edi) ofs 0
  DW_CFA_advance_loc: 7 to 000c
  DW_CFA_expression: r5 (ebp) (DW_OP_breg5: 0)
  DW_CFA_advance_loc: 37 to 0031
  DW_CFA_def_cfa_expression (DW_OP_breg5: -4; DW_OP_deref)
  DW_CFA_expression: r6 (esi) (DW_OP_breg5: -8)
  DW_CFA_expression: r3 (ebx) (DW_OP_breg5: -12)

 <_Z3bariii>:
   0:   57  push   %edi
   1:   8d 7c 24 08 lea0x8(%esp),%edi
   5:   83 e4 e0and$0xffe0,%esp
   8:   ff 77 fcpushl  -0x4(%edi)
   b:   55  push   %ebp
   c:   89 e5   mov%esp,%ebp
   e:   81 ec 88 00 00 00   sub$0x88,%esp
  14:   89 45 c4mov%eax,-0x3c(%ebp)
  17:   89 c8   mov%ecx,%eax
  19:   83 c0 1eadd$0x1e,%eax
  1c:   83 e0 f0and$0xfff0,%eax
  1f:   89 5c 24 7c mov%ebx,0x7c(%esp)
  23:   89 b4 24 80 00 00 00mov%esi,0x80(%esp)
  2a:   89 bc 24 84 00 00 00mov%edi,0x84(%esp)
  31:   29 c4   sub%eax,%esp

Thanks - Joey


RE: CFA expression failure

2008-06-25 Thread Ye, Joey
It might due to
  DW_CFA_expression: r6 (esi) (DW_OP_breg5: -8)
  DW_CFA_expression: r3 (ebx) (DW_OP_breg5: -12)
After defining reg via CFA instead of r5, we got less failure.

Thanks - Joey

-Original Message-
From: Daniel Jacobowitz [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 25, 2008 10:00 PM
To: H.J. Lu
Cc: Ye, Joey; gcc@gcc.gnu.org; Guo, Xuepeng
Subject: Re: CFA expression failure

On Tue, Jun 24, 2008 at 08:40:18PM -0700, H.J. Lu wrote:
> I think the problem is in uw_update_context_1.  REG_SAVED_EXP
> and REG_SAVED_VAL_EXP may use other registers as shown above:
> 
>DW_CFA_expression: r6 (esi) (DW_OP_breg5: -8)
> 
> They should be handle last.  I am testing this patch. Does it
> make senses?

I think that rather than delaying such expressions, you need to
evaluate into a temporary context.  DW_OP_breg5 means the current
frame's %ebp; DW_CFA_expression: r5 describes the location of the
previous frame's %ebp.  They're different registers.  Otherwise this
is going to be too order-sensitive.

-- 
Daniel Jacobowitz
CodeSourcery


4.3 x86_64 Bootstrap breaks

2007-07-03 Thread Ye, Joey
4.3 trunk revision 126185 I got at x86_64:

libtool: compile: unable to infer tagged configuration
libtool: compile: specify a tag with `--tag'
make[6]: *** [kill.lo] Error 1

Anyone else got the same?

126184 passes. Looks like problems in this check:
r126185 | kargl | 2007-07-02 10:47:21 +0800 (Mon, 02 Jul 2007) | 281
lines

Thanks - Joey 


RE: DFA Scheduler - unable to pipeline loads

2007-09-03 Thread Ye, Joey
Matt,

I just started working on pipeline description and I'm confused one thing in 
your description.

For "integer", your cpu have a 1-cycle latency, but with 3 units stages 
"issue,iu,wb". What does that mean? My understanding is that the number of 
units seperated by "," should be equal to latency. Am I right?

Thanks - Joey

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Lee
Sent: 2007年9月1日 5:58
To: gcc@gcc.gnu.org
Subject: DFA Scheduler - unable to pipeline loads

Hi,

I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
RISC processors with the following description for loads and stores,

(define_insn_reservation "integer" 1
  (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop")
  "issue,iu,wb")

(define_insn_reservation "load" 3
  (eq_attr "type" "load")
  "issue,iu,wb")

(define_insn_reservation "store" 1
  (eq_attr "type" "store")
  "issue,iu,wb")

I am seeing poor scheduling in Dhrystone where a memcpy call is
expanded inline.

memcpy (&dst, &src, 16) ==>

load  1, rA + 4
store 1, rB + 4
load  2, rA + 8
store 2, rB + 8
...

Basically, instead of pipelining the loads, the current schedule
stalls the processor for two cycles on every dependent store. Here is
a dump from the .35.sched1 file.

;;   ==
;;   -- basic block 0 from 6 to 36 -- before reload
;;   ==

;;0--> 6r84=r5 :issue,iu,wb
;;1--> 13   r86=[`Ptr_Glob']   :issue,iu,wb
;;2--> 25   r92=0x5:issue,iu,wb
;;3--> 12   r85=[r84]  :issue,iu,wb
;;4--> 14   r87=[r86]  :issue,iu,wb
;;7--> 15   [r85]=r87  :issue,iu,wb
;;8--> 16   r88=[r86+0x4]  :issue,iu,wb
;;   11--> 17   [r85+0x4]=r88  :issue,iu,wb
;;   12--> 18   r89=[r86+0x8]  :issue,iu,wb
;;   15--> 19   [r85+0x8]=r89  :issue,iu,wb
;;   16--> 20   r90=[r86+0xc]  :issue,iu,wb
;;   19--> 21   [r85+0xc]=r90  :issue,iu,wb
;;   20--> 22   r91=[r86+0x10] :issue,iu,wb
;;   23--> 23   [r85+0x10]=r91 :issue,iu,wb
;;   24--> 26   [r84+0xc]=r92  :issue,iu,wb
;;   25--> 31   clobber r3 :nothing
;;   25--> 36   use r3 :nothing
;;  Ready list (final):
;;   total time = 25
;;   new head = 7
;;   new tail = 36

There is an obvious better schedule to be obtained. Here is one such
(hand-modified) schedule which just pipelines two of the loads to
obtain a shorter critical path length to the whole function (function
has only bb 0)

;;0--> 6r84=r5 :issue,iu,wb
;;1--> 13   r86=[`Ptr_Glob']   :issue,iu,wb
;;2--> 25   r92=0x5:issue,iu,wb
;;3--> 12   r85=[r84]  :issue,iu,wb
;;4--> 14   r87=[r86]  :issue,iu,wb
;;7--> 15   [r85]=r87  :issue,iu,wb
;;8--> 16   r88=[r86+0x4]  :issue,iu,wb
;;9--> 18   r89=[r86+0x8]  :issue,iu,wb
;;   10--> 20   r90=[r86+0xc]  :issue,iu,wb
;;   11--> 17   [r85+0x4]=r88  :issue,iu,wb
;;   12--> 19   [r85+0x8]=r89  :issue,iu,wb
;;   13--> 21   [r85+0xc]=r90  :issue,iu,wb
;;   14--> 22   r91=[r86+0x10] :issue,iu,wb
;;   17--> 23   [r85+0x10]=r91 :issue,iu,wb
;;   18--> 26   [r84+0xc]=r92  :issue,iu,mb_wb
;;   19--> 31   clobber r3 :nothing
;;   20--> 36   use r3 :nothing
;;  Ready list (final):
;;   total time = 20
;;   new head = 7
;;   new tail = 36

This schedule is 5 cycles faster.

I have read and re-read the material surrounding the DFA scheduler. I
understand that the heuristics optimize critical path length and not
stalls or other metrics. But in this case it is precisely the critical
path length that is shortened by the better schedule. I have been
examining various hooks available and for a while it seemed like
TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a
larger window to look for better candidates to schedule into the ready
queue. For instance, this discussion seems to say so.
http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html

But a post that follows soon after seems to imply otherwise.
http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html

Both posts are from Vladimir. In any case the final conclusion seems
to be that the lookahead is useful only for multi-

RE: Designs for better debug info in GCC. Choice A or B?

2007-11-25 Thread Ye, Joey
I like option B. It will be very helpful to reduce software product development 
time. Some software product just release with -O0 because they are not 
confident releasing a version differ to the one they were debugging and testing 
in. 

Also in some systems -O0 simply doesn't work, which is too slow or is too big 
code size to fit into flash memory. Developer has to suffer poor debugability.

I believe it valuable to have an option generating code with fair 
performance/code size but almost full debugability. And I believe it not 
technically impossible. 

Thanks - Joey

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of J.C. Pizarro
Sent: 2007年11月25日 7:46
To: gcc@gcc.gnu.org
Subject: Re: Designs for better debug info in GCC. Choice A or B?

To imagine that i'm using "-g -Os -finline-functions -funroll-loops".

There are differences in how to generate "optimized AND debugged" code.

A) Whole-optimized but with dirty debugged information if possible.

When there is coredump from crash then its debugged information can
be not complete (with losses) but can be readable for humans.
This kind of strategy can't work well in "step to step" debuggers like
gdb, ddd, kgdb, ... but its code is whole-optimized same as stripped program.

B) Whole-debugged but partially optimized because of restricted requirements
to maintain the full debugged information without losses.

This kind of strategy works well in "step to step" debuggers like
gdb, ddd, kgdb, ... but its code is less whole-optimized and bigger than
stripped program.

Sincerely, J.C.Pizarro


A proposal to align GCC stack

2007-12-17 Thread Ye, Joey
-- 0. MOTIVATION --
Some local variables (such as of __m128 type or marked with alignment
attribute) require stack aligned at a boundary larger than the default
stack
boundary. Current GCC partially supports this with limitations. We are
proposing a new design to fully solve the problem.


-- 1. CURRENT IMPLEMENTATION --
There are two ways current GCC supports bigger than default stack
alignment.  One is to make sure that stack is aligned at program entry
point, and then ensure that for each non-leaf function, its frame size
is
aligned. This approach doesn't work when linking with libs or objects
compiled by other psABI confirming compilers. Some problems are logged
as
PR 33721. Another is to adjust stack alignment at the entry point of a
function if it is marked with __attribute__ ((force_align_arg_pointer))
or -mstackrealign option is provided. This method guarantees the
alignment
in most of the cases but with following problems and limitations:

*  Only 16 bytes alignment is supported
*  Adjusting stack alignment at each function prologue hurts performance
unnecessarily, because not all functions need bigger alignment. In fact,
commonly only those functions which have SSE variables defined locally
(either declared by the user or compiler generated internal temporary
variables) need corresponding alignment.
*  Doesn't support x86_64 for the cases when required stack alignment
is > 16 bytes
*  Emits inefficient and complicated prologue/epilogue code to adjust
stack alignment
*  Doesn't work with nested functions
*  Has a bug handling register parameters, which resulted in a cpu2006
failure. A patch is available as a workaround.

-- 2. NEW PROPOSAL: DESIGN --
Here, we propose a new design to fully support stack alignment while
overcoming above problems. The new design will
*  Support arbitrary alignment value, including 4,8,16,32...
*  Adjust function stack alignment only when necessary
*  Initial development will be on i386 and x86_64, but can be extended
to other platforms
*  Emit more efficient prologue/epilogue code
*  Coexist with special features like dynamic stack allocation (alloca),
nested functions, register parameter passing, PIC code and tail call
optimization
*  Be able to debug and unwind stack

2.1 Support arbitrary alignment value
Different source code and optimizations requires different stack
alignment,
as in following table:
Feature Alignment (bytes)
i386_ABI4
x86_64_ABI  16
char1
short   2
int 4
long4/8*
long long   8
__m64   8
__m128  16
float   4
double  8
long double 4/16*
user specified  any power of 2

*Note: 4 for i386, 8/16 for x86_64
The new design will support any alignment value in this table.

2.2 Adjust function stack alignment only when necessary

Current GCC defines following macros related to stack alignment:
i. STACK_BOUNDARY in bits, which is enforced by hardware, 32 for i386
and
64 for x86_64. It is the minimum stack boundary. It is fixed.
ii. PREFERRED_STACK_BOUNDARY. It sets the stack alignment when calling a
function. It may be set at command line and has no impact on stack
alignment at function entry. This proposal requires PREFERRED >= STACK,
and
by default set to ABI_STACK_BOUNDARY

This design will define a few more macros, or concepts not explicitly
defined in code:
iii. ABI_STACK_BOUNDARY in bits, which is the stack boundary specified
by
psABI, 32 for i386 and 128 for x86_64.  ABI_STACK_BOUNDARY >=
STACK_BOUNDARY. It is fixed for a given psABI.
iv. LOCAL_STACK_BOUNDARY in bits. Each function stack has its own stack
alignment requirement, which depends the alignment of its stack
variables,
LOCAL_STACK_BOUNDARY = MAX (alignment of each effective stack variable).
v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary at
function
entry. If a function is marked with __attribute__
((force_align_arg_pointer))
or -mstackrealign option is provided, INCOMING = STACK_BOUNDARY.
Otherwise,
INCOMING == MIN(ABI_STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY) because a
function can be called via psABI externally or called locally with
PREFERRED_STACK_BOUNDARY.
vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment required
by
local variables and calling other function. REQUIRED_STACK_ALIGNMENT ==
MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a non-leaf
function. For a leaf function, REQUIRED_STACK_ALIGNMENT ==
LOCAL_STACK_BOUNDARY.

This proposal won't adjust stack when INCOMING_STACK_BOUNDARY >=
REQUIRED_STACK_ALIGNMENT. Only when INCOMING_STACK_BOUNDARY <
REQUIRED_STACK_ALIGNMENT, it will adjust stack to
REQUIRED_STACK_ALIGNMENT
at prologue.

2.3 Initial development on i386 and x86_64
We initially support i386 and x86_64. In this document we focus more on
i386 because it is hard to implement because of the restriction of
having
a small register file.  But all that we discuss can be easily applied
to x86_64.

2.4 Emit more efficient prologue/epil

RE: A proposal to align GCC stack

2007-12-17 Thread Ye, Joey
Ross, HJ,

> 
> >Because I386 PIC requires BX as GOT pointer and I386 may use AX, DX
> >and CX as parameter passing registers, there are limited candidates for
> >this proposal to choose. Current proposal suggests EDI, because it won't
> >conflict with i386 PIC or regparm.
> 
> Could you pick a call-clobbered register in cases where one is availale?
I think it is doable. In current Apple engineer's code to support 
-mstackrealign,
hard register ECX is used. We need to add additional code to find which caller 
save register is not used to pass parameters. If none of them is available, 
we still have to use callee save reg like EDI.

> 
> >//  Reserve two stack slots and save return address 
> >//  and previous frame pointer into them. By
> >//  pointing new ebp to them, we build a pseudo 
> >//  stack for unwinding
> 
> Hmmm... I don't know much about the DWARF unwind information, but
> couldn't it handle this case without creating the "pseudo frame"?
> Or at least be extended so it could?

I haven't spent time investigated it yet. I agree it will be much more 
beautiful 
without "pseudo frame". I will be happy if solution can be found or be 
suggested here. 
But I doubt if it is worthwhile effort. Remember only when stack adjustment + 
alloca is 
present, will "pseudo frame" be generated. It may not be so common to impact 
performance.


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of H.J. Lu
Sent: 2007年12月18日 13:17
To: Ross Ridge
Cc: gcc@gcc.gnu.org
Subject: Re: A proposal to align GCC stack

On Mon, Dec 17, 2007 at 11:25:35PM -0500, Ross Ridge wrote:
> Ye, Joey writes:
> >i. STACK_BOUNDARY in bits, which is enforced by hardware, 32 for i386
> >and 64 for x86_64. It is the minimum stack boundary. It is fixed.
> 
> Strictly speaking by the above definition it would be 8 for i386.
> The hardware doesn't force the stack to be 32-bit aligned, it just
> performs poorly if it isn't.

We can change the wording.

> 
> >v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary
> >at function entry. If a function is marked with __attribute__
> >((force_align_arg_pointer)) or -mstackrealign option is provided,
> >INCOMING = STACK_BOUNDARY.  Otherwise, INCOMING == MIN(ABI_STACK_BOUNDARY,
> >PREFERRED_STACK_BOUNDARY) because a function can be called via psABI
> >externally or called locally with PREFERRED_STACK_BOUNDARY.
> 
> This section doesn't make sense to me.  The force_align_arg_pointer
> attribute and -mstackrealign assume that the ABI is being
> followed, while the -fpreferred-stack-boundary option effectively

According to Apple engineer who implemented the -mstackrealign,
on MacOS/ia32, psABI is 16byte, but -mstackrealign will assume
4byte, which is STACK_BOUNDARY.

> changes the ABI.  According your defintions, I would think
> that INCOMING should be ABI_STACK_BOUNDARY in the first case,
> and MAX(ABI_STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY) in the second.

That isn't true since some .o files may not be compiled with
-fpreferred-stack-boundary or with a different value of
-fpreferred-stack-boundary.

> (Or just PREFERRED_STACK_BOUNDARY because a boundary less than the ABI's
> should be rejected during command line processing.)

On x86-64, ABI_STACK_BOUNDARY is 16byte, but the Linux kernel may
want to use 8 byte for PREFERRED_STACK_BOUNDARY.

> 
> >vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment required
> >by local variables and calling other function. REQUIRED_STACK_ALIGNMENT
> >== MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a
> >non-leaf function. For a leaf function, REQUIRED_STACK_ALIGNMENT ==
> >LOCAL_STACK_BOUNDARY.
> 
> Hmm... I think you should define STACK_BOUNDARY as the minimum
> alignment that ABI requires the stack pointer to keep at all times.
> ABI_STACK_BOUNDARY should be defined as the stack alignment the
> ABI requires at function entry.  In that case a leaf function's
> REQUIRED_STACK_ALIGMENT should be MAX(LOCAL_STACK_BOUNDARY,
> STACK_BOUNDARY).

That is true since if the only local variable is char, LOCAL_STACK_BOUNDARY
will be 1. But we want the stack to be aligned at STACK_BOUNDARY.
We will update our proposal. 



H.J.


RE: A proposal to align GCC stack

2007-12-18 Thread Ye, Joey
 
Ross Ridge wrote:
> I'm currently using -fpreferred-stack-boundary without any trouble.
> Your proposal would in fact generate code to align stack when it's not
> necessary.  This would change the behaviour of
-fpreferred-stack-boundary,
> hurting performance and that's unacceptable to me.
This proposal values correctness at first place. So when compile can't
make
sure a function is only called from functions with the same or bigger 
preferred-stack-boundary, it will conservatively align the stack. One
optimization
is to set INCOMING = PREFERRED for local functions. Do you think it more
acceptable?

>> Ok, if people are using this flag to change the alignment to
something
>> smaller than used by the standard ABI, then INCOMING should be
>> MAX(STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY).
>
> On x86-64, ABI_STACK_BOUNDARY is 16byte, but the Linux kernel may
> want to use 8 byte for PREFERRED_STACK_BOUNDARY. INCOMING will
> be MIN(STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY) == 8 byte.

> Using MAX(STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY) also equals 8 in
that
> case and preserves the behaviour -fpreferred-stack-boundary in every
case.
I think HJ means MIN(ABI_STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY). 
MAX(ABI, PREFERRED) == 16 in this case.

Thanks - Joey


A proposal to align GCC stack - update

2007-12-19 Thread Ye, Joey
Thanks for Ross and HJ's comments. Here is updated proposal:

Changes:
- value of REQUIRED_STACK_BOUNDARY of leaf function
- value of INCOMING_STACK_BOUNDARY 

-- 0. MOTIVATION --
Some local variables (such as of __m128 type or marked with alignment
attribute) require stack aligned at a boundary larger than the default
stack
boundary. Current GCC partially supports this with limitations. We are
proposing a new design to fully solve the problem.


-- 1. CURRENT IMPLEMENTATION --
There are two ways current GCC supports bigger than default stack
alignment.  One is to make sure that stack is aligned at program entry
point, and then ensure that for each non-leaf function, its frame size
is
aligned. This approach doesn't work when linking with libs or objects
compiled by other psABI confirming compilers. Some problems are logged
as
PR 33721. Another is to adjust stack alignment at the entry point of a
function if it is marked with __attribute__ ((force_align_arg_pointer))
or -mstackrealign option is provided. This method guarantees the
alignment
in most of the cases but with following problems and limitations:

*  Only 16 bytes alignment is supported
*  Adjusting stack alignment at each function prologue hurts performance
unnecessarily, because not all functions need bigger alignment. In fact,
commonly only those functions which have SSE variables defined locally
(either declared by the user or compiler generated internal temporary
variables) need corresponding alignment.
*  Doesn't support x86_64 for the cases when required stack alignment
is > 16 bytes
*  Emits inefficient and complicated prologue/epilogue code to adjust
stack alignment
*  Doesn't work with nested functions
*  Has a bug handling register parameters, which resulted in a cpu2006
failure. A patch is available as a workaround.

-- 2. NEW PROPOSAL: DESIGN --
Here, we propose a new design to fully support stack alignment while
overcoming above problems. The new design will
*  Support arbitrary alignment value, including 4,8,16,32...
*  Adjust function stack alignment only when necessary
*  Initial development will be on i386 and x86_64, but can be extended
to other platforms
*  Emit more efficient prologue/epilogue code
*  Coexist with special features like dynamic stack allocation (alloca),
nested functions, register parameter passing, PIC code and tail call
optimization
*  Be able to debug and unwind stack

2.1 Support arbitrary alignment value
Different source code and optimizations requires different stack
alignment,
as in following table:
Feature Alignment (bytes)
i386_ABI4
x86_64_ABI  16
char1
short   2
int 4
long4/8*
long long   8
__m64   8
__m128  16
float   4
double  8
long double 16
user specified  any power of 2

*Note: 4 for i386, 8 for x86_64
The new design will support any alignment value in this table.

2.2 Adjust function stack alignment only when necessary

Current GCC defines following macros related to stack alignment:
i. STACK_BOUNDARY in bits, which is preferred by hardware, 32 for i386
and
64 for x86_64. It is the minimum stack boundary. It is fixed.
ii. PREFERRED_STACK_BOUNDARY. It sets the stack alignment when calling a
function. It may be set at command line and has no impact on stack
alignment at function entry. This proposal requires PREFERRED >= STACK,
and
by default set to ABI_STACK_BOUNDARY

This design will define a few more macros, or concepts not explicitly
defined in code:
iii. ABI_STACK_BOUNDARY in bits, which is the stack boundary specified
by
psABI, 32 for i386 and 128 for x86_64.  ABI_STACK_BOUNDARY >=
STACK_BOUNDARY. It is fixed for a given psABI.
iv. LOCAL_STACK_BOUNDARY in bits. Each function stack has its own stack
alignment requirement, which depends the alignment of its stack
variables,
LOCAL_STACK_BOUNDARY = MAX (alignment of each effective stack variable).
v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary at
function
entry. If a function is marked with __attribute__
((force_align_arg_pointer))
or -mstackrealign option is provided, INCOMING = STACK_BOUNDARY.
Otherwise,
INCOMING == PREFERRED_STACK_BOUNDARY. For those function whose  
PREFERRED is larger than ABI, it is the caller's responsibility to
invoke 
them with appropriate PREFERRED.
vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment required
by
local variables and calling other function. REQUIRED_STACK_ALIGNMENT ==
MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a non-leaf
function. For a leaf function, REQUIRED_STACK_ALIGNMENT ==
MAX(LOCAL_STACK_BOUNDARY,STACK_BOUNDARY).

This proposal won't adjust stack when INCOMING_STACK_BOUNDARY >=
REQUIRED_STACK_ALIGNMENT. Only when INCOMING_STACK_BOUNDARY <
REQUIRED_STACK_ALIGNMENT, it will adjust stack to
REQUIRED_STACK_ALIGNMENT
at prologue.

2.3 Initial development on i386 and x86_64
We initially support i386 and x86_64. In this document we focus more on
i386 becau

RE: A proposal to align GCC stack

2007-12-20 Thread Ye, Joey
Ye, Joey writes:
>> This proposal values correctness at first place. So when compile
can't
>> make sure a function is only called from functions with the same or
bigger
>> preferred-stack-boundary, it will conservatively align the stack. One
>> optimization is to set INCOMING = PREFERRED for local functions. Do
you
>> think it more acceptable?

Ross Ridge wrote:
> Not really.  It might reduce the amount of unnecessary stack
adjustment,
> but the performance regression would remain.  Changing the behaviour
of
> -fpreferred-stack-boundary doesn't make it more correct.  It supposed
> to change the ABI, it works as documented and, yes, if it's misused it
> will cause problems.  So will any number of GCC's ABI changing
options.

> Look at it another way.  Lets say you were compiling x86_64 code with
> -fpreferred-stack-boundary=3, an 8-byte PREFERRED alignment.  As you
> know, this is different from the standard x86_64 ABI which requires a
> 16-byte alignment.  Now with your proposal, GCC's behaviour of won't
> change, because it's safe to assume that incoming stack is at least
> 8-byte aligned.  There should be no change in the code GCC generates,
> with or without your proposal.  However, the outgoing stack won't be
> 16-byte aligned as the x86_64 ABI requires.  In this case, what also
> doesn't change is the fact that mixing code compiled with different
> -fpreferred-stack-boundary values doesn't work.  It's just as
problematic
> and unsafe as it was before.

> So when you said "this proposal values correctness at first place",
> that really isn't true.  The proposal only addresses safety when
> preferred alignment is raised from the standard ABI's alignment.
You're
> conservatively aligning the incoming stack, but not the outgoing
stack.
> You don't seem to be concerned about the problems that can arise when
> the preferred is raised above the ABI's.  Why?  My guess is that
because
> "correctness" in this case would cause unacceptable regressions when
> compiling the x86_64 Linux kernel.
You are right. My proposal doesn't guarantee 100% correctness. In case
of PREFERRED < ABI, we hope the author knows what will happen.

> If you can understand why it would be unacceptable to change how
> -fpreferred-stack-boundary behaves when compiling the Linux kernel,
> then maybe you can understand why I don't find it acceptable for it to
> change when compiling my code.
I think I understand now. My updated version proposal sets 
INCOMING == PREFERRED, and -fpreferred-stack-boundary works
the same as before.

Thanks - Joey


RE: A proposal to align GCC stack

2007-12-20 Thread Ye, Joey
Andrew,

My proposal is supposed not limited to i386/x86_64. Would do please
spend some time review it and see if it can really solve problem in PowerPC?
Your comments is welcome.

Thanks - Joey  

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrew Pinski
Sent: 2007年12月19日 18:07
To: Ross Ridge
Cc: gcc@gcc.gnu.org
Subject: Re: A proposal to align GCC stack

On 12/18/07, Ross Ridge <[EMAIL PROTECTED]> wrote:
> Look at it another way.  Lets say you were compiling x86_64 code with
> -fpreferred-stack-boundary=3, an 8-byte PREFERRED alignment.

Can we stop talking about x86/x86_64 specifics issues here?  I have an
use case for the PowerPC side of the Cell BE for variables greater
than the normal stack boundary alignment of 16bytes.  They need to be
128byte aligned for DMA transfering to the SPUs.

I already proposed a patch [1] to fix this use case but I have not
seen many replies yet.


Thanks,
Andrew Pinski

[1] http://gcc.gnu.org/ml/gcc-patches/2007-05/msg01167.html


RE: Re: A proposal to align GCC stack

2007-12-23 Thread Ye, Joey
Christian Schüler writes:

> Please go forward with this idea!

> The current implementation of force_align_arg_pointer has never worked for me.
This proposal should solve your problem. But to comfirm, I'd like to know the 
root cause. force_align_arg_pointer should have guaranteed 16 bytes align. Are
you using data structure requirement alignment larger than 16? Or maybe you
didn't specify force_align_arg_pointer for all of your functions?

Thanks - Joey