Re: A question about redudant load elimination
From tree dump we can see that there are two assignments from x, one to unsigned and one to signed. I guess that's the reason. Apparently there is room to improve though. int prephitmp.8; int * D.2027; unsigned int D.2026; unsigned int x.1; int x.0; # BLOCK 2 freq:1 # PRED: ENTRY [100.0%] (fallthru,exec) x.0_1 = x; x.1_2 = (unsigned int) x.0_1; // unsigned move D.2026_3 = x.1_2 * 4; D.2027_5 = a_4(D) + D.2026_3; *D.2027_5 = 1; prephitmp.8_6 = x; // signed move On Mon, Nov 14, 2011 at 4:01 PM, Jiangning Liu wrote: > Hi, > > For this test case, > > int x; > extern void f(void); > > void g(int *a) > { > a[x] = 1; > if (x == 100) > f(); > a[x] = 2; > } > > For trunk, the x86 assembly code is like below, > > movl x, %eax > movl 16(%esp), %ebx > movl $1, (%ebx,%eax,4) > movl x, %eax // Is this a redundant one? > cmpl $100, %eax > je .L4 > movl $2, (%ebx,%eax,4) > addl $8, %esp > .cfi_remember_state > .cfi_def_cfa_offset 8 > popl %ebx > .cfi_restore 3 > .cfi_def_cfa_offset 4 > ret > .p2align 4,,7 > .p2align 3 > .L4: > .cfi_restore_state > call f > movl x, %eax > movl $2, (%ebx,%eax,4) > addl $8, %esp > .cfi_def_cfa_offset 8 > popl %ebx > .cfi_restore 3 > .cfi_def_cfa_offset 4 > Ret > > Is the 2nd "movl x, %eax" is a redundant one for single thread programming > model? If yes, can this be optimized away? > > Thanks, > -Jiangning > > > >
Re: A new stack protector option?
On Wed, Nov 30, 2011 at 7:53 AM, Han Shen(沈涵) wrote: > Hi, I propose to add to gcc a new option regarding stack protector - > "-fstack-protector-strong", in addition to current gcc's > "-fstack-protector-all", which protects ALL functions, and > "-fstack-protector", which protects functions that have a big > (signed/unsigned) char array or have alloca called. > > Background - some times stack-protector is too-simple while > stack-protector-all over-kills, for example, to build one of our core > systems, we forcibly add "-fstack-protector-all" to all compile > commands, which brings big performance penalty (due to extra stack > guard/check insns on function prologue and epilogue) on both atom and > arm. To use "-fstack-protector" is just regarded as not secure enough > (only "protects" <2% functions) by the system secure team. So I'd like > to add the option "-fstack-protector-strong", that hits the balance > between "-fstack-protector" and "-fstack-protector-all". Any further detail about when the proposed -strong will protect stack? If the new criteria is general secure principles, maybe you can just enhance -fstack-prtector instead of introducing new option. Thanks - Joey
Re: Which Binutils should I use for performing daily regression test on trunk?
On Thu, Dec 22, 2011 at 12:43 AM, Ian Lance Taylor wrote: > Terry Guo writes: > >> I plan to set up daily regression test on trunk for target >> ARM-NONE-EABI and post results to gcc-testresults mailing list. Which >> Binutils should I use, the Binutils trunk or the latest released >> Binutils? And which way is recommended, building from a combined tree >> or building separately? If there is something I should pay attention >> to, please let me know. Thanks very much. > > For gcc testing, the latest released binutils is normally fine. You > should only move to binutils trunk if there is some specific bug you > need to work around temporarily. > > I personally would recommend building binutils separately. If you > choose to build a combined tree, then you should ignore the previous > paragraph and always use binutils trunk. For a combined tree you should > always use sources from the same development date, so using gcc trunk > implies using binutils trunk. > > Ian Combined build with latest gcc and binutils trunk has the advantage of monitoring both trunks. I'd prefer this approach. - Joey
RE: How to debug if scheduling in gcc is wrong?
袁立威 wrote: > Hi, I'm a guy working with gcc4.1.1 on itanium2. In my work, some > instrumentations are added by gcc. After instrumentation, all > specint2000 benchmarks except gzip can successfully run with > optimization flag -O3. There are some information list below: No answer from me but hopefully following suggestion useful. Your information posted here may not be sufficient for root cause analysis. Posting the full patch will be more helpful. As to the failure itself. Suggest you reduce the it a small case, or at least find out exactly with function in gzip is miscompiled and split that function. It might not the scheduling problem. Finding exactly which instruction in .s is wrong will help tracing back to problem in your patch. Thanks - Joey
RE: ia32 gcc-Debian 4.3.2-1 "rep ret" ?
Maybe comments at the insn pattern who emit "rep\; ret" can explain it: ";; Used by x86_machine_dependent_reorg to avoid penalty on single byte RET ;; instruction Athlon and K8 have." Thanks - Joey -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Etienne Lorrain Sent: 2008年12月4日 18:31 To: gcc@gcc.gnu.org Subject: ia32 gcc-Debian 4.3.2-1 "rep ret" ? Hello, I did not find any documentation of a "rep ret" instruction, at http://www.intel.com/design/processor/manuals/253667.pdf they just say: "The behavior of the REP prefix is undefined when used with non-strings instructions". Any pointers? Thanks, Etienne. etienne:~$ gcc --version gcc (Debian 4.3.2-1) 4.3.2 Copyright (C) 2008 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. etienne:~$ cat tmp.c void fct2(int); void fct (int i, int a) { a *= 2; if (i == 2) fct2(a); } etienne:~$ gcc -O2 -fomit-frame-pointer -S tmp.c -o tmp.s etienne:~$ cat tmp.s .file "tmp.c" .text .p2align 4,,15 .globl fct .type fct, @function fct: cmpl$2, 4(%esp) movl8(%esp), %eax je .L5 rep ret .p2align 4,,7 .p2align 3 .L5: addl%eax, %eax movl%eax, 4(%esp) jmp fct2 .size fct, .-fct .ident "GCC: (Debian 4.3.2-1) 4.3.2" .section.note.GNU-stack,"",@progbits etienne:~$
How to define 2 bypasses for a single pair of insn_reservation
When I write schedule model for following instructions: Insn1: mov %r1, %r2 Insn2: mov %r1, %r3 Insn3: foo %r2, %r3 (foo is a 3 op insn, for example, %r3 = %r3 << %r2) Latency from insn1 to insn3 is x cycles, and latency from insn2 to insn3 is y cycles. x != y. Both insn1 and insn2 are insn_reservation_mov. Insn3 are insn_reservation_foo. When I define bypass for them I found I couldn't do it. I can only define one bypass from mov to foo, like this: (define_bypass x "insn_reservation_mov" "insn_reservation_foo" "condition1") If I define following bypass too, gcc will report error: (define_bypass y "insn_reservation_mov" "insn_reservation_foo" "condition2") genautomata: bypass `insn_reservation_lea - insn_reservation_foo' is already defined Anyone can help me through this please? Thanks - Joey
RE: How to define 2 bypasses for a single pair of insn_reservation
Maxim and Vladimir Wrote: >>> Anyone can help me through this please? >>> >> It was supposed to have two latency definitions at most (one in >> define_insn_reservation and another one in define_bypass). That time it >> seemed enough for all processors supported by GCC. It also simplified >> semantics definition when two bypass conditions returns true for the >> same insn pair. >> >> If you really need more one bypass for insn pair, I could implement >> this. Please, let me know. In this case semantics of choosing latency >> time could be >> >> o time in first bypass occurred in pipeline description whose condition >> returns true >> o time given in define_insn_reservation > > I had a similar problem with ColdFire V4 scheduler model and the > solution for me was using adjust_cost() target hook; it is a bit > complicated, but it works fine. Search m68k.c for 'bypass' for more > information, comments there describe the thing in sufficient detail. Thanks Maxim and Vlad, I'd take a look at m68k.c before knowing it is really needed to extension the semantics. Thanks - Joey
RE: How to define 2 bypasses for a single pair of insn_reservation
Maxim and Vladimir Wrote: >>> Anyone can help me through this please? >>> >> It was supposed to have two latency definitions at most (one in >> define_insn_reservation and another one in define_bypass). That time it >> seemed enough for all processors supported by GCC. It also simplified >> semantics definition when two bypass conditions returns true for the >> same insn pair. >> >> If you really need more one bypass for insn pair, I could implement >> this. Please, let me know. In this case semantics of choosing latency >> time could be >> >> o time in first bypass occurred in pipeline description whose condition >> returns true >> o time given in define_insn_reservation > > I had a similar problem with ColdFire V4 scheduler model and the > solution for me was using adjust_cost() target hook; it is a bit > complicated, but it works fine. Search m68k.c for 'bypass' for more > information, comments there describe the thing in sufficient detail. Maxim, I read your implementation in m68k.c. IMHO it is a smart but tricky solution. For example it depends on the assumption that targetm.sched.adjust_cost () immediately called after bypass_p(). Also the redundant check and calls to min_insn_conflict_delay looks inefficient. I'd prefer to extend semantics to support more than one bypass. Thanks - Joey
RE: How to define 2 bypasses for a single pair of insn_reservation
Vladimir Makarov [mailto:vmaka...@redhat.com] wrote: > It was supposed to have two latency definitions at most (one in > define_insn_reservation and another one in define_bypass). That time it > seemed enough for all processors supported by GCC. It also simplified > semantics definition when two bypass conditions returns true for the > same insn pair. > > If you really need more one bypass for insn pair, I could implement > this. Please, let me know. In this case semantics of choosing latency > time could be > > o time in first bypass occurred in pipeline description whose condition > returns true > o time given in define_insn_reservation Maxim and I encountered the same problem, and I believe we won't be the last two unlucky guys. Can you please implement the extended semantics, which looks good to me? Thank s- Joey
RE: How to define 2 bypasses for a single pair of insn_reservation
Maxim Kuvyrkov [mailto:ma...@codesourcery.com] wrote: > Yes, it does depend on this assumption and the comment states exactly that. What I concerned is that the assumption may be broken someday, unless scheduler guarantees it. > Which check[s] do you have in mind, the gcc_assert's? Also, out of > curiosity, what is inefficient about the use of min_insn_conflict_delay? > > For the record, min_insn_conflict delay has nothing to do with emulating > two bypasses; this tweak makes scheduler faster by not adding > instructions to the ready list which makes haifa-sched.c:max_issue() do > its exhaustive-like search on a smaller set. I admit your implementation is probably the best correct solution based on current semantic. I'm just too lazy to like wrting that additional code and defining new data structure, especially after Vladimir said he could extend the semantic ;) > Don't get me wrong, I'm not against adding support for N>1 bypasses; it > is not that easy though ;) . No idea about the effort. But I guess you'd like to re-implement m68k with the 2nd bypass when it is ready. Thanks - Joey
Options of fixing biggest alignment in PR target/38736
This is about http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38736 and I'd rather discuss it in gcc mail list. Basicly the problem is shown as following example: Case 1 (on x86 or x86_64): $ cat i.h struct s { char dummy0; // align at maxmiun aligned boundary supported by this target. char dummy __attribute__((aligned)); int data; }; extern void foo(struct s*); $ cat foo.c #include "i.h" void foo(struct s* input) { input->data = 1; } $ cat main.c #include "i.h" extern void abort(void); struct s g_s; int main() { foo(&g_s); if (g_s.data != 1) abort(); } $ gcc -S foo.c $ gcc -S main.c -mavx $ gcc -o foo.exe foo.s main.s $ ./foo.exe Aborted The reason is that AVX target defines BIGGEST_ALIGNMENT to 32 bytes and non-AVX x86 target does as 16 bytes. Since __attribute__((aligned)) aligns struct memory according to BIGGEST_ALIGNMENT, objects built by avx/non-avx GCC will result in different struct layout. There are options to solve this problem so far I can think of: Option 1: Leave BIGGEST_ALIGNMENT as it is nowaday and modify all libraries and header files using __attribute__((aligned)) similar to i.h Option 2: Define BIGGEST_ALIGNMENT as a fixed value 16 bytes, for all x86 target. Option 3: Define BIGGEST_ALIGNMENT as a fixed value 32 bytes, for all x86 target, and extend to 64 or more bytes in future. Option 1 follows the definition of __attribute__((aligned)) in GCC manual, and it works as expected to provide a way to align at maxium target alignment. However, fixing all libraries will be tidious and easy to miss. Also documentation should mention the potiential issue using this feature. Option 2 and option 3 seems to be a quick solution, but their draw back is obvious. Firstly it doesn't follow the definition of __attribute__((aligned)) and can leave confusion to users. Secondly it eliminates a convenient way for user utilize the maxium alignment supported in x86 family. Also very importantly they won't solve all problem, for example if i.h is like this: Case 2: $ cat i2.h #ifdef __AVX__ #define aligned __aligned__(32) #else #define aligned __aligned__(16) #endif struct s { char dummy0; char dummy __attribute__((aligned)); int data; }; extern void foo(struct s*); Furthermore option 3 will result different behavior for GCC 4.3- and GCC 4.4+, case 1 will still fail if foo.c is built by GCC 4.3- and main.c by 4.4+. In summary, I don't see an obvious best way to solve in PR38736. But IMHO option 1 is more reasonable. Thanks - Joey
RE: Options of fixing biggest alignment in PR target/38736
From: Ian Lance Taylor [mailto:i...@google.com]: > Therefore, I propose that we do the following: > > 1) Introduce __attribute__ ((aligned (scalar))). This will be >documented as having a fixed value for each ABI. The value will be >guaranteed to be sufficient to hold any ordinary non-vector type. >The default will be BIGGEST_ALIGNMENT. The value for the >x86/x86_64 will be 128. > > 2) Introduce __attribute__ ((aligned (max))). This will be documented >as having the largest value available for any version of the >architecture, and thus in particular it may change if new versions >of the architecture are released. The value will not change based >on command line options which do not change the ABI; that is, if it >is possible to link together two files compiled with different set >of command line options and expect the result to work, then those >command line options must not change the value of this attribute. >The value will be guaranteed to be sufficient to hold any type, >including any vector type. The default will be BIGGEST_ALIGNMENT. >The value for the x86/x86_64 will (presumably) be 256. To me "new version of x86 architecture are released" usually means "change based on command line option". How about the default value grow to 512 or even higher in future? Thanks - Joey
Suspicious missing tail call opportunity
In following example, call to sbfoo isn't a tail call with -O2. GCC analyzes local variable may be referenced in sbfoo. Is it a reasonable analysis? In another word, is it a legal program that bar stores address of local to a static variable, and then for sbfoo to access it? This issue cause a missed tail call opportunity in newlib, thus unnecessarily increased stack consumption. a.c: extern int sbfoo(void); extern int bar(int *); int foo() { int local = 0; if (bar(&local)) return 0; return sbfoo(); } b.c: int * g; int bar(int *c) { g=c; return 0;} int sbfoo() { return *g; }
Re: Stellaris Non-Word-Aligned Write to SRAM Erratum
On Fri, Jan 11, 2013 at 2:29 AM, Louis-Philippe Brais wrote: > Hi all, > > The latest errata for Texas Instruments' Cortex-M3 family, updated > last October [1], contains a disturbing new problem triggered by > non-word-aligned writes to SRAM. This is the kind of errata that is > effectively addressed with a compiler work-around. FWIW, it has > already been addressed by a popular commercial toolchain vendor [2]. I > was wondering if the GCC ARM maintainers were aware of this bug, and > if somebody implemented or was working on a compiler work-around for > this problem. I had a look at recent discussions and patches on the > GCC mailing lists, but could not find anything. I'm looking for > something along the lines of the -mfix-cortex-m3-ldrd fix, but for > that new alignment write erratum. > > [1] http://www.ti.com/lit/er/spmz642b/spmz642b.pdf > [2] > http://netstorage.iar.com/SuppDB/Public/UPDINFO/007040/arm/doc/infocenter/iccarm.ENU.html > > Thanks for your attention, > LP Brais I don't see any patch for this erratum. It should be a new option rather than -mfix-cortex-m3-ldrd. - Joey
Hoist across FP control register setting
Following case attempts to set floating point control register and execute floating point operation afterward. However, it doesn't works as expected with -Os, as GCC hoists multiply operation beyond FP control register setting. As there is no register dependence between __set_FPSCR and multiply, hoisting can happen. There is structure dependence indeed but can't be expressed in GCC semantic. How about the idea to provide some kind of barrier that can prevent such a hoisting from happening? int ftz; float foo(float a, float b) { float r; unsigned fpscr_orig = __get_FPSCR(); if (ftz) { __set_FPSCR(fpscr_orig | 0x100); r = a * b; } else { __set_FPSCR(fpscr_orig & ~0x100); r = a * b; } __set_FPSCR(fpscr_orig); return r; }
RE: [discuss] When is RBX used for base pointer?
On Wed, 13 Feb 2008, H.J. Lu wrote: >> Recent i386 use arbitrary register as GOT pointer only for leaf >> function. When you call something, the GOT entry uses EBX too. >> We use RBX for large PIC model. But I am with Michael here that I don't >> see reason why choice of register needs to be set in stone. >> We can probably use RBX for non-large-PIC and R12 elsewhere. > Joey ran into issues when he didn't use a hard register to realign stack. > It has something to do with reload. We really need some help here with > reload. Joey can explain it when he comes from vacation next week. Michael, Jan, When aligning stack for those functions who have dynamic stack allocation, we must use an available callee-saved register in prologue. We named this hard register DRAP. It is worthwhile to emphasize that *free* here means "free in prologue". After prologue, a virtual register will be used instead. Given the definition of free, we can fix the DRAP register to simplify the implementation. Original GCC only have limited cases that use callee-saved register in prologue, such as setting GOT pointer as far as I know. So choosing the DRAP register is easy: just avoid GOT pointer register, which is EBX in i386 and RBX in x86_64. As HJ said, R12 is a good candiate. It will be more complicated if GOT pointer register is not fixed. In this case, the DRAP candidate must be avoid using GOT register, or vice versa. When will current GCC decide the register to use as GOT pointer? Thanks - Joey
RE: [discuss] When is RBX used for base pointer?
Honza, > Honza said: > I am bit confused here. If I wanted a free register in prologue only, I > would probably look at the caller saved ones. But I gues it is just > typo. > I don't see much value in making the register callee-saved especially if > you say that virtual reg (pseudo?) is used afterward. I'm sorry for the confusing word. But I did mean callee-save register, in case none caller-save register is available. For i386, eax, edx and ecx can all be used to pass register parameters. So there must be a callee-save register in stock. Due to faked bug we said only callee-save register can be used. It has been clarified now. > When you just need a temporary in prologue, I think you can go with RAX > in most cases getting shortest code. It is used by x86-64 stdargs > prologue and by i386 regparm. You can improve bit broken > ix86_eax_live_at_start_p to test this. Using alternative choice if RAX > is taken. In case callee-save registers are available, ECX is a good candidate for i386 because it is the latest register to use for parameter passing. RAX is a good candiate for x86_64. Thanks - Joey
RE: A proposal to align GCC stack
Ross, Christian, Here are the patches to implement the idea we discussed before. Can you take a look at it or try it? http://gcc.gnu.org/ml/gcc-patches/2008-03/msg01200.html http://gcc.gnu.org/ml/gcc-patches/2008-03/msg01199.html Thanks - Joey
I386.md: *_mixed and *_sse
Hi, From i386.md, alternative 1 of *fop_sf_comm_mixed is duplicated with *fop_sf_comm_sse. Why do we define a _mixed pattern here? (define_insn "*fop_sf_comm_mixed" [(set (match_operand:SF 0 "register_operand" "=f,x") (match_operator:SF 3 "binary_fp_operator" [(match_operand:SF 1 "nonimmediate_operand" "%0,0") (match_operand:SF 2 "nonimmediate_operand" "fm,xm")]))] "TARGET_MIX_SSE_I387 && COMMUTATIVE_ARITH_P (operands[3]) && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "* return output_387_binary_op (insn, operands);" [(set (attr "type") (if_then_else (eq_attr "alternative" "1") (if_then_else (match_operand:SF 3 "mult_operator" "") (const_string "ssemul") (const_string "sseadd")) (if_then_else (match_operand:SF 3 "mult_operator" "") (const_string "fmul") (const_string "fop" (set_attr "mode" "SF")]) (define_insn "*fop_sf_comm_sse" [(set (match_operand:SF 0 "register_operand" "=x") (match_operator:SF 3 "binary_fp_operator" [(match_operand:SF 1 "nonimmediate_operand" "%0") (match_operand:SF 2 "nonimmediate_operand" "xm")]))] "TARGET_SSE_MATH && COMMUTATIVE_ARITH_P (operands[3]) && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "* return output_387_binary_op (insn, operands);" [(set (attr "type") (if_then_else (match_operand:SF 3 "mult_operator" "") (const_string "ssemul") (const_string "sseadd"))) (set_attr "mode" "SF")]) Thanks - Joey
Ask for help: constraints error
I got following error after changing some GCC code, can anyone give me some hints what's wrong here? --- error: insn does not satisfy its constraints: (insn:HI 690 689 1267 79 libgcc/config/libbid/bid_binarydecimal.c:146450 (parallel [ (set (mem/c:DI (plus:SI (reg:SI 2 cx [59]) (const_int -264 [0xfef8])) [1440 lC.3833+0 S8 A64]) (sign_extend:DI (reg:SI 0 ax [351]))) (clobber (reg:CC 17 flags)) (clobber (reg:SI 2 cx)) ]) 123 {*extendsidi2_1} (nil)) *extendsidi2_1 is like: (define_insn "*extendsidi2_1" [(set (match_operand:DI 0 "nonimmediate_operand" "=*A,r,?r,?*o") (sign_extend:DI (match_operand:SI 1 "register_operand" "0,0,r,r"))) (clobber (reg:CC FLAGS_REG)) (clobber (match_scratch:SI 2 "=X,X,X,&r"))] "!TARGET_64BIT" "#") Thanks - Joey
CFA expression failure
Daniel, We generate following DWARF2 instructions for stack alignment prologue. Basically we use expression to calculate CFA. But it run into some segfault in libmudflap and libjava. Do you have any hints what's wrong? DW_CFA_def_cfa: r4 (esp) ofs 4 DW_CFA_offset: r8 (eip) at cfa-4 DW_CFA_nop DW_CFA_nop 001c 002c 0020 FDE cie= pc=..0083 DW_CFA_advance_loc: 1 to 0001 DW_CFA_def_cfa_offset: 8 DW_CFA_offset: r7 (edi) at cfa-8 DW_CFA_advance_loc: 4 to 0005 DW_CFA_def_cfa: r7 (edi) ofs 0 DW_CFA_advance_loc: 7 to 000c DW_CFA_expression: r5 (ebp) (DW_OP_breg5: 0) DW_CFA_advance_loc: 37 to 0031 DW_CFA_def_cfa_expression (DW_OP_breg5: -4; DW_OP_deref) DW_CFA_expression: r6 (esi) (DW_OP_breg5: -8) DW_CFA_expression: r3 (ebx) (DW_OP_breg5: -12) <_Z3bariii>: 0: 57 push %edi 1: 8d 7c 24 08 lea0x8(%esp),%edi 5: 83 e4 e0and$0xffe0,%esp 8: ff 77 fcpushl -0x4(%edi) b: 55 push %ebp c: 89 e5 mov%esp,%ebp e: 81 ec 88 00 00 00 sub$0x88,%esp 14: 89 45 c4mov%eax,-0x3c(%ebp) 17: 89 c8 mov%ecx,%eax 19: 83 c0 1eadd$0x1e,%eax 1c: 83 e0 f0and$0xfff0,%eax 1f: 89 5c 24 7c mov%ebx,0x7c(%esp) 23: 89 b4 24 80 00 00 00mov%esi,0x80(%esp) 2a: 89 bc 24 84 00 00 00mov%edi,0x84(%esp) 31: 29 c4 sub%eax,%esp Thanks - Joey
RE: CFA expression failure
It might due to DW_CFA_expression: r6 (esi) (DW_OP_breg5: -8) DW_CFA_expression: r3 (ebx) (DW_OP_breg5: -12) After defining reg via CFA instead of r5, we got less failure. Thanks - Joey -Original Message- From: Daniel Jacobowitz [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 25, 2008 10:00 PM To: H.J. Lu Cc: Ye, Joey; gcc@gcc.gnu.org; Guo, Xuepeng Subject: Re: CFA expression failure On Tue, Jun 24, 2008 at 08:40:18PM -0700, H.J. Lu wrote: > I think the problem is in uw_update_context_1. REG_SAVED_EXP > and REG_SAVED_VAL_EXP may use other registers as shown above: > >DW_CFA_expression: r6 (esi) (DW_OP_breg5: -8) > > They should be handle last. I am testing this patch. Does it > make senses? I think that rather than delaying such expressions, you need to evaluate into a temporary context. DW_OP_breg5 means the current frame's %ebp; DW_CFA_expression: r5 describes the location of the previous frame's %ebp. They're different registers. Otherwise this is going to be too order-sensitive. -- Daniel Jacobowitz CodeSourcery
4.3 x86_64 Bootstrap breaks
4.3 trunk revision 126185 I got at x86_64: libtool: compile: unable to infer tagged configuration libtool: compile: specify a tag with `--tag' make[6]: *** [kill.lo] Error 1 Anyone else got the same? 126184 passes. Looks like problems in this check: r126185 | kargl | 2007-07-02 10:47:21 +0800 (Mon, 02 Jul 2007) | 281 lines Thanks - Joey
RE: DFA Scheduler - unable to pipeline loads
Matt, I just started working on pipeline description and I'm confused one thing in your description. For "integer", your cpu have a 1-cycle latency, but with 3 units stages "issue,iu,wb". What does that mean? My understanding is that the number of units seperated by "," should be equal to latency. Am I right? Thanks - Joey -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt Lee Sent: 2007年9月1日 5:58 To: gcc@gcc.gnu.org Subject: DFA Scheduler - unable to pipeline loads Hi, I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar RISC processors with the following description for loads and stores, (define_insn_reservation "integer" 1 (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop") "issue,iu,wb") (define_insn_reservation "load" 3 (eq_attr "type" "load") "issue,iu,wb") (define_insn_reservation "store" 1 (eq_attr "type" "store") "issue,iu,wb") I am seeing poor scheduling in Dhrystone where a memcpy call is expanded inline. memcpy (&dst, &src, 16) ==> load 1, rA + 4 store 1, rB + 4 load 2, rA + 8 store 2, rB + 8 ... Basically, instead of pipelining the loads, the current schedule stalls the processor for two cycles on every dependent store. Here is a dump from the .35.sched1 file. ;; == ;; -- basic block 0 from 6 to 36 -- before reload ;; == ;;0--> 6r84=r5 :issue,iu,wb ;;1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb ;;2--> 25 r92=0x5:issue,iu,wb ;;3--> 12 r85=[r84] :issue,iu,wb ;;4--> 14 r87=[r86] :issue,iu,wb ;;7--> 15 [r85]=r87 :issue,iu,wb ;;8--> 16 r88=[r86+0x4] :issue,iu,wb ;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb ;; 12--> 18 r89=[r86+0x8] :issue,iu,wb ;; 15--> 19 [r85+0x8]=r89 :issue,iu,wb ;; 16--> 20 r90=[r86+0xc] :issue,iu,wb ;; 19--> 21 [r85+0xc]=r90 :issue,iu,wb ;; 20--> 22 r91=[r86+0x10] :issue,iu,wb ;; 23--> 23 [r85+0x10]=r91 :issue,iu,wb ;; 24--> 26 [r84+0xc]=r92 :issue,iu,wb ;; 25--> 31 clobber r3 :nothing ;; 25--> 36 use r3 :nothing ;; Ready list (final): ;; total time = 25 ;; new head = 7 ;; new tail = 36 There is an obvious better schedule to be obtained. Here is one such (hand-modified) schedule which just pipelines two of the loads to obtain a shorter critical path length to the whole function (function has only bb 0) ;;0--> 6r84=r5 :issue,iu,wb ;;1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb ;;2--> 25 r92=0x5:issue,iu,wb ;;3--> 12 r85=[r84] :issue,iu,wb ;;4--> 14 r87=[r86] :issue,iu,wb ;;7--> 15 [r85]=r87 :issue,iu,wb ;;8--> 16 r88=[r86+0x4] :issue,iu,wb ;;9--> 18 r89=[r86+0x8] :issue,iu,wb ;; 10--> 20 r90=[r86+0xc] :issue,iu,wb ;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb ;; 12--> 19 [r85+0x8]=r89 :issue,iu,wb ;; 13--> 21 [r85+0xc]=r90 :issue,iu,wb ;; 14--> 22 r91=[r86+0x10] :issue,iu,wb ;; 17--> 23 [r85+0x10]=r91 :issue,iu,wb ;; 18--> 26 [r84+0xc]=r92 :issue,iu,mb_wb ;; 19--> 31 clobber r3 :nothing ;; 20--> 36 use r3 :nothing ;; Ready list (final): ;; total time = 20 ;; new head = 7 ;; new tail = 36 This schedule is 5 cycles faster. I have read and re-read the material surrounding the DFA scheduler. I understand that the heuristics optimize critical path length and not stalls or other metrics. But in this case it is precisely the critical path length that is shortened by the better schedule. I have been examining various hooks available and for a while it seemed like TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a larger window to look for better candidates to schedule into the ready queue. For instance, this discussion seems to say so. http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html But a post that follows soon after seems to imply otherwise. http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html Both posts are from Vladimir. In any case the final conclusion seems to be that the lookahead is useful only for multi-
RE: Designs for better debug info in GCC. Choice A or B?
I like option B. It will be very helpful to reduce software product development time. Some software product just release with -O0 because they are not confident releasing a version differ to the one they were debugging and testing in. Also in some systems -O0 simply doesn't work, which is too slow or is too big code size to fit into flash memory. Developer has to suffer poor debugability. I believe it valuable to have an option generating code with fair performance/code size but almost full debugability. And I believe it not technically impossible. Thanks - Joey -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of J.C. Pizarro Sent: 2007年11月25日 7:46 To: gcc@gcc.gnu.org Subject: Re: Designs for better debug info in GCC. Choice A or B? To imagine that i'm using "-g -Os -finline-functions -funroll-loops". There are differences in how to generate "optimized AND debugged" code. A) Whole-optimized but with dirty debugged information if possible. When there is coredump from crash then its debugged information can be not complete (with losses) but can be readable for humans. This kind of strategy can't work well in "step to step" debuggers like gdb, ddd, kgdb, ... but its code is whole-optimized same as stripped program. B) Whole-debugged but partially optimized because of restricted requirements to maintain the full debugged information without losses. This kind of strategy works well in "step to step" debuggers like gdb, ddd, kgdb, ... but its code is less whole-optimized and bigger than stripped program. Sincerely, J.C.Pizarro
A proposal to align GCC stack
-- 0. MOTIVATION -- Some local variables (such as of __m128 type or marked with alignment attribute) require stack aligned at a boundary larger than the default stack boundary. Current GCC partially supports this with limitations. We are proposing a new design to fully solve the problem. -- 1. CURRENT IMPLEMENTATION -- There are two ways current GCC supports bigger than default stack alignment. One is to make sure that stack is aligned at program entry point, and then ensure that for each non-leaf function, its frame size is aligned. This approach doesn't work when linking with libs or objects compiled by other psABI confirming compilers. Some problems are logged as PR 33721. Another is to adjust stack alignment at the entry point of a function if it is marked with __attribute__ ((force_align_arg_pointer)) or -mstackrealign option is provided. This method guarantees the alignment in most of the cases but with following problems and limitations: * Only 16 bytes alignment is supported * Adjusting stack alignment at each function prologue hurts performance unnecessarily, because not all functions need bigger alignment. In fact, commonly only those functions which have SSE variables defined locally (either declared by the user or compiler generated internal temporary variables) need corresponding alignment. * Doesn't support x86_64 for the cases when required stack alignment is > 16 bytes * Emits inefficient and complicated prologue/epilogue code to adjust stack alignment * Doesn't work with nested functions * Has a bug handling register parameters, which resulted in a cpu2006 failure. A patch is available as a workaround. -- 2. NEW PROPOSAL: DESIGN -- Here, we propose a new design to fully support stack alignment while overcoming above problems. The new design will * Support arbitrary alignment value, including 4,8,16,32... * Adjust function stack alignment only when necessary * Initial development will be on i386 and x86_64, but can be extended to other platforms * Emit more efficient prologue/epilogue code * Coexist with special features like dynamic stack allocation (alloca), nested functions, register parameter passing, PIC code and tail call optimization * Be able to debug and unwind stack 2.1 Support arbitrary alignment value Different source code and optimizations requires different stack alignment, as in following table: Feature Alignment (bytes) i386_ABI4 x86_64_ABI 16 char1 short 2 int 4 long4/8* long long 8 __m64 8 __m128 16 float 4 double 8 long double 4/16* user specified any power of 2 *Note: 4 for i386, 8/16 for x86_64 The new design will support any alignment value in this table. 2.2 Adjust function stack alignment only when necessary Current GCC defines following macros related to stack alignment: i. STACK_BOUNDARY in bits, which is enforced by hardware, 32 for i386 and 64 for x86_64. It is the minimum stack boundary. It is fixed. ii. PREFERRED_STACK_BOUNDARY. It sets the stack alignment when calling a function. It may be set at command line and has no impact on stack alignment at function entry. This proposal requires PREFERRED >= STACK, and by default set to ABI_STACK_BOUNDARY This design will define a few more macros, or concepts not explicitly defined in code: iii. ABI_STACK_BOUNDARY in bits, which is the stack boundary specified by psABI, 32 for i386 and 128 for x86_64. ABI_STACK_BOUNDARY >= STACK_BOUNDARY. It is fixed for a given psABI. iv. LOCAL_STACK_BOUNDARY in bits. Each function stack has its own stack alignment requirement, which depends the alignment of its stack variables, LOCAL_STACK_BOUNDARY = MAX (alignment of each effective stack variable). v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary at function entry. If a function is marked with __attribute__ ((force_align_arg_pointer)) or -mstackrealign option is provided, INCOMING = STACK_BOUNDARY. Otherwise, INCOMING == MIN(ABI_STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY) because a function can be called via psABI externally or called locally with PREFERRED_STACK_BOUNDARY. vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment required by local variables and calling other function. REQUIRED_STACK_ALIGNMENT == MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a non-leaf function. For a leaf function, REQUIRED_STACK_ALIGNMENT == LOCAL_STACK_BOUNDARY. This proposal won't adjust stack when INCOMING_STACK_BOUNDARY >= REQUIRED_STACK_ALIGNMENT. Only when INCOMING_STACK_BOUNDARY < REQUIRED_STACK_ALIGNMENT, it will adjust stack to REQUIRED_STACK_ALIGNMENT at prologue. 2.3 Initial development on i386 and x86_64 We initially support i386 and x86_64. In this document we focus more on i386 because it is hard to implement because of the restriction of having a small register file. But all that we discuss can be easily applied to x86_64. 2.4 Emit more efficient prologue/epil
RE: A proposal to align GCC stack
Ross, HJ, > > >Because I386 PIC requires BX as GOT pointer and I386 may use AX, DX > >and CX as parameter passing registers, there are limited candidates for > >this proposal to choose. Current proposal suggests EDI, because it won't > >conflict with i386 PIC or regparm. > > Could you pick a call-clobbered register in cases where one is availale? I think it is doable. In current Apple engineer's code to support -mstackrealign, hard register ECX is used. We need to add additional code to find which caller save register is not used to pass parameters. If none of them is available, we still have to use callee save reg like EDI. > > >// Reserve two stack slots and save return address > >// and previous frame pointer into them. By > >// pointing new ebp to them, we build a pseudo > >// stack for unwinding > > Hmmm... I don't know much about the DWARF unwind information, but > couldn't it handle this case without creating the "pseudo frame"? > Or at least be extended so it could? I haven't spent time investigated it yet. I agree it will be much more beautiful without "pseudo frame". I will be happy if solution can be found or be suggested here. But I doubt if it is worthwhile effort. Remember only when stack adjustment + alloca is present, will "pseudo frame" be generated. It may not be so common to impact performance. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of H.J. Lu Sent: 2007年12月18日 13:17 To: Ross Ridge Cc: gcc@gcc.gnu.org Subject: Re: A proposal to align GCC stack On Mon, Dec 17, 2007 at 11:25:35PM -0500, Ross Ridge wrote: > Ye, Joey writes: > >i. STACK_BOUNDARY in bits, which is enforced by hardware, 32 for i386 > >and 64 for x86_64. It is the minimum stack boundary. It is fixed. > > Strictly speaking by the above definition it would be 8 for i386. > The hardware doesn't force the stack to be 32-bit aligned, it just > performs poorly if it isn't. We can change the wording. > > >v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary > >at function entry. If a function is marked with __attribute__ > >((force_align_arg_pointer)) or -mstackrealign option is provided, > >INCOMING = STACK_BOUNDARY. Otherwise, INCOMING == MIN(ABI_STACK_BOUNDARY, > >PREFERRED_STACK_BOUNDARY) because a function can be called via psABI > >externally or called locally with PREFERRED_STACK_BOUNDARY. > > This section doesn't make sense to me. The force_align_arg_pointer > attribute and -mstackrealign assume that the ABI is being > followed, while the -fpreferred-stack-boundary option effectively According to Apple engineer who implemented the -mstackrealign, on MacOS/ia32, psABI is 16byte, but -mstackrealign will assume 4byte, which is STACK_BOUNDARY. > changes the ABI. According your defintions, I would think > that INCOMING should be ABI_STACK_BOUNDARY in the first case, > and MAX(ABI_STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY) in the second. That isn't true since some .o files may not be compiled with -fpreferred-stack-boundary or with a different value of -fpreferred-stack-boundary. > (Or just PREFERRED_STACK_BOUNDARY because a boundary less than the ABI's > should be rejected during command line processing.) On x86-64, ABI_STACK_BOUNDARY is 16byte, but the Linux kernel may want to use 8 byte for PREFERRED_STACK_BOUNDARY. > > >vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment required > >by local variables and calling other function. REQUIRED_STACK_ALIGNMENT > >== MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a > >non-leaf function. For a leaf function, REQUIRED_STACK_ALIGNMENT == > >LOCAL_STACK_BOUNDARY. > > Hmm... I think you should define STACK_BOUNDARY as the minimum > alignment that ABI requires the stack pointer to keep at all times. > ABI_STACK_BOUNDARY should be defined as the stack alignment the > ABI requires at function entry. In that case a leaf function's > REQUIRED_STACK_ALIGMENT should be MAX(LOCAL_STACK_BOUNDARY, > STACK_BOUNDARY). That is true since if the only local variable is char, LOCAL_STACK_BOUNDARY will be 1. But we want the stack to be aligned at STACK_BOUNDARY. We will update our proposal. H.J.
RE: A proposal to align GCC stack
Ross Ridge wrote: > I'm currently using -fpreferred-stack-boundary without any trouble. > Your proposal would in fact generate code to align stack when it's not > necessary. This would change the behaviour of -fpreferred-stack-boundary, > hurting performance and that's unacceptable to me. This proposal values correctness at first place. So when compile can't make sure a function is only called from functions with the same or bigger preferred-stack-boundary, it will conservatively align the stack. One optimization is to set INCOMING = PREFERRED for local functions. Do you think it more acceptable? >> Ok, if people are using this flag to change the alignment to something >> smaller than used by the standard ABI, then INCOMING should be >> MAX(STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY). > > On x86-64, ABI_STACK_BOUNDARY is 16byte, but the Linux kernel may > want to use 8 byte for PREFERRED_STACK_BOUNDARY. INCOMING will > be MIN(STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY) == 8 byte. > Using MAX(STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY) also equals 8 in that > case and preserves the behaviour -fpreferred-stack-boundary in every case. I think HJ means MIN(ABI_STACK_BOUNDARY, PREFERRED_STACK_BOUNDARY). MAX(ABI, PREFERRED) == 16 in this case. Thanks - Joey
A proposal to align GCC stack - update
Thanks for Ross and HJ's comments. Here is updated proposal: Changes: - value of REQUIRED_STACK_BOUNDARY of leaf function - value of INCOMING_STACK_BOUNDARY -- 0. MOTIVATION -- Some local variables (such as of __m128 type or marked with alignment attribute) require stack aligned at a boundary larger than the default stack boundary. Current GCC partially supports this with limitations. We are proposing a new design to fully solve the problem. -- 1. CURRENT IMPLEMENTATION -- There are two ways current GCC supports bigger than default stack alignment. One is to make sure that stack is aligned at program entry point, and then ensure that for each non-leaf function, its frame size is aligned. This approach doesn't work when linking with libs or objects compiled by other psABI confirming compilers. Some problems are logged as PR 33721. Another is to adjust stack alignment at the entry point of a function if it is marked with __attribute__ ((force_align_arg_pointer)) or -mstackrealign option is provided. This method guarantees the alignment in most of the cases but with following problems and limitations: * Only 16 bytes alignment is supported * Adjusting stack alignment at each function prologue hurts performance unnecessarily, because not all functions need bigger alignment. In fact, commonly only those functions which have SSE variables defined locally (either declared by the user or compiler generated internal temporary variables) need corresponding alignment. * Doesn't support x86_64 for the cases when required stack alignment is > 16 bytes * Emits inefficient and complicated prologue/epilogue code to adjust stack alignment * Doesn't work with nested functions * Has a bug handling register parameters, which resulted in a cpu2006 failure. A patch is available as a workaround. -- 2. NEW PROPOSAL: DESIGN -- Here, we propose a new design to fully support stack alignment while overcoming above problems. The new design will * Support arbitrary alignment value, including 4,8,16,32... * Adjust function stack alignment only when necessary * Initial development will be on i386 and x86_64, but can be extended to other platforms * Emit more efficient prologue/epilogue code * Coexist with special features like dynamic stack allocation (alloca), nested functions, register parameter passing, PIC code and tail call optimization * Be able to debug and unwind stack 2.1 Support arbitrary alignment value Different source code and optimizations requires different stack alignment, as in following table: Feature Alignment (bytes) i386_ABI4 x86_64_ABI 16 char1 short 2 int 4 long4/8* long long 8 __m64 8 __m128 16 float 4 double 8 long double 16 user specified any power of 2 *Note: 4 for i386, 8 for x86_64 The new design will support any alignment value in this table. 2.2 Adjust function stack alignment only when necessary Current GCC defines following macros related to stack alignment: i. STACK_BOUNDARY in bits, which is preferred by hardware, 32 for i386 and 64 for x86_64. It is the minimum stack boundary. It is fixed. ii. PREFERRED_STACK_BOUNDARY. It sets the stack alignment when calling a function. It may be set at command line and has no impact on stack alignment at function entry. This proposal requires PREFERRED >= STACK, and by default set to ABI_STACK_BOUNDARY This design will define a few more macros, or concepts not explicitly defined in code: iii. ABI_STACK_BOUNDARY in bits, which is the stack boundary specified by psABI, 32 for i386 and 128 for x86_64. ABI_STACK_BOUNDARY >= STACK_BOUNDARY. It is fixed for a given psABI. iv. LOCAL_STACK_BOUNDARY in bits. Each function stack has its own stack alignment requirement, which depends the alignment of its stack variables, LOCAL_STACK_BOUNDARY = MAX (alignment of each effective stack variable). v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary at function entry. If a function is marked with __attribute__ ((force_align_arg_pointer)) or -mstackrealign option is provided, INCOMING = STACK_BOUNDARY. Otherwise, INCOMING == PREFERRED_STACK_BOUNDARY. For those function whose PREFERRED is larger than ABI, it is the caller's responsibility to invoke them with appropriate PREFERRED. vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment required by local variables and calling other function. REQUIRED_STACK_ALIGNMENT == MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a non-leaf function. For a leaf function, REQUIRED_STACK_ALIGNMENT == MAX(LOCAL_STACK_BOUNDARY,STACK_BOUNDARY). This proposal won't adjust stack when INCOMING_STACK_BOUNDARY >= REQUIRED_STACK_ALIGNMENT. Only when INCOMING_STACK_BOUNDARY < REQUIRED_STACK_ALIGNMENT, it will adjust stack to REQUIRED_STACK_ALIGNMENT at prologue. 2.3 Initial development on i386 and x86_64 We initially support i386 and x86_64. In this document we focus more on i386 becau
RE: A proposal to align GCC stack
Ye, Joey writes: >> This proposal values correctness at first place. So when compile can't >> make sure a function is only called from functions with the same or bigger >> preferred-stack-boundary, it will conservatively align the stack. One >> optimization is to set INCOMING = PREFERRED for local functions. Do you >> think it more acceptable? Ross Ridge wrote: > Not really. It might reduce the amount of unnecessary stack adjustment, > but the performance regression would remain. Changing the behaviour of > -fpreferred-stack-boundary doesn't make it more correct. It supposed > to change the ABI, it works as documented and, yes, if it's misused it > will cause problems. So will any number of GCC's ABI changing options. > Look at it another way. Lets say you were compiling x86_64 code with > -fpreferred-stack-boundary=3, an 8-byte PREFERRED alignment. As you > know, this is different from the standard x86_64 ABI which requires a > 16-byte alignment. Now with your proposal, GCC's behaviour of won't > change, because it's safe to assume that incoming stack is at least > 8-byte aligned. There should be no change in the code GCC generates, > with or without your proposal. However, the outgoing stack won't be > 16-byte aligned as the x86_64 ABI requires. In this case, what also > doesn't change is the fact that mixing code compiled with different > -fpreferred-stack-boundary values doesn't work. It's just as problematic > and unsafe as it was before. > So when you said "this proposal values correctness at first place", > that really isn't true. The proposal only addresses safety when > preferred alignment is raised from the standard ABI's alignment. You're > conservatively aligning the incoming stack, but not the outgoing stack. > You don't seem to be concerned about the problems that can arise when > the preferred is raised above the ABI's. Why? My guess is that because > "correctness" in this case would cause unacceptable regressions when > compiling the x86_64 Linux kernel. You are right. My proposal doesn't guarantee 100% correctness. In case of PREFERRED < ABI, we hope the author knows what will happen. > If you can understand why it would be unacceptable to change how > -fpreferred-stack-boundary behaves when compiling the Linux kernel, > then maybe you can understand why I don't find it acceptable for it to > change when compiling my code. I think I understand now. My updated version proposal sets INCOMING == PREFERRED, and -fpreferred-stack-boundary works the same as before. Thanks - Joey
RE: A proposal to align GCC stack
Andrew, My proposal is supposed not limited to i386/x86_64. Would do please spend some time review it and see if it can really solve problem in PowerPC? Your comments is welcome. Thanks - Joey -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrew Pinski Sent: 2007年12月19日 18:07 To: Ross Ridge Cc: gcc@gcc.gnu.org Subject: Re: A proposal to align GCC stack On 12/18/07, Ross Ridge <[EMAIL PROTECTED]> wrote: > Look at it another way. Lets say you were compiling x86_64 code with > -fpreferred-stack-boundary=3, an 8-byte PREFERRED alignment. Can we stop talking about x86/x86_64 specifics issues here? I have an use case for the PowerPC side of the Cell BE for variables greater than the normal stack boundary alignment of 16bytes. They need to be 128byte aligned for DMA transfering to the SPUs. I already proposed a patch [1] to fix this use case but I have not seen many replies yet. Thanks, Andrew Pinski [1] http://gcc.gnu.org/ml/gcc-patches/2007-05/msg01167.html
RE: Re: A proposal to align GCC stack
Christian Schüler writes: > Please go forward with this idea! > The current implementation of force_align_arg_pointer has never worked for me. This proposal should solve your problem. But to comfirm, I'd like to know the root cause. force_align_arg_pointer should have guaranteed 16 bytes align. Are you using data structure requirement alignment larger than 16? Or maybe you didn't specify force_align_arg_pointer for all of your functions? Thanks - Joey