Re: [PLUGIN] dlopen and RTLD_NOW
2011/9/5 Jakub Jelinek : > On Mon, Sep 05, 2011 at 10:22:10AM -0700, Andrew Pinski wrote: >> On Mon, Sep 5, 2011 at 1:10 AM, Jakub Jelinek wrote: >> > That said, relying on lazy binding is terribly bad design. >> >> In fact I was going to say why can't those symbols be marked as weak >> in your plugin? You don't even need to change the GCC headers, just >> have an extra header that does: >> #pargma weak > > s/pargma/pragma/. Yeah, making them weak will work just fine, independently > on whether it is RTLD_NOW or not, or, when program is directly linked > against it, with LD_BIND_NOW=1 or not. > > Jakub > Thanks, it works fine. I didn't know about weak symbols. Romain Geissler
Issue with delay slot scheduling?
Hi, I am doing a private port in GCC 4.5.1. For the my target i see some strange behavior in delay slot scheduling. For my target the instruction in the delay slots gets executed irrespective of whether the branch is taken or not. I have generated the following code after commenting out the call to 'relax_delay_slots' in the function 'dbr_schedule'. RTL: (insn 97 42 51 del1.c:19 (sequence [ (jump_insn 61 42 38 del1.c:19 (set (pc) (if_then_else (ne (reg:CCF 34 CC) (const_int 0 [0x0])) (label_ref:PQI 86) (pc))) 56 {conditional_branch} (expr_list:REG_BR_PRED (const_int 5 [0x5]) (expr_list:REG_DEAD (reg:CCF 34 CC) (expr_list:REG_BR_PROB (const_int 5000 [0x1388]) (nil -> 86) (insn 38 61 43 (set (mem/s/j:QI (reg/f:PQI 28 a0 [orig:62 D.1955 ] [62]) [0 bytes S1 A32]) (reg:QI 1 g1 [orig:65 D.1938 ] [65])) 7 {movqi_op} (nil)) (insn 43 38 51 (set (reg:QI 1 g1 [75]) (ior:QI (reg:QI 1 g1 [orig:65 D.1938 ] [65]) (reg:QI 3 g3 [77]))) 31 {iorqi3} (expr_list:REG_EQUAL (ior:QI (reg:QI 1 g1 [orig:65 D.1938 ] [65]) (const_int 128 [0x80])) (nil))) ]) -1 (nil)) (code_label 51 97 52 1 "" [2 uses]) (note 52 51 73 [bb 4] NOTE_INSN_BASIC_BLOCK) (jump_insn 73 52 72 (return) 72 {return_rts} (expr_list:REG_BR_PRED (const_int 12 [0xc]) (nil))) (barrier 72 73 86) (code_label 86 72 41 5 "" [1 uses]) (note 41 86 45 [bb 5] NOTE_INSN_BASIC_BLOCK) (insn 45 41 44 del1.c:20 (set (reg:QI 2 g2 [orig:68 ivtmp.7 ] [68]) (plus:QI (reg:QI 2 g2 [orig:68 ivtmp.7 ] [68]) (const_int 1 [0x1]))) 13 {addqi3} (nil)) (insn 44 45 101 del1.c:20 (set (mem/s/j:QI (reg/f:PQI 28 a0 [orig:62 D.1955 ] [62]) [0 bytes S1 A32]) (reg:QI 1 g1 [75])) 7 {movqi_op} (expr_list:REG_DEAD (reg/f:PQI 28 a0 [orig:62 D.1955 ] [62]) (expr_list:REG_DEAD (reg:QI 1 g1 [75]) (nil (code_label 101 44 79 7 "" [1 uses]) Corresponding code: jmp.ne .L5; st [a0], g1; (INSN 38) or g1, g1, g3; (INSN 43) .L1: rts; nop; nop; .L5: add g2, g2, 1; (INSN 45) st [a0], g1;(INSN 44) -> deleted .L7: You can see that INSN 44 and INSN 38 are identical. In 'relax_delay_slots' while processing INSN 97, the second call to 'try_merge_delay_insns' deletes the INSN 44 because of which unexpected result is generated. /* If we own the thread opposite the way this insn branches, see if we can merge its delay slots with following insns. */ if (INSN_FROM_TARGET_P (XVECEXP (pat, 0, 1)) && own_thread_p (NEXT_INSN (insn), 0, 1)) try_merge_delay_insns (insn, next); else if (! INSN_FROM_TARGET_P (XVECEXP (pat, 0, 1)) && own_thread_p (target_label, target_label, 0)) try_merge_delay_insns (insn, next_active_insn (target_label)); Deleting the INSN 44 would have been proper if the 2nd delay slot insn had not modified G1. But looking at the comments from the function 'try_merge_delay_insns' /* Try merging insns starting at THREAD which match exactly the insns in INSN's delay list. If all insns were matched and the insn was previously annulling, the annul bit will be cleared. For each insn that is merged, if the branch is or will be non-annulling, we delete the merged insn. */ I think REGOUT dependency of g1 between instructions 38 and 43 in the delay slot is not being considered by 'try_merge_delay_insns'. Is this a bug? Regards, Shafi
Re: Issue with delay slot scheduling?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/06/11 08:46, Mohamed Shafi wrote: > Hi, > > I am doing a private port in GCC 4.5.1. For the my target i see some > strange behavior in delay slot scheduling. For my target the > instruction in the delay slots gets executed irrespective of whether > the branch is taken or not. I have generated the following code > after commenting out the call to 'relax_delay_slots' in the function > 'dbr_schedule'. [ ... ] It looks like you have found a bug. While reorg.c is supposed to work with targets that have multiple delay slots, it's not something that has been extensively tested. >> > I think REGOUT dependency of g1 between instructions 38 and 43 in > the delay slot is not being considered by 'try_merge_delay_insns'. You're probably correct. Jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOZjpZAAoJEBRtltQi2kC7Hq4H/1m6RaLUP+3RXCLT8zZgZ7YN i/8EmgGrjsJevsjWZEaIVW0yzjMwtQU0bwTVEj9aYEKFh4s9xAWWZfWYxy40StZs 8dp5cU9k672CNecI+tYNXFlZLqDhJ/YImwW/L9KvppeSo1VCXjjzLbVoJ2CrRBM4 eJw+PEk6yWwbz2bXvOfJr/1ziEvjGddLzet6eICv5ypqO+jKzC+FOaQl/I3sJCWO axforjfSUlthYGwYlRgHlJgrWfgRIG/AhAqhkhOqSWzcIdEzy2XFuL8ez6mOe7rW qeyeZwClTpPuCtBZ7vkfQ0+LZHa5pRZHXeO9GK+OGHFzUm8kS5eaAzCIAZP1J7E= =bfxg -END PGP SIGNATURE-
Is this correct behaviour?
Hi, I compile the following code with arm gcc 4.6 (x86 is the similar with one of 4.7 snapshot). I noticed "a" is written to memory three times instead of being added by 3 and written at the end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3 "a++" can be optimized? Thanks, Bingfeng Mei int a; int P[100]; void foo (int * restrict p) { P[0] = *p; a++; P[1] = *p; a++; P[2] = *p; a++; } ~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99 foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L2 ldr r1, [r3, #0] ldr ip, [r0, #0] ldr r2, .L2+4 str r4, [sp, #-4]! add r4, r1, #1 str r4, [r3, #0] str ip, [r2, #0] ldr ip, [r0, #0] add r4, r1, #2 str r4, [r3, #0] str ip, [r2, #4] ldr r0, [r0, #0] add r1, r1, #3 str r0, [r2, #8] str r1, [r3, #0] ldmfd sp!, {r4} bx lr
Re: Issue with delay slot scheduling?
> I am doing a private port in GCC 4.5.1. For the my target i see some > strange behavior in delay slot scheduling. For my target the > instruction in the delay slots gets executed irrespective of whether > the branch is taken or not. Early 4.5.x releases have known bugs in this area. You'd need to upgrade to 4.5.3 at least (or use the SVN 4.5 branch). That being said, targets with multiple delay slots are indeed relatively untested. -- Eric Botcazou
Re: Is this correct behaviour?
On Tue, Sep 6, 2011 at 5:30 PM, Bingfeng Mei wrote: > Hi, > I compile the following code with arm gcc 4.6 (x86 is the similar with one of > 4.7 snapshot). > I noticed "a" is written to memory three times instead of being added by 3 > and written at the > end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3 "a++" can be > optimized? No it does not. > Thanks, > Bingfeng Mei > > int a; > int P[100]; > void foo (int * restrict p) > { > P[0] = *p; > a++; > P[1] = *p; > a++; > P[2] = *p; > a++; > } > > ~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99 > > foo: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > ldr r3, .L2 > ldr r1, [r3, #0] > ldr ip, [r0, #0] > ldr r2, .L2+4 > str r4, [sp, #-4]! > add r4, r1, #1 > str r4, [r3, #0] > str ip, [r2, #0] > ldr ip, [r0, #0] > add r4, r1, #2 > str r4, [r3, #0] > str ip, [r2, #4] > ldr r0, [r0, #0] > add r1, r1, #3 > str r0, [r2, #8] > str r1, [r3, #0] > ldmfd sp!, {r4} > bx lr > >
RE: Is this correct behaviour?
> -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 06 September 2011 16:42 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Is this correct behaviour? > > On Tue, Sep 6, 2011 at 5:30 PM, Bingfeng Mei wrote: > > Hi, > > I compile the following code with arm gcc 4.6 (x86 is the similar > with one of 4.7 snapshot). > > I noticed "a" is written to memory three times instead of being added > by 3 and written at the > > end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3 > "a++" can be optimized? > > No it does not. Then how do I tell compiler that "a" is not aliased if I have to use global variable? > > > Thanks, > > Bingfeng Mei > > > > int a; > > int P[100]; > > void foo (int * restrict p) > > { > > P[0] = *p; > > a++; > > P[1] = *p; > > a++; > > P[2] = *p; > > a++; > > } > > > > ~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99 > > > > foo: > > @ args = 0, pretend = 0, frame = 0 > > @ frame_needed = 0, uses_anonymous_args = 0 > > @ link register save eliminated. > > ldr r3, .L2 > > ldr r1, [r3, #0] > > ldr ip, [r0, #0] > > ldr r2, .L2+4 > > str r4, [sp, #-4]! > > add r4, r1, #1 > > str r4, [r3, #0] > > str ip, [r2, #0] > > ldr ip, [r0, #0] > > add r4, r1, #2 > > str r4, [r3, #0] > > str ip, [r2, #4] > > ldr r0, [r0, #0] > > add r1, r1, #3 > > str r0, [r2, #8] > > str r1, [r3, #0] > > ldmfd sp!, {r4} > > bx lr > > > >
Re: [PLUGIN] dlopen and RTLD_NOW
On 09/05/2011 12:50 AM, Romain Geissler wrote: Hi Is there any particular reason to load plugin with the RTLD_NOW option? This option force .so symbol resolution to be completely made at load time, but this could be done only when a symbol is needed (RTLD_NOW). Here is the dlopen line in plugin.c: dl_handle = dlopen (plugin->full_name, RTLD_NOW | RTLD_GLOBAL); My issue is, I want to load the same plugin.so in both cc1 and cc1plus, but in the C++ case, I may need to reference some cc1plus specific symbols. I can check whether cc1 or cc1plus loaded the plugin and thus use custom C++ symbols only when present. With RTLD_NOW, the plugin fails to load in cc1 as symbol resolution is forced at load time. Can you supply weak binding implementations for the missing functions? That might allow the linking to succeed. David Daney
Re: [PLUGIN] dlopen and RTLD_NOW
On 09/06/2011 10:55 AM, David Daney wrote: On 09/05/2011 12:50 AM, Romain Geissler wrote: Hi Is there any particular reason to load plugin with the RTLD_NOW option? This option force .so symbol resolution to be completely made at load time, but this could be done only when a symbol is needed (RTLD_NOW). Here is the dlopen line in plugin.c: dl_handle = dlopen (plugin->full_name, RTLD_NOW | RTLD_GLOBAL); My issue is, I want to load the same plugin.so in both cc1 and cc1plus, but in the C++ case, I may need to reference some cc1plus specific symbols. I can check whether cc1 or cc1plus loaded the plugin and thus use custom C++ symbols only when present. With RTLD_NOW, the plugin fails to load in cc1 as symbol resolution is forced at load time. Can you supply weak binding implementations for the missing functions? That might allow the linking to succeed. ... And if I read the entire thread before responding, I would have seen that others had already suggested the same thing. Sorry for the noise. David Daney
Re: Is this correct behaviour?
"Bingfeng Mei" writes: > Then how do I tell compiler that "a" is not aliased if I have to use global > variable? > >> >> > Thanks, >> > Bingfeng Mei >> > >> > int a; >> > int P[100]; >> > void foo (int * restrict p) >> > { >> > P[0] = *p; >> > a++; >> > P[1] = *p; >> > a++; >> > P[2] = *p; >> > a++; >> > } How about int a; int P[100]; void foo (int * restrict p) { foo1 (p, P, &a); } void foo1 (int * restrict p, int * restrict pp, int * restrict pa) { pp[0] = *p; a++; pp[1] = *p; a++; pp[2] = *p; a++; } Ian
gcc-4.4-20110906 is now available
Snapshot gcc-4.4-20110906 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.4-20110906/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.4 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_4-branch revision 178615 You'll find: gcc-4.4-20110906.tar.bz2 Complete GCC MD5=a2aa3066e8b004051649ca4a0ab2af3e SHA1=da4655f17827c6012af66a94101f106411a3d170 Diffs from 4.4-20110830 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.4 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Adding fstack-protector prologue to get_pc_thunk for targets with TARGET_PAD_SHORT_FUNCTION
On Thu, Jun 9, 2011 at 11:17 AM, Ian Lance Taylor wrote: > asharif tools writes: > >> On Wed, Jun 8, 2011 at 10:32 PM, Ian Lance Taylor wrote: >>> asharif tools writes: >>> function: call __i686.get_pc_thunk.bx addl $_GLOBAL_OFFSET_TABLE_, %ebx movl %gs:20, %eax # Stack-guard init movl %eax, -12(%ebp) # Stack-guard init >>> Now, what I want to do is move stack guard initialization part (consisting of the two instructions I have commented as "Stack-guard init" into get_pc_thunk.bx for those functions that have both the stack guard and a call to get_pc_thunk.bx. The compiler should generate a "stack_guarded_get_pc_thunk.bx" that will do move the %gs:20 value to the correction location on the stack instead of executing nops. In this way some useful work can be done instead of nops. >>> >>> I don't understand how you can do that. The offset from %ebp will be >>> different in different functions. When optimizing, it is likely to be >>> an offset from %esp instead. The scratch register used may also be >>> different; consider functions with __attribute__ ((regparm (2))), or the >>> use of -mregparm=2. >> >> I see. >> >> Would it be possible for the caller of stack_protected_get_pc_thunk to >> pass in this offset from gs in the return register (ebx in this case) >> in all the cases you described? > > You mean the offset from %esp or %ebp. This would require an leal > instruction, so now you are only saving one instruction. And that by > itself would not be enough, because __stack_protected_get_pc_thunk would > not know which register it could use to move the value. But you could > have different variants of the function, or it could preserve the > register. With those conditions, yes, I think it would be possible. > But the savings seems fairly minimal to me, and it only matters on the > Atom. Not that I want to stop you if you are interested. Ian, I got this to work with -O0 and a patch is attached for those who want to take a peek (It's a big hack right now and needs a lot of clean-up). This is what it does: 1. When gcc decides to add a call to get_pc_thunk for accessing globals with -fPIE, it checks if the stack guard is present in the current function. If so, it notes the base register, the offset and the scratch register used to move the stack guard from gs:0x14 to the base of the stack. 2. During the emission of get_pc_thunk, it generates extra get_pc_thunk()-like functions that use the base register, offset and scratch register noted in step (1). I learnt several things from implementing this and I want to improve on this implementation (of course a final clean-up would be required like changing the static array of get_pc_thunks to a VEC() or GTY(), etc. before I put this patch up for review). But before that I want some input from you. Here are some drawbacks of this current implementation: a. The one of immediate concern is that -O2 doesn't work with it. The reason is that between the call to get_pc_thunk() and the assembly to move the stack guard to the stack, there could be a write to the base register that was noted in step (1) above. So I'd have to note the def of that register and make sure that the call to get_pc_thunk() as well as all uses of the return register is after that def. b. It is too specific. I was thinking of scanning RTL instructions just before and after the get_pc_thunk() call and moving them to unique get_pc_thunk() functions instead of the nops that currently reside there. I could have a knob to control how many instructions to move there. For this transformation to be safe, I'd have to make sure offsets to esp are moved by 4 and the return register is not used in any of those instructions (because I want to fill up nops before the def of that return register). For (b), I'd like to save the RTL of the instructions around the call to get_pc_thunk and delete them from the function. Then, in ix86_code_end(), I want to be able to re-emit that RTL in assembly form. Do you think that is feasible? Is there a utility function to print RTL in assembly form easily so I can just use output_asm_insn in ix86_code_end()? > > Ian > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 16d977e..1be797c 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -8768,11 +8768,30 @@ ix86_setup_frame_addresses (void) static int pic_labels_used; +typedef struct +{ + int stack_reg; + int stack_offset; + int scratch_reg; +} stack_guard_code; + + +/* TODO: Do this using a VEC */ +/* + DEF_VEC_P(stack_guard_code); +DEF_VEC_ALLOC_P(stack_guard_code, gc); +* + * static VEC(stack_guard_code, gc) stack_guard_codes; */ +static int stack_guard_codes_size; +static stack_guard_code stack_guard_codes[0x100]; + +int GET_PC_THUNK_NAME_SIZE = 0x100; + /* Fills in the label name that should be used for a pc thunk for the given register. */ static void -get_pc
Re: Issue with delay slot scheduling?
On 6 September 2011 20:50, Jeff Law wrote: > > On 09/06/11 08:46, Mohamed Shafi wrote: >> Hi, >> >> I am doing a private port in GCC 4.5.1. For the my target i see some >> strange behavior in delay slot scheduling. For my target the >> instruction in the delay slots gets executed irrespective of whether >> the branch is taken or not. I have generated the following code >> after commenting out the call to 'relax_delay_slots' in the function >> 'dbr_schedule'. > [ ... ] > It looks like you have found a bug. While reorg.c is supposed to work > with targets that have multiple delay slots, it's not something that has > been extensively tested. > >>> >> I think REGOUT dependency of g1 between instructions 38 and 43 in >> the delay slot is not being considered by 'try_merge_delay_insns'. > You're probably correct. > > Jeff How do raise a bug report, mine being a private target? Regards, Shafi