Re: [PLUGIN] dlopen and RTLD_NOW

2011-09-06 Thread Romain Geissler
2011/9/5 Jakub Jelinek :
> On Mon, Sep 05, 2011 at 10:22:10AM -0700, Andrew Pinski wrote:
>> On Mon, Sep 5, 2011 at 1:10 AM, Jakub Jelinek  wrote:
>> > That said, relying on lazy binding is terribly bad design.
>>
>> In fact I was going to say why can't those symbols be marked as weak
>> in your plugin?  You don't even need to change the GCC headers, just
>> have an extra header that does:
>> #pargma weak
>
> s/pargma/pragma/.  Yeah, making them weak will work just fine, independently
> on whether it is RTLD_NOW or not, or, when program is directly linked
> against it, with LD_BIND_NOW=1 or not.
>
>        Jakub
>

Thanks, it works fine. I didn't know about weak symbols.

Romain Geissler


Issue with delay slot scheduling?

2011-09-06 Thread Mohamed Shafi
Hi,

I am doing a private port in GCC 4.5.1. For the my target i see some
strange behavior in delay slot scheduling. For my target the
instruction in the delay slots gets executed irrespective of whether
the branch is taken or not. I have generated the following code after
commenting out the call to 'relax_delay_slots' in the function
'dbr_schedule'.

RTL:

(insn 97 42 51 del1.c:19 (sequence [
    (jump_insn 61 42 38 del1.c:19 (set (pc)
    (if_then_else (ne (reg:CCF 34 CC)
    (const_int 0 [0x0]))
    (label_ref:PQI 86)
    (pc))) 56 {conditional_branch}
(expr_list:REG_BR_PRED (const_int 5 [0x5])
    (expr_list:REG_DEAD (reg:CCF 34 CC)
    (expr_list:REG_BR_PROB (const_int 5000 [0x1388])
    (nil
 -> 86)
    (insn 38 61 43 (set (mem/s/j:QI (reg/f:PQI 28 a0 [orig:62
D.1955 ] [62]) [0 bytes S1 A32])
    (reg:QI 1 g1 [orig:65 D.1938 ] [65])) 7 {movqi_op} (nil))
    (insn 43 38 51 (set (reg:QI 1 g1 [75])
    (ior:QI (reg:QI 1 g1 [orig:65 D.1938 ] [65])
    (reg:QI 3 g3 [77]))) 31 {iorqi3}
(expr_list:REG_EQUAL (ior:QI (reg:QI 1 g1 [orig:65 D.1938 ] [65])
    (const_int 128 [0x80]))
    (nil)))
    ]) -1 (nil))

(code_label 51 97 52 1 "" [2 uses])

(note 52 51 73 [bb 4] NOTE_INSN_BASIC_BLOCK)

(jump_insn 73 52 72 (return) 72 {return_rts} (expr_list:REG_BR_PRED
(const_int 12 [0xc])
    (nil)))

(barrier 72 73 86)

(code_label 86 72 41 5 "" [1 uses])

(note 41 86 45 [bb 5] NOTE_INSN_BASIC_BLOCK)

(insn 45 41 44 del1.c:20 (set (reg:QI 2 g2 [orig:68 ivtmp.7 ] [68])
    (plus:QI (reg:QI 2 g2 [orig:68 ivtmp.7 ] [68])
    (const_int 1 [0x1]))) 13 {addqi3} (nil))

(insn 44 45 101 del1.c:20 (set (mem/s/j:QI (reg/f:PQI 28 a0 [orig:62
D.1955 ] [62]) [0 bytes S1 A32])
    (reg:QI 1 g1 [75])) 7 {movqi_op} (expr_list:REG_DEAD
(reg/f:PQI 28 a0 [orig:62 D.1955 ] [62])
    (expr_list:REG_DEAD (reg:QI 1 g1 [75])
    (nil

(code_label 101 44 79 7 "" [1 uses])


Corresponding code:

jmp.ne  .L5;
st  [a0], g1; (INSN 38)
or  g1, g1, g3;  (INSN 43)
.L1:
rts;
nop;
nop;
.L5:
add   g2, g2, 1;   (INSN 45)
st  [a0], g1;(INSN 44)  -> deleted
.L7:



You can see that INSN 44 and INSN 38 are identical. In
'relax_delay_slots' while processing INSN 97, the second call to
'try_merge_delay_insns' deletes the INSN 44 because of which
unexpected result is generated.

  /* If we own the thread opposite the way this insn branches, see if we
 can merge its delay slots with following insns.  */
  if (INSN_FROM_TARGET_P (XVECEXP (pat, 0, 1))
  && own_thread_p (NEXT_INSN (insn), 0, 1))
try_merge_delay_insns (insn, next);
  else if (! INSN_FROM_TARGET_P (XVECEXP (pat, 0, 1))
   && own_thread_p (target_label, target_label, 0))
try_merge_delay_insns (insn, next_active_insn (target_label));

Deleting the INSN 44 would have been proper if the 2nd delay slot insn
had not modified G1. But looking at the comments from the function
'try_merge_delay_insns'

/* Try merging insns starting at THREAD which match exactly the insns in
   INSN's delay list.

   If all insns were matched and the insn was previously annulling, the
   annul bit will be cleared.

   For each insn that is merged, if the branch is or will be non-annulling,
   we delete the merged insn.  */

I think REGOUT dependency of g1 between instructions 38 and 43 in the
delay slot is not being considered by 'try_merge_delay_insns'.

Is this a bug?

Regards,
Shafi


Re: Issue with delay slot scheduling?

2011-09-06 Thread Jeff Law
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/06/11 08:46, Mohamed Shafi wrote:
> Hi,
> 
> I am doing a private port in GCC 4.5.1. For the my target i see some 
> strange behavior in delay slot scheduling. For my target the 
> instruction in the delay slots gets executed irrespective of whether 
> the branch is taken or not. I have generated the following code
> after commenting out the call to 'relax_delay_slots' in the function 
> 'dbr_schedule'.
[ ... ]
It looks like you have found a bug.  While reorg.c is supposed to work
with targets that have multiple delay slots, it's not something that has
been extensively tested.

>> 
> I think REGOUT dependency of g1 between instructions 38 and 43 in
> the delay slot is not being considered by 'try_merge_delay_insns'.
You're probably correct.

Jeff
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOZjpZAAoJEBRtltQi2kC7Hq4H/1m6RaLUP+3RXCLT8zZgZ7YN
i/8EmgGrjsJevsjWZEaIVW0yzjMwtQU0bwTVEj9aYEKFh4s9xAWWZfWYxy40StZs
8dp5cU9k672CNecI+tYNXFlZLqDhJ/YImwW/L9KvppeSo1VCXjjzLbVoJ2CrRBM4
eJw+PEk6yWwbz2bXvOfJr/1ziEvjGddLzet6eICv5ypqO+jKzC+FOaQl/I3sJCWO
axforjfSUlthYGwYlRgHlJgrWfgRIG/AhAqhkhOqSWzcIdEzy2XFuL8ez6mOe7rW
qeyeZwClTpPuCtBZ7vkfQ0+LZHa5pRZHXeO9GK+OGHFzUm8kS5eaAzCIAZP1J7E=
=bfxg
-END PGP SIGNATURE-


Is this correct behaviour?

2011-09-06 Thread Bingfeng Mei
Hi, 
I compile the following code with arm gcc 4.6 (x86 is the similar with one of 
4.7 snapshot).
I noticed "a" is written to memory three times instead of being added by 3 and 
written at the
end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3 "a++" can be 
optimized?

Thanks,
Bingfeng Mei

int a;
int P[100];
void foo (int * restrict p)
{
  P[0] = *p;
  a++;
  P[1] = *p;
  a++;
  P[2] = *p;
  a++;
}

~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99

foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L2
ldr r1, [r3, #0]
ldr ip, [r0, #0]
ldr r2, .L2+4
str r4, [sp, #-4]!
add r4, r1, #1
str r4, [r3, #0]
str ip, [r2, #0]
ldr ip, [r0, #0]
add r4, r1, #2
str r4, [r3, #0]
str ip, [r2, #4]
ldr r0, [r0, #0]
add r1, r1, #3
str r0, [r2, #8]
str r1, [r3, #0]
ldmfd   sp!, {r4}
bx  lr



Re: Issue with delay slot scheduling?

2011-09-06 Thread Eric Botcazou
> I am doing a private port in GCC 4.5.1. For the my target i see some
> strange behavior in delay slot scheduling. For my target the
> instruction in the delay slots gets executed irrespective of whether
> the branch is taken or not. 

Early 4.5.x releases have known bugs in this area.  You'd need to upgrade to 
4.5.3 at least (or use the SVN 4.5 branch).  That being said, targets with 
multiple delay slots are indeed relatively untested.

-- 
Eric Botcazou


Re: Is this correct behaviour?

2011-09-06 Thread Richard Guenther
On Tue, Sep 6, 2011 at 5:30 PM, Bingfeng Mei  wrote:
> Hi,
> I compile the following code with arm gcc 4.6 (x86 is the similar with one of 
> 4.7 snapshot).
> I noticed "a" is written to memory three times instead of being added by 3 
> and written at the
> end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3 "a++" can be 
> optimized?

No it does not.

> Thanks,
> Bingfeng Mei
>
> int a;
> int P[100];
> void foo (int * restrict p)
> {
>  P[0] = *p;
>  a++;
>  P[1] = *p;
>  a++;
>  P[2] = *p;
>  a++;
> }
>
> ~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99
>
> foo:
>        @ args = 0, pretend = 0, frame = 0
>        @ frame_needed = 0, uses_anonymous_args = 0
>        @ link register save eliminated.
>        ldr     r3, .L2
>        ldr     r1, [r3, #0]
>        ldr     ip, [r0, #0]
>        ldr     r2, .L2+4
>        str     r4, [sp, #-4]!
>        add     r4, r1, #1
>        str     r4, [r3, #0]
>        str     ip, [r2, #0]
>        ldr     ip, [r0, #0]
>        add     r4, r1, #2
>        str     r4, [r3, #0]
>        str     ip, [r2, #4]
>        ldr     r0, [r0, #0]
>        add     r1, r1, #3
>        str     r0, [r2, #8]
>        str     r1, [r3, #0]
>        ldmfd   sp!, {r4}
>        bx      lr
>
>


RE: Is this correct behaviour?

2011-09-06 Thread Bingfeng Mei


> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: 06 September 2011 16:42
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Is this correct behaviour?
> 
> On Tue, Sep 6, 2011 at 5:30 PM, Bingfeng Mei  wrote:
> > Hi,
> > I compile the following code with arm gcc 4.6 (x86 is the similar
> with one of 4.7 snapshot).
> > I noticed "a" is written to memory three times instead of being added
> by 3 and written at the
> > end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3
> "a++" can be optimized?
> 
> No it does not.

Then how do I tell compiler that "a" is not aliased if I have to use global 
variable? 

> 
> > Thanks,
> > Bingfeng Mei
> >
> > int a;
> > int P[100];
> > void foo (int * restrict p)
> > {
> >  P[0] = *p;
> >  a++;
> >  P[1] = *p;
> >  a++;
> >  P[2] = *p;
> >  a++;
> > }
> >
> > ~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99
> >
> > foo:
> >        @ args = 0, pretend = 0, frame = 0
> >        @ frame_needed = 0, uses_anonymous_args = 0
> >        @ link register save eliminated.
> >        ldr     r3, .L2
> >        ldr     r1, [r3, #0]
> >        ldr     ip, [r0, #0]
> >        ldr     r2, .L2+4
> >        str     r4, [sp, #-4]!
> >        add     r4, r1, #1
> >        str     r4, [r3, #0]
> >        str     ip, [r2, #0]
> >        ldr     ip, [r0, #0]
> >        add     r4, r1, #2
> >        str     r4, [r3, #0]
> >        str     ip, [r2, #4]
> >        ldr     r0, [r0, #0]
> >        add     r1, r1, #3
> >        str     r0, [r2, #8]
> >        str     r1, [r3, #0]
> >        ldmfd   sp!, {r4}
> >        bx      lr
> >
> >




Re: [PLUGIN] dlopen and RTLD_NOW

2011-09-06 Thread David Daney

On 09/05/2011 12:50 AM, Romain Geissler wrote:

Hi

Is there any particular reason to load plugin with the RTLD_NOW option?
This option force .so symbol resolution to be completely made at load time,
but this could be done only when a symbol is needed (RTLD_NOW).

Here is the dlopen line in plugin.c:
dl_handle = dlopen (plugin->full_name, RTLD_NOW | RTLD_GLOBAL);

My issue is, I want to load the same plugin.so in both cc1 and cc1plus, but
in the C++ case, I may need to reference some cc1plus specific symbols. I can
check whether cc1 or cc1plus loaded the plugin and thus use custom C++
symbols only when present. With RTLD_NOW, the plugin fails to load in cc1 as
symbol resolution is forced at load time.



Can you supply weak binding implementations for the missing functions? 
That might allow the linking to succeed.


David Daney



Re: [PLUGIN] dlopen and RTLD_NOW

2011-09-06 Thread David Daney

On 09/06/2011 10:55 AM, David Daney wrote:

On 09/05/2011 12:50 AM, Romain Geissler wrote:

Hi

Is there any particular reason to load plugin with the RTLD_NOW option?
This option force .so symbol resolution to be completely made at load
time,
but this could be done only when a symbol is needed (RTLD_NOW).

Here is the dlopen line in plugin.c:
dl_handle = dlopen (plugin->full_name, RTLD_NOW | RTLD_GLOBAL);

My issue is, I want to load the same plugin.so in both cc1 and
cc1plus, but
in the C++ case, I may need to reference some cc1plus specific
symbols. I can
check whether cc1 or cc1plus loaded the plugin and thus use custom C++
symbols only when present. With RTLD_NOW, the plugin fails to load in
cc1 as
symbol resolution is forced at load time.



Can you supply weak binding implementations for the missing functions?
That might allow the linking to succeed.



... And if I read the entire thread before responding, I would have seen 
that others had already suggested the same thing.


Sorry for the noise.

David Daney


Re: Is this correct behaviour?

2011-09-06 Thread Ian Lance Taylor
"Bingfeng Mei"  writes:

> Then how do I tell compiler that "a" is not aliased if I have to use global 
> variable? 
>
>> 
>> > Thanks,
>> > Bingfeng Mei
>> >
>> > int a;
>> > int P[100];
>> > void foo (int * restrict p)
>> > {
>> >  P[0] = *p;
>> >  a++;
>> >  P[1] = *p;
>> >  a++;
>> >  P[2] = *p;
>> >  a++;
>> > }

How about

int a;
int P[100];
void foo (int * restrict p)
{
  foo1 (p, P, &a);
}
void foo1 (int * restrict p, int * restrict pp, int * restrict pa)
{
  pp[0] = *p;
  a++;
  pp[1] = *p;
  a++;
  pp[2] = *p;
  a++;
}

Ian


gcc-4.4-20110906 is now available

2011-09-06 Thread gccadmin
Snapshot gcc-4.4-20110906 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.4-20110906/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.4 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_4-branch 
revision 178615

You'll find:

 gcc-4.4-20110906.tar.bz2 Complete GCC

  MD5=a2aa3066e8b004051649ca4a0ab2af3e
  SHA1=da4655f17827c6012af66a94101f106411a3d170

Diffs from 4.4-20110830 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.4
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: Adding fstack-protector prologue to get_pc_thunk for targets with TARGET_PAD_SHORT_FUNCTION

2011-09-06 Thread asharif tools
On Thu, Jun 9, 2011 at 11:17 AM, Ian Lance Taylor  wrote:
> asharif tools  writes:
>
>> On Wed, Jun 8, 2011 at 10:32 PM, Ian Lance Taylor  wrote:
>>> asharif tools  writes:
>>>
 function:
       call    __i686.get_pc_thunk.bx
       addl    $_GLOBAL_OFFSET_TABLE_, %ebx
       movl    %gs:20, %eax # Stack-guard init
       movl    %eax, -12(%ebp) # Stack-guard init
>>>
 Now, what I want to do is move stack guard initialization part
 (consisting of the two instructions I have commented as "Stack-guard
 init" into get_pc_thunk.bx for those functions that have both the
 stack guard and a call to get_pc_thunk.bx. The compiler should
 generate a "stack_guarded_get_pc_thunk.bx" that will do move the
 %gs:20 value to the correction location on the stack instead of
 executing nops. In this way some useful work can be done instead of
 nops.
>>>
>>> I don't understand how you can do that.  The offset from %ebp will be
>>> different in different functions.  When optimizing, it is likely to be
>>> an offset from %esp instead.  The scratch register used may also be
>>> different; consider functions with __attribute__ ((regparm (2))), or the
>>> use of -mregparm=2.
>>
>> I see.
>>
>> Would it be possible for the caller of stack_protected_get_pc_thunk to
>> pass in this offset from gs in the return register (ebx in this case)
>> in all the cases you described?
>
> You mean the offset from %esp or %ebp.  This would require an leal
> instruction, so now you are only saving one instruction.  And that by
> itself would not be enough, because __stack_protected_get_pc_thunk would
> not know which register it could use to move the value.  But you could
> have different variants of the function, or it could preserve the
> register.  With those conditions, yes, I think it would be possible.
> But the savings seems fairly minimal to me, and it only matters on the
> Atom.  Not that I want to stop you if you are interested.


Ian, I got this to work with -O0 and a patch is attached for those who
want to take a peek (It's a big hack right now and needs a lot of
clean-up).

This is what it does:

1. When gcc decides to add a call to get_pc_thunk for accessing
globals with -fPIE, it checks if the stack guard is present in the
current function. If so, it notes the base register, the offset and
the scratch register used to move the stack guard from gs:0x14 to the
base of the stack.
2. During the emission of get_pc_thunk, it generates extra
get_pc_thunk()-like functions that use the base register, offset and
scratch register noted in step (1).

I learnt several things from implementing this and I want to improve
on this implementation (of course a final clean-up would be required
like changing the static array of get_pc_thunks to a VEC() or GTY(),
etc. before I put this patch up for review). But before that I want
some input from you. Here are some drawbacks of this current
implementation:

a. The one of immediate concern is that -O2 doesn't work with it. The
reason is that between the call to get_pc_thunk() and the assembly to
move the stack guard to the stack, there could be a write to the base
register that was noted in step (1) above. So I'd have to note the def
of that register and make sure that the call to get_pc_thunk() as well
as all uses of the return register is after that def.
b. It is too specific. I was thinking of scanning RTL instructions
just before and after the get_pc_thunk() call and moving them to
unique get_pc_thunk() functions instead of the nops that currently
reside there. I could have a knob to control how many instructions to
move there. For this transformation to be safe, I'd have to make sure
offsets to esp are moved by 4 and the return register is not used in
any of those instructions (because I want to fill up nops before the
def of that return register).

For (b), I'd like to save the RTL of the instructions around the call
to get_pc_thunk and delete them from the function. Then, in
ix86_code_end(), I want to be able to re-emit that RTL in assembly
form. Do you think that is feasible? Is there a utility function to
print RTL in assembly form easily so I can just use output_asm_insn in
ix86_code_end()?

>
> Ian
>
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 16d977e..1be797c 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -8768,11 +8768,30 @@ ix86_setup_frame_addresses (void)
 
 static int pic_labels_used;
 
+typedef struct
+{
+  int stack_reg;
+  int stack_offset;
+  int scratch_reg;
+} stack_guard_code;
+
+
+/* TODO: Do this using a VEC */
+/* 
+ DEF_VEC_P(stack_guard_code);
+DEF_VEC_ALLOC_P(stack_guard_code, gc);
+*
+ * static VEC(stack_guard_code, gc) stack_guard_codes; */
+static int stack_guard_codes_size;
+static stack_guard_code stack_guard_codes[0x100];
+
+int GET_PC_THUNK_NAME_SIZE = 0x100;
+
 /* Fills in the label name that should be used for a pc thunk for
the given register.  */
 
 static void
-get_pc

Re: Issue with delay slot scheduling?

2011-09-06 Thread Mohamed Shafi
On 6 September 2011 20:50, Jeff Law  wrote:
>
> On 09/06/11 08:46, Mohamed Shafi wrote:
>> Hi,
>>
>> I am doing a private port in GCC 4.5.1. For the my target i see some
>> strange behavior in delay slot scheduling. For my target the
>> instruction in the delay slots gets executed irrespective of whether
>> the branch is taken or not. I have generated the following code
>> after commenting out the call to 'relax_delay_slots' in the function
>> 'dbr_schedule'.
> [ ... ]
> It looks like you have found a bug.  While reorg.c is supposed to work
> with targets that have multiple delay slots, it's not something that has
> been extensively tested.
>
>>>
>> I think REGOUT dependency of g1 between instructions 38 and 43 in
>> the delay slot is not being considered by 'try_merge_delay_insns'.
> You're probably correct.
>
> Jeff

How do raise a bug report, mine being a private target?

Regards,
Shafi