Precompiled headers - still useful feature?

2015-05-27 Thread Martin Liška

Hello.

I would like to ask folks what is their opinion about support of precompiled 
headers for
future releases of GCC. From my point of view, the feature brings some 
speed-up, but question
is if it's worth for?

Last time I hit precompiled headers was when I was rewriting memory
allocation statistics infrastructure, where GGC memory is 'streamed' and loaded 
afterwards
in usage of precompiled headers.
Because of that I was unable to track some pointers that were allocated in the 
first phase
of compilation.

There are numbers related to --disable-libstdcxx-pch option:

Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz:
Boostrap time w/ precompiled headers enabled: 35m47s (100.00%)
Boostrap time w/ precompiled headers disabled: 36m27s (101.86%)

make -j9 check-target-libstdc++-v3 -k time:
precompiled headers enabled: 8m11s (100.00%)
precompiled headers disabled: 8m42s (106.31%)

Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz:
Boostrap time w/ precompiled headers enabled: 57m35s (100.00%)
Boostrap time w/ precompiled headers disabled: 57m12s (99.33%)

Feel free to send any statistics, opinions and ideas.

Thank you,
Martin


Re: Precompiled headers - still useful feature?

2015-05-27 Thread Markus Trippelsdorf
On 2015.05.27 at 10:14 +0200, Martin Liška wrote:
> I would like to ask folks what is their opinion about support of
> precompiled headers for future releases of GCC. From my point of view,
> the feature brings some speed-up, but question is if it's worth for?
> 
> Last time I hit precompiled headers was when I was rewriting memory
> allocation statistics infrastructure, where GGC memory is 'streamed'
> and loaded afterwards in usage of precompiled headers.  Because of
> that I was unable to track some pointers that were allocated in the
> first phase of compilation.
> 
> There are numbers related to --disable-libstdcxx-pch option:
> 
> Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz:
> Boostrap time w/ precompiled headers enabled: 35m47s (100.00%)
> Boostrap time w/ precompiled headers disabled: 36m27s (101.86%)
> 
> make -j9 check-target-libstdc++-v3 -k time:
> precompiled headers enabled: 8m11s (100.00%)
> precompiled headers disabled: 8m42s (106.31%)
> 
> Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz:
> Boostrap time w/ precompiled headers enabled: 57m35s (100.00%)
> Boostrap time w/ precompiled headers disabled: 57m12s (99.33%)

Measuring the impact on bigger projects that use pch like QT or Boost
would be more informative perhaps.

And until C++ modules are implemented (unfortunately nobody is working
on this AFAIK) pch is still the only option left. So deprecating them
now seem premature.

-- 
Markus


Re: Precompiled headers - still useful feature?

2015-05-27 Thread Jonathan Wakely
On 27 May 2015 at 10:01, Markus Trippelsdorf wrote:
> And until C++ modules are implemented (unfortunately nobody is working
> on this AFAIK) pch is still the only option left. So deprecating them
> now seem premature.

I doubt anyone's going to implement them until they're specified, the
proposals are still evolving.


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-27 Thread Ilya Enkovich
2015-05-27 6:31 GMT+03:00 Jeff Law :
> On 05/25/2015 09:27 AM, Ilya Enkovich wrote:
>>
>> 2015-05-22 15:01 GMT+03:00 Ilya Enkovich :
>>>
>>> 2015-05-22 11:53 GMT+03:00 Ilya Enkovich :

 2015-05-21 22:08 GMT+03:00 Vladimir Makarov :
>
> So, Ilya, to solve the problem you need to avoid sharing subregs for
> the
> correct LRA/reload work.
>
>

 Thanks a lot for your help! I'll fix it.

 Ilya
>>>
>>>
>>> I've fixed SUBREG sharing and got a missing spill. I added
>>> --enable-checking=rtl to check other possible bugs. Spill/fill code
>>> still seems incorrect because different sizes are used.  Shouldn't
>>> block me though.
>>>
>>> .L5:
>>>  movl16(%esp), %eax
>>>  addl$8, %esi
>>>  movl20(%esp), %edx
>>>  movl%eax, (%esp)
>>>  movl%edx, 4(%esp)
>>>  callcounter@PLT
>>>  movq-8(%esi), %xmm0
>>>  **movdqa  16(%esp), %xmm2**
>>>  pand%xmm0, %xmm2
>>>  movdqa  %xmm2, %xmm0
>>>  movd%xmm2, %edx
>>>  **movq%xmm2, 16(%esp)**
>>>  psrlq   $32, %xmm0
>>>  movd%xmm0, %eax
>>>  orl %edx, %eax
>>>  jne .L5
>>>
>>> Thanks,
>>> Ilya
>>
>>
>> I was wrong assuming reloads with wrong size shouldn't block me. These
>> reloads require memory to be aligned which is not always true. Here is
>> what I have in RTL now:
>>
>> (insn 2 7 3 2 (set (reg/v:DI 93 [ l ])
>>  (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89
>> {*movdi_internal}
>>   (nil))
>> ...
>> (insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0)
>>  (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0)
>>  (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489
>> {*iorv2di3}
>>   (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ])
>>  (expr_list:REG_DEAD (reg/v:DI 93 [ l ])
>>  (nil
>>
>> After reload I get:
>>
>> (insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93])
>>  (mem/c:DI (plus:SI (reg/f:SI 7 sp)
>>  (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89
>> {*movdi_internal}
>>   (nil))
>> (insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64])
>>  (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal}
>>   (nil))
>> ...
>> (insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87])
>>  (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99])
>>  (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64])))
>> test.c:11 3489 {*iorv2di3}
>>
>>
>> 'por' instruction requires memory to be aligned and fails in a bigger
>> testcase. There is also movdqa generated for esp by reload. May it
>> mean I still have some inconsistencies in the produced RTL? Probably I
>> should somehow transform loads and stores?
>
> I'd start by looking at the AP->SP elimination step.  What's the defined
> stack alignment and whether or not a dynamic stack realignment is needed.
> If you don't have all that setup properly prior to the allocators, then
> they're not going to know how what objects to align nor how to align them.

I looked into assign_stack_local_1 call for this spill. LRA correctly
requests 16 bytes size with 16 bytes alignment. But
assign_stack_local_1 look reduces alignment to 8 because estimated
stack alignment before RA is 8 and requested mode's (DI) alignment
fits it. Probably LRA should pass biggest_mode of the reg when
requesting a stack slot?

I handled it by increasing stack_alignment_estimated when transform
some instructions to vector mode.

Thanks for help!

Ilya

>
> jeff
>


Relocations to use when eliding plts

2015-05-27 Thread Richard Henderson
There's one problem with the couple of patches that I've seen go by wrt eliding
PLTs with -z now, and relaxing inlined PLTs (aka -fno-plt):

They're currently using the same relocations used by data, and thus the linker
and dynamic linker must ensure that pointer equality is maintained.  Which
results in branch-to-branch-(to-branch) situations.

E.g. the attached test case, in which main has a plt entry for function A in
a.so, and the function B in b.so calls A.

$ LD_BIND_NOW=1 gdb main
...
(gdb) b b
Breakpoint 1 at 0x400540
(gdb) run
Starting program: /home/rth/x/main
Breakpoint 1, b () at b.c:2
2   void b(void) { a(); }
(gdb) si
2   void b(void) { a(); }
=> 0x77bf75f4 :callq  0x77bf74e0
(gdb)
0x77bf74e0 in ?? () from ./b.so
=> 0x77bf74e0:  jmpq   *0x20034a(%rip)# 0x77df7830
(gdb)
0x00400560 in a@plt ()
=> 0x400560 :jmpq   *0x20057a(%rip)# 0x600ae0
(gdb)
a () at a.c:2
2   void a() { printf("Hello, World!\n"); }
=> 0x77df95f0 :  sub$0x8,%rsp


If we use -fno-plt, we eliminate the first callq, but do still have two
consecutive jmpq's.

If seems to me that we ought to have different relocations when we're only
going to use a pointer for branching, and when we need a pointer to be
canonicalized for pointer comparisons.

In the linked image, we already have these: R_X86_64_GLOB_DAT vs
R_X86_64_JUMP_SLOT.  Namely, GLOB_DAT implies "data" (and therefore pointer
equality), while JUMP_SLOT implies "code" (and therefore we can resolve past
plt stubs in the main executable).

Which means that HJ's patch of May 16 (git hash 25070364), is less than ideal.
 I do like the smaller PLT entries, but I don't like the fact that it now emits
GLOB_DAT for the relocations instead of JUMP_SLOT.


In the relocatable image, when we're talking about -fno-plt, we should think
about what relocation we'd like to emit.  Yes, the existing R_X86_64_GOTPCREL
works with existing toolchains, and there's something to be said for that.
However, if we're talking about adding a new relocation for relaxing an
indirect call via GOTPCREL, then:

If we want -fno-plt to be able to hoist function addresses, then we're going to
want the address that we load for the call to also not be subject to possible
jump-to-jump.

Unless we want the linker to do an unreasonable amount of x86 code examination
in order to determine mov vs call for relaxation, we need two different
relocations (preferably using the same assembler mnemonic, and thus the correct
relocation is enforced by the assembler).

On the users/hjl/relax branch (and posted on list somewhere), the new
relocation is called R_X86_64_RELAX_GOTPCREL.  I'm not keen on that "relax"
name, despite that being exactly what it's for.

I suggest R_X86_64_GOTPLTPCREL_{CALL,LOAD} for the two relocation names.  That
is, the address is in the .got.plt section, it's a pc-relative relocation, and
it's being used by a call or load (mov) insn.

With those two, we can fairly easily relax call/jmp to direct branches, and mov
to lea.  Yes, LTO can perform the same optimization, but I'll also agree that
there are many projects for which LTO is both overkill and unworkable.

This does leave open other optimization questions, mostly around weak
functions.  Consider constructs like

if (foo) foo();

Do we, within the compiler, try to CSE GOTPCREL and GOTPLTPCREL, accepting the
possibility (not certainty) of jump-to-jump but definitely avoiding a separate
load insn and the latency implied by that?


Comments?


r~


test.tar
Description: Unix tar archive


Re: Relocations to use when eliding plts

2015-05-27 Thread H.J. Lu
On Wed, May 27, 2015 at 1:03 PM, Richard Henderson  wrote:
> There's one problem with the couple of patches that I've seen go by wrt 
> eliding
> PLTs with -z now, and relaxing inlined PLTs (aka -fno-plt):
>
> They're currently using the same relocations used by data, and thus the linker
> and dynamic linker must ensure that pointer equality is maintained.  Which
> results in branch-to-branch-(to-branch) situations.
>

Your test exposed a linker bug:

https://sourceware.org/bugzilla/show_bug.cgi?id=18458

I checked in this patch to fix it.

-- 
H.J.
--
When pointer equality needed, we can't replace PLT relocations with
GOT relocations for -z now.  This patch checks if pointer equality is
needed before converting PLT relocations to GOT relocations.

bfd/

PR binutils/18458
* elf32-i386.c (elf_i386_check_relocs): Create .plt.got section
for now binding only if pointer equality isn't needed.
(elf_i386_allocate_dynrelocs): Use .plt.got section for now
binding only if pointer equality isn't needed.
* elf64-x86-64.c (elf_x86_64_check_relocs): Create .plt.got
section for now binding only if pointer equality isn't needed.
(elf_x86_64_allocate_dynrelocs): Use .plt.got section for now
binding only if pointer equality isn't needed.

ld/testsuite/

PR binutils/18458
* ld-elf/shared.exp (build_tests): Build libpr18458a.so and
libpr18458b.so.
(run_tests): Run pr18458 test.
* ld-elf/pr18458a.c: New file.
* ld-elf/pr18458b.c: Likewise.
* ld-elf/pr18458c.c: Likewise.
From 8ded2ddc8bac501c1ee0706cb3d3ef3fb1c10b85 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Wed, 27 May 2015 14:32:24 -0700
Subject: [PATCH] Convert PLT reloc only if pointer equality isn't needed

When pointer equality needed, we can't replace PLT relocations with
GOT relocations for -z now.  This patch checks if pointer equality is
needed before converting PLT relocations to GOT relocations.

bfd/

	PR binutils/18458
	* elf32-i386.c (elf_i386_check_relocs): Create .plt.got section
	for now binding only if pointer equality isn't needed.
	(elf_i386_allocate_dynrelocs): Use .plt.got section for now
	binding only if pointer equality isn't needed.
	* elf64-x86-64.c (elf_x86_64_check_relocs): Create .plt.got
	section for now binding only if pointer equality isn't needed.
	(elf_x86_64_allocate_dynrelocs): Use .plt.got section for now
	binding only if pointer equality isn't needed.

ld/testsuite/

	PR binutils/18458
	* ld-elf/shared.exp (build_tests): Build libpr18458a.so and
	libpr18458b.so.
	(run_tests): Run pr18458 test.
	* ld-elf/pr18458a.c: New file.
	* ld-elf/pr18458b.c: Likewise.
	* ld-elf/pr18458c.c: Likewise.
---
 bfd/ChangeLog  | 12 
 bfd/elf32-i386.c   |  5 +++--
 bfd/elf64-x86-64.c |  5 +++--
 ld/testsuite/ChangeLog | 10 ++
 ld/testsuite/ld-elf/pr18458a.c |  6 ++
 ld/testsuite/ld-elf/pr18458b.c |  6 ++
 ld/testsuite/ld-elf/pr18458c.c | 18 ++
 ld/testsuite/ld-elf/shared.exp |  9 +
 8 files changed, 67 insertions(+), 4 deletions(-)
 create mode 100644 ld/testsuite/ld-elf/pr18458a.c
 create mode 100644 ld/testsuite/ld-elf/pr18458b.c
 create mode 100644 ld/testsuite/ld-elf/pr18458c.c

diff --git a/bfd/ChangeLog b/bfd/ChangeLog
index 87a0bff..a8a0ad9 100644
--- a/bfd/ChangeLog
+++ b/bfd/ChangeLog
@@ -1,3 +1,15 @@
+2015-05-27  H.J. Lu  
+
+	PR binutils/18458
+	* elf32-i386.c (elf_i386_check_relocs): Create .plt.got section
+	for now binding only if pointer equality isn't needed.
+	(elf_i386_allocate_dynrelocs): Use .plt.got section for now
+	binding only if pointer equality isn't needed.
+	* elf64-x86-64.c (elf_x86_64_check_relocs): Create .plt.got
+	section for now binding only if pointer equality isn't needed.
+	(elf_x86_64_allocate_dynrelocs): Use .plt.got section for now
+	binding only if pointer equality isn't needed.
+
 2015-05-26  H.J. Lu  
 
 	PR binutils/18437
diff --git a/bfd/elf32-i386.c b/bfd/elf32-i386.c
index 23d50e1..f3aee96 100644
--- a/bfd/elf32-i386.c
+++ b/bfd/elf32-i386.c
@@ -1885,7 +1885,8 @@ do_size:
   if (use_plt_got
 	  && h != NULL
 	  && h->plt.refcount > 0
-	  && ((info->flags & DF_BIND_NOW) || h->got.refcount > 0)
+	  && (((info->flags & DF_BIND_NOW) && !h->pointer_equality_needed)
+	  || h->got.refcount > 0)
 	  && htab->plt_got == NULL)
 	{
 	  /* Create the GOT procedure linkage table.  */
@@ -2323,7 +2324,7 @@ elf_i386_allocate_dynrelocs (struct elf_link_hash_entry *h, void *inf)
 {
   bfd_boolean use_plt_got;
 
-  if ((info->flags & DF_BIND_NOW))
+  if ((info->flags & DF_BIND_NOW) && !h->pointer_equality_needed)
 	{
 	  /* Don't use the regular PLT for DF_BIND_NOW. */
 	  h->plt.offset = (bfd_vma) -1;
diff --git a/bfd/elf64-x86-64.c b/bfd/elf64-x86-64.c
index 4428f97..072c00b 100644
--- a/bfd/elf64-x86-64.c
+++ b/bfd/elf64-x86-64.c
@@ -2080,7 +2080,8 @@ do_size:
   if (use_plt_got
 	  && h != NULL
 	  && h->plt.refcount > 0
-	  && ((info->flags & DF_BIND_NOW) || h->got.refcount > 0)
+	  && (((info->

gcc-4.9-20150527 is now available

2015-05-27 Thread gccadmin
Snapshot gcc-4.9-20150527 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.9-20150527/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.9 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_9-branch 
revision 223783

You'll find:

 gcc-4.9-20150527.tar.bz2 Complete GCC

  MD5=ff0c439b3b8c30026c8707adc1998130
  SHA1=f3ae7b1a4b96ac3e250bfacb6616e901d3033162

Diffs from 4.9-20150520 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.9
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: Relocations to use when eliding plts

2015-05-27 Thread H.J. Lu
On Wed, May 27, 2015 at 1:03 PM, Richard Henderson  wrote:
> There's one problem with the couple of patches that I've seen go by wrt 
> eliding
> PLTs with -z now, and relaxing inlined PLTs (aka -fno-plt):
>
> They're currently using the same relocations used by data, and thus the linker
> and dynamic linker must ensure that pointer equality is maintained.  Which
> results in branch-to-branch-(to-branch) situations.
>
> E.g. the attached test case, in which main has a plt entry for function A in
> a.so, and the function B in b.so calls A.
>
> $ LD_BIND_NOW=1 gdb main
> ...
> (gdb) b b
> Breakpoint 1 at 0x400540
> (gdb) run
> Starting program: /home/rth/x/main
> Breakpoint 1, b () at b.c:2
> 2   void b(void) { a(); }
> (gdb) si
> 2   void b(void) { a(); }
> => 0x77bf75f4 :callq  0x77bf74e0
> (gdb)
> 0x77bf74e0 in ?? () from ./b.so
> => 0x77bf74e0:  jmpq   *0x20034a(%rip)# 0x77df7830
> (gdb)
> 0x00400560 in a@plt ()
> => 0x400560 :jmpq   *0x20057a(%rip)# 0x600ae0
> (gdb)
> a () at a.c:2
> 2   void a() { printf("Hello, World!\n"); }
> => 0x77df95f0 :  sub$0x8,%rsp
>
>
> If we use -fno-plt, we eliminate the first callq, but do still have two
> consecutive jmpq's.
>
> If seems to me that we ought to have different relocations when we're only
> going to use a pointer for branching, and when we need a pointer to be
> canonicalized for pointer comparisons.
>
> In the linked image, we already have these: R_X86_64_GLOB_DAT vs
> R_X86_64_JUMP_SLOT.  Namely, GLOB_DAT implies "data" (and therefore pointer
> equality), while JUMP_SLOT implies "code" (and therefore we can resolve past
> plt stubs in the main executable).
>
> Which means that HJ's patch of May 16 (git hash 25070364), is less than ideal.
>  I do like the smaller PLT entries, but I don't like the fact that it now 
> emits
> GLOB_DAT for the relocations instead of JUMP_SLOT.

ld.so just does whatever is arranged by ld.  I am not sure change ld.so
is a good idea.  I don't what kind of optimization we can do when function
is called and its address it taken.

>
> In the relocatable image, when we're talking about -fno-plt, we should think
> about what relocation we'd like to emit.  Yes, the existing R_X86_64_GOTPCREL
> works with existing toolchains, and there's something to be said for that.
> However, if we're talking about adding a new relocation for relaxing an
> indirect call via GOTPCREL, then:
>
> If we want -fno-plt to be able to hoist function addresses, then we're going 
> to
> want the address that we load for the call to also not be subject to possible
> jump-to-jump.
>
> Unless we want the linker to do an unreasonable amount of x86 code examination
> in order to determine mov vs call for relaxation, we need two different
> relocations (preferably using the same assembler mnemonic, and thus the 
> correct
> relocation is enforced by the assembler).
>
> On the users/hjl/relax branch (and posted on list somewhere), the new
> relocation is called R_X86_64_RELAX_GOTPCREL.  I'm not keen on that "relax"
> name, despite that being exactly what it's for.
>
> I suggest R_X86_64_GOTPLTPCREL_{CALL,LOAD} for the two relocation names.  That
> is, the address is in the .got.plt section, it's a pc-relative relocation, and
> it's being used by a call or load (mov) insn.

Since it is used for indirect call, how about R_X86_64_INBR_GOTPCREL?

I updated users/hjl/relax branch to covert relocation in *foo@GOTPCREL(%rip)
from R_X86_64_GOTPCREL to R_X86_64_RELAX_GOTPCREL so that
existing assembly code works automatically with a new binutils.

> With those two, we can fairly easily relax call/jmp to direct branches, and 
> mov
> to lea.  Yes, LTO can perform the same optimization, but I'll also agree that
> there are many projects for which LTO is both overkill and unworkable.
>
> This does leave open other optimization questions, mostly around weak
> functions.  Consider constructs like
>
> if (foo) foo();
>
> Do we, within the compiler, try to CSE GOTPCREL and GOTPLTPCREL, accepting the
> possibility (not certainty) of jump-to-jump but definitely avoiding a separate
> load insn and the latency implied by that?
>
>
> Comments?
>
>
> r~



-- 
H.J.