Precompiled headers - still useful feature?
Hello. I would like to ask folks what is their opinion about support of precompiled headers for future releases of GCC. From my point of view, the feature brings some speed-up, but question is if it's worth for? Last time I hit precompiled headers was when I was rewriting memory allocation statistics infrastructure, where GGC memory is 'streamed' and loaded afterwards in usage of precompiled headers. Because of that I was unable to track some pointers that were allocated in the first phase of compilation. There are numbers related to --disable-libstdcxx-pch option: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz: Boostrap time w/ precompiled headers enabled: 35m47s (100.00%) Boostrap time w/ precompiled headers disabled: 36m27s (101.86%) make -j9 check-target-libstdc++-v3 -k time: precompiled headers enabled: 8m11s (100.00%) precompiled headers disabled: 8m42s (106.31%) Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz: Boostrap time w/ precompiled headers enabled: 57m35s (100.00%) Boostrap time w/ precompiled headers disabled: 57m12s (99.33%) Feel free to send any statistics, opinions and ideas. Thank you, Martin
Re: Precompiled headers - still useful feature?
On 2015.05.27 at 10:14 +0200, Martin Liška wrote: > I would like to ask folks what is their opinion about support of > precompiled headers for future releases of GCC. From my point of view, > the feature brings some speed-up, but question is if it's worth for? > > Last time I hit precompiled headers was when I was rewriting memory > allocation statistics infrastructure, where GGC memory is 'streamed' > and loaded afterwards in usage of precompiled headers. Because of > that I was unable to track some pointers that were allocated in the > first phase of compilation. > > There are numbers related to --disable-libstdcxx-pch option: > > Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz: > Boostrap time w/ precompiled headers enabled: 35m47s (100.00%) > Boostrap time w/ precompiled headers disabled: 36m27s (101.86%) > > make -j9 check-target-libstdc++-v3 -k time: > precompiled headers enabled: 8m11s (100.00%) > precompiled headers disabled: 8m42s (106.31%) > > Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz: > Boostrap time w/ precompiled headers enabled: 57m35s (100.00%) > Boostrap time w/ precompiled headers disabled: 57m12s (99.33%) Measuring the impact on bigger projects that use pch like QT or Boost would be more informative perhaps. And until C++ modules are implemented (unfortunately nobody is working on this AFAIK) pch is still the only option left. So deprecating them now seem premature. -- Markus
Re: Precompiled headers - still useful feature?
On 27 May 2015 at 10:01, Markus Trippelsdorf wrote: > And until C++ modules are implemented (unfortunately nobody is working > on this AFAIK) pch is still the only option left. So deprecating them > now seem premature. I doubt anyone's going to implement them until they're specified, the proposals are still evolving.
Re: [i386] Scalar DImode instructions on XMM registers
2015-05-27 6:31 GMT+03:00 Jeff Law : > On 05/25/2015 09:27 AM, Ilya Enkovich wrote: >> >> 2015-05-22 15:01 GMT+03:00 Ilya Enkovich : >>> >>> 2015-05-22 11:53 GMT+03:00 Ilya Enkovich : 2015-05-21 22:08 GMT+03:00 Vladimir Makarov : > > So, Ilya, to solve the problem you need to avoid sharing subregs for > the > correct LRA/reload work. > > Thanks a lot for your help! I'll fix it. Ilya >>> >>> >>> I've fixed SUBREG sharing and got a missing spill. I added >>> --enable-checking=rtl to check other possible bugs. Spill/fill code >>> still seems incorrect because different sizes are used. Shouldn't >>> block me though. >>> >>> .L5: >>> movl16(%esp), %eax >>> addl$8, %esi >>> movl20(%esp), %edx >>> movl%eax, (%esp) >>> movl%edx, 4(%esp) >>> callcounter@PLT >>> movq-8(%esi), %xmm0 >>> **movdqa 16(%esp), %xmm2** >>> pand%xmm0, %xmm2 >>> movdqa %xmm2, %xmm0 >>> movd%xmm2, %edx >>> **movq%xmm2, 16(%esp)** >>> psrlq $32, %xmm0 >>> movd%xmm0, %eax >>> orl %edx, %eax >>> jne .L5 >>> >>> Thanks, >>> Ilya >> >> >> I was wrong assuming reloads with wrong size shouldn't block me. These >> reloads require memory to be aligned which is not always true. Here is >> what I have in RTL now: >> >> (insn 2 7 3 2 (set (reg/v:DI 93 [ l ]) >> (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89 >> {*movdi_internal} >> (nil)) >> ... >> (insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0) >> (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0) >> (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489 >> {*iorv2di3} >> (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ]) >> (expr_list:REG_DEAD (reg/v:DI 93 [ l ]) >> (nil >> >> After reload I get: >> >> (insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93]) >> (mem/c:DI (plus:SI (reg/f:SI 7 sp) >> (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89 >> {*movdi_internal} >> (nil)) >> (insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64]) >> (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal} >> (nil)) >> ... >> (insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87]) >> (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99]) >> (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64]))) >> test.c:11 3489 {*iorv2di3} >> >> >> 'por' instruction requires memory to be aligned and fails in a bigger >> testcase. There is also movdqa generated for esp by reload. May it >> mean I still have some inconsistencies in the produced RTL? Probably I >> should somehow transform loads and stores? > > I'd start by looking at the AP->SP elimination step. What's the defined > stack alignment and whether or not a dynamic stack realignment is needed. > If you don't have all that setup properly prior to the allocators, then > they're not going to know how what objects to align nor how to align them. I looked into assign_stack_local_1 call for this spill. LRA correctly requests 16 bytes size with 16 bytes alignment. But assign_stack_local_1 look reduces alignment to 8 because estimated stack alignment before RA is 8 and requested mode's (DI) alignment fits it. Probably LRA should pass biggest_mode of the reg when requesting a stack slot? I handled it by increasing stack_alignment_estimated when transform some instructions to vector mode. Thanks for help! Ilya > > jeff >
Relocations to use when eliding plts
There's one problem with the couple of patches that I've seen go by wrt eliding PLTs with -z now, and relaxing inlined PLTs (aka -fno-plt): They're currently using the same relocations used by data, and thus the linker and dynamic linker must ensure that pointer equality is maintained. Which results in branch-to-branch-(to-branch) situations. E.g. the attached test case, in which main has a plt entry for function A in a.so, and the function B in b.so calls A. $ LD_BIND_NOW=1 gdb main ... (gdb) b b Breakpoint 1 at 0x400540 (gdb) run Starting program: /home/rth/x/main Breakpoint 1, b () at b.c:2 2 void b(void) { a(); } (gdb) si 2 void b(void) { a(); } => 0x77bf75f4 :callq 0x77bf74e0 (gdb) 0x77bf74e0 in ?? () from ./b.so => 0x77bf74e0: jmpq *0x20034a(%rip)# 0x77df7830 (gdb) 0x00400560 in a@plt () => 0x400560 :jmpq *0x20057a(%rip)# 0x600ae0 (gdb) a () at a.c:2 2 void a() { printf("Hello, World!\n"); } => 0x77df95f0 : sub$0x8,%rsp If we use -fno-plt, we eliminate the first callq, but do still have two consecutive jmpq's. If seems to me that we ought to have different relocations when we're only going to use a pointer for branching, and when we need a pointer to be canonicalized for pointer comparisons. In the linked image, we already have these: R_X86_64_GLOB_DAT vs R_X86_64_JUMP_SLOT. Namely, GLOB_DAT implies "data" (and therefore pointer equality), while JUMP_SLOT implies "code" (and therefore we can resolve past plt stubs in the main executable). Which means that HJ's patch of May 16 (git hash 25070364), is less than ideal. I do like the smaller PLT entries, but I don't like the fact that it now emits GLOB_DAT for the relocations instead of JUMP_SLOT. In the relocatable image, when we're talking about -fno-plt, we should think about what relocation we'd like to emit. Yes, the existing R_X86_64_GOTPCREL works with existing toolchains, and there's something to be said for that. However, if we're talking about adding a new relocation for relaxing an indirect call via GOTPCREL, then: If we want -fno-plt to be able to hoist function addresses, then we're going to want the address that we load for the call to also not be subject to possible jump-to-jump. Unless we want the linker to do an unreasonable amount of x86 code examination in order to determine mov vs call for relaxation, we need two different relocations (preferably using the same assembler mnemonic, and thus the correct relocation is enforced by the assembler). On the users/hjl/relax branch (and posted on list somewhere), the new relocation is called R_X86_64_RELAX_GOTPCREL. I'm not keen on that "relax" name, despite that being exactly what it's for. I suggest R_X86_64_GOTPLTPCREL_{CALL,LOAD} for the two relocation names. That is, the address is in the .got.plt section, it's a pc-relative relocation, and it's being used by a call or load (mov) insn. With those two, we can fairly easily relax call/jmp to direct branches, and mov to lea. Yes, LTO can perform the same optimization, but I'll also agree that there are many projects for which LTO is both overkill and unworkable. This does leave open other optimization questions, mostly around weak functions. Consider constructs like if (foo) foo(); Do we, within the compiler, try to CSE GOTPCREL and GOTPLTPCREL, accepting the possibility (not certainty) of jump-to-jump but definitely avoiding a separate load insn and the latency implied by that? Comments? r~ test.tar Description: Unix tar archive
Re: Relocations to use when eliding plts
On Wed, May 27, 2015 at 1:03 PM, Richard Henderson wrote: > There's one problem with the couple of patches that I've seen go by wrt > eliding > PLTs with -z now, and relaxing inlined PLTs (aka -fno-plt): > > They're currently using the same relocations used by data, and thus the linker > and dynamic linker must ensure that pointer equality is maintained. Which > results in branch-to-branch-(to-branch) situations. > Your test exposed a linker bug: https://sourceware.org/bugzilla/show_bug.cgi?id=18458 I checked in this patch to fix it. -- H.J. -- When pointer equality needed, we can't replace PLT relocations with GOT relocations for -z now. This patch checks if pointer equality is needed before converting PLT relocations to GOT relocations. bfd/ PR binutils/18458 * elf32-i386.c (elf_i386_check_relocs): Create .plt.got section for now binding only if pointer equality isn't needed. (elf_i386_allocate_dynrelocs): Use .plt.got section for now binding only if pointer equality isn't needed. * elf64-x86-64.c (elf_x86_64_check_relocs): Create .plt.got section for now binding only if pointer equality isn't needed. (elf_x86_64_allocate_dynrelocs): Use .plt.got section for now binding only if pointer equality isn't needed. ld/testsuite/ PR binutils/18458 * ld-elf/shared.exp (build_tests): Build libpr18458a.so and libpr18458b.so. (run_tests): Run pr18458 test. * ld-elf/pr18458a.c: New file. * ld-elf/pr18458b.c: Likewise. * ld-elf/pr18458c.c: Likewise. From 8ded2ddc8bac501c1ee0706cb3d3ef3fb1c10b85 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Wed, 27 May 2015 14:32:24 -0700 Subject: [PATCH] Convert PLT reloc only if pointer equality isn't needed When pointer equality needed, we can't replace PLT relocations with GOT relocations for -z now. This patch checks if pointer equality is needed before converting PLT relocations to GOT relocations. bfd/ PR binutils/18458 * elf32-i386.c (elf_i386_check_relocs): Create .plt.got section for now binding only if pointer equality isn't needed. (elf_i386_allocate_dynrelocs): Use .plt.got section for now binding only if pointer equality isn't needed. * elf64-x86-64.c (elf_x86_64_check_relocs): Create .plt.got section for now binding only if pointer equality isn't needed. (elf_x86_64_allocate_dynrelocs): Use .plt.got section for now binding only if pointer equality isn't needed. ld/testsuite/ PR binutils/18458 * ld-elf/shared.exp (build_tests): Build libpr18458a.so and libpr18458b.so. (run_tests): Run pr18458 test. * ld-elf/pr18458a.c: New file. * ld-elf/pr18458b.c: Likewise. * ld-elf/pr18458c.c: Likewise. --- bfd/ChangeLog | 12 bfd/elf32-i386.c | 5 +++-- bfd/elf64-x86-64.c | 5 +++-- ld/testsuite/ChangeLog | 10 ++ ld/testsuite/ld-elf/pr18458a.c | 6 ++ ld/testsuite/ld-elf/pr18458b.c | 6 ++ ld/testsuite/ld-elf/pr18458c.c | 18 ++ ld/testsuite/ld-elf/shared.exp | 9 + 8 files changed, 67 insertions(+), 4 deletions(-) create mode 100644 ld/testsuite/ld-elf/pr18458a.c create mode 100644 ld/testsuite/ld-elf/pr18458b.c create mode 100644 ld/testsuite/ld-elf/pr18458c.c diff --git a/bfd/ChangeLog b/bfd/ChangeLog index 87a0bff..a8a0ad9 100644 --- a/bfd/ChangeLog +++ b/bfd/ChangeLog @@ -1,3 +1,15 @@ +2015-05-27 H.J. Lu + + PR binutils/18458 + * elf32-i386.c (elf_i386_check_relocs): Create .plt.got section + for now binding only if pointer equality isn't needed. + (elf_i386_allocate_dynrelocs): Use .plt.got section for now + binding only if pointer equality isn't needed. + * elf64-x86-64.c (elf_x86_64_check_relocs): Create .plt.got + section for now binding only if pointer equality isn't needed. + (elf_x86_64_allocate_dynrelocs): Use .plt.got section for now + binding only if pointer equality isn't needed. + 2015-05-26 H.J. Lu PR binutils/18437 diff --git a/bfd/elf32-i386.c b/bfd/elf32-i386.c index 23d50e1..f3aee96 100644 --- a/bfd/elf32-i386.c +++ b/bfd/elf32-i386.c @@ -1885,7 +1885,8 @@ do_size: if (use_plt_got && h != NULL && h->plt.refcount > 0 - && ((info->flags & DF_BIND_NOW) || h->got.refcount > 0) + && (((info->flags & DF_BIND_NOW) && !h->pointer_equality_needed) + || h->got.refcount > 0) && htab->plt_got == NULL) { /* Create the GOT procedure linkage table. */ @@ -2323,7 +2324,7 @@ elf_i386_allocate_dynrelocs (struct elf_link_hash_entry *h, void *inf) { bfd_boolean use_plt_got; - if ((info->flags & DF_BIND_NOW)) + if ((info->flags & DF_BIND_NOW) && !h->pointer_equality_needed) { /* Don't use the regular PLT for DF_BIND_NOW. */ h->plt.offset = (bfd_vma) -1; diff --git a/bfd/elf64-x86-64.c b/bfd/elf64-x86-64.c index 4428f97..072c00b 100644 --- a/bfd/elf64-x86-64.c +++ b/bfd/elf64-x86-64.c @@ -2080,7 +2080,8 @@ do_size: if (use_plt_got && h != NULL && h->plt.refcount > 0 - && ((info->flags & DF_BIND_NOW) || h->got.refcount > 0) + && (((info->
gcc-4.9-20150527 is now available
Snapshot gcc-4.9-20150527 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.9-20150527/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.9 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_9-branch revision 223783 You'll find: gcc-4.9-20150527.tar.bz2 Complete GCC MD5=ff0c439b3b8c30026c8707adc1998130 SHA1=f3ae7b1a4b96ac3e250bfacb6616e901d3033162 Diffs from 4.9-20150520 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.9 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Relocations to use when eliding plts
On Wed, May 27, 2015 at 1:03 PM, Richard Henderson wrote: > There's one problem with the couple of patches that I've seen go by wrt > eliding > PLTs with -z now, and relaxing inlined PLTs (aka -fno-plt): > > They're currently using the same relocations used by data, and thus the linker > and dynamic linker must ensure that pointer equality is maintained. Which > results in branch-to-branch-(to-branch) situations. > > E.g. the attached test case, in which main has a plt entry for function A in > a.so, and the function B in b.so calls A. > > $ LD_BIND_NOW=1 gdb main > ... > (gdb) b b > Breakpoint 1 at 0x400540 > (gdb) run > Starting program: /home/rth/x/main > Breakpoint 1, b () at b.c:2 > 2 void b(void) { a(); } > (gdb) si > 2 void b(void) { a(); } > => 0x77bf75f4 :callq 0x77bf74e0 > (gdb) > 0x77bf74e0 in ?? () from ./b.so > => 0x77bf74e0: jmpq *0x20034a(%rip)# 0x77df7830 > (gdb) > 0x00400560 in a@plt () > => 0x400560 :jmpq *0x20057a(%rip)# 0x600ae0 > (gdb) > a () at a.c:2 > 2 void a() { printf("Hello, World!\n"); } > => 0x77df95f0 : sub$0x8,%rsp > > > If we use -fno-plt, we eliminate the first callq, but do still have two > consecutive jmpq's. > > If seems to me that we ought to have different relocations when we're only > going to use a pointer for branching, and when we need a pointer to be > canonicalized for pointer comparisons. > > In the linked image, we already have these: R_X86_64_GLOB_DAT vs > R_X86_64_JUMP_SLOT. Namely, GLOB_DAT implies "data" (and therefore pointer > equality), while JUMP_SLOT implies "code" (and therefore we can resolve past > plt stubs in the main executable). > > Which means that HJ's patch of May 16 (git hash 25070364), is less than ideal. > I do like the smaller PLT entries, but I don't like the fact that it now > emits > GLOB_DAT for the relocations instead of JUMP_SLOT. ld.so just does whatever is arranged by ld. I am not sure change ld.so is a good idea. I don't what kind of optimization we can do when function is called and its address it taken. > > In the relocatable image, when we're talking about -fno-plt, we should think > about what relocation we'd like to emit. Yes, the existing R_X86_64_GOTPCREL > works with existing toolchains, and there's something to be said for that. > However, if we're talking about adding a new relocation for relaxing an > indirect call via GOTPCREL, then: > > If we want -fno-plt to be able to hoist function addresses, then we're going > to > want the address that we load for the call to also not be subject to possible > jump-to-jump. > > Unless we want the linker to do an unreasonable amount of x86 code examination > in order to determine mov vs call for relaxation, we need two different > relocations (preferably using the same assembler mnemonic, and thus the > correct > relocation is enforced by the assembler). > > On the users/hjl/relax branch (and posted on list somewhere), the new > relocation is called R_X86_64_RELAX_GOTPCREL. I'm not keen on that "relax" > name, despite that being exactly what it's for. > > I suggest R_X86_64_GOTPLTPCREL_{CALL,LOAD} for the two relocation names. That > is, the address is in the .got.plt section, it's a pc-relative relocation, and > it's being used by a call or load (mov) insn. Since it is used for indirect call, how about R_X86_64_INBR_GOTPCREL? I updated users/hjl/relax branch to covert relocation in *foo@GOTPCREL(%rip) from R_X86_64_GOTPCREL to R_X86_64_RELAX_GOTPCREL so that existing assembly code works automatically with a new binutils. > With those two, we can fairly easily relax call/jmp to direct branches, and > mov > to lea. Yes, LTO can perform the same optimization, but I'll also agree that > there are many projects for which LTO is both overkill and unworkable. > > This does leave open other optimization questions, mostly around weak > functions. Consider constructs like > > if (foo) foo(); > > Do we, within the compiler, try to CSE GOTPCREL and GOTPLTPCREL, accepting the > possibility (not certainty) of jump-to-jump but definitely avoiding a separate > load insn and the latency implied by that? > > > Comments? > > > r~ -- H.J.