How can I tune gcc to move up simple common subexpression?

2013-11-08 Thread Konstantin Vladimirov
Hi,

Consider simple code:

typedef struct
{
  unsigned prev;
  unsigned next;
} foo_t;

void
foo( unsigned x, unsigned y)
  {
foo_t *ptr = (foo_t *)((void *)x);

if (y != 0)
  {
 ptr->prev = y;
 ptr->next = x;
   }
 else
   {
 ptr->prev = 0; /* or explicitly ptr->prev = y; no difference */
 ptr->next = 0;
   }
}

GCC 4.7.2 and 4.8.1 both on O2 and Os creates code like:

testl %esi, %esi
movl %edi, %eax
jne .L5
movl $0, (%edi)
movl $0, 4(%rax)
ret
.L5:
movl %esi, (%edi)
movl %edi, 4(%rax)
ret

Which can be obviously changed to:

testl %esi, %esi
movl %edi, %eax
movl %esi, (%edi)
jne .L5
movl $0, 4(%rax)
ret
.L5:
movl %edi, 4(%rax)
ret

May be there are some options to make it behave so? This is question
for gcc-help group.

Question for gcc group is trickier:

May be in x86 it is not a big deal, but I am working on my private
backend, that have predicated instructions and second form is really
much more prefferable. May be I can somehow tune my backend to achieve
this effect? I can see that 210r.csa pass (try_optimize_cfg before it,
not pass itself) can move code upper in  some cases, but only with
very simple memory addressing and rather unstable, say changing

 ptr->prev = y;
 ptr->next = x;

to

 ptr->prev = x;
 ptr->next = y;

may break everything just because next is second member and addressed
like M[%r+4].

Any ideas?

---
With best regards, Konstantin


Re: powerpc64 bootstrap broken due to libsanitizer merge from upstream

2013-11-08 Thread Richard Biener
On Fri, Nov 8, 2013 at 5:49 AM, Peter Bergner  wrote:
> On Fri, 2013-11-08 at 00:03 +0100, Steven Bosscher wrote:
>> powerpc64-linux bootstrap is broken by the libsanitizer merge:
>
> I already reported the failures here:
>
> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg00312.html
>
> It seems others have reported it breaks bootstrap for them as
> well on other arches.  It's sad it's been broken this long,
> given it affects so many people.  Anyway, the powerpc64-linux
> breakage is being tracked here:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59009

--disable-libsanitizer is your friend (using that, the merge broke
bootstrap on x86_64 for me).

Richard.

> Peter
>
>
>


Re: How can I tune gcc to move up simple common subexpression?

2013-11-08 Thread Richard Biener
On Fri, Nov 8, 2013 at 10:28 AM, Konstantin Vladimirov
 wrote:
> Hi,
>
> Consider simple code:
>
> typedef struct
> {
>   unsigned prev;
>   unsigned next;
> } foo_t;
>
> void
> foo( unsigned x, unsigned y)
>   {
> foo_t *ptr = (foo_t *)((void *)x);
>
> if (y != 0)
>   {
>  ptr->prev = y;
>  ptr->next = x;
>}
>  else
>{
>  ptr->prev = 0; /* or explicitly ptr->prev = y; no difference */
>  ptr->next = 0;
>}
> }
>
> GCC 4.7.2 and 4.8.1 both on O2 and Os creates code like:
>
> testl %esi, %esi
> movl %edi, %eax
> jne .L5
> movl $0, (%edi)
> movl $0, 4(%rax)
> ret
> .L5:
> movl %esi, (%edi)
> movl %edi, 4(%rax)
> ret
>
> Which can be obviously changed to:
>
> testl %esi, %esi
> movl %edi, %eax
> movl %esi, (%edi)
> jne .L5
> movl $0, 4(%rax)
> ret
> .L5:
> movl %edi, 4(%rax)
> ret
>
> May be there are some options to make it behave so? This is question
> for gcc-help group.
>
> Question for gcc group is trickier:
>
> May be in x86 it is not a big deal, but I am working on my private
> backend, that have predicated instructions and second form is really
> much more prefferable. May be I can somehow tune my backend to achieve
> this effect? I can see that 210r.csa pass (try_optimize_cfg before it,
> not pass itself) can move code upper in  some cases, but only with
> very simple memory addressing and rather unstable, say changing
>
>  ptr->prev = y;
>  ptr->next = x;
>
> to
>
>  ptr->prev = x;
>  ptr->next = y;
>
> may break everything just because next is second member and addressed
> like M[%r+4].
>
> Any ideas?

IIRC there is some code hoisting in RTL GCSE but not very strong.
code-hoisting in GIMPLE via PRE is still in-progress (see PR23286).

Richard.

> ---
> With best regards, Konstantin


Re: Architecture maintainers: please define TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2013-11-08 Thread Andreas Schwab
"Joseph S. Myers"  writes:

> The test gcc.dg/atomic/c11-atomic-exec-5.c will indicate if this is 
> working correctly for your architecture, as long as your system supports 
> pthreads (required to run that test).  If any of the other 
> c11-atomic-exec-* tests are failing, you should fix that first as it 
> indicates a more serious issue with atomic operations on your target.

All of the tests require sync_long_long_runtime, so they are currently
useless.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


m68k optimisations?

2013-11-08 Thread Fredrik Olsson
I have this simple functions:
int sum_vec(int c, ...) {
va_list argptr;
va_start(argptr, c);
int sum = 0;
while (c--) {
int x = va_arg(argptr, int);
sum += x;
}
va_end(argptr);
return sum;
}


When compiling with "-fomit-frame-pointer -Os -march=68000 -c -S
-mshort" I get this assembly (I have manually added comments with
clock cycles per instruction and a total for a count of 0, 8 and n>0):
.even
.globl _sum_vec
_sum_vec:
lea (6,%sp),%a0 | 8
move.w 4(%sp),%d1   | 12
clr.w %d0   | 4
jra .L1 | 12
.L2:
add.w (%a0)+,%d0| 8
.L1:
dbra %d1,.L2| 16,12
rts | 16
| c==0: 8+12+4+12+12+16=64
| c==8: 8+12+4+12+(16+8)*8+12+16=256
| c==n: =64+24n

When instead compiling with "-fomit-frame-pointer -O3 -march=68000 -c
-S -mshort" I expect to get more aggressive optimisation than -Os, or
at least just as performant, but instead I get this:
.even
.globl _sum_vec
_sum_vec:
move.w 4(%sp),%d0   | 12
jeq .L2 | 12,8
lea (6,%sp),%a0 | 8
subq.w #1,%d0   | 4
and.l #65535,%d0| 16
add.l %d0,%d0   | 8
lea 8(%sp,%d0.l),%a1| 16
clr.w %d0   | 4
.L1:
add.w (%a0)+,%d0| 8
cmp.l %a0,%a1   | 8
jne .L1 | 12|8
rts | 16
.L2:
clr.w %d0   | 4
rts | 16
| c==0: 12+12+4+16=44
| c==8: 12+8+8+4+16+8+16+4+(8+8+12)*4-4+16=316
| c==n: =88+28n

The count==0 case is better. I can see what optimisation has been
tried for the loop, but it just not working since both the ini for the
loop and the loop itself becomes more costly.

Being a GCC beginner I would like a few pointers as to how I should go
about to fix this?

// Fredrik


Re: Architecture maintainers: please define TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2013-11-08 Thread Joseph S. Myers
On Fri, 8 Nov 2013, Andreas Schwab wrote:

> "Joseph S. Myers"  writes:
> 
> > The test gcc.dg/atomic/c11-atomic-exec-5.c will indicate if this is 
> > working correctly for your architecture, as long as your system supports 
> > pthreads (required to run that test).  If any of the other 
> > c11-atomic-exec-* tests are failing, you should fix that first as it 
> > indicates a more serious issue with atomic operations on your target.
> 
> All of the tests require sync_long_long_runtime, so they are currently
> useless.

The tests do not require sync_long_long_runtime.  ISO C requires atomic 
operations to be available on types of all sizes, but they need not be 
lock-free and need not be inlined.  It is the function of libatomic to 
provide whatever atomic operations are not inlined or in libgcc, using 
locking when necessary.  If some required operation is not being provided 
by libatomic on some target, that's a pre-existing GCC bug on that target 
(the __atomic_* built-in functions are *also* meant to be available for 
types of all sizes on all targets, and I expect any such bug would have 
shown up with them as well).

-- 
Joseph S. Myers
jos...@codesourcery.com


DWARF and atomic types

2013-11-08 Thread Joseph S. Myers
I realised that the C11 atomics changes didn't do anything to record 
atomic types as such in DWARF debug info - then found that DWARF4 didn't 
provide a way to specify the C11 _Atomic qualifier at all.  Could someone 
working with the DWARF committee get an appropriate tag into the next 
version of DWARF?

(I've filed bug 59051 for the lack of use of DW_tag_restrict_type for 
restricted pointers.)

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: Architecture maintainers: please define TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2013-11-08 Thread Andreas Schwab
"Joseph S. Myers"  writes:

> On Fri, 8 Nov 2013, Andreas Schwab wrote:
>
>> "Joseph S. Myers"  writes:
>> 
>> > The test gcc.dg/atomic/c11-atomic-exec-5.c will indicate if this is 
>> > working correctly for your architecture, as long as your system supports 
>> > pthreads (required to run that test).  If any of the other 
>> > c11-atomic-exec-* tests are failing, you should fix that first as it 
>> > indicates a more serious issue with atomic operations on your target.
>> 
>> All of the tests require sync_long_long_runtime, so they are currently
>> useless.
>
> The tests do not require sync_long_long_runtime.

True, but they require __atomic_store_16, which is generally not
available.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


Re: Architecture maintainers: please define TARGET_ATOMIC_ASSIGN_EXPAND_FENV

2013-11-08 Thread Joseph S. Myers
On Fri, 8 Nov 2013, Andreas Schwab wrote:

> > The tests do not require sync_long_long_runtime.
> 
> True, but they require __atomic_store_16, which is generally not
> available.

See my comments at 
 - my inclination 
is that the code resolving the overloaded builtins (unchanged by the 
addition of _Atomic support) shouldn't try to use the _16 versions unless 
targetm.scalar_mode_supported_p (TImode).

-- 
Joseph S. Myers
jos...@codesourcery.com


LRA: check_rtl modifies RTL instruction stream

2013-11-08 Thread Robert Suchanek
Hi Vladimir,

I have been looking into regression testing for mips16 with LRA enabled 
and tried to understand and solve some ICEs. It was found that in 
a narrowed testcase (attached below) that there are two issues:

1. In the back end - pattern not recognized and hence ICE.
2. In the LRA - a bug that exposes the problem above.

The problem within LRA points to check_rtl function. The function
does not only check for the consistency of the instruction stream, but
unfortunately, it accidentally modifies it as well.

The fragment of the RTL dump before check_rtl():

(insn 18 7 12 2 (set (reg/f:DI 197) 
 
  (symbol_ref:DI ("a")  )) 
fpr-moves-7.c:7 280 {*movdi_64bit_mips16}
 (expr_list:REG_EQUIV (symbol_ref:DI ("a")  )
 
(nil))) 
 

After check_rtl(), movdi_64bit_mips16 turns into *lea64:

(insn 18 7 12 2 (parallel [ 
(set (reg/f:DI 197) 
(symbol_ref:DI ("a")  )) 
(clobber (scratch:DI))  
]) fpr-moves-7.c:7 258 {*lea64} 
 (expr_list:REG_EQUIV (symbol_ref:DI ("a")  )
(nil))) 

What happens here is that check_rtl calls insn_invalid_p and insn_invalid_p
tries to add clobber registers in the hope to match a pattern. In our case, 
adding a clobber does match *lea64 and insn_invalid_p generates new 
instruction. The reason for doing this is that reload_in_progress is not being 
set
when LRA is running. Otherwise, insn_invalid_p would be prevented to add 
clobbers.
The problem does not exist if we run it with the classic reload.

One of the solutions I can think of is adding !lra_in_progress to insn_invalid_p
and set this variable before check_rtl() but I am not fully confident that this
is so trivial (I am new to the gcc hacking business). I see a number of reasons
that reload_in_progress is not being used when LRA is used, thus, not entirely
sure if this change would not break anything else. 

Can you suggest how to guarantee check_rtl does not modify the insns?

The back end issue will be looked separately by us.

The testcase compiled with -mips64 -mabi=64 -mips16 -msoft-float with LRA 
enabled:

char a[10];
char foo () {  return a[2]; }

Regards,
Robert Suchanek






Re: [RFC] Target compilation for offloading

2013-11-08 Thread Jakub Jelinek
On Fri, Nov 08, 2013 at 06:26:53PM +0400, Andrey Turetskiy wrote:
> Thanks.
> And a few questions about compiler options:
> 1) You've mentioned two options for offloading:
> -foffload-target= - to specify targets for offloading
> -foffload-target-= - to specify
> compiler options for different targets
> Do we really need two options to set up offloading?
> What do you think about, in my opinion, more compact way:
> -foffload- - if I want to offload for 'target name',
> but I don't want to specify any options
> -foffload-= - enable offloading for
> 'target name' and set options
> And compilation for several targets would look like:
> gcc -fopenmp -foffload-mic="-O3 -msse -m64" -foffload-ptx
> -foffload-hsail="-O2 -m32" file.c

I don't think it is a good idea to include the target name before =
in the name of the option, but perhaps you can use two =s:
-foffloat-target=x86_64-k1om-linux="-O2 -mtune=foobar' -foffloat-target=ptx-none

> 2) If user doesn't specify target options directly, is target
> compilation done without any option or compiler uses those host
> options which are suitable for target?

I think I've said that earlier, non-target specific options from original
compilation should be copied over, target specific options discarded,
and the command line supplied overrides appended to that.

> 3) Am I understand right, that options for different targets should be
> stored in different sections in fat object file, and than lto frontend
> should read theese options and run target compilation with them?

No, I'd store in the LTO target IL only the original host compilation
options that weren't target specific (opt* has some flags what is target
specific and what is not), so say -O2 -ftree-vrp would go there,
but say -march=corei7-avx would not.  And the -foffload-target= options
would only matter during linking.

Jakub


Re: [RFC] Target compilation for offloading

2013-11-08 Thread Andrey Turetskiy
Thanks.
And a few questions about compiler options:
1) You've mentioned two options for offloading:
-foffload-target= - to specify targets for offloading
-foffload-target-= - to specify
compiler options for different targets
Do we really need two options to set up offloading?
What do you think about, in my opinion, more compact way:
-foffload- - if I want to offload for 'target name',
but I don't want to specify any options
-foffload-= - enable offloading for
'target name' and set options
And compilation for several targets would look like:
gcc -fopenmp -foffload-mic="-O3 -msse -m64" -foffload-ptx
-foffload-hsail="-O2 -m32" file.c
2) If user doesn't specify target options directly, is target
compilation done without any option or compiler uses those host
options which are suitable for target?
3) Am I understand right, that options for different targets should be
stored in different sections in fat object file, and than lto frontend
should read theese options and run target compilation with them?


On Thu, Nov 7, 2013 at 7:42 PM, Jakub Jelinek  wrote:
> On Thu, Nov 07, 2013 at 07:36:06PM +0400, Andrey Turetskiy wrote:
>> > Note, configure options should be either --with- or --enable- prefixed.
>> > Plus, it is probably better to use configuration triplets there.
>>
>> Do you mean smth like this:
>> configure --build=x86 --host=x86 --target=x86,mic,ptx
>> Then "make" should build 3 gcc: x86 native and crosses for mic and ptx.
>
> It can very well be just that the user should first
> mkdir ~/whatever-1; cd ~/whatever-1
> .../configure --target x86_64-k1om-linux --prefix=/whatever
> make; make install
> mkdir ~/whatever-2; cd ~/whatever-2
> .../configure --target ptx-none --prefix=/whatever
> make; make install
> and then
> mkdir ~/whatever-3; cd ~/whatever-3
> .../configure --with-offload-targets=x86_64-k1om-linux,ptx-none 
> --prefix=/whatever
> ?
> At least initially, because building several different compilers in one
> build directory would be kind of interesting, I'm not saying not doable,
> but there are other issues to be solved first.
>
> Jakub



-- 
Best regards,
Andrey Turetskiy


[gomp4] libgomp.c/target-1.c failing in fn2's GOMP_target_update

2013-11-08 Thread Thomas Schwinge
Hi!

On the gomp-4_0-branch, when using the ID 257 device (host fallback but
with non-shared memory), I see the libgomp.c/target-1.c test fail in
fn2's GOMP_target_update call:

libgomp: Trying to update [0x601a80..0x601a84) object that is not mapped

Is this a known issue?  (I have not yet started debugging that, and
figured someone more familiar with the code may perhaps easily be able to
tell what's going wrong.)

Breakpoint 1, 0x77316190 in exit () from 
/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x77316190 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x778c703d in gomp_fatal (fmt=fmt@entry=0x778d1418 "Trying 
to update [%p..%p) object that is not mapped") at [...]/libgomp/error.c:65
#2  0x778cfce6 in gomp_update (devicep=0x602030, 
mapnum=mapnum@entry=2, hostaddrs=hostaddrs@entry=0x7fffb370, 
sizes=sizes@entry=0x601a10 <.omp_data_sizes.10.1928>,
kinds=kinds@entry=0x601a00 <.omp_data_kinds.11.1929> "\022\032") at 
[...]/libgomp/target.c:462
#3  0x778d08ce in GOMP_target_update (device=device@entry=-1, 
openmp_target=openmp_target@entry=0x0, mapnum=mapnum@entry=2, 
hostaddrs=hostaddrs@entry=0x7fffb370, sizes=sizes@entry=0x601a10 
<.omp_data_sizes.10.1928>,
kinds=kinds@entry=0x601a00 <.omp_data_kinds.11.1929> "\022\032") at 
[...]/libgomp/target.c:568
#4  0x00400f55 in fn2 (x=, x@entry=128, y=, y@entry=4, z=, z@entry=6) at 
[...]/libgomp/testsuite/libgomp.c/target-1.c:44
#5  0x004011ca in main () at 
[...]/libgomp/testsuite/libgomp.c/target-1.c:79


The libgomp.c++/target-1.C test also fails; have not yet looked whether
it's the same issue.


That aside, I'm using the following patch to enable the ID 257 device
without having the LIBGOMP_PLUGIN_PATH environment variable set; OK for
gomp-4_0-branch?

libgomp: Always set up device 257 if no other device has been found.

libgomp/
* target.c (gomp_find_available_plugins): Don't skip device 257
setup.

diff --git libgomp/target.c libgomp/target.c
index c0730a7..d84a1fa 100644
--- libgomp/target.c
+++ libgomp/target.c
@@ -651,11 +651,11 @@ gomp_find_available_plugins (void)
 
   plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");
   if (!plugin_path)
-return;
+goto out;
 
   dir = opendir (plugin_path);
   if (!dir)
-return;
+goto out;
 
   while ((ent = readdir (dir)) != NULL)
 {
@@ -675,7 +675,7 @@ gomp_find_available_plugins (void)
{
  num_devices = 0;
  closedir (dir);
- return;
+ goto out;
}
 
   devices[num_devices] = current_device;
@@ -686,6 +686,7 @@ gomp_find_available_plugins (void)
 }
   closedir (dir);
 
+ out:
   /* FIXME: Temporary hack for testing non-shared address spaces on host.
  We create device 257 just to check memory mapping.  */
   if (num_devices == 0)


Grüße,
 Thomas


pgpiES20MRoeC.pgp
Description: PGP signature


Re: [gomp4] libgomp.c/target-1.c failing in fn2's GOMP_target_update

2013-11-08 Thread Jakub Jelinek
On Fri, Nov 08, 2013 at 04:29:03PM +0100, Thomas Schwinge wrote:
> 
> On the gomp-4_0-branch, when using the ID 257 device (host fallback but
> with non-shared memory), I see the libgomp.c/target-1.c test fail in
> fn2's GOMP_target_update call:
> 
> libgomp: Trying to update [0x601a80..0x601a84) object that is not mapped
> 
> Is this a known issue?  (I have not yet started debugging that, and
> figured someone more familiar with the code may perhaps easily be able to
> tell what's going wrong.)

That is expected, device 257 is just a temporary testing hack, which doesn't 
support
variables with "omp declare target" attribute.
So, you can only use it for testcases that don't have any #pragma omp
declare target variables, or if they do, they only access them in #pragma
omp declare target functions, but never try to use them or anything related
to them in map/to/from clauses.
The plan is that using the two proposed tables (host table of host_addr, size
pairs and corresponding target table of target_addr) during initialization
of offloading for a particular shared library resp. binary libgomp will
register all those ranges in the mapping table.

> That aside, I'm using the following patch to enable the ID 257 device
> without having the LIBGOMP_PLUGIN_PATH environment variable set; OK for
> gomp-4_0-branch?

I guess it is ok, once we have at least one supported offloading target,
hopefully we'll nuke device 257.

> libgomp: Always set up device 257 if no other device has been found.
> 
>   libgomp/
>   * target.c (gomp_find_available_plugins): Don't skip device 257
>   setup.
> 
> diff --git libgomp/target.c libgomp/target.c
> index c0730a7..d84a1fa 100644
> --- libgomp/target.c
> +++ libgomp/target.c
> @@ -651,11 +651,11 @@ gomp_find_available_plugins (void)
>  
>plugin_path = getenv ("LIBGOMP_PLUGIN_PATH");
>if (!plugin_path)
> -return;
> +goto out;
>  
>dir = opendir (plugin_path);
>if (!dir)
> -return;
> +goto out;
>  
>while ((ent = readdir (dir)) != NULL)
>  {
> @@ -675,7 +675,7 @@ gomp_find_available_plugins (void)
>   {
> num_devices = 0;
> closedir (dir);
> -   return;
> +   goto out;
>   }
>  
>devices[num_devices] = current_device;
> @@ -686,6 +686,7 @@ gomp_find_available_plugins (void)
>  }
>closedir (dir);
>  
> + out:
>/* FIXME: Temporary hack for testing non-shared address spaces on host.
>   We create device 257 just to check memory mapping.  */
>if (num_devices == 0)
> 
> 
> Grüße,
>  Thomas



Jakub


Vectorizer/alignment

2013-11-08 Thread Hendrik Greving
The code for a simple loop like

for (i = 0; i < LENGTH-1; i++) {
g_c[i] = g_a[i] + g_b[i];
}

looks good for g++ (4.9.0 20131028 (experimental)) (-O3 core-avx2)

.L2:
vmovdqa g_a(%rax), %ymm0 # 26 *movv8si_internal/2 [length = 8]
vpaddd g_b(%rax), %ymm0, %ymm0 # 27 *addv8si3/2 [length = 8]
addq $32, %rax # 29 *adddi_1/1 [length = 4]
vmovaps %ymm0, g_c-32(%rax) # 28 *movv8si_internal/3 [length = 8]
cmpq $39968, %rax # 31 *cmpdi_1/1 [length = 6]
jne .L2 # 32 *jcc_1 [length = 2]

but for gcc, I'm getting

.L4:
vmovdqu (%rsi,%rax), %xmm0 # 156 sse2_loaddquv16qi [length = 5]
vinserti128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0 # 157
avx_vec_concatv32qi/1 [length = 8]
addl $1, %edx # 161 *addsi_1/1 [length = 3]
vpaddd (%rdi,%rax), %ymm0, %ymm0 # 158 *addv8si3/2 [length = 5]
vmovups %xmm0, (%rcx,%rax) # 412 *movv16qi_internal/3 [length = 5]
vextracti128 $0x1, %ymm0, 16(%rcx,%rax) # 160 vec_extract_hi_v32qi/2
[length = 8]
addq $32, %rax # 162 *adddi_1/1 [length = 4]
cmpl $1248, %edx # 164 *cmpsi_1/1 [length = 6]
jbe .L4 # 165 *jcc_1 [length = 2]

unless I add "__attribute__ ((aligned (64)));" g_a, g_b, g_c.

2 questions: Does C have different alignment requirements/specs than
C++ (I don't think so)? But if so, why does gcc not just align the
arrays (they are in the same module in my example...)? Let aside the
alignment question, why not just do avx2 (ymm) moves as g++ does?

Guess my question is, is this a bug or a feature?

Thanks,
Regards,
Hendrik


The Linux binutils 2.24.51.0.1 is released

2013-11-08 Thread H.J. Lu
It is also available as linux/release/2.24.51.0.1 tag at

https://sourceware.org/git/?p=binutils-gdb.git;a=summary


H.J.
---
This is the beta release of binutils 2.24.51.0.1 for Linux, which is
based on binutils 2013 1106 master branch on sourceware.org plus
various changes. It is purely for Linux.

All relevant patches in patches have been applied to the source tree.
You can take a look at patches/README to see what have been applied and
in what order they have been applied.

Starting from the 2.23.52.0.2 release, when creating executables, BFD
linker will issue an error for undefined weak reference which is
defined in a shared library from DT_NEEDED.  Previously BFD linker
will silently include the shared library from DT_NEEDED.

Starting from the 2.21.51.0.3 release, you must remove .ctors/.dtors
section sentinels when building glibc or other C run-time libraries.
Otherwise, you will run into:

http://sourceware.org/bugzilla/show_bug.cgi?id=12343

Starting from the 2.21.51.0.2 release, BFD linker has the working LTO
plugin support. It can be used with GCC 4.5 and above. For GCC 4.5, you
need to configure GCC with --enable-gold to enable LTO plugin support.

Starting from the 2.21.51.0.2 release, binutils fully supports compressed
debug sections.  However, compressed debug section isn't turned on by
default in assembler. I am planning to turn it on for x86 assembler in
the future release, which may lead to the Linux kernel bug messages like

WARNING: lib/ts_kmp.o (.zdebug_aranges): unexpected non-allocatable section.

But the resulting kernel works fine.

Starting from the 2.20.51.0.4 release, no diffs against the previous
release will be provided.

You can enable both gold and bfd ld with --enable-gold=both.  Gold will
be installed as ld.gold and bfd ld will be installed as ld.bfd.  By
default, ld.bfd will be installed as ld.  You can use the configure
option, --enable-gold=both/gold to choose gold as the default linker,
ld.  IA-32 binary and X64_64 binary tar balls are configured with
--enable-gold=both/ld --enable-plugins --enable-threads.

Starting from the 2.18.50.0.4 release, the x86 assembler no longer
accepts

fnstsw %eax

fnstsw stores 16bit into %ax and the upper 16bit of %eax is unchanged.
Please use

fnstsw %ax

Starting from the 2.17.50.0.4 release, the default output section LMA
(load memory address) has changed for allocatable sections from being
equal to VMA (virtual memory address), to keeping the difference between
LMA and VMA the same as the previous output section in the same region.

For

.data.init_task : { *(.data.init_task) }

LMA of .data.init_task section is equal to its VMA with the old linker.
With the new linker, it depends on the previous output section. You
can use

.data.init_task : AT (ADDR(.data.init_task)) { *(.data.init_task) }

to ensure that LMA of .data.init_task section is always equal to its
VMA. The linker script in the older 2.6 x86-64 kernel depends on the
old behavior.  You can add AT (ADDR(section)) to force LMA of
.data.init_task section equal to its VMA. It will work with both old
and new linkers. The x86-64 kernel linker script in kernel 2.6.13 and
above is OK.

The new x86_64 assembler no longer accepts

monitor %eax,%ecx,%edx

You should use

monitor %rax,%ecx,%edx

or
monitor

which works with both old and new x86_64 assemblers. They should
generate the same opcode.

The new i386/x86_64 assemblers no longer accept instructions for moving
between a segment register and a 32bit memory location, i.e.,

movl (%eax),%ds
movl %ds,(%eax)

To generate instructions for moving between a segment register and a
16bit memory location without the 16bit operand size prefix, 0x66,

mov (%eax),%ds
mov %ds,(%eax)

should be used. It will work with both new and old assemblers. The
assembler starting from 2.16.90.0.1 will also support

movw (%eax),%ds
movw %ds,(%eax)

without the 0x66 prefix. Patches for 2.4 and 2.6 Linux kernels are
available at

http://www.kernel.org/pub/linux/devel/binutils/linux-2.4-seg-4.patch
http://www.kernel.org/pub/linux/devel/binutils/linux-2.6-seg-5.patch

The ia64 assembler is now defaulted to tune for Itanium 2 processors.
To build a kernel for Itanium 1 processors, you will need to add

ifeq ($(CONFIG_ITANIUM),y)
CFLAGS += -Wa,-mtune=itanium1
AFLAGS += -Wa,-mtune=itanium1
endif

to arch/ia64/Makefile in your kernel source tree.

Please report any bugs related to binutils 2.24.51.0.1 to
hjl.to...@gmail.com

and

http://www.sourceware.org/bugzilla/

Changes from binutils 2.23.52.0.2:

1. Update from binutils 2013 1106.
2. Add Intel AVX-512 new instruction support.
3. Add Intel MPX new instruction support.
4. Update ld to support x86-64 large PIC model with TLS GD and LD sequences.
5. Fix ld to properly handle R_X86_64_DTPOFF64.  PR 15685.
6. Fix x86 assembler to properly check 64-bit register.
7. Update x86 assembler not to align text/data/b

Re: The Linux binutils 2.24.51.0.1 is released

2013-11-08 Thread H.J. Lu
I renamed the release tag to hjl/linux/release/2.24.51.0.1


H.J.
On Fri, Nov 8, 2013 at 9:25 AM, H.J. Lu  wrote:
> It is also available as linux/release/2.24.51.0.1 tag at
>
> https://sourceware.org/git/?p=binutils-gdb.git;a=summary
>
>
> H.J.
> ---
> This is the beta release of binutils 2.24.51.0.1 for Linux, which is
> based on binutils 2013 1106 master branch on sourceware.org plus
> various changes. It is purely for Linux.
>
> All relevant patches in patches have been applied to the source tree.
> You can take a look at patches/README to see what have been applied and
> in what order they have been applied.
>
> Starting from the 2.23.52.0.2 release, when creating executables, BFD
> linker will issue an error for undefined weak reference which is
> defined in a shared library from DT_NEEDED.  Previously BFD linker
> will silently include the shared library from DT_NEEDED.
>
> Starting from the 2.21.51.0.3 release, you must remove .ctors/.dtors
> section sentinels when building glibc or other C run-time libraries.
> Otherwise, you will run into:
>
> http://sourceware.org/bugzilla/show_bug.cgi?id=12343
>
> Starting from the 2.21.51.0.2 release, BFD linker has the working LTO
> plugin support. It can be used with GCC 4.5 and above. For GCC 4.5, you
> need to configure GCC with --enable-gold to enable LTO plugin support.
>
> Starting from the 2.21.51.0.2 release, binutils fully supports compressed
> debug sections.  However, compressed debug section isn't turned on by
> default in assembler. I am planning to turn it on for x86 assembler in
> the future release, which may lead to the Linux kernel bug messages like
>
> WARNING: lib/ts_kmp.o (.zdebug_aranges): unexpected non-allocatable section.
>
> But the resulting kernel works fine.
>
> Starting from the 2.20.51.0.4 release, no diffs against the previous
> release will be provided.
>
> You can enable both gold and bfd ld with --enable-gold=both.  Gold will
> be installed as ld.gold and bfd ld will be installed as ld.bfd.  By
> default, ld.bfd will be installed as ld.  You can use the configure
> option, --enable-gold=both/gold to choose gold as the default linker,
> ld.  IA-32 binary and X64_64 binary tar balls are configured with
> --enable-gold=both/ld --enable-plugins --enable-threads.
>
> Starting from the 2.18.50.0.4 release, the x86 assembler no longer
> accepts
>
> fnstsw %eax
>
> fnstsw stores 16bit into %ax and the upper 16bit of %eax is unchanged.
> Please use
>
> fnstsw %ax
>
> Starting from the 2.17.50.0.4 release, the default output section LMA
> (load memory address) has changed for allocatable sections from being
> equal to VMA (virtual memory address), to keeping the difference between
> LMA and VMA the same as the previous output section in the same region.
>
> For
>
> .data.init_task : { *(.data.init_task) }
>
> LMA of .data.init_task section is equal to its VMA with the old linker.
> With the new linker, it depends on the previous output section. You
> can use
>
> .data.init_task : AT (ADDR(.data.init_task)) { *(.data.init_task) }
>
> to ensure that LMA of .data.init_task section is always equal to its
> VMA. The linker script in the older 2.6 x86-64 kernel depends on the
> old behavior.  You can add AT (ADDR(section)) to force LMA of
> .data.init_task section equal to its VMA. It will work with both old
> and new linkers. The x86-64 kernel linker script in kernel 2.6.13 and
> above is OK.
>
> The new x86_64 assembler no longer accepts
>
> monitor %eax,%ecx,%edx
>
> You should use
>
> monitor %rax,%ecx,%edx
>
> or
> monitor
>
> which works with both old and new x86_64 assemblers. They should
> generate the same opcode.
>
> The new i386/x86_64 assemblers no longer accept instructions for moving
> between a segment register and a 32bit memory location, i.e.,
>
> movl (%eax),%ds
> movl %ds,(%eax)
>
> To generate instructions for moving between a segment register and a
> 16bit memory location without the 16bit operand size prefix, 0x66,
>
> mov (%eax),%ds
> mov %ds,(%eax)
>
> should be used. It will work with both new and old assemblers. The
> assembler starting from 2.16.90.0.1 will also support
>
> movw (%eax),%ds
> movw %ds,(%eax)
>
> without the 0x66 prefix. Patches for 2.4 and 2.6 Linux kernels are
> available at
>
> http://www.kernel.org/pub/linux/devel/binutils/linux-2.4-seg-4.patch
> http://www.kernel.org/pub/linux/devel/binutils/linux-2.6-seg-5.patch
>
> The ia64 assembler is now defaulted to tune for Itanium 2 processors.
> To build a kernel for Itanium 1 processors, you will need to add
>
> ifeq ($(CONFIG_ITANIUM),y)
> CFLAGS += -Wa,-mtune=itanium1
> AFLAGS += -Wa,-mtune=itanium1
> endif
>
> to arch/ia64/Makefile in your kernel source tree.
>
> Please report any bugs related to binutils 2.24.51.0.1 to
> hjl.to...@gmail.com
>
> and
>
> http://www.sourceware.org/bugzilla/
>
> Changes from binutils 2.23.52.0.2:
>
> 1. Update from b

Re: Vectorizer/alignment

2013-11-08 Thread Richard Biener
Hendrik Greving  wrote:
>The code for a simple loop like
>
>for (i = 0; i < LENGTH-1; i++) {
>g_c[i] = g_a[i] + g_b[i];
>}
>
>looks good for g++ (4.9.0 20131028 (experimental)) (-O3 core-avx2)
>
>.L2:
>vmovdqa g_a(%rax), %ymm0 # 26 *movv8si_internal/2 [length = 8]
>vpaddd g_b(%rax), %ymm0, %ymm0 # 27 *addv8si3/2 [length = 8]
>addq $32, %rax # 29 *adddi_1/1 [length = 4]
>vmovaps %ymm0, g_c-32(%rax) # 28 *movv8si_internal/3 [length = 8]
>cmpq $39968, %rax # 31 *cmpdi_1/1 [length = 6]
>jne .L2 # 32 *jcc_1 [length = 2]
>
>but for gcc, I'm getting
>
>.L4:
>vmovdqu (%rsi,%rax), %xmm0 # 156 sse2_loaddquv16qi [length = 5]
>vinserti128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0 # 157
>avx_vec_concatv32qi/1 [length = 8]
>addl $1, %edx # 161 *addsi_1/1 [length = 3]
>vpaddd (%rdi,%rax), %ymm0, %ymm0 # 158 *addv8si3/2 [length = 5]
>vmovups %xmm0, (%rcx,%rax) # 412 *movv16qi_internal/3 [length = 5]
>vextracti128 $0x1, %ymm0, 16(%rcx,%rax) # 160 vec_extract_hi_v32qi/2
>[length = 8]
>addq $32, %rax # 162 *adddi_1/1 [length = 4]
>cmpl $1248, %edx # 164 *cmpsi_1/1 [length = 6]
>jbe .L4 # 165 *jcc_1 [length = 2]
>
>unless I add "__attribute__ ((aligned (64)));" g_a, g_b, g_c.
>
>2 questions: Does C have different alignment requirements/specs than
>C++ (I don't think so)?

Try -fno-common

Richard.

 But if so, why does gcc not just align the
>arrays (they are in the same module in my example...)? Let aside the
>alignment question, why not just do avx2 (ymm) moves as g++ does?
>
>Guess my question is, is this a bug or a feature?
>
>Thanks,
>Regards,
>Hendrik




Re: How can I tune gcc to move up simple common subexpression?

2013-11-08 Thread Jeff Law

On 11/08/13 02:28, Konstantin Vladimirov wrote:

typedef struct
{
   unsigned prev;
   unsigned next;
} foo_t;

void
foo( unsigned x, unsigned y)
   {
 foo_t *ptr = (foo_t *)((void *)x);

 if (y != 0)
   {
  ptr->prev = y;
  ptr->next = x;
}
  else
{
  ptr->prev = 0; /* or explicitly ptr->prev = y; no difference */
  ptr->next = 0;
}
}
Umm, you can't hoist ptr->prev before the conditional because that would 
change the meaning of this code.



I think you wanted the conditional to test y == 0 which exposes the code 
hoisting opportunity for the ptr->prev assignment.  Once you fix the 
testcase the code in jump2 will hoist the assignment resulting in:






.cfi_startproc
testl   %esi, %esi
movl%edi, %eax
movl$0, (%edi)
je  .L5
movl$0, 4(%rax)
ret
.p2align 4,,10
.p2align 3
.L5:
movl%edi, 4(%rax)
ret
.cfi_endproc


Jeff


Re: LRA: check_rtl modifies RTL instruction stream

2013-11-08 Thread Vladimir Makarov

On 11/8/2013, 9:13 AM, Robert Suchanek wrote:

Hi Vladimir,

I have been looking into regression testing for mips16 with LRA enabled
and tried to understand and solve some ICEs. It was found that in
a narrowed testcase (attached below) that there are two issues:

1. In the back end - pattern not recognized and hence ICE.
2. In the LRA - a bug that exposes the problem above.

The problem within LRA points to check_rtl function. The function
does not only check for the consistency of the instruction stream, but
unfortunately, it accidentally modifies it as well.

The fragment of the RTL dump before check_rtl():

(insn 18 7 12 2 (set (reg/f:DI 197)
  (symbol_ref:DI ("a")  )) fpr-moves-7.c:7 280 {*movdi_64bit_mips16}
  (expr_list:REG_EQUIV (symbol_ref:DI ("a")  )
 (nil)))

After check_rtl(), movdi_64bit_mips16 turns into *lea64:

(insn 18 7 12 2 (parallel [
 (set (reg/f:DI 197)
 (symbol_ref:DI ("a")  ))
 (clobber (scratch:DI))
 ]) fpr-moves-7.c:7 258 {*lea64}
  (expr_list:REG_EQUIV (symbol_ref:DI ("a")  )
 (nil)))

What happens here is that check_rtl calls insn_invalid_p and insn_invalid_p
tries to add clobber registers in the hope to match a pattern. In our case,
adding a clobber does match *lea64 and insn_invalid_p generates new
instruction. The reason for doing this is that reload_in_progress is not being 
set
when LRA is running. Otherwise, insn_invalid_p would be prevented to add 
clobbers.
The problem does not exist if we run it with the classic reload.

One of the solutions I can think of is adding !lra_in_progress to insn_invalid_p
and set this variable before check_rtl() but I am not fully confident that this
is so trivial (I am new to the gcc hacking business). I see a number of reasons
that reload_in_progress is not being used when LRA is used, thus, not entirely
sure if this change would not break anything else.
No, you are perfectly right.  At the end of LRA all insns should be 
valid *without any change*.  That is the purpose of check_rtl. There are 
a lot of reasons why we use lra_in_progress and not reload_in_progress.  
Unfortunately, there are too many places where reload_in_progress is and 
was used and apparently one of this place was overlooked.


So adding !lra_in_progress along with !reload_in_progress is a right 
solution here.

Can you suggest how to guarantee check_rtl does not modify the insns?
The back end issue will be looked separately by us.


Thanks for finding this.  I guess you can submit the patch.

Please use also gcc-patches mailing lists for such discussions.  It is a 
better and right place for this.





[RFC] Replace Java with Go in default languages

2013-11-08 Thread Jeff Law



GCJ has, IMHO, moved from active development into a deep maintenance 
mode.I suspect this is largely due to the change of focus of key 
developers to OpenJDK and other projects.  GCJ played a role in 
bootstrapping OpenJDK, both technically and politically and had OpenJDK 
not happened, I suspect GCJ would still be under active development.


The last news item related to Java was 2009 and scanning the ChangeLog 
doesn't show significant project activity (~14 changes in 2013, most of 
which look like routine maintenance in the language front-end.   There's 
even fewer changes occurring in the runtime system.


I did some benchmarking using one of my slower systems (primarily 
because my faster systems are used for real work).  It's an older quad 
machine, but should give us a reasonable feel for how expensive java is 
to the bootstrap & regression testing process.


A default languages bootstrap takes 67 minutes on that box (-j4).  The 
times were consistent to within 20 seconds.  Disabling java brings that 
time down to 51 minutes, again with a variance of around 20 seconds. 
That means roughly 25% of the time to bootstrap is Java.


I didn't measure total testing time -- just the time to test Java, where 
it clocks in at 7 minutes (again -j4, though it's clearly not doing much 
in parallel).


Clearly bootstrapping and testing Java is expensive.  It's better than a 
while back (thanks to removing the static library build), but it's still 
a significant component of the bootstrap & test cycle we all do regularly.


We discussed removing libjava extensively in 2008, but never moved 
forward.  It's not entirely clear why from reviewing the thread. 
Additionally, I think the landscape around OpenJDK is a bit different 
now than then and thus it's time to revisit.


So instead of proposing that we just remove Java from the default 
languags, I propose that we replace Java with Go.


Go uses -fnon-call-exceptions which is one of the things that was a bit 
unique about GCJ and Go appears to have a much more vibrant developer 
and user community than GCJ.  So we get the -fnon-call-exceptions 
testing we want and we're actually building a front-end that a larger 
community cares about.


A bootstrap with Go replacing Java clocks in at 56 minutes.  So we're 
still getting most of the improvement in bootstrap times.


Testing Go (compiler & runtime) takes about a minute longer than libjava 
(it's doing more in parallel, so serially Go would be considerably 
longer in testing).


Clearly switching from libjava to go would be a significant improvement 
in the bootstrap and regression test cycle.  On the box I tested we'd 
see roughly at 15% improvement and we'd still get testing of 
-fnon-call-exceptions.



Thoughts or comments?







Re: [RFC] Replace Java with Go in default languages

2013-11-08 Thread Diego Novillo
On Fri, Nov 8, 2013 at 5:21 PM, Jeff Law  wrote:

> Thoughts or comments?

I fully support this.  I've been wanting to remove Java from the
default bootstrap for a long time now. Bringing in Go seems like a
good idea as well.


Diego.


Re: [RFC] Replace Java with Go in default languages

2013-11-08 Thread Ian Lance Taylor
On Fri, Nov 8, 2013 at 2:21 PM, Jeff Law  wrote:
>
> So instead of proposing that we just remove Java from the default languags,
> I propose that we replace Java with Go.

I'm certainly in favor of removing Java from the set of default
languages.

I'm less sure about adding Go.

Right now Go does not build on a range of targets, notably including
Windows, MacOS, AIX, and most embedded systems.  We would have to
disable it by default on targets that are not supported, which is
straightforward (we already have rules to disable java on targets it
does not support).  But to the extent that there are options like
-fnon-call-exceptions that are tested primarily by Java and Go, we
would get less coverage of those options, since we would not test them
on systems that Java supports but Go does not.

More seriously, the Go sources live in a separate repository, and are
copied to the GCC repo.  In practice this means that when Go breaks,
it can't be fixed until I am online to fix it.  I don't think it would
be good for GCC for a bootstrap break to depend on me.  Of course we
could change the rules somewhat, and let people commit changes to the
Go parts of the GCC repo which I would then have to copy out.  But
it's something to think about.

Ian


Re: [RFC] Detect most integer overflows.

2013-11-08 Thread Geert Bosch

On Oct 29, 2013, at 05:41, Richard Biener  wrote:

> For reference those
> (http://clang.llvm.org/docs/LanguageExtensions.html) look like
> 
>  if (__builtin_umul_overflow(x, y, &result))
>return kErrorCodeHackers;
> 
> which should be reasonably easy to support in GCC (if you factor out
> generating best code and just aim at compatibility).  Code-generation
> will be somewhat pessimized by providing the multiplication result
> via memory, but that's an implementation detail.

I've done the overflow checking in Gigi (Ada front end). Benchmarking
real world large Ada programs (where every integer operation is checked,
including array index computations etc.), I found the performance cost 
*very* small (less than 1% on typical code, never more than 2%). There
is a bit more cost in code size, but that is mostly due to the fact that
we want to generate error messages with correct line number information
without relying on backtraces.

The rest of the run time checks in Ada (especially index checks and range
checks) were far more costly (more on the order of 10-15%, but very
variable depending on code style).

A few things helped to make the cost small: the biggest one is that
typically on of the operands is known to be negative or positive.
Gigi will use Ada type information, and Natural or Positive integer
variables are very common.  So, if you compute X + C with C positive, 
you can write the conditional expression as:
(if X < Integer'Last - C then X + C else raise Constraint_Error)

On my x86-64 this generates something like:
__ada_add:
00  cmpl$0x7fff,%edi
06  je  0x000c
08  leal0x01(%rdi),%eax
0b  ret
0c  leaq0x000d(%rip),%rdi
13  pushq   %rax
14  movl$0x0003,%esi
19  xorl%eax,%eax
1b  callq   ___gnat_rcheck_CE_Overflow_Check

While this may look like a lot, these operations are expanded
inline, and only the first three are on the normal execution
path. As the exception raise is a No_Return subprogram, it will
be moved to the end of the file. The jumps will both statically
and dynamically be treated as not-taken, and have very little cost.

Additionally, the comparison is visible for the optimizers, in
effect giving more value range information which can be used
for optimizing away further checks. The drawback of using any
"special" new operations is that we loose that aspect.

For the less common case in which neither operand has a known
sign, widening to 64-bits is the straightforward solution. For Ada,
we have a mode where we do this kind of widening for entire
expressions, so we only have to check on the final assignment.
The semantics here are that you'd get the mathematically correct
result, even if there was an intermediate overflow. The drawback
of this approach is that an overflow check may not fail, but
that suppressing the checks removes the widening and causes
wrong answers.

  -Geert




Re: [RFC] Detect most integer overflows.

2013-11-08 Thread Ondřej Bílka
On Fri, Nov 08, 2013 at 08:31:38PM -0500, Geert Bosch wrote:
> 
> On Oct 29, 2013, at 05:41, Richard Biener  wrote:
> 
> > For reference those
> > (http://clang.llvm.org/docs/LanguageExtensions.html) look like
> > 
> >  if (__builtin_umul_overflow(x, y, &result))
> >return kErrorCodeHackers;
> > 
> > which should be reasonably easy to support in GCC (if you factor out
> > generating best code and just aim at compatibility).  Code-generation
> > will be somewhat pessimized by providing the multiplication result
> > via memory, but that's an implementation detail.
> 
> I've done the overflow checking in Gigi (Ada front end). Benchmarking
> real world large Ada programs (where every integer operation is checked,
> including array index computations etc.), I found the performance cost 
> *very* small (less than 1% on typical code, never more than 2%). There
> is a bit more cost in code size, but that is mostly due to the fact that
> we want to generate error messages with correct line number information
> without relying on backtraces.
>
Overhead is mostly from additonal branches that are not taken. We need
more accurate measure of cache effects than code size, for example
looking to increased number icache hits which will not count code that
is never executed.

> The rest of the run time checks in Ada (especially index checks and range
> checks) were far more costly (more on the order of 10-15%, but very
> variable depending on code style).
> 
> A few things helped to make the cost small: the biggest one is that
> typically on of the operands is known to be negative or positive.
> Gigi will use Ada type information, and Natural or Positive integer
> variables are very common.  So, if you compute X + C with C positive, 
> you can write the conditional expression as:

On x64 efect of this analysis is small, processor does overflow detection for 
you.

> (if X < Integer'Last - C then X + C else raise Constraint_Error)
> 

> On my x86-64 this generates something like:
> __ada_add:
> 00cmpl$0x7fff,%edi
> 06je  0x000c
> 08leal0x01(%rdi),%eax
> 0bret
> 0cleaq0x000d(%rip),%rdi
> 13pushq   %rax
> 14movl$0x0003,%esi
> 19xorl%eax,%eax
> 1bcallq   ___gnat_rcheck_CE_Overflow_Check
>

This has redundant compare instruction that cost a cycle and 6 bytes.
You can just write

   0:   83 c7 01add$0x1,%edi
   3:   71 03   jno0x8
   5:   89 f8   mov%edi,%eax
   7:   c3  retq   
   8:   48 8d 3d 0d 00 00 00lea0xd(%rip),%rdi
   f:   50  push   %rax
  10:   be 03 00 00 00  mov$0x3,%esi
  15:   31 c0   xor%eax,%eax
  17:   e8 00 00 00 00  callq  0x1c

When you know that one operand is positive or you deal with unsigned
then you could replace jno with jnc which is bit faster on sandy bridge
processors and later as add, jnc pair is macro-fused but add jno is not.
 
> While this may look like a lot, these operations are expanded
> inline, and only the first three are on the normal execution
> path. As the exception raise is a No_Return subprogram, it will
> be moved to the end of the file. The jumps will both statically
> and dynamically be treated as not-taken, and have very little cost.
> 
> Additionally, the comparison is visible for the optimizers, in
> effect giving more value range information which can be used
> for optimizing away further checks. The drawback of using any
> "special" new operations is that we loose that aspect.
> 
> For the less common case in which neither operand has a known
> sign, widening to 64-bits is the straightforward solution. For Ada,
> we have a mode where we do this kind of widening for entire
> expressions, so we only have to check on the final assignment.
> The semantics here are that you'd get the mathematically correct
> result, even if there was an intermediate overflow. The drawback
> of this approach is that an overflow check may not fail, but
> that suppressing the checks removes the widening and causes
> wrong answers.
> 
>   -Geert
>