Re: Question about find modifiable mems

2015-06-05 Thread shmeel gutl

On 04-Jun-15 03:54 AM, Jim Wilson wrote:

On 06/02/2015 11:39 PM, shmeel gutl wrote:

find_modifiable_mems was introduced to gcc 4.8 in september 2012. Is
there any documentation as to how it is supposed to help the haifa
scheduler?

The patch was submitted here
   https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00155.html
and this message contains a brief explanation of what it is supposed to
do.  The explanation looks like a useful optimization, but perhaps it is
triggering in cases when it shouldn't.

Jim



Thanks, this is what I was looking for. From the comments, he didn't 
intend to do what I saw. Probably, the problem is in my port and the 
very special way that we handle instruction costs.

If I see a problem that isn't specific to my port, I will report back.
Shmeel



Re: Builtin expansion versus headers optimization: Reductions

2015-06-05 Thread Ondřej Bílka
On Thu, Jun 04, 2015 at 02:34:40PM -0700, Andi Kleen wrote:
> The compiler has much more information than the headers.
> 
> - It can do alias analysis, so to avoid needing to handle overlap
> and similar.

Could but it could also export that information which would benefit
third parties.

> - It can (sometimes) determine alignment, which is important
> information for tuning.

In general case yes, but here its useless. As most functions are aligned
to 16 bytes in less than 10% of calls you shouldn't add cold branch to
handle aligned data.

Also as I mentioned bugs before gcc now doesn't handle alignment well so
it doesn't optimize following to zero for aligned code.

 align = ((uintptr_t) x) % 16;

If it done so then you don't need go to gcc, just check alignment with
__builtin_constant_p(((uintptr_t) x) % 16) && ((uintptr_t) x) % 16 == 0

> - With profile feedback it can use value histograms to determine the
> best code.
> 
Problem is that histograms are not enough as I mentioned before. For
profiling you need to measure useful data which differs per function and
should be done in userspace.

For best code you need to know things like percentage of cache lines in L1,
L2 and L3 cache cache to select correct memset. 

On ivy bridge I got that Using rep stosq for memset(x,0,4096) is 20%
slower than libcall for L1 cache resident data while 50% faster for data
outside cache. How do you teach compiler that?

Switch to 16 byte blocks here to see graphs.

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_rand/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memset_profile/results_rand_nocache/result.html

Likewise on memcpy I got that rte_memcpy is faster on copies of L1 cache data.
That isn't very useful as you cannot have many 8kb input and output
buffers both in L1 cache. Reason is it uses 256-byte loopp That becomes nil for 
L2 cache and problem for L3 cache where it is slower.

http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_L2/result.html
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand_L3/result.html

Likewise for strcmp+co you need to know probabilities in which bytes
match occurs and depending on than first add 0-4 bytewise checks
followed by maybe 8byte checks and libcall.

> It may not use all of this today, but it could.
> 


Re: Builtin expansion versus headers optimization: Reductions

2015-06-05 Thread Jakub Jelinek
On Fri, Jun 05, 2015 at 11:02:03AM +0200, Ondřej Bílka wrote:
> On Thu, Jun 04, 2015 at 02:34:40PM -0700, Andi Kleen wrote:
> > The compiler has much more information than the headers.
> > 
> > - It can do alias analysis, so to avoid needing to handle overlap
> > and similar.
> 
> Could but it could also export that information which would benefit
> third parties.

How?

> > - It can (sometimes) determine alignment, which is important
> > information for tuning.
> 
> In general case yes, but here its useless. As most functions are aligned
> to 16 bytes in less than 10% of calls you shouldn't add cold branch to
> handle aligned data.
> 
> Also as I mentioned bugs before gcc now doesn't handle alignment well so
> it doesn't optimize following to zero for aligned code.
> 
>  align = ((uintptr_t) x) % 16;

That is simply not true.  E.g.
struct __attribute__((aligned (16))) S { char b[16]; };
struct S a;

unsigned long
foo (void)
{
  return (((unsigned long) &a) % 16);
}
is optimized into 0, many other testcases too, the CCP pass takes alignment
info into account and optimize based on that.  If you are talking about
result of malloc, supposedly it is because glibc headers don't properly mark
malloc with the alloc_align attribute yet.

> > - With profile feedback it can use value histograms to determine the
> > best code.
> > 
> Problem is that histograms are not enough as I mentioned before. For
> profiling you need to measure useful data which differs per function and
> should be done in userspace.

For some builtin functions PGO can collect custom extra data, that the
compiler then can use to decide how to expand the builtins.
E.g. for some string op builtins PGO already collects average alignment and
average size.

Jakub


Re: Builtin expansion versus headers optimization: Reductions

2015-06-05 Thread Ondřej Bílka
On Fri, Jun 05, 2015 at 11:23:12AM +0200, Jakub Jelinek wrote:
>
> That is simply not true.  E.g.
> struct __attribute__((aligned (16))) S { char b[16]; };
> struct S a;
> 
> unsigned long
> foo (void)
> {
>   return (((unsigned long) &a) % 16);
> }
> is optimized into 0, many other testcases too, the CCP pass takes alignment
> info into account and optimize based on that.  If you are talking about
> result of malloc, supposedly it is because glibc headers don't properly mark
> malloc with the alloc_align attribute yet.
>
Ok, I take that back. I just didn't heard that we should use that
attribute. I though that __attribute__ ((__malloc__)) implies that.
 
> > > - With profile feedback it can use value histograms to determine the
> > > best code.
> > > 
> > Problem is that histograms are not enough as I mentioned before. For
> > profiling you need to measure useful data which differs per function and
> > should be done in userspace.
> 
> For some builtin functions PGO can collect custom extra data, that the
> compiler then can use to decide how to expand the builtins.
> E.g. for some string op builtins PGO already collects average alignment and
> average size.
> 
Where should I look?


Re: Builtin expansion versus headers optimization: Reductions

2015-06-05 Thread Mikhail Maltsev
05.06.2015 13:02, Ondřej Bílka writes:
> Also as I mentioned bugs before gcc now doesn't handle alignment well so
> it doesn't optimize following to zero for aligned code.
> 
>  align = ((uintptr_t) x) % 16;
> 
That is because GCC is conservative and supports some non-ABI-compliant
memory allocators which only guarantee 8-byte alignment, but

char *bar()
{
char *data = __builtin_malloc(64);
return data + ((unsigned long)data) % 8;
}

does get optimized to

bar:
.LFB1:
.cfi_startproc
movl$64, %edi
jmp malloc
.cfi_endproc
-- 
Regards,
Mikhail Maltsev


Re: Builtin expansion versus headers optimization: Reductions

2015-06-05 Thread Ondřej Bílka
On Fri, Jun 05, 2015 at 09:26:33AM +0400, Mikhail Maltsev wrote:
> > The compiler has much more information than the headers.
> > - It can do alias analysis, so to avoid needing to handle overlap
> > and similar.
> > - It can (sometimes) determine alignment, which is important
> > information for tuning.
> > - With profile feedback it can use value histograms to determine the
> > best code.
> Value range information is also passed to, for example, memcpy expander.
> Even if exact length of data being copied is not known at compile time,
> the known range can help the compiler to select a subset of available
> algorithms a choose one of them at run time without using several
> branches and a huge jumptable (which is unavoidable in library code,
> because it has to deal with general case).
>
You can't as there is no such thing as optimum algorithm. I did simple
memset benchmark that tests 8byte aligned memset(x, 0, c) with constant
argument on L1 cache resident data.

You as compiler simply cannot do that as you would need to expand
memset(x, 0, 256) into

a) rep stosb. On haswell thats fastest which glibc memset-avx2 does.
b) sequence of movdqa %xmm0, x(%rdi) Thats optimal on ivy bridge.
Thats a table lookup exported from libc that I talked about.
c) use -mstringop-strategy=unrolled_loop. Thats a fastest implementation on 
core2

Now tell me how you as a compiler would do selection?

Also data clearly shows that memset suffers same problem as memcmp and
using loop should be disabled by default. Its slower than glibc one
since 128 where you stop unrolling it into movq sequence for all tested
processors except core2 where treshold is at 256 bytes (read below).

>From variants only unrolled_loop is sometimes faster but thats slower
than a full unrolling that I proposed.

As using jumptable its only way to do that as you need to map each size
to appropriate address in completely unrolled implementation. Also we
could use unrolled implementation using vmovdqa %ymm0, (%rdi) on
haswell.
Without that I could use jumptable with powers of 2.

 
> BUT comparison functions (like str[n]cmp, memcmp) expand to something
> like "repz cmpsb", which is definitely suboptimal. IMHO, a quick and
> dirty (but working) solution would be to use library call instead of
> relying on HAVE_cmpstr[n]si, (except short strings). A better solution
> would be to implement something like movmem pattern (which is aware of
> value ranges, alignment, etc.) but for comparison patterns. I could try
> to write at least some pro

I wrote a recursive always_inline function how it should be done in
sibling thread.

Main point is for gcc to recognize streq and memeq when you dont care
about strcmp sign which makes implementation almost trivial.


haswell

size 64

glibc.s 0.41
loop.s  0.64
memset-avx2.s   0.52
memset_tbl.s0.43
rep8.s  0.91
unroll.s0.44

size 96

glibc.s 0.63
loop.s  0.82
memset-avx2.s   0.52
memset_tbl.s0.55
rep8.s  0.91
unroll.s0.56

size 128

glibc.s 0.80
loop.s  1.02
memset-avx2.s   0.77
memset_tbl.s0.64
rep8.s  0.92
unroll.s0.64

size 192

glibc.s 1.06
loop.s  1.46
memset-avx2.s   0.95
memset_tbl.s0.80
rep8.s  1.26
unroll.s0.88

size 256

glibc.s 1.34
loop.s  1.87
memset-avx2.s   0.70
memset_tbl.s0.93
rep8.s  1.46
unroll.s1.10

size 384

glibc.s 1.64
loop.s  3.05
memset-avx2.s   0.85
memset_tbl.s1.21
rep8.s  1.88
unroll.s1.54

size 512

glibc.s 1.86
loop.s  3.59
memset-avx2.s   1.07
memset_tbl.s1.45
rep8.s  2.28
unroll.s2.00

size 1024

glibc.s 2.94
loop.s  5.80
memset-avx2.s   1.83
memset_tbl.s2.54
rep8.s  2.81
unroll.s3.78

size 2048

glibc.s 4.91
loop.s  10.44
memset-avx2.s   3.60
memset_tbl.s4.65
rep8.s  4.91
unroll.s7.50


i7_ivy_bridge



size 64

glibc.s 0.44
loop.s  0.99
memset-avx2.s   0.57
memset_tbl.s0.44
rep8.s  0.96
unroll.s0.48

size 96

glibc.s 0.64
loop.s  1.14
memset-avx2.s   0.56
memset_tbl.s0.59
rep8.s  1.00
unroll.s0.62

size 128

glibc.s 0.80
loop.s  1.45
memset-avx2.s   0.84
memset_tbl.s0.66
rep8.s  1.14
unroll.s0.69

size 192

glibc.s 1.04
loop.s  2.24
memset-avx2.s   0.97
memset_tbl.s0.82
rep8.s  1.38
unroll.s0.93

size 256

glibc.s 1.31
loop.s  2.83
memset-avx2.s   Command terminated by signal 4 0.00
memset_tbl.s0.95
rep8.s  1.61
unroll.s1.17

size 384

glibc.s 1.61
loop.s  4.13
memset-avx2.s   Command terminated by signal 4 0.00
memset_tbl.s1.26
rep8.s  2.01
unroll.s1.72

size 512

glibc.s 1.89
loop.s  4.99
memset-avx2.s   Command terminated by signal 4 0.00
memset_tbl.s1.48
rep8.s  2.39
unroll.s2.05

size 1024

glibc.s 3.01
loop.s  8.26
memset-avx2.s   Command terminated by signal 4 0.00
memset_tbl.s3.31
rep8.s  3.43
unroll.s3.84

size 2048

glibc.s 5.01
loop.s  14.89
memset-avx2.s   Command terminated by signal 4 0.00
memset_tbl.s5.05
rep8.s  5.75
unroll.s7.83


core2:



size 64

glibc.s 1

Re: Static Chain Register on iOS AArch64

2015-06-05 Thread Richard Henderson

On 06/04/2015 03:40 AM, Richard Earnshaw wrote:

The static chain register is pretty much private to a translation unit...


That was true when the static chain was restricted to trampolines.  Since Go 
has started using it for cross-translation-unit closures, that makes it part of 
the ABI.


I did raise this issue at the start of the year, when I submitted the patches. 
 At the time, folks seemed ok with the additional restriction.



r~


undefined behavior of signed left shifts (was Re: [PULL 00/40] ppc patch queue 2015-06-03)

2015-06-05 Thread Paolo Bonzini


On 05/06/2015 17:45, Peter Maydell wrote:
>>> ...but things like "(1U << 31)" are entirely valid.
>>
>> They're only valid until someone does a ~ on them.  I think it's
>> reasonable to forbid them in our coding standards, if we want to fix
>> ubsan's warning of (1 << 31).
>>
>> I don't think it's reasonable for compiler writers to exploit the
>> undefinedness of (1 << 31) anyway, and if it were possible to shut up
>> ubsan about this particular kind of undefined behavior, I would prefer it.
>
> I don't think it's reasonable for compiler writers to exploit
> undefined behaviour either, but historically they absolutely
> have done.

Most cases of undefined behavior are rooted in "you should never do that
anyway".  This is not the case for bitwise operations, since they are
not mathematical concepts and the representation of integers as bits is
only implementation-defined.

> Absent a guarantee from gcc that it will never do
> so, I think we should avoid any UB in our code.

The GCC manual says "GCC does not use the latitude given in C99 and C11
only to treat certain aspects of signed '<<' as undefined, but this is
subject to change".  It would certainly be nice if they removed the
"this is subject to change" part.

Paolo


Successfully compiled and installed GCC 4.8.4

2015-06-05 Thread stevenyhw
Dear GCC developers,

I have just compiled and installed GCC 4.8.4 (see attached files). Any comments 
& suggestions are welcome. Thanks!

Yuhang Wang 
### Compilation environment
* CentOS 6.6
* x86_64 architecture (Intel(R) Xeon(R) CPU W3550  @ 3.07GHz)
* Compiler: gcc version 4.4.7 20120313 (Red Hat 4.4.7-11)
* The following programs are used for compiling gcc-4.7.4
1) dejagnu-1.5.3   2) gcc-4.7.4   3) libisl-0.11.1   4) 
libcloog-0.18.0 5) libgmp-4.3.26) libmpc-0.8.17) 
libmpfr-2.4.2   8) tcl-8.6.4   9) gnu-autoconf-2.64
cat <<'EOF' |
LAST_UPDATED: Obtained from SVN: tags/gcc_4_8_4_release revision 218947

Native configuration is x86_64-unknown-linux-gnu

=== gcc tests ===


Running target unix

=== gcc Summary ===

# of expected passes95006
# of expected failures  267
# of unsupported tests  1607
/Scr/scr-test-steven/Programs/GCC/build_gcc-4.8.4/gcc/xgcc  version 4.8.4 (GCC) 

=== gfortran tests ===


Running target unix
FAIL: gfortran.dg/guality/pr41558.f90  -O2  line 7 s == 'foo'
FAIL: gfortran.dg/guality/pr41558.f90  -O3 -fomit-frame-pointer  line 7 s == 
'foo'
FAIL: gfortran.dg/guality/pr41558.f90  -O3 -fomit-frame-pointer -funroll-loops  
line 7 s == 'foo'
FAIL: gfortran.dg/guality/pr41558.f90  -O3 -fomit-frame-pointer 
-funroll-all-loops -finline-functions  line 7 s == 'foo'
FAIL: gfortran.dg/guality/pr41558.f90  -O3 -g  line 7 s == 'foo'
FAIL: gfortran.dg/guality/pr41558.f90  -Os  line 7 s == 'foo'

=== gfortran Summary ===

# of expected passes44002
# of unexpected failures6
# of expected failures  56
# of unsupported tests  73
/Scr/scr-test-steven/Programs/GCC/build_gcc-4.8.4/gcc/testsuite/gfortran/../../gfortran
  version 4.8.4 (GCC) 

=== g++ tests ===


Running target unix

=== g++ Summary ===

# of expected passes54566
# of expected failures  292
# of unsupported tests  921
/Scr/scr-test-steven/Programs/GCC/build_gcc-4.8.4/gcc/testsuite/g++/../../xg++  
version 4.8.4 (GCC) 

=== objc tests ===


Running target unix

=== objc Summary ===

# of expected passes2988
# of expected failures  6
# of unsupported tests  74
/Scr/scr-test-steven/Programs/GCC/build_gcc-4.8.4/gcc/xgcc  version 4.8.4 (GCC) 

=== boehm-gc tests ===


Running target unix

=== boehm-gc Summary ===

# of expected passes12
# of unsupported tests  1
=== libatomic tests ===


Running target unix

=== libatomic Summary ===

# of expected passes54
=== libffi tests ===


Running target unix

=== libffi Summary ===

# of expected passes1819
# of unsupported tests  55
=== libgomp tests ===


Running target unix

=== libgomp Summary ===

# of expected passes3090
=== libitm tests ===


Running target unix

=== libitm Summary ===

# of expected passes26
# of expected failures  3
# of unsupported tests  1
=== libjava tests ===


Running target unix
XPASS: sourcelocation -O3 -findirect-dispatch output - source compiled test

=== libjava Summary ===

# of expected passes2582
# of unexpected successes   1
# of expected failures  3
=== libmudflap tests ===


Running target unix
FAIL: libmudflap.c++/pass41-frag.cxx ( -O) execution test
FAIL: libmudflap.c++/pass41-frag.cxx (-O2) execution test
FAIL: libmudflap.c++/pass41-frag.cxx (-O3) execution test

=== libmudflap Summary ===

# of expected passes1433
# of unexpected failures3
=== libstdc++ tests ===


Running target unix

=== libstdc++ Summary ===

# of expected passes9291
# of expected failures  45
# of unsupported tests  217

Compiler version: 4.8.4 (GCC) 
Platform: x86_64-unknown-linux-gnu
configure flags: CC=/Scr/scr-test-steven/install/gcc/4.7.4/bin/gcc 
CXX=/Scr/scr-test-steven/install/gcc/4.7.4/bin/g++ 
CPPFLAGS=-I/Scr/scr-test-steven/install/gcc/4.7.4/include 
LDFLAGS=-L/Scr/scr-test-steven/install/gcc/4.7.4/lib64 
--prefix=/Scr/scr-test-steven/install/gcc/4.8.4 --disable-multilib 
--with-gmp=/Scr/scr-test-steven/install/libgmp/4.3.2 
--with-mpfr=/Scr/scr-test-steven/install/libmpfr/2.4.2 
--with-mpc=/Scr/scr-test-steven/install/libmpc/0.8.1 
--with-cloog=/Scr/scr-test-steven/install/libcloog/0.18.0 
--with-isl=/Scr/scr-test-steven/install/libisl/0.11.1
EOF
Mail -s "Results for 4.8.4 (GCC) testsuite on x86_64-unknown-linux-gnu" nobody 
&&
true


Re: Builtin expansion versus headers optimization: Reductions

2015-06-05 Thread Andi Kleen
Ondřej Bílka  writes:
>
> On ivy bridge I got that Using rep stosq for memset(x,0,4096) is 20%
> slower than libcall for L1 cache resident data while 50% faster for data
> outside cache. How do you teach compiler that?

It would be in theory possible with autofdo. Profile with a cache miss
event. Correlate. Maintain the information in addition to the basic
block frequencies.

Probably not simple, but definitely possible.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only


Re: Builtin expansion versus headers optimization: Reductions

2015-06-05 Thread Joseph Myers
On Fri, 5 Jun 2015, Mikhail Maltsev wrote:

> There are other issues with macros in glibc headers (well, not as
> significant as performance-related concerns, but never the less).
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=24016#c3
> 
> > #include 
> > int foo(void *x) {
> > return strcmp(x + 1, "test");
> > }
> > 
> > does not cause warnings when compiled with -Wpointer-arith -O1
> > (glibc v. 2.17). It can be reduced to:
> > 
> > int foo(void *x) {
> > return __extension__({ __builtin_strcmp(x + 1, "test"); });
> > }
> 
> In this case __extension__ inhibits -Wpointer-arith, because void*
> arithmetics is a valid GNU extension, but the user does not even know
> that he is using it.

In  I suggested 
__unextension__ to allow macros to avoid suppressing such warnings inside 
the expansion of macro arguments (given that feature, lots of macro 
definitions would then need to change to use __unextension__ (ARG) instead 
of just (ARG)).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: undefined behavior of signed left shifts (was Re: [PULL 00/40] ppc patch queue 2015-06-03)

2015-06-05 Thread Joseph Myers
On Fri, 5 Jun 2015, Paolo Bonzini wrote:

> The GCC manual says "GCC does not use the latitude given in C99 and C11
> only to treat certain aspects of signed '<<' as undefined, but this is
> subject to change".  It would certainly be nice if they removed the
> "this is subject to change" part.

The correct statement would be more complicated.  That is: the value 
returned is as documented, without that latitude being used for 
*optimization*, but (a) -fsanitize=undefined (and its subcase 
-fsanitize=shift) intends to follow exactly what the different standards 
specify when giving runtime errors and (b) the cases that are undefined 
are thereby not considered integer constant expressions (with consequent 
pedwarns-if-pedantic in various cases, and corner case effects on what's a 
null pointer constant).  (The only "subject to change" would be that if 
there are still missing cases from the runtime detection or the not 
treating as integer constant expressions, then those missing cases may be 
fixed.  I don't think it would be a good idea to add optimizations on this 
basis - for example, optimizations of x * 2 based on undefined overflow 
should not be applied to x << 1.)

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: undefined behavior of signed left shifts (was Re: [PULL 00/40] ppc patch queue 2015-06-03)

2015-06-05 Thread Peter Maydell
On 5 June 2015 at 16:55, Paolo Bonzini  wrote:
> The GCC manual says "GCC does not use the latitude given in C99 and C11
> only to treat certain aspects of signed '<<' as undefined, but this is
> subject to change".  It would certainly be nice if they removed the
> "this is subject to change" part.

Does clang provide a similar guarantee? I couldn't find one in
a quick scan through the docs, but I might be looking in the
wrong place.

thanks
-- PMM


[RFC] New memcmp expansion strategy.

2015-06-05 Thread Ondřej Bílka
Hi,

After trying to implement memcmp for big endian I realized that its
exactly whats needed.

As on big endian we could just load them into register and directly
compare them as ordering is same then for little endian we could just
first use bswap for big-endian case.

Following expansion should be close to optimal on architectures that
have unaligned loads. It isn't worth do it without these as emulating
them bloats implementation size, so its better do libcall.

I was surprised when I checked how it expands pattern
if (memcmp(x,y,n)) 
that it optimizes byteswap away, expansion of that looks like best
possible.

It could be improved on other platforms. First is that I ignore 16bit
operations as they tend to be slow. Instead I do an overlapping load
of maximal size. I could add table of expansion without overlap and see
whats faster.

For generic memcmp assemly could be made shorter as gcc duplicates
tails. For first 8 bytes I use fact that its likely so I start with
converting them, assembly of memcmp(x,y,23) is following.

movq(%rdi), %rax
movq(%rsi), %rdx
.here:  
bswap   %rax
bswap   %rdx
cmpq%rdx, %rax
je  .L22
.L20:
cmpq%rax, %rdx
.L21:
sbbl%eax, %eax
andl$2, %eax
subl$1, %eax
ret
.L22:
movq8(%rdi), %rax
movq8(%rsi), %rdx
cmpq%rdx, %rax
je  .L15
bswap   %rdx
bswap   %rax
jmp .L20
...

You could save space by changing that into 

movq8(%rdi), %rax
movq8(%rsi), %rdx
cmpq%rdx, %rax
jne  .here
movq15(%rdi), %rax
movq15(%rsi), %rdx
cmpq%rdx, %rax
jne  .here
xor %rax, %rax
ret

Also there is bug that you duplicate comparison:

cmpq%rdx, %rax
je  .L22
.L20:
cmpq%rax, %rdx

Comments?


#include 
#include 

#undef memcmp
#define memcmp(x, y, n) (__builtin_constant_p (n) && n < 64 ? __memcmp_inline 
(x, y, n) \
 : memcmp (x, y, n))

#define LOAD8(x) (*((uint8_t *) (x)))
#define LOAD32(x) (*((uint32_t *) (x)))
#define LOAD64(x) (*((uint64_t *) (x)))

#define CHECK(tp, n)
#if __BYTE_ORDER == __LITTLE_ENDIAN
# define SWAP32(x) __builtin_bswap32 (LOAD32 (x))
# define SWAP64(x) __builtin_bswap64 (LOAD64 (x))
#else
# define SWAP32(x) LOAD32 (x)
# define SWAP64(x) LOAD64 (x)
#endif

#define __ARCH_64BIT 1

static __always_inline
int
check (uint64_t x, uint64_t y)
{
  if (x == y)
return 0;
  if (x > y)
return 1;

  return -1;
}

static __always_inline
int
check_nonzero (uint64_t x, uint64_t y)
{
  if (x > y)
return 1;

  return -1;
}


static __always_inline
int
__memcmp_inline (void *x, void *y, size_t n)
{
#define CHECK1 if (LOAD8 (x + i) - LOAD8 (y + i)) \
return check_nonzero (LOAD8 (x + i), LOAD8 (y + i)); i = i + 1;
#define CHECK4 if (i == 0 ? SWAP32 (x + i) - SWAP32 (y + i)\
  : LOAD32 (x + i) - LOAD32 (y + i)) \
return check_nonzero (SWAP32 (x + i), SWAP32 (y + i)); i = i + 4;
#define CHECK8 if (i == 0 ? SWAP64 (x + i) - SWAP64 (y + i)\
  : LOAD64 (x + i) - LOAD64 (y + i)) \
return check_nonzero (SWAP64 (x + i), SWAP64 (y + i)); i = i + 8;

#define CHECK1FINAL(o) return check (LOAD8 (x + i + o), LOAD8 (y + i + o));
#define CHECK4FINAL(o) return check (SWAP32 (x + i + o), SWAP32 (y + i + o));
#define CHECK8FINAL(o) return check (SWAP64 (x + i + o), SWAP64 (y + i + o));

#if __ARCH_64BIT == 0
# undef CHECK8
# undef CHECK8FINAL
# define CHECK8 CHECK4 CHECK4
# define CHECK8FINAL(o) CHECK4 CHECK4FINAL (o)
#endif

#define LOOP if (i + 8 < n) { CHECK8 } \
if (i + 8 < n) { CHECK8 } \
if (i + 8 < n) { CHECK8 } \
if (i + 8 < n) { CHECK8 } \
if (i + 8 < n) { CHECK8 } \
if (i + 8 < n) { CHECK8 } \
if (i + 8 < n) { CHECK8 } \
if (i + 8 < n) { CHECK8 } 


  long i = 0;

  switch (n % 8)
{
case 0:
  if (n == 0)
return 0;

  LOOP; CHECK8FINAL (0);
case 1:
  LOOP CHECK1FINAL (0);
case 2:
  if (n == 2)
{
  CHECK1 CHECK1FINAL (0);
}
  LOOP CHECK4FINAL (-2);
case 3:
  if (n == 3)
{
  CHECK1 CHECK1 CHECK1FINAL (0);
}
  LOOP CHECK4FINAL (-1);
case 4:
  LOOP CHECK4FINAL (0);
case 5:
  if (n == 5)
{
  CHECK4 CHECK1FINAL (0);
}
#if __ARCH_64BIT
  LOOP CHECK8FINAL (-3);
#else
  LOOP CHECK4 CHECK1FINAL (0);
#endif
case 6:
  if (n == 6)
{
  CHECK4 CHECK4FINAL (-2);
}
  LOOP CHECK8FINAL (-2);
case 7:
  if (n == 7)
{
  CHECK4 CHECK4FINAL (-1);
}
  LOOP CHECK8FINAL (-1);
}
}

int
memcmp1 (char *x, char *y)
{
  return memcmp (x, y, 1);
}
int
memcmp10 (char *x, char *y)
{
  return memcmp (x, y, 10);
}
int
memcmp20 (char *x, char *y)
{
  return memcmp (x, y, 20);
}
int
memcmp30 (char *x, ch

debug-early branch merged into mainline

2015-06-05 Thread Aldy Hernandez

The debug-early work has been merged into mainline.

There is a known Ada failure which Eric B. knows about and approved, and 
for which there is an appropriate FIXME note in the Ada sources:


+FAIL: gnat.dg/specs/debug1.ads scan-assembler-times DW_AT_artificial 17

There is also a known regression in the testsuite that we've discussed 
before and will be fixed shortly.  It is an optimization issue:


+FAIL: gcc.dg/debug/dwarf2/stacked-qualified-types-3.c 
scan-assembler-times DIE \\([^\n]*\\) 
DW_TAG_(?:const|volatile|atomic|restrict)_type 8


Finally, as previously discussed there can be substantial increases in 
the size of the .debug_info sections for a minimum of cases.  This is 
immediately on my plate as of right now.  It is expected.  Please don't 
report this, or any of the above 2 failures.


Thanks to everyone involved in the design and review, particularly Jason 
and Richi who were there at each step of the way, and Michael Matz whose 
original patch this work is based off of.


Aldy