[Bug middle-end/54299] New: Array parameter does not allow for iterator syntax

2012-08-17 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54299

 Bug #: 54299
   Summary: Array parameter does not allow for iterator syntax
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


Compile the following code:

~~
int aa[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };

int f(int arr[10])
{
  int s = 0;
  for (auto i : arr)
s += i;
  return s;
}

int main()
{
  return f(aa);
}
~~

This fails with

u.cc: In function ‘int f(int*)’:
u.cc:18:17: error: ‘begin’ was not declared in this scope
u.cc:18:17: error: ‘end’ was not declared in this scope
u.cc:18:17: error: unable to deduce ‘auto’ from ‘’



This indicates that the problem is that the parameter is seen as 'int *'
instead of as 'int [10]'.  According to Andrew another problem caused by the
too-early decay of arguments to pointers (bug 24666).

Changing the code as follows makes it compile:

~~
int aa[1][10] = { { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 } };

int f(int arr[1][10])
{
  int s = 0;
  for (auto i : arr[0])
s += i;
  return s;
}

int main()
{
  return f(aa);
}
~~


[Bug target/54087] __atomic_fetch_add does not use xadd instruction

2012-08-23 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087

--- Comment #8 from Ulrich Drepper  2012-08-23 
15:41:49 UTC ---
(In reply to comment #7)
> Check to see if it solves the problem as well.

I tested it.  Seems to  work in all cases and does not disturb other
optimizations like comparisons with zero.


[Bug c++/54376] New: incorrect complaint about redefinition

2012-08-25 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54376

 Bug #: 54376
   Summary: incorrect complaint about redefinition
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


At least I think this is a compiler problem.  I cannot see anything wrong in
the libstdc++ code or in my test code.

Compiling the following code with 4.7.0 or even the current trunk version of
the compiler will produce an error like this.  There is no double inclusion
problem and still the line with the redefinition is exactly the same as the
definition.

If you take out one of the two variable definitions and uses the error
disappears which indicates the problem is that the compiler doesn't distinguish
instantiations correctly.

The other pairs of instantiations also produce the same type of mistake.


In file included from
/usr/lib/gcc/x86_64-redhat-linux/4.7.0/../../../../include/c++/4.7.0/random:50:0,
 from r3.cc:3:
/usr/lib/gcc/x86_64-redhat-linux/4.7.0/../../../../include/c++/4.7.0/bits/random.h:
In instantiation of ‘class std::lognormal_distribution’:
r3.cc:31:39:   required from here
/usr/lib/gcc/x86_64-redhat-linux/4.7.0/../../../../include/c++/4.7.0/bits/random.h:2279:9:
error: redefinition of ‘template bool std::operator==(const
std::lognormal_distribution<_RealType>&, const
std::lognormal_distribution<_RealType>&)’
/usr/lib/gcc/x86_64-redhat-linux/4.7.0/../../../../include/c++/4.7.0/bits/random.h:2279:9:
error: ‘template bool std::operator==(const
std::lognormal_distribution<_RealType>&, const
std::lognormal_distribution<_RealType>&)’ previously defined here



Source:

#include 
#include 
#include 


template
void measure(const char *name, size_t n, E &e, D &d)
{
  typename D::result_type arr[n];
  typename D::result_type s = 0;
  for (int tries = 0; tries < 100; ++tries)
{
  e.seed(1234);

  for (size_t i = 0; i < n; ++i)
arr[i] = d(e);

  for (size_t i = 0; i < n; ++i)
s += arr[i];
}

  std::cout << name << " " << n << " = " << " " << s << std::endl;
}


int
main(void)
{
  std::mt19937 e2;
  std::lognormal_distribution d7;
  std::lognormal_distribution d8;
  //std::gamma_distribution d9;
  //std::gamma_distribution d10;
  //std::chi_squared_distribution d11;
  //std::chi_squared_distribution d12;
  //std::fisher_f_distribution d15;
  //std::fisher_f_distribution d16;
  //std::student_t_distribution d17;
  //std::student_t_distribution d18;
  //std::binomial_distribution d20;
  //std::binomial_distribution d21;
  //std::negative_binomial_distribution d24;
  //std::negative_binomial_distribution d25;
  //std::poisson_distribution d26;
  //std::poisson_distribution d27;


  for (size_t n = 10; n < 10; n *= 1.1)
{
  measure("lognormal:32", n, e2, d7);
  measure("lognormal:64", n, e2, d8);
  //measure("gamma:32", n, e2, d9);
  //measure("gamma:64", n, e2, d10);
  //measure("chi_squared:32", n, e2, d11);
  //measure("chi_squared:64", n, e2, d12);
  //measure("fisher_f:32", n, e2, d15);
  //measure("fisher_f:64", n, e2, d16);
  //measure("student_t:32", n, e2, d17);
  //measure("student_t:64", n, e2, d18);
  //measure("binomial:32", n, e2, d20);
  //measure("binomial:64", n, e2, d21);
  //measure("negative_binomial:32", n, e2, d24);
  //measure("negative_binomial:64", n, e2, d25);
  //measure("poisson:32", n, e2, d26);
  //measure("poisson:64", n, e2, d27);
}
}


[Bug c++/54376] incorrect complaint about redefinition

2012-08-25 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54376

--- Comment #10 from Ulrich Drepper  2012-08-25 
22:54:02 UTC ---
Created attachment 28085
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28085
Avoid nested inlined friend functions

This patch fixes the issue for me.  It also cleans the code.  There is
currently a lot of inconsistency as to where the operator== functions are
defined, all depending on whether they are friends or not.

With this patch all operator== are defined after the class and friend
declarations are used.


[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand

2012-08-30 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419

--- Comment #2 from Ulrich Drepper  2012-08-30 
20:19:35 UTC ---
The instruction is generated by the compiler.  If you try to compile a new
compiler you have to make sure the tools used are recent enough to understand
the output of the compiler.


[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand

2012-08-31 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419

--- Comment #9 from Ulrich Drepper  2012-08-31 
17:46:41 UTC ---
(In reply to comment #8)
> Is it clear which are the specific requirements for the various x86* targets?
> I'm wondering if after all it's just matter of updating:
> http://gcc.gnu.org/install/specific.html

Indeed.  You cannot use old binutils if any of the code generated by the
compiler requires something newer.  If these dependencies are not wanted then
make the compiler to emit .byte sequences when the new builtins are used.


> Since rdrand is only supported on Ivy Bridge processors, shouldn't
> src/c++11/random.cc have a fall through using rdtsc in case the processor
> doesn't support rdrand?

Read the code.  There is of course a fall-through for older processors.  This
is about the code generated by the compiler, not what is used at runtime.


[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand

2012-09-02 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419

--- Comment #15 from Ulrich Drepper  2012-09-02 
20:04:57 UTC ---
(In reply to comment #14)
> libstdc++ should check if rdrand is supported by assembler
> before using __builtin_ia32_rdrand32_step.

Every gcc feature should have a test.  When you added the built-in this should
have happened.  The unavailability of a recent-enough compiler should therefore
have been a problem for a long time.  It's just wrong to expect a compiler to
work with binutils versions which cannot handle all the output the compiler
produces.


[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand

2012-09-03 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419

--- Comment #20 from Ulrich Drepper  2012-09-04 
01:06:33 UTC ---
Created attachment 28127
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28127
Check for rdrand availability

How about this patch?  Not sure whether this handles cross-compiling.  It seems
to work for me.

I still think it's wrong to bother with obsolete assemblers...


[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand

2012-09-05 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419

--- Comment #36 from Ulrich Drepper  2012-09-05 
13:25:21 UTC ---
(In reply to comment #35)
> What will happen if the assembly accept rdrand, but not the CPU?

The code at runtime checks for the feature bit.  There will be no problem. 
This is *exclusively* a problem with obsolete assemblers.


[Bug bootstrap/54419] [4.8 Regression] Compiling libstdc++-v3/src/c++11/random.cc fails on platforms not knowing rdrand

2012-09-05 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419

--- Comment #37 from Ulrich Drepper  2012-09-05 
13:57:27 UTC ---
(In reply to comment #23)
> (though,
> apparently insufficient for i?86 - it should use either __get_cpuid, or
> __get_cpuid_max before __cpuid).

I fixed that.  The code now should work in theory also on those systems. 
Although the sheer size of all the code together will prevent these systems
from being used...


[Bug c++/54825] New: ICE with vector extension

2012-10-05 Thread drepper.fsp at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54825



 Bug #: 54825

   Summary: ICE with vector extension

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: c++

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: drepper@gmail.com

  Host: x86_64-linux





While trying to convert some of libstdc++ to use gcc's vector extensions I ran

into this ICE.  The code /should/ be valid.


[Bug c++/54825] ICE with vector extension

2012-10-05 Thread drepper.fsp at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54825



--- Comment #1 from Ulrich Drepper  2012-10-05 
13:58:21 UTC ---

In case the version number isn't making this clear, I tested this with the

current mainline code.  4.7 probably won't work at all since some of the

features used have been added to the C++ frontend after 4.7.


[Bug c++/54825] ICE with vector extension

2012-10-05 Thread drepper.fsp at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54825



--- Comment #2 from Ulrich Drepper  2012-10-05 
13:59:26 UTC ---

Created attachment 28363

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28363

Reproducer



Why didn't BZ add the file?...


[Bug tree-optimization/54825] ICE with vector extension

2012-10-05 Thread drepper.fsp at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54825



--- Comment #11 from Ulrich Drepper  2012-10-05 
15:12:18 UTC ---

(In reply to comment #7)

> Created attachment 28364 [details]

> patch

> 

> patch I am testing.



This seems to fix the problem for me, even with the original code and not the

reduced test case.


[Bug tree-optimization/54855] New: Unnecessary duplication when performing scalar operation on vector element

2012-10-08 Thread drepper.fsp at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54855



 Bug #: 54855

   Summary: Unnecessary duplication when performing scalar

operation on vector element

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: drepper@gmail.com





Take the following code:





#include 



typedef double v2df __attribute__((vector_size(16)));



int

main(int argc, char *argv[])

{

  v2df v = { 2.0, 2.0 };

  v2df v2 = { 2.0, 2.0 };

  while (argc-- > 1)

{

  v[0] -= 1.0;

  v *= v2;

}

  printf("%g\n", v[0] + v[1]);

  return 0;

}



It compiles as C and C++, both compilers behave the same.



When compiling on x86-64 (therefore with SSE enabled) it generates for the loop

this code:





  4003f0:   66 0f 28 c1 movapd %xmm1,%xmm0

  4003f4:   83 e8 01sub$0x1,%eax

  4003f7:   f2 0f 5c c2 subsd  %xmm2,%xmm0

  4003fb:   f2 0f 10 c8 movsd  %xmm0,%xmm1

  4003ff:   66 0f 58 c9 addpd  %xmm1,%xmm1

  400403:   75 eb   jne4003f0 





I.e., the value is pulled out of the vector, the subtraction is performed, and

then the scalar value is put back into the vector.



Instead the following sequence would have been completely sufficient:



sub$0x1,%eax

subsd  %xmm2,%xmm1

addpd  %xmm1,%xmm1

jne...back



The subsd instruction doesn't touch the high parts of the register.





I know this is a special case, it only works if the scalar operation is for the

element zero of the vector.  But code can be designed like that.  I have some

code which would work nicely like this.  I don't know whether this translates

to other architectures as well.


[Bug libstdc++/54869] ext/random/simd_fast_mersenne_twister_engine/cons/default.cc FAILs

2012-10-09 Thread drepper.fsp at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54869



--- Comment #4 from Ulrich Drepper  2012-10-09 
11:23:41 UTC ---

(In reply to comment #0)

> The new ext/random/simd_fast_mersenne_twister_engine/cons/default.cc testcase

> FAILs on Solaris/SPARC (both 32 and 64-bit):



That's expected.  I mentioned when I posted the patches that the implementation

is for little endian machines.  I don't have access to any big endian machines

and therefore didn't even try to make it work.



It might be sufficient, at end of _M_gen_rand, to swap the order of the four

32-bit words in a 128-bit word.  I never tested this, someone else will have to

do this.


[Bug c/47043] New: allow deprecating enum values

2010-12-22 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47043

   Summary: allow deprecating enum values
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: c
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


The deprecated is nice and it's use should be expanded.  Sometimes enum values
have to be deprecated and it would be useful if one could write this:

enum { newval, oldval __attribute__ ((deprecated)) };

Any use of 'oldval' should provoke the usual warning.


[Bug c++/50734] New: const and pure attributes don't have the effect as in C

2011-10-14 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50734

 Bug #: 50734
   Summary: const and pure attributes don't have the effect as in
C
Classification: Unclassified
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


When the const and pure function attributes are used the compiler doesn't
generate the same code for C++ as for C.  Take this code:

extern int *f() __attribute__((pure));
int g(int *a) {
  int s = 0;
  for (int n = 0; n < 100; ++n)
s += f()[a[n]];
  return s;
}

When compiled as C code the call to 'f' is hoisted out of the loop.  Similarly
when const instead of pure is used.  One can also define 'f' as extern "C"
without changing the result:

 <_Z1gPi>:
   0:41 54push   %r12
   2:49 89 fc mov%rdi,%r12
   5:55   push   %rbp
   6:31 edxor%ebp,%ebp
   8:53   push   %rbx
   9:31 dbxor%ebx,%ebx
   b:0f 1f 44 00 00   nopl   0x0(%rax,%rax,1)
  10:e8 00 00 00 00   callq  15 <_Z1gPi+0x15>
11: R_X86_64_PC32f-0x4
  15:49 63 14 1c  movslq (%r12,%rbx,1),%rdx
  19:48 83 c3 04  add$0x4,%rbx
  1d:03 2c 90 add(%rax,%rdx,4),%ebp
  20:48 81 fb 90 01 00 00 cmp$0x190,%rbx
  27:75 e7jne10 <_Z1gPi+0x10>
  29:5b   pop%rbx
  2a:89 e8mov%ebp,%eax
  2c:5d   pop%rbp
  2d:41 5cpop%r12
  2f:c3   retq   

Versus the C code:

 :
   0:53   push   %rbx
   1:31 c0xor%eax,%eax
   3:48 89 fb mov%rdi,%rbx
   6:e8 00 00 00 00   callq  b 
7: R_X86_64_PC32f-0x4
   b:31 d2xor%edx,%edx
   d:31 c9xor%ecx,%ecx
   f:90   nop
  10:48 63 34 13  movslq (%rbx,%rdx,1),%rsi
  14:48 83 c2 04  add$0x4,%rdx
  18:03 0c b0 add(%rax,%rsi,4),%ecx
  1b:48 81 fa 90 01 00 00 cmp$0x190,%rdx
  22:75 ecjne10 
  24:89 c8mov%ecx,%eax
  26:5b   pop%rbx
  27:c3   retq   



When the very same code is compiled with the C++ compiler the call stays in the
loop.

Should there be a reason for this (which I cannot see, these are extensions and
gcc is not limited by a standard) the compiler should issue a warning and there
should be a way to get the behavior we get with the C compiler.


I checked this with the current Fedora x86-64 compiler

gcc version 4.6.1 20110908 (Red Hat 4.6.1-9) (GCC) 

This is most probably architecture-independent.


[Bug middle-end/50963] New: TLS incompatible with -mcmodel=large & PIC

2011-11-02 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50963

 Bug #: 50963
   Summary: TLS incompatible with -mcmodel=large & PIC
Classification: Unclassified
   Product: gcc
   Version: 4.6.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


Due to some build problems with the default models a program was compiled with
-mcmodel=large.  But that seems to be incompatible with TLS in PIC.  This tiny
code sequence blows up gcc as recently as 4.6.2 (from Fedora rawhide):

__thread int a;

int f(int b)
{
  return a;
}


The ICE message when compiled with 'g++ -c -mcmodel=large t.c -fpic' is:

t.c: In function ‘int f(int)’:
t.c:6:1: error: unrecognizable insn:
(call_insn/u 6 5 7 3 (parallel [
(set (reg:DI 0 ax)
(call:DI (mem:QI (symbol_ref:DI ("__tls_get_addr")) [0 S1 A8])
(const_int 0 [0])))
(unspec:DI [
(symbol_ref:DI ("a") [flags 0x10] )
] UNSPEC_TLS_GD)
]) t.c:5 -1
 (expr_list:REG_EH_REGION (const_int -2147483648 [0x8000])
(nil))
(nil))
t.c:6:1: internal compiler error: in extract_insn, at recog.c:2109


[Bug tree-optimization/50984] New: Boolean return value expression clears register too often

2011-11-03 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50984

 Bug #: 50984
   Summary: Boolean return value expression clears register too
often
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com
Target: x86_64-linux


Compile this code with the current HEAD gcc (or 4.5, I tried that as well) and
you see less than optimal code:

int
f(int a, int b)
{
  return a & 8 && b & 4;
}

For x86-64 I see this asm code:
xorl%eax, %eax
andl$8, %edi
je.L2
xorl%eax, %eax  <- Unnecessary !!!
andl$4, %esi
setne%al
.L2:
rep
ret

The compiler should realize that the second xor is unnecessary.


[Bug tree-optimization/53243] New: Use vector comparisons for if cascades

2012-05-04 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53243

 Bug #: 53243
   Summary: Use vector comparisons for if cascades
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com
Target: x86_64-linux


Created attachment 27312
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27312
Test program (compile with and without -DOLD)

The vector units can compare multiple comparisons concurrently but this is not
used automatically in gcc in situations where it can lead to better
performance.  Assume a function like this:

void
f(float a)
{
  if (a < 1.0)
cb(1);
  else if (a < 2.0)
cb(2);
  else if (a < 3.0)
cb(3);
  else if (a < 4.0)
cb(4);
  else if (a < 5.0)
cb(5);
  else if (a < 6.0)
cb(6);
  else if (a < 7.0)
cb(7);
  else if (a < 8.0)
cb(8);
  else
++o;
}

In this case the first or second if is not marked with __builtin_expect as
likely, otherwise the following *might* not apply.

The routine can be rewritten for AVX machines like this:

void
f(float a)
{
  const __m256 fv = _mm256_set_ps(8.0,7.0,6.0,5.0,4.0,3.0,2.0,1.0);
  __m256 r = _mm256_cmp_ps(fv, _mm256_set1_ps(a), _CMP_LT_OS);
  int i = _mm256_movemask_ps(r);
  asm goto ("bsr %0, %0; jz %l[less1]; .pushsection .rodata; 1: .quad %l2, %l3,
%l4, %l5, %l6, %l7, %l8, %l9; .popsection; jmp *1b(,%0,8)" : : "r" (i) : :
less1, less2, less3, less4, less5, less6, less7, less8, gt8);
  __builtin_unreachable ();
 less1:
  cb(1);
  return;
 less2:
  cb(2);
  return;
 less3:
  cb(3);
  return;
 less4:
  cb(4);
  return;
 less5:
  cb(5);
  return;
 less6:
  cb(6);
  return;
 less7:
  cb(7);
  return;
 less8:
  cb(8);
  return;
 gt8:
  ++o;
}

This might not generate the absolute best code but it runs for the test program
which I attach 20% faster.

The same technique can be applied to integer comparisons.  More complex if
cascades can also be simplified a lot by masking the integer bsr result
accordingly.  This should still be faster.


[Bug target/54087] New: __atomic_fetch_add does not use xadd instruction

2012-07-24 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087

 Bug #: 54087
   Summary: __atomic_fetch_add does not use xadd instruction
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com
Target: x86_64-redhat-linux


Compiling this code

int a;

int f1(int p)
{
  return __atomic_sub_fetch(&a, p, __ATOMIC_SEQ_CST) == 0;
}

int f2(int p)
{
  return __atomic_fetch_sub(&a, p, __ATOMIC_SEQ_CST) - p == 0;
}

you'll see that neither function uses the xadd instruction with the lock
prefix.  Instead an expensive emulation using cmpxchg is used:

 :
   0:8b 05 00 00 00 00mov0x0(%rip),%eax# 6 
2: R_X86_64_PC32a-0x4
   6:89 c2mov%eax,%edx
   8:29 fasub%edi,%edx
   a:f0 0f b1 15 00 00 00 lock cmpxchg %edx,0x0(%rip)# 12

  11:00 
e: R_X86_64_PC32a-0x4
  12:75 f2jne6 
  14:31 c0xor%eax,%eax
  16:85 d2test   %edx,%edx
  18:0f 94 c0 sete   %al
  1b:c3   retq   

This implementation not only is larger, it has possibly (unlikely) unbounded
cost and even if the cmpxchg succeeds right away it is costlier.  The last
point is esepcially true if the cache line for the variable in question is not
in the core's cache.  In this case the initial load causes a I->S transition
for the cache line and the cmpxchg an additional and possibly also very
expensive S->E transition.  Using xadd would cause a I->E transition.

The config/i386/sync.md file in the current tree contains a pattern for
atomic_fetch_add which does use xadd but it seems not to be used, even if
instead of the function parameter an immediate value is used.

;; For operand 2 nonmemory_operand predicate is used instead of
;; register_operand to allow combiner to better optimize atomic
;; additions of constants.
(define_insn "atomic_fetch_add"
  [(set (match_operand:SWI 0 "register_operand" "=")
(unspec_volatile:SWI
  [(match_operand:SWI 1 "memory_operand" "+m")
   (match_operand:SI 3 "const_int_operand")];; model
  UNSPECV_XCHG))
   (set (match_dup 1)
(plus:SWI (match_dup 1)
  (match_operand:SWI 2 "nonmemory_operand" "0")))
   (clobber (reg:CC FLAGS_REG))]
  "TARGET_XADD"
  "lock{%;} %K3xadd{}\t{%0, %1|%1, %0}")


X86_ARCH_XADD should be defined for every architecture but i386.


[Bug target/54087] __atomic_fetch_add does not use xadd instruction

2012-08-01 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087

--- Comment #3 from Ulrich Drepper  2012-08-01 
16:06:33 UTC ---
(In reply to comment #2)
> (In reply to comment #1)
> > Use __atomic_add_fetch and __atomic_fetch_sub in the testcase, and you will
> 
> Eh, __atomic_fetch_add.

Yes, but the compiler should automatically do this.  The extreme case is this:


int v;

int a(void)
{
  return __sync_sub_and_fetch(&v, 5);
}

int b(void)
{
  return __sync_add_and_fetch(&v, -5);
}


The second function does compile as expected.  The first doesn't, it uses
cmpxchg.

Shouldn't this be easy enough to fix by adding patterns for atomic_fetch_sub
and atomic_sub_fetch which match if the second parameter is a constant?  If
it's not a constant a bit more code is needed but that should be no problem
either.


[Bug target/54087] __atomic_fetch_add does not use xadd instruction

2012-08-02 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087

--- Comment #4 from Ulrich Drepper  2012-08-02 
14:33:19 UTC ---
One more data point.  In a micro-benchmark which uses realistic code used in
production the change from

   __sync_sub_and_fetch(var, constant)

to

   __sync_add_and_fetch(var, -constant)

lead to a 10% to 27% improvement in performance.  The cmpxchg use with the
necessary initial load and I->S cache transition really kills performance when
memory is highly contested.


[Bug target/54087] __atomic_fetch_add does not use xadd instruction

2012-08-02 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54087

--- Comment #6 from Ulrich Drepper  2012-08-03 
02:16:57 UTC ---
(In reply to comment #5)
> This patch introduces atomic_fetch_sub:

Seems to work nicely.


[Bug middle-end/54167] New: excessive alignment

2012-08-03 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54167

 Bug #: 54167
   Summary: excessive alignment
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


Compile the following code:

struct c
{
  int a, b;
  /*constexpr*/ c() : a(1), b(2) { }
};

c v;


The variable v will be defined with:

.bss
.align 16
.typev, @object
.sizev, 8
v:
.zero8

The variable has alignment 16!

If you uncomment the constexpr and compile with -std=gnu++11 it can be seen
that the compiler does know what the correct alignment is:

.globlv
.data
.align 4
.typev, @object
.sizev, 8
v:
.long1
.long2


This happens with the current svn version as well as with 4.7.0.


[Bug tree-optimization/52070] New: missing integer comparison optimization

2012-01-31 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52070

 Bug #: 52070
   Summary: missing integer comparison optimization
Classification: Unclassified
   Product: gcc
   Version: 4.6.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


Compile this code with gcc 4.6.2:

#include 
size_t b;
int f(size_t a)
{
  return b == 0 || a < b;
}

For x86-64 I see this result:

f:movqb(%rip), %rdx
movl$1, %eax
testq%rdx, %rdx
je.L2
xorl%eax, %eax
cmpq%rdi, %rdx
seta%al
.L2:rep ret

This can be more done without a conditional jump:

f:movqb(%rip), %rdx
xorl%eax, %eax
subq$1, %rdx
cmpq%rdi, %rdx
setae%al
rep ret

Unless the b==0 test is marked as likely I'd say this code is performing better
on all architectures.


[Bug middle-end/59521] New: __builtin_expect not effective in switch

2013-12-15 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59521

Bug ID: 59521
   Summary: __builtin_expect not effective in switch
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: drepper.fsp at gmail dot com

When used in switch, __builtin_expect should reorder the comparisons
appropriately.  Take this code:

#include 
void
f(int ch) {
  switch (__builtin_expect(ch, 333)) {
case 3: puts("a"); break; 
case 42: puts("e"); break; 
case 333: puts("i"); break; 
} 
}

Current mainline (and also prior versions, I tested 4.8.2) produce with -O3
code like this:

 :
   0:83 ff 2a cmp$0x2a,%edi
   3:74 33je 38 
   5:81 ff 4d 01 00 00cmp$0x14d,%edi
   b:74 1bje 28 
   d:83 ff 03 cmp$0x3,%edi
  10:74 06je 18 
  12:c3   retq   
  13:0f 1f 44 00 00   nopl   0x0(%rax,%rax,1)
  18:bf 00 00 00 00   mov$0x0,%edi
  1d:e9 00 00 00 00   jmpq   22 
  22:66 0f 1f 44 00 00nopw   0x0(%rax,%rax,1)
  28:bf 00 00 00 00   mov$0x0,%edi
  2d:e9 00 00 00 00   jmpq   32 
  32:66 0f 1f 44 00 00nopw   0x0(%rax,%rax,1)
  38:bf 00 00 00 00   mov$0x0,%edi
  3d:e9 00 00 00 00   jmpq   42 

Instead the test for 333/$0x14d should have been moved to the front.


[Bug tree-optimization/51492] New: vectorizer generates unnecessary code

2011-12-09 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

 Bug #: 51492
   Summary: vectorizer generates unnecessary code
Classification: Unclassified
   Product: gcc
   Version: 4.6.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com
 Build: x86_64-linux


Compile this code with 4.6.2 on a x86-64 machine with -O3:

#define SIZE 65536
#define WSIZE 64
unsigned short head[SIZE] __attribute__((aligned(64)));

void
f(void)
{
  for (unsigned n = 0; n < SIZE; ++n) {
unsigned short m = head[n];
head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0);
  }
}

The result I see is this:

 :
   0:66 0f ef d2  pxor   %xmm2,%xmm2
   4:b8 00 00 00 00   mov$0x0,%eax
5: R_X86_64_32head
   9:66 0f 6f 25 00 00 00 movdqa 0x0(%rip),%xmm4# 11 
  10:00 
d: R_X86_64_PC32.LC0-0x4
  11:66 0f 6f 1d 00 00 00 movdqa 0x0(%rip),%xmm3# 19 
  18:00 
15: R_X86_64_PC32.LC1-0x4
  19:0f 1f 80 00 00 00 00 nopl   0x0(%rax)
  20:66 0f 6f 00  movdqa (%rax),%xmm0
  24:66 0f 6f c8  movdqa %xmm0,%xmm1
  28:66 0f d9 c4  psubusw %xmm4,%xmm0
  2c:66 0f 75 c2  pcmpeqw %xmm2,%xmm0
  30:66 0f fd cb  paddw  %xmm3,%xmm1
  34:66 0f df c1  pandn  %xmm1,%xmm0
  38:66 0f 7f 00  movdqa %xmm0,(%rax)
  3c:48 83 c0 10  add$0x10,%rax
  40:48 3d 00 00 00 00cmp$0x0,%rax
42: R_X86_64_32Shead+0x2
  46:75 d8jne20 
  48:f3 c3repz retq 


There is a lot of unnecessary code.  The psubusw instruction alone is
sufficient.  The purpose of this instruction is to implement saturated
subtraction.  Why does gcc create all this extra code?  The code should just be

   movdqa (%rax), %xmm0
   psubusw %xmm1, %xmm0
   movdqa %mm0, (%rax)

where %xmm1 has WSIZE in the 16-bit values.


[Bug c++/51785] New: gets not anymore declared

2012-01-07 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51785

 Bug #: 51785
   Summary: gets not anymore declared
Classification: Unclassified
   Product: gcc
   Version: 4.6.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


glibc 2.15 and later don't declare gets anymore for ISO C11 mode and if
_GNU_SOURCE is defined.  This causes problems with the cstdio header which
unconditionally uses

using ::gets;


Something has to be done about this.  If you want glibc to define a macro to
signal that gets is not declared let me know.  Otherwise recognize __USE_GNU.

The problem still applies to the trunk.


[Bug tree-optimization/51492] vectorizer does not support saturated arithmetic patterns

2012-01-08 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

--- Comment #2 from Ulrich Drepper  2012-01-08 
18:56:48 UTC ---
Note, this code appears in gzip and therefore IIRC in specCPU (in
deflate.c:fill_window).  Although when compiling gzip myself with that code
embedded in a larger function I cannot get the optimization to apply at all.

If this bug is fixed and the optimization is applied the spec numbers could go
up if specCPUis testing unzipping...


[Bug tree-optimization/52034] New: __builtin_copysign optimization suboptimal

2012-01-28 Thread drepper.fsp at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52034

 Bug #: 52034
   Summary: __builtin_copysign optimization suboptimal
Classification: Unclassified
   Product: gcc
   Version: 4.6.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: drepper@gmail.com


The most trivial __builtin_copysign optimization is not optimal:

double f(double a, double b)
{
  return __builtin_copysign(a,b);
}

With gcc 4.6.2 this gets compiled to

movapd%xmm1, %xmm2
andpd.LC0(%rip), %xmm0
andpd.LC1(%rip), %xmm2
orpd%xmm2, %xmm0
ret

There is no reason for %xmm1 to be duplicated to %xmm2.  This is sufficient:

andpd.LC0(%rip), %xmm0
andpd.LC1(%rip), %xmm1
orpd%xmm1, %xmm0
ret

The same happens with more complicated code sequences.


[Bug middle-end/59521] __builtin_expect not effective in switch

2017-07-14 Thread drepper.fsp at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59521

--- Comment #12 from Ulrich Drepper  ---
On Fri, Jul 14, 2017 at 2:17 PM, marxin at gcc dot gnu.org
 wrote:
> Maybe I miss something, but I would expect to sort all branches in
> emit_case_decision_tree as either predictors can sort branches, or one have a
> profile feedback. Having a chain of equal comparisons, that should be always
> beneficial, or?

I agree.  There seems to be no negative effect.  If you use a stable
sort algorithm the programmer can have influence when needed since the
program's order is preserved.  If the compiler has probability
information it should use it.  Note, I just mean the order of the
tests.  Deciding about placing code in cold sections is a different
story but this isn't what we're talking about here, right?