Calculating cosinus/sinus

2013-05-11 Thread jacob navia

Hi

When caculating the cos/sinus, gcc generates a call to a complicated 
routine that takes several thousand instructions to execute.


Suppose the value is stored in some XMM register, say xmm0 and the 
result should be in another xmm register, say xmm1.


Why it doesn't generate:

movsd%xmm0,(%rsp)
fldl (%rsp)
fsin
fstpl(%rsp)
movsd(%rsp),%xmm1

My compiler system (lcc-win) is generating that when optimizations are 
ON. Maybe there are some flags in gcc that I am missing?


Òr there is some other reason?

Thanks in advance for your attention.

jacob


Re: Calculating cosinus/sinus

2013-05-11 Thread Oleg Endo
Hi,

This question is not appropriate for this mailing list.
Please take any further discussions to the gcc-help mailing list.

On Sat, 2013-05-11 at 11:15 +0200, jacob navia wrote:
> Hi
> 
> When caculating the cos/sinus, gcc generates a call to a complicated 
> routine that takes several thousand instructions to execute.
> 
> Suppose the value is stored in some XMM register, say xmm0 and the 
> result should be in another xmm register, say xmm1.
> 
> Why it doesn't generate:
> 
>  movsd%xmm0,(%rsp)
>  fldl (%rsp)
>  fsin
>  fstpl(%rsp)
>  movsd(%rsp),%xmm1
> 
> My compiler system (lcc-win) is generating that when optimizations are 
> ON. Maybe there are some flags in gcc that I am missing?

These optimizations are usually turned on with -ffast-math.
You also have to make sure to select the appropriate CPU or architecture
type to enable the usage of certain instructions.

For more information see:
http://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html

Cheers,
Oleg



Re: Calculating cosinus/sinus

2013-05-11 Thread Marc Glisse

On Sat, 11 May 2013, jacob navia wrote:


Hi

When caculating the cos/sinus, gcc generates a call to a complicated routine 
that takes several thousand instructions to execute.


Suppose the value is stored in some XMM register, say xmm0 and the result 
should be in another xmm register, say xmm1.


Why it doesn't generate:

   movsd%xmm0,(%rsp)
   fldl (%rsp)
   fsin
   fstpl(%rsp)
   movsd(%rsp),%xmm1

My compiler system (lcc-win) is generating that when optimizations are ON. 
Maybe there are some flags in gcc that I am missing?


Òr there is some other reason?


fsin is slower and less precise than the libc SSE2 implementation.

--
Marc Glisse


Re: Calculating cosinus/sinus

2013-05-11 Thread jacob navia

Le 11/05/13 11:20, Oleg Endo a écrit :

Hi,

This question is not appropriate for this mailing list.
Please take any further discussions to the gcc-help mailing list.

On Sat, 2013-05-11 at 11:15 +0200, jacob navia wrote:

Hi

When caculating the cos/sinus, gcc generates a call to a complicated
routine that takes several thousand instructions to execute.

Suppose the value is stored in some XMM register, say xmm0 and the
result should be in another xmm register, say xmm1.

Why it doesn't generate:

  movsd%xmm0,(%rsp)
  fldl (%rsp)
  fsin
  fstpl(%rsp)
  movsd(%rsp),%xmm1

My compiler system (lcc-win) is generating that when optimizations are
ON. Maybe there are some flags in gcc that I am missing?

These optimizations are usually turned on with -ffast-math.
You also have to make sure to select the appropriate CPU or architecture
type to enable the usage of certain instructions.

For more information see:
http://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html

Cheers,
Oleg

Sorry but I DID try:

gcc -ffast-math -S -O3 -finline-functions  tsin.c

Will generate a call to sin() and NOT the assembly above.

AndI DID look at the options page.


Re: Calculating cosinus/sinus

2013-05-11 Thread jacob navia

Le 11/05/13 11:30, Marc Glisse a écrit :

On Sat, 11 May 2013, jacob navia wrote:


Hi

When caculating the cos/sinus, gcc generates a call to a complicated 
routine that takes several thousand instructions to execute.


Suppose the value is stored in some XMM register, say xmm0 and the 
result should be in another xmm register, say xmm1.


Why it doesn't generate:

   movsd%xmm0,(%rsp)
   fldl (%rsp)
   fsin
   fstpl(%rsp)
   movsd(%rsp),%xmm1

My compiler system (lcc-win) is generating that when optimizations 
are ON. Maybe there are some flags in gcc that I am missing?


Òr there is some other reason?


fsin is slower and less precise than the libc SSE2 implementation.


Excuse me but:

1) The fsin instruction is ONE instruction! The sin routine is (at 
least) thousand instructions!
   Even if the fsin instruction itself is "slow" it should be thousand 
times faster than the

   complicated routine gcc calls.
2) The FPU is at 64 bits mantissa using gcc, i.e. fsin will calculate 
with 64 bits mantissa and

   NOT only 53 as SSE2. The fsin instruction is more precise!

I think that gcc has a problem here. I am pointing you to this problem, 
but please keep in mind

I am no newbee...

jacob


Re: Calculating cosinus/sinus

2013-05-11 Thread Robert Dewar

On 5/11/2013 5:42 AM, jacob navia wrote:


1) The fsin instruction is ONE instruction! The sin routine is (at
least) thousand instructions!
 Even if the fsin instruction itself is "slow" it should be thousand
times faster than the
 complicated routine gcc calls.
2) The FPU is at 64 bits mantissa using gcc, i.e. fsin will calculate
with 64 bits mantissa and
 NOT only 53 as SSE2. The fsin instruction is more precise!


You are making conclusions based on naive assumptions here.


I think that gcc has a problem here. I am pointing you to this problem,
but please keep in mind
I am no newbee...


Sure, but that does not mean you are familiar with the intracacies
of accurate computation of transcendental functions!


jacob





Re: Calculating cosinus/sinus

2013-05-11 Thread Ondřej Bílka
On Sat, May 11, 2013 at 09:34:37AM -0400, Robert Dewar wrote:
> On 5/11/2013 5:42 AM, jacob navia wrote:
> 
> >1) The fsin instruction is ONE instruction! The sin routine is (at
> >least) thousand instructions!
> > Even if the fsin instruction itself is "slow" it should be thousand
> >times faster than the
> > complicated routine gcc calls.
> >2) The FPU is at 64 bits mantissa using gcc, i.e. fsin will calculate
> >with 64 bits mantissa and
> > NOT only 53 as SSE2. The fsin instruction is more precise!
> 
> You are making conclusions based on naive assumptions here.

As 1) only way is measure that. Compile following an we will see who is
rigth.

cat "
#include 

int main(){ int i;
  double x=0;

  double ret=0;
  double f;
  for(i=0;i<1000;i++){
 ret+=sin(x);
x+=0.3;
  }
  return ret;
}
" > sin.c

gcc sin.c -O3 -lm -S 
cp sin.s fsin.s
#change implementation in to fsin.s
gcc sin.s -lm -o  sin; gcc fsin.s -lm -o fsin
for I in `seq 1 10` ; do
time ./sin
time ./fsin
done


> >
> >I think that gcc has a problem here. I am pointing you to this problem,
> >but please keep in mind
> >I am no newbee...
> 
> Sure, but that does not mean you are familiar with the intracacies
> of accurate computation of transcendental functions!
> >
> >jacob
> >

-- 

Borg implants are failing


Re: Calculating cosinus/sinus

2013-05-11 Thread Robert Dewar



As 1) only way is measure that. Compile following an we will see who is
rigth.


Right, probably you should have done that before posting
anything! (I leave the experiment up to you!)


cat "
#include 

int main(){ int i;
   double x=0;

   double ret=0;
   double f;
   for(i=0;i<1000;i++){
  ret+=sin(x);
 x+=0.3;
   }
   return ret;
}
" > sin.c

gcc sin.c -O3 -lm -S
cp sin.s fsin.s
#change implementation in to fsin.s
gcc sin.s -lm -o  sin; gcc fsin.s -lm -o fsin
for I in `seq 1 10` ; do
time ./sin
time ./fsin
done




I think that gcc has a problem here. I am pointing you to this problem,
but please keep in mind
I am no newbee...


Sure, but that does not mean you are familiar with the intracacies
of accurate computation of transcendental functions!


jacob







Re: Calculating cosinus/sinus

2013-05-11 Thread Robert Dewar

On 5/11/2013 10:46 AM, Robert Dewar wrote:



As 1) only way is measure that. Compile following an we will see who is
rigth.


Right, probably you should have done that before posting
anything! (I leave the experiment up to you!)


And of course this experiment says nothing about accuracy!



Re: Calculating cosinus/sinus

2013-05-11 Thread jacob navia

Le 11/05/13 16:01, Ondřej Bílka a écrit :

As 1) only way is measure that. Compile following an we will see who is
rigth.

cat "
#include 

int main(){ int i;
   double x=0;

   double ret=0;
   double f;
   for(i=0;i<1000;i++){
  ret+=sin(x);
 x+=0.3;
   }
   return ret;
}
" > sin.c

OK I did a similar thing. I just compiled sin(argc) in main.
The results prove that you were right. The single fsin instruction
takes longer than several HUNDRED instructions (calls, jumps
table lookup what have you)

Gone are the times when an fsin would take 30 cycles or so.
Intel has destroyed the FPU.

But is this the case in real code?
The results are around 2 seconds for 100 million sin calculations
and 4 seconds for the same calculations doing fsin.

But the code used for the fsin solutions is just a few bytes,
compared to the several hundred bytes of the sin function,
not counting the table lookups.

In the benchmark code all that code/data is in the L1 cache.
In real life code you use the sin routine sometimes, and
the probability of it not being in the L1 cache is much higher,
I would say almost one if you do not do sin/cos VERY often.

For the time being I will go on generating the fsin code.
I will try to optimize Moshier's SIN function later on.

I suppose this group is for asking this kind of questions. I
thank everyone that answered.


Yours sincerely

Jacob, the developer of lcc-win

(http://www.cs.virginia.edu/~lcc-win32)




Re: Calculating cosinus/sinus

2013-05-11 Thread Robert Dewar

On 5/11/2013 11:20 AM, jacob navia wrote:


OK I did a similar thing. I just compiled sin(argc) in main.
The results prove that you were right. The single fsin instruction
takes longer than several HUNDRED instructions (calls, jumps
table lookup what have you)

Gone are the times when an fsin would take 30 cycles or so.
Intel has destroyed the FPU.


That's an unwarrented claim, but indeed the algorithm used
within the FPU is inferior to the one in the library. Not
so surprising, the one in the chip is old, and we have made
good advances in learning how to calculate things accurately.
Also, the library is using the fast new 64-bit arithmetic.
So none of this is (or should be surprising).


In the benchmark code all that code/data is in the L1 cache.
In real life code you use the sin routine sometimes, and
the probability of it not being in the L1 cache is much higher,
I would say almost one if you do not do sin/cos VERY often.


But of course you don't really care about performance so much
unless you *are* using it very often. I would be surprised if
there are any real programs in which using the FPU instruction
is faster.

And as noted earlier in the thread, the library algorithm is
more accurate than the Intel algorithm, which is also not at
all surprising.


For the time being I will go on generating the fsin code.
I will try to optimize Moshier's SIN function later on.


Well I will be surprised if you can find significant
optimizations to that very clever routine. Certainly
you have to be a floating-point expert to even touch it!

Robert Dewar




Re: Calculating cosinus/sinus

2013-05-11 Thread Tim Prince

On 05/11/2013 11:25 AM, Robert Dewar wrote:

On 5/11/2013 11:20 AM, jacob navia wrote:


OK I did a similar thing. I just compiled sin(argc) in main.
The results prove that you were right. The single fsin instruction
takes longer than several HUNDRED instructions (calls, jumps
table lookup what have you)

Gone are the times when an fsin would take 30 cycles or so.
Intel has destroyed the FPU.


That's an unwarrented claim, but indeed the algorithm used
within the FPU is inferior to the one in the library. Not
so surprising, the one in the chip is old, and we have made
good advances in learning how to calculate things accurately.
Also, the library is using the fast new 64-bit arithmetic.
So none of this is (or should be surprising).


In the benchmark code all that code/data is in the L1 cache.
In real life code you use the sin routine sometimes, and
the probability of it not being in the L1 cache is much higher,
I would say almost one if you do not do sin/cos VERY often.


But of course you don't really care about performance so much
unless you *are* using it very often. I would be surprised if
there are any real programs in which using the FPU instruction
is faster.
Possible, if long double precision is needed, within the range where 
fsin can deliver it.
I take it the use of vector sin library is excluded (not available for 
long double).


And as noted earlier in the thread, the library algorithm is
more accurate than the Intel algorithm, which is also not at
all surprising.
reduction for range well outside basic 4 quadrants should be better in 
the library (note that fsin gives up for |x| > 2^64) but a double 
library function can hardly be claimed to be generally more accurate 
than long double built-in.


For the time being I will go on generating the fsin code.
I will try to optimize Moshier's SIN function later on.


Well I will be surprised if you can find significant
optimizations to that very clever routine. Certainly
you have to be a floating-point expert to even touch it!

Robert Dewar





--
Tim Prince



Re: OpenACC support in 4.9

2013-05-11 Thread Dinar Temirbulatov
Another interesting use-case for OpenACC and OpenMP is mixing both
standard annotations for the same loop:
 // Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])

  for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
  float tmp = 0.;
  for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
  }
  C[i][j] = tmp;
}
  }
This means that OpenACC pragmas should be parsed before OpenMP pass
(in case both standards were enabled), before the OpenMP pass would
change annotated GIMPLE statements irrecoverably. In my view this
use-case could be handles for example in this way:
We could add some temporary variable for example
"expand_gimple_with_openmp" and change the example above to something
like this just before the OpenMP pass:


if (expand_gimple_with_openmp) {
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
  float tmp = 0.;
  for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
  }
  C[i][j] = tmp;
}
  }
else {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])

  for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
  float tmp = 0.;
  for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
  }
  C[i][j] = tmp;
}
}
and later at the Graphite pass we could understand that our statement
is SCOP and we could produce kernel for this statement and then we
could assume that expand_gimple_with_openmp heuristic is false and the
OpenMP version of the loop could be eliminated or vice versa. But we
have to make sure that optimization passes would not change our
OpenACC gimple that it become unparalleled.
   thanks, Dinar.

On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus  wrote:
> Jakub Jelinek wrote:
> [Fallback generation of CPU code]
>>
>> If one uses the OpenMP 4.0 accelerator pragmas, then that is the required
>> behavior, if the code is for whatever reason not possible to run on the
>> accelerator, it should be executed on host [...]
>
> (I haven't checked, but is this a compile time or run-time requirement?)
>
>
>> Otherwise, the OpenMP runtime as well as the pragmas have a way to choose
>> which accelerator you want to run something on, as device id (integer), so
>> the OpenMP runtime library should maintain the list of supported
>> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU
>> devices), and probably we'll need a compiler switch to say for which kinds
>> of accelerators we want to generate code for, plus the runtime could have
>> dlopened plugins for each of the accelerator kinds.
>
>
> At least two OpenACC implementations I know fail hard when the GPU is not
> available (nonexisting or if the /dev/... has not the right permissions).
> And three of them fail at compile time with an error message if an
> expression within a device section is not possible (e.g. calling some
> nondevice/noninlinable function).
>
> While it is convenient to have CPU fallback, it would be nice to know
> whether some code actually uses the accelerator - both at compile time and
> at run time. Otherwise, one thinks the the GPU is used - without realizing
> that it isn't because, e.g. the device permissions are wrong - or one forgot
> to declare a certain function as target function.
>
> Besides having a flag which tells the compiler for which accelerator the
> code should be generated, also additional flags should be handled, e.g. for
> different versions of the accelerator. For instance, one accelerator model
> of the same series might support double-precision variables while another
> might not. - I assume that falling back to the CPU if the accelerator
> doesn't support a certain feature won't work and one will get an error in
> this case.
>
>
> Is there actually the need to handle multiple accelerators simultaneously?
> My impression is that both OpenACC and OpenMP 4 assume that there is only
> one kind of accelerator available besides the host. If I missed some fine
> print or something else  requires that there are multiple different
> accelerators, it will get more complicated - especially for those code
> section where the user didn't explicitly specify which one should be used.
>
>
> Finally, one should think about debugging. It is not really clear (to me)
> how to handle this best, but as the compiler generates quite some additional
> code (e.g. for copying the data around) and as printf debugging doesn't work
> on GPUs, it is not that easy. I wonder whether there should be an optional
> library like libgomp_debug which adds additional sanity checks (e.g. related
> to copying data to/from the GPU) and which allows to print diagnostic
> output, when one sets an environment variables.
>
> Tobias


gcc-4.7-20130511 is now available

2013-05-11 Thread gccadmin
Snapshot gcc-4.7-20130511 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.7-20130511/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.7 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_7-branch 
revision 198799

You'll find:

 gcc-4.7-20130511.tar.bz2 Complete GCC

  MD5=40d74e2c71b5a73f0dbbc3f82930e705
  SHA1=acf8232b8d29d1fe85f975413ca741c66cc3c530

Diffs from 4.7-20130504 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.7
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Baseless address space

2013-05-11 Thread Amir Gonnen
Hi,

I'm considering adding a named address space to our private gcc port.
This address space is accessed using special instructions with a very
limited addressing mode "[index*8 + imm]"  :  it only supports an
index scaled by 64bit + an immediate. The issue here is that there is
no base register. (access to this address space is always aligned to
64 bits)

Up until now we were using inline assembly to access this address space.
I've tried adding a named address space and defining the relevant insn
patterns and macors (such as TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P
and adjusting TARGET_ADDRESS_COST) but without success so far. It
looks like the expander assumes there is always a base and it could
always fall-back to "[base]" addressing mode.

I'll appreciate any hints regarding whether there is already a target
with such addressing mode, and how hard would it be to modify gcc to
support it.

Thanks,

Amir