Calculating cosinus/sinus
Hi When caculating the cos/sinus, gcc generates a call to a complicated routine that takes several thousand instructions to execute. Suppose the value is stored in some XMM register, say xmm0 and the result should be in another xmm register, say xmm1. Why it doesn't generate: movsd%xmm0,(%rsp) fldl (%rsp) fsin fstpl(%rsp) movsd(%rsp),%xmm1 My compiler system (lcc-win) is generating that when optimizations are ON. Maybe there are some flags in gcc that I am missing? Òr there is some other reason? Thanks in advance for your attention. jacob
Re: Calculating cosinus/sinus
Hi, This question is not appropriate for this mailing list. Please take any further discussions to the gcc-help mailing list. On Sat, 2013-05-11 at 11:15 +0200, jacob navia wrote: > Hi > > When caculating the cos/sinus, gcc generates a call to a complicated > routine that takes several thousand instructions to execute. > > Suppose the value is stored in some XMM register, say xmm0 and the > result should be in another xmm register, say xmm1. > > Why it doesn't generate: > > movsd%xmm0,(%rsp) > fldl (%rsp) > fsin > fstpl(%rsp) > movsd(%rsp),%xmm1 > > My compiler system (lcc-win) is generating that when optimizations are > ON. Maybe there are some flags in gcc that I am missing? These optimizations are usually turned on with -ffast-math. You also have to make sure to select the appropriate CPU or architecture type to enable the usage of certain instructions. For more information see: http://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html Cheers, Oleg
Re: Calculating cosinus/sinus
On Sat, 11 May 2013, jacob navia wrote: Hi When caculating the cos/sinus, gcc generates a call to a complicated routine that takes several thousand instructions to execute. Suppose the value is stored in some XMM register, say xmm0 and the result should be in another xmm register, say xmm1. Why it doesn't generate: movsd%xmm0,(%rsp) fldl (%rsp) fsin fstpl(%rsp) movsd(%rsp),%xmm1 My compiler system (lcc-win) is generating that when optimizations are ON. Maybe there are some flags in gcc that I am missing? Òr there is some other reason? fsin is slower and less precise than the libc SSE2 implementation. -- Marc Glisse
Re: Calculating cosinus/sinus
Le 11/05/13 11:20, Oleg Endo a écrit : Hi, This question is not appropriate for this mailing list. Please take any further discussions to the gcc-help mailing list. On Sat, 2013-05-11 at 11:15 +0200, jacob navia wrote: Hi When caculating the cos/sinus, gcc generates a call to a complicated routine that takes several thousand instructions to execute. Suppose the value is stored in some XMM register, say xmm0 and the result should be in another xmm register, say xmm1. Why it doesn't generate: movsd%xmm0,(%rsp) fldl (%rsp) fsin fstpl(%rsp) movsd(%rsp),%xmm1 My compiler system (lcc-win) is generating that when optimizations are ON. Maybe there are some flags in gcc that I am missing? These optimizations are usually turned on with -ffast-math. You also have to make sure to select the appropriate CPU or architecture type to enable the usage of certain instructions. For more information see: http://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html Cheers, Oleg Sorry but I DID try: gcc -ffast-math -S -O3 -finline-functions tsin.c Will generate a call to sin() and NOT the assembly above. AndI DID look at the options page.
Re: Calculating cosinus/sinus
Le 11/05/13 11:30, Marc Glisse a écrit : On Sat, 11 May 2013, jacob navia wrote: Hi When caculating the cos/sinus, gcc generates a call to a complicated routine that takes several thousand instructions to execute. Suppose the value is stored in some XMM register, say xmm0 and the result should be in another xmm register, say xmm1. Why it doesn't generate: movsd%xmm0,(%rsp) fldl (%rsp) fsin fstpl(%rsp) movsd(%rsp),%xmm1 My compiler system (lcc-win) is generating that when optimizations are ON. Maybe there are some flags in gcc that I am missing? Òr there is some other reason? fsin is slower and less precise than the libc SSE2 implementation. Excuse me but: 1) The fsin instruction is ONE instruction! The sin routine is (at least) thousand instructions! Even if the fsin instruction itself is "slow" it should be thousand times faster than the complicated routine gcc calls. 2) The FPU is at 64 bits mantissa using gcc, i.e. fsin will calculate with 64 bits mantissa and NOT only 53 as SSE2. The fsin instruction is more precise! I think that gcc has a problem here. I am pointing you to this problem, but please keep in mind I am no newbee... jacob
Re: Calculating cosinus/sinus
On 5/11/2013 5:42 AM, jacob navia wrote: 1) The fsin instruction is ONE instruction! The sin routine is (at least) thousand instructions! Even if the fsin instruction itself is "slow" it should be thousand times faster than the complicated routine gcc calls. 2) The FPU is at 64 bits mantissa using gcc, i.e. fsin will calculate with 64 bits mantissa and NOT only 53 as SSE2. The fsin instruction is more precise! You are making conclusions based on naive assumptions here. I think that gcc has a problem here. I am pointing you to this problem, but please keep in mind I am no newbee... Sure, but that does not mean you are familiar with the intracacies of accurate computation of transcendental functions! jacob
Re: Calculating cosinus/sinus
On Sat, May 11, 2013 at 09:34:37AM -0400, Robert Dewar wrote: > On 5/11/2013 5:42 AM, jacob navia wrote: > > >1) The fsin instruction is ONE instruction! The sin routine is (at > >least) thousand instructions! > > Even if the fsin instruction itself is "slow" it should be thousand > >times faster than the > > complicated routine gcc calls. > >2) The FPU is at 64 bits mantissa using gcc, i.e. fsin will calculate > >with 64 bits mantissa and > > NOT only 53 as SSE2. The fsin instruction is more precise! > > You are making conclusions based on naive assumptions here. As 1) only way is measure that. Compile following an we will see who is rigth. cat " #include int main(){ int i; double x=0; double ret=0; double f; for(i=0;i<1000;i++){ ret+=sin(x); x+=0.3; } return ret; } " > sin.c gcc sin.c -O3 -lm -S cp sin.s fsin.s #change implementation in to fsin.s gcc sin.s -lm -o sin; gcc fsin.s -lm -o fsin for I in `seq 1 10` ; do time ./sin time ./fsin done > > > >I think that gcc has a problem here. I am pointing you to this problem, > >but please keep in mind > >I am no newbee... > > Sure, but that does not mean you are familiar with the intracacies > of accurate computation of transcendental functions! > > > >jacob > > -- Borg implants are failing
Re: Calculating cosinus/sinus
As 1) only way is measure that. Compile following an we will see who is rigth. Right, probably you should have done that before posting anything! (I leave the experiment up to you!) cat " #include int main(){ int i; double x=0; double ret=0; double f; for(i=0;i<1000;i++){ ret+=sin(x); x+=0.3; } return ret; } " > sin.c gcc sin.c -O3 -lm -S cp sin.s fsin.s #change implementation in to fsin.s gcc sin.s -lm -o sin; gcc fsin.s -lm -o fsin for I in `seq 1 10` ; do time ./sin time ./fsin done I think that gcc has a problem here. I am pointing you to this problem, but please keep in mind I am no newbee... Sure, but that does not mean you are familiar with the intracacies of accurate computation of transcendental functions! jacob
Re: Calculating cosinus/sinus
On 5/11/2013 10:46 AM, Robert Dewar wrote: As 1) only way is measure that. Compile following an we will see who is rigth. Right, probably you should have done that before posting anything! (I leave the experiment up to you!) And of course this experiment says nothing about accuracy!
Re: Calculating cosinus/sinus
Le 11/05/13 16:01, Ondřej Bílka a écrit : As 1) only way is measure that. Compile following an we will see who is rigth. cat " #include int main(){ int i; double x=0; double ret=0; double f; for(i=0;i<1000;i++){ ret+=sin(x); x+=0.3; } return ret; } " > sin.c OK I did a similar thing. I just compiled sin(argc) in main. The results prove that you were right. The single fsin instruction takes longer than several HUNDRED instructions (calls, jumps table lookup what have you) Gone are the times when an fsin would take 30 cycles or so. Intel has destroyed the FPU. But is this the case in real code? The results are around 2 seconds for 100 million sin calculations and 4 seconds for the same calculations doing fsin. But the code used for the fsin solutions is just a few bytes, compared to the several hundred bytes of the sin function, not counting the table lookups. In the benchmark code all that code/data is in the L1 cache. In real life code you use the sin routine sometimes, and the probability of it not being in the L1 cache is much higher, I would say almost one if you do not do sin/cos VERY often. For the time being I will go on generating the fsin code. I will try to optimize Moshier's SIN function later on. I suppose this group is for asking this kind of questions. I thank everyone that answered. Yours sincerely Jacob, the developer of lcc-win (http://www.cs.virginia.edu/~lcc-win32)
Re: Calculating cosinus/sinus
On 5/11/2013 11:20 AM, jacob navia wrote: OK I did a similar thing. I just compiled sin(argc) in main. The results prove that you were right. The single fsin instruction takes longer than several HUNDRED instructions (calls, jumps table lookup what have you) Gone are the times when an fsin would take 30 cycles or so. Intel has destroyed the FPU. That's an unwarrented claim, but indeed the algorithm used within the FPU is inferior to the one in the library. Not so surprising, the one in the chip is old, and we have made good advances in learning how to calculate things accurately. Also, the library is using the fast new 64-bit arithmetic. So none of this is (or should be surprising). In the benchmark code all that code/data is in the L1 cache. In real life code you use the sin routine sometimes, and the probability of it not being in the L1 cache is much higher, I would say almost one if you do not do sin/cos VERY often. But of course you don't really care about performance so much unless you *are* using it very often. I would be surprised if there are any real programs in which using the FPU instruction is faster. And as noted earlier in the thread, the library algorithm is more accurate than the Intel algorithm, which is also not at all surprising. For the time being I will go on generating the fsin code. I will try to optimize Moshier's SIN function later on. Well I will be surprised if you can find significant optimizations to that very clever routine. Certainly you have to be a floating-point expert to even touch it! Robert Dewar
Re: Calculating cosinus/sinus
On 05/11/2013 11:25 AM, Robert Dewar wrote: On 5/11/2013 11:20 AM, jacob navia wrote: OK I did a similar thing. I just compiled sin(argc) in main. The results prove that you were right. The single fsin instruction takes longer than several HUNDRED instructions (calls, jumps table lookup what have you) Gone are the times when an fsin would take 30 cycles or so. Intel has destroyed the FPU. That's an unwarrented claim, but indeed the algorithm used within the FPU is inferior to the one in the library. Not so surprising, the one in the chip is old, and we have made good advances in learning how to calculate things accurately. Also, the library is using the fast new 64-bit arithmetic. So none of this is (or should be surprising). In the benchmark code all that code/data is in the L1 cache. In real life code you use the sin routine sometimes, and the probability of it not being in the L1 cache is much higher, I would say almost one if you do not do sin/cos VERY often. But of course you don't really care about performance so much unless you *are* using it very often. I would be surprised if there are any real programs in which using the FPU instruction is faster. Possible, if long double precision is needed, within the range where fsin can deliver it. I take it the use of vector sin library is excluded (not available for long double). And as noted earlier in the thread, the library algorithm is more accurate than the Intel algorithm, which is also not at all surprising. reduction for range well outside basic 4 quadrants should be better in the library (note that fsin gives up for |x| > 2^64) but a double library function can hardly be claimed to be generally more accurate than long double built-in. For the time being I will go on generating the fsin code. I will try to optimize Moshier's SIN function later on. Well I will be surprised if you can find significant optimizations to that very clever routine. Certainly you have to be a floating-point expert to even touch it! Robert Dewar -- Tim Prince
Re: OpenACC support in 4.9
Another interesting use-case for OpenACC and OpenMP is mixing both standard annotations for the same loop: // Compute matrix multiplication. #pragma omp parallel for default(none) shared(A,B,C,size) #pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \ pcopyout(C[0:size][0:size]) for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; } } This means that OpenACC pragmas should be parsed before OpenMP pass (in case both standards were enabled), before the OpenMP pass would change annotated GIMPLE statements irrecoverably. In my view this use-case could be handles for example in this way: We could add some temporary variable for example "expand_gimple_with_openmp" and change the example above to something like this just before the OpenMP pass: if (expand_gimple_with_openmp) { #pragma omp parallel for default(none) shared(A,B,C,size) for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; } } else { #pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \ pcopyout(C[0:size][0:size]) for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { float tmp = 0.; for (int k = 0; k < size; ++k) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; } } and later at the Graphite pass we could understand that our statement is SCOP and we could produce kernel for this statement and then we could assume that expand_gimple_with_openmp heuristic is false and the OpenMP version of the loop could be eliminated or vice versa. But we have to make sure that optimization passes would not change our OpenACC gimple that it become unparalleled. thanks, Dinar. On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus wrote: > Jakub Jelinek wrote: > [Fallback generation of CPU code] >> >> If one uses the OpenMP 4.0 accelerator pragmas, then that is the required >> behavior, if the code is for whatever reason not possible to run on the >> accelerator, it should be executed on host [...] > > (I haven't checked, but is this a compile time or run-time requirement?) > > >> Otherwise, the OpenMP runtime as well as the pragmas have a way to choose >> which accelerator you want to run something on, as device id (integer), so >> the OpenMP runtime library should maintain the list of supported >> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU >> devices), and probably we'll need a compiler switch to say for which kinds >> of accelerators we want to generate code for, plus the runtime could have >> dlopened plugins for each of the accelerator kinds. > > > At least two OpenACC implementations I know fail hard when the GPU is not > available (nonexisting or if the /dev/... has not the right permissions). > And three of them fail at compile time with an error message if an > expression within a device section is not possible (e.g. calling some > nondevice/noninlinable function). > > While it is convenient to have CPU fallback, it would be nice to know > whether some code actually uses the accelerator - both at compile time and > at run time. Otherwise, one thinks the the GPU is used - without realizing > that it isn't because, e.g. the device permissions are wrong - or one forgot > to declare a certain function as target function. > > Besides having a flag which tells the compiler for which accelerator the > code should be generated, also additional flags should be handled, e.g. for > different versions of the accelerator. For instance, one accelerator model > of the same series might support double-precision variables while another > might not. - I assume that falling back to the CPU if the accelerator > doesn't support a certain feature won't work and one will get an error in > this case. > > > Is there actually the need to handle multiple accelerators simultaneously? > My impression is that both OpenACC and OpenMP 4 assume that there is only > one kind of accelerator available besides the host. If I missed some fine > print or something else requires that there are multiple different > accelerators, it will get more complicated - especially for those code > section where the user didn't explicitly specify which one should be used. > > > Finally, one should think about debugging. It is not really clear (to me) > how to handle this best, but as the compiler generates quite some additional > code (e.g. for copying the data around) and as printf debugging doesn't work > on GPUs, it is not that easy. I wonder whether there should be an optional > library like libgomp_debug which adds additional sanity checks (e.g. related > to copying data to/from the GPU) and which allows to print diagnostic > output, when one sets an environment variables. > > Tobias
gcc-4.7-20130511 is now available
Snapshot gcc-4.7-20130511 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.7-20130511/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.7 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_7-branch revision 198799 You'll find: gcc-4.7-20130511.tar.bz2 Complete GCC MD5=40d74e2c71b5a73f0dbbc3f82930e705 SHA1=acf8232b8d29d1fe85f975413ca741c66cc3c530 Diffs from 4.7-20130504 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.7 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Baseless address space
Hi, I'm considering adding a named address space to our private gcc port. This address space is accessed using special instructions with a very limited addressing mode "[index*8 + imm]" : it only supports an index scaled by 64bit + an immediate. The issue here is that there is no base register. (access to this address space is always aligned to 64 bits) Up until now we were using inline assembly to access this address space. I've tried adding a named address space and defining the relevant insn patterns and macors (such as TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P and adjusting TARGET_ADDRESS_COST) but without success so far. It looks like the expander assumes there is always a base and it could always fall-back to "[base]" addressing mode. I'll appreciate any hints regarding whether there is already a target with such addressing mode, and how hard would it be to modify gcc to support it. Thanks, Amir