On Sun, May 12, 2013 at 02:14:31PM +0200, David Brown wrote: > On 11/05/13 17:20, jacob navia wrote: > >Le 11/05/13 16:01, Ondřej Bílka a écrit : > >>As 1) only way is measure that. Compile following an we will see who is > >>rigth. > >> > >>cat " > >>#include <math.h> > >> > >>int main(){ int i; > >> double x=0; > >> > >> double ret=0; > >> double f; > >> for(i=0;i<10000000;i++){ > >> ret+=sin(x); > >> x+=0.3; > >> } > >> return ret; > >>} > >>" > sin.c > >OK I did a similar thing. I just compiled sin(argc) in main. > >The results prove that you were right. The single fsin instruction > >takes longer than several HUNDRED instructions (calls, jumps > >table lookup what have you) > > > >Gone are the times when an fsin would take 30 cycles or so. > >Intel has destroyed the FPU. > > > > What makes you so sure that it takes more than 30 cycles to execute > hundreds of instructions in the library? Modern cpus often do > several instructions per cycle (I am not considering multiple cores > here). They can issue several instructions per cycle, and predicted > jumps can often be eliminated entirely in the decode stages. > To clarify numbers here 30 cycles library call is unrealistic, just latency caused by call and saving/restoring xmm register overhead is often more than 30 cycles. A sin takes around 150 cycles for normal inputs.
A fsin is slower for several reasons. One is that performance depends on input. From http://www.agner.org/optimize/instruction_tables.pdf fsin takes about 20-100 cycles. Second problem is that xmm->memory->fpu->memory->xmm roundtrip is expensive. There is performance penalty when switching between fpu and xmm instructions. > The moral here is that /you/ need to benchmark /your/ code on /your/ > processor - don't jump to conclusions, or accept other benchmarks as > giving the complete picture. > Agreed.