speed of simple loops on x86_64 using opencc vs gcc
Hi, I run some tests of simple number-crunching loops whenever new architectures and compilers arise. These tests on recent Intel architectures show similar performance between gcc and icc compilers, at full optimization. However a recent test on x86_64 showed the open64 compiler outstripping gcc by a factor of 2 to 3. I tried all the obvious flags; nothing helped. Versions: gcc 4.5.2, Open64 4.2.5. AMD Phenom(tm) II X4 840 Processor. A peek in the assembler makes it clear though. Even with -O3, gcc is not unrolling loops in this code, but opencc does, and profits. Attached find the C file. It's not pretty but the guts are in the small routine double_array_mults_by_const(). For your convenience, also attached is the assembler for the innermost loop, generated by the two compilers with the -S flag. --- Building and running: $ gcc --std=c99 -O3 -Wall -pedantic mults_by_const.c $ ./a.out double array mults by const 450 ms [ 1.013193] $ opencc -std=c99 -O3 -Wall mults_by_const.c $ ./a.out double array mults by const 170 ms [ 1.013193] --- Now, the gcc -O3 should have turned on loop unrolling. I tried turning it on explicitly without success. By the way, I also tried. No difference. -march=native and -ffast-math did not affect the time at all. Cheers! #ifdef __ICC #include #else #include #endif /* timer stuff */ #include #include #define __USE_XOPEN2K 1 #include #include static const int who = RUSAGE_SELF; static struct rusage local; static time_t tv_sec; #define START_CLOCK() getrusage(who, &local) #define MS_SINCE( ) ( tv_usec = local.ru_utime.tv_usec, tv_sec = local.ru_utime.tv_sec, \ getrusage( who, &local), \ (long)( ( local.ru_utime.tv_sec - tv_sec ) * 1000 \ + ( local.ru_utime.tv_usec - tv_usec ) / 1000 ) ) #ifdef __suseconds_t_defined static suseconds_t tv_usec; #else static long tv_usec; #endif /* test parameters */ enum { ITERATIONS = 131072, size = 8192 }; static void double_array_mults_by_const( double dvec[] ); int main( int argc, char *argv[] ) { double * restrict dvec = 0; void **dvecptr = (void **)&dvec; if( 0 == posix_memalign( dvecptr, 16, size * sizeof(double) ) ) { double_array_mults_by_const( dvec ); } return 0; } void double_array_mults_by_const( double * restrict dvec ) { long i, j; const double dval = 1.001; for( i = 0; i < size; i++ ) dvec[i] = 1.0; START_CLOCK(); for( j = 0; j < ITERATIONS; j++ ) for( i = 0; i < size; i++ ) dvec[i] *= dval; printf( "%-38s %4ld ms [%10.6f]\n", "double array mults by const", MS_SINCE(), dvec[0] ); } gcc.asm Description: Binary data opencc.asm Description: Binary data
Re: speed of simple loops on x86_64 using opencc vs gcc
Hi Richard! On Thu, Sep 8, 2011 at 11:02 AM, Richard Guenther wrote: > On Thu, Sep 8, 2011 at 12:31 AM, Steve White > wrote: >> Hi, >> >> I run some tests of simple number-crunching loops whenever new >> architectures and compilers arise. >> >> These tests on recent Intel architectures show similar performance >> between gcc and icc compilers, at full optimization. >> >> However a recent test on x86_64 showed the open64 compiler >> outstripping gcc by a factor of 2 to 3. I tried all the obvious >> flags; nothing helped. > > Like -funroll-loops? > ** Let's turn it around: What are a good set of flags then for improving speed in simple loops such as these on the x86_64? In fact, I did try -funroll-loops and several others, but I somehow fooled myself (Maybe partly because, as I wrote, I was under the impression -O3 turned this on by default.) With -funroll-loops, the performance is improved a lot. $ gcc --std=c99 -O3 -funroll-loops -Wall -pedantic mults_by_const.c $ ./a.out double array mults by const 320 ms [ 1.013193] Which puts it only a factor of 2 slower than the open64 -O3. Furthermore, -march=native improves it yet more. $ gcc --std=c99 -O3 -funroll-loops -march=native -Wall -pedantic mults_by_const.c $ ./a.out double array mults by const 300 ms [ 1.013193] Now it's only 70% slower than the open64 results. I tried these flags -floop-optimize -fmove-loop-invariants -fprefetch-loop-arrays -fprofile-use but saw no further improvements. So I drop my claim of knowing what the problem is (and repent of even having tried before.) Simple searches on the web turn up a lot of experiments, nothing definitive. FWIW, also attached is the whole assembler file generated with the above settings. To my eye, the gcc assembler is a great deal more complicated, and does a lot more stuff, besides being slower. Thanks! mults_by_const.s.gz Description: GNU Zip compressed data
proposal: warning for incomplete multichars
Hi! This concerns multiple character integer constants, e.g. 'abcd' as discussed in the C99 standard in subsection 6.4.4.4. We'll call them "multichars". First: everybody agrees multichars are non-portable and therefore to be avoided. That said, there are real-life situations where they are very natural. People use them, portability notwithstanding. Presently cpp has options to turn all multichar warnings on, or all off. Furthermore, it properly warns of multichars that are too big for the int type. This is good. An edge case is that of *incomplete* multichars, such as 'abc' While gcc arranges for this to equal '\0abc' as one might expect (and as in many applications is assumed); other compilers do not. The C99 standard is silent on the point. (Note this behavior is independent of endian-ness.) The primary issue with complete multichars is that of endian-ness, but that can be handled in various programmatic ways. Incomplete multichars by contrast are impossible to detect programmatically and are non-portable even between compilers on the same architecture. Furthermore, in many applications, a multichar with other than four characters is always a typo. What is meant by an incomplete multichar depends: on a 64-bit architecture, an int isn't completely specified by four characters (without a specification of padding and endian-ness). Also, there are applications too where two-character multichars are always intended. I propose therefore a default warning of an incomplete multichar, to compliment the existing option -Wno-multichar, to be turned on by an option something like -Wmultichar-besides= where is 2, 4, or 8, which would turn on warnings for any multichar of length besides bytes. Cheers!
speed of double-precision divide
Hi, I recently revised some speed tests of basic CPU operations. There were a few surprises, but one was that, a test of double-precision divide was a factor of ten slower when compiled with gcc than with the Intel compiler icc. This was with full optimization turned on, with an Intel Duo (Yonah) processor. I figured gcc was simply not using SSE2, and icc was. But that is not the case at all. While gcc produces apparently SSE2 assembler, icc does something quite different. What's going on? Find the .c file attached. Assembler snippets follow. gcc has this (gcc -std=c99 -O3 -msse2 -mfpmath=sse -lm -S dt.c) .L27: movapd (%esi,%eax), %xmm3 ;move 2 dbls at *(esi+eax) to xmm3 divpd 192(%esp,%eax), %xmm3;(192 is xmm2) *(esp+eax), result->xmm3 movapd %xmm3, (%esi,%eax) ;move 2 dbls from xmm3 back addl$16, %eax;add 16 (len of 2 doubles) to eax cmpl$16384, %eax ;compare eax to 1024 * 16 jne .L27 ;if not equal, do it again icc has this (icc -Wall -w2 -fast -c dt.c) # LOE eax xmm2 ..B1.69:# Preds ..B1.71 ..B1.68 movsd 8336(%esp,%eax,8), %xmm1 #108.30 movsd _2il0floatpacket.13, %xmm0#108.2 divsd 24720(%esp,%eax,8), %xmm0 #108.2 unpcklpd %xmm2, %xmm1 #108.30 xorl %edx, %edx# movddup %xmm0, %xmm0 #108.2 movddup %xmm0, %xmm0 #108.2 # LOE eax edx xmm0 xmm1 xmm2 ..B1.70:# Preds ..B1.70 ..B1.69 mulpd %xmm0, %xmm1 #108.2 mulpd %xmm0, %xmm1 #108.2 mulpd %xmm0, %xmm1 #108.2 mulpd %xmm0, %xmm1 #108.2 addl $8, %edx # cmpl $131072, %edx #108.2 jb..B1.70 # Prob 99% #108.2 -- | - - - - - - - - - - - - - - - - - - - - - - - - - | Steve White +49(331)7499-202 | e-Science / AstroGrid-D Zi. 35 Bg. 20 | - - - - - - - - - - - - - - - - - - - - - - - - - | Astrophysikalisches Institut Potsdam (AIP) | An der Sternwarte 16, D-14482 Potsdam | | Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz | | Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026 | - - - - - - - - - - - - - - - - - - - - - - - - - #include #include #include #include #include enum { ITERATIONS = 131072, size = 2048 }; inline void double_array_divs_variable( double * restrict dvec1, double * restrict dvec2 ) { longi, j; for( j = 0; j < ITERATIONS; j++ ) for( i = 0; i < size; i++ ) dvec1[i] /= dvec2[i]; } static const int who = RUSAGE_SELF; static struct rusage local; static time_t tv_sec; static long tv_usec; void START_CLOCK() { getrusage( who, &local ); } long MS_SINCE() { return tv_usec = local.ru_utime.tv_usec, tv_sec = local.ru_utime.tv_sec, getrusage( who, &local), (long)( ( local.ru_utime.tv_sec - tv_sec ) * 1000 + ( local.ru_utime.tv_usec - tv_usec ) / 1000 ); } int main( int argc, char *argv[] ) { double *dvec1, *dvec2; const char *compiler = NULL; longi; printf( " SpeedTest >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n" ); #ifdef __INTEL_COMPILER compiler = "INTEL"; #elif defined( __PATHSCALE__ ) compiler = "PathScale"; #elif defined( __PGI ) compiler = "Portland Group"; #elif defined( __GNUC__) compiler = "Gnu gcc"; #endif posix_memalign( &dvec1, 16, size * sizeof(double) ); posix_memalign( &dvec2, 16, size * sizeof(double) ); printf( " C version" ); if( compiler ) printf( ", %s compiler ", compiler ); printf( "\n" ); printf( " size of int: %zu size of long: %zu size of double: %zu\n", sizeof( int ), sizeof( long ), sizeof( double ) ); printf( " %i iterations of each test. ", ITERATIONS );
Re: speed of double-precision divide
Hi, Andrew! Thanks for the suggestion, but it didn't make any difference for me. Neither the speed nor the assembler was significantly altered. Which version of gcc did you use? Mine is 4.4.1. I threw everything at it: gcc -std=c99 -Wall -pedantic -O3 -ffast-math -mmmx -msse -msse2 -mfpmath=sse -mtune=pentium-m -o dt dt.c -lm -lc I should say, I have tried a lot of other combinations. I have never got gcc to perform well with this test. You will also see that I thought of alignment, and tried to correct for that. Nevermind icc for the moment, with whatever trick it may be doing. Why is the SSE2 division so slow, compared to multiplication? Change one character in the division test to make a multiplication test. It is an order of magnitude difference in speed. Try it yourself! Thanks! On 23.01.10, Andrew Pinski wrote: > On Sat, Jan 23, 2010 at 8:47 AM, Steve White wrote: > > gcc has this (gcc -std=c99 -O3 -msse2 -mfpmath=sse -lm -S dt.c) > > icc has this (icc -Wall -w2 -fast -c dt.c) > > icc's -fast is equivalent to gcc's -ffast-math option which you did > not supply so you comparing apples to oranges. > > Note supplying -ffast-math will have gcc to pull out the division out > of the loop which should speed up your program with some loss of > precision. > > Thanks, > Andrew Pinski > -- | - - - - - - - - - - - - - - - - - - - - - - - - - | Steve White +49(331)7499-202 | e-Science / AstroGrid-D Zi. 35 Bg. 20 | - - - - - - - - - - - - - - - - - - - - - - - - - | Astrophysikalisches Institut Potsdam (AIP) | An der Sternwarte 16, D-14482 Potsdam | | Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz | | Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026 | - - - - - - - - - - - - - - - - - - - - - - - - -
Re: speed of double-precision divide
Richard, Could you provide us with a good reference for the latencies and other speed issues of SSE operations? What I've found is scattered and hard to compare. Frankly, I was under the misconception that each of these SSE operatons was meant to be accomplished in a single clock cycle (although I knew there are various other issues.) Cheers! On 23.01.10, Richard Guenther wrote: > On Sat, Jan 23, 2010 at 6:33 PM, Steve White wrote: > > Hi, Andrew! > > ... > > > > Nevermind icc for the moment, with whatever trick it may be doing. > > Why is the SSE2 division so slow, compared to multiplication? > > > > Change one character in the division test to make a multiplication test. > > It is an order of magnitude difference in speed. > > It's because multiplication latency is like 4 cycles while division is about > 20, also one mutliplication can be issued per cycle while only every > 17th instruction can be a division (AMD Fam10 values). > > GCC performs loop interchange with -ftree-loop-linear but the pass > is scheduled in an unfortunate place so no further optimization happens. > > Richard. > -- | - - - - - - - - - - - - - - - - - - - - - - - - - | Steve White +49(331)7499-202 | e-Science / AstroGrid-D Zi. 35 Bg. 20 | - - - - - - - - - - - - - - - - - - - - - - - - - | Astrophysikalisches Institut Potsdam (AIP) | An der Sternwarte 16, D-14482 Potsdam | | Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz | | Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026 | - - - - - - - - - - - - - - - - - - - - - - - - -