speed of simple loops on x86_64 using opencc vs gcc

2011-09-07 Thread Steve White
Hi,

I run some tests of simple number-crunching loops whenever new
architectures and compilers arise.

These tests on recent Intel architectures show similar performance
between gcc and icc compilers, at full optimization.

However a recent test on x86_64 showed the open64 compiler
outstripping gcc by a factor of 2 to 3.  I tried all the obvious
flags; nothing helped.

Versions: gcc 4.5.2, Open64 4.2.5.  AMD Phenom(tm) II X4 840 Processor.

A peek in the assembler makes it clear though.  Even with -O3, gcc is
not unrolling loops in this code, but opencc does, and profits.

Attached find the C file. It's not pretty but the guts are in the
small routine double_array_mults_by_const().

For your convenience, also attached is the assembler for the innermost
loop, generated by the two compilers with the -S flag.
---
Building and running:

$ gcc --std=c99 -O3 -Wall -pedantic mults_by_const.c
$ ./a.out
double array mults by const 450 ms [  1.013193]

$ opencc -std=c99 -O3 -Wall mults_by_const.c
$ ./a.out
double array mults by const 170 ms [  1.013193]
---
Now, the gcc -O3 should have turned on loop unrolling. I tried turning
it on explicitly without success.

By the way, I also tried.  No difference.
-march=native
and
-ffast-math
did not affect the time at all.

Cheers!
#ifdef __ICC
  #include 
#else
  #include 
#endif

/* timer stuff  */
#include 
#include 
#define __USE_XOPEN2K	1
#include 
#include 

static const int who = RUSAGE_SELF;
static struct rusage local;
static time_t tv_sec;
#define START_CLOCK() getrusage(who, &local)
#define MS_SINCE( ) ( tv_usec = local.ru_utime.tv_usec, tv_sec = local.ru_utime.tv_sec, \
			getrusage( who, &local), \
			(long)( ( local.ru_utime.tv_sec - tv_sec ) * 1000 \
+ ( local.ru_utime.tv_usec - tv_usec ) / 1000 ) )

#ifdef __suseconds_t_defined
static suseconds_t tv_usec;
#else
static long tv_usec;
#endif

/* test parameters  */
enum {
	ITERATIONS = 131072,
	size = 8192
};

static void double_array_mults_by_const( double dvec[] );

int
main( int argc, char *argv[] )
{
	double	* restrict dvec = 0;
	void	**dvecptr = (void **)&dvec;

	if( 0 == posix_memalign( dvecptr, 16, size * sizeof(double) ) )
	{
		double_array_mults_by_const( dvec );
	}

	return 0;
}

void
double_array_mults_by_const( double * restrict dvec )
{
	long		i, j;
	const double	dval = 1.001;

	for( i = 0; i < size; i++ )
		dvec[i] = 1.0;

	START_CLOCK();

	for( j = 0; j < ITERATIONS; j++ )
		for( i = 0; i < size; i++ )
			dvec[i] *= dval;
	
	printf( "%-38s %4ld ms [%10.6f]\n",
			"double array mults by const", MS_SINCE(), dvec[0] );
}


gcc.asm
Description: Binary data


opencc.asm
Description: Binary data


Re: speed of simple loops on x86_64 using opencc vs gcc

2011-09-08 Thread Steve White
Hi Richard!

On Thu, Sep 8, 2011 at 11:02 AM, Richard Guenther
 wrote:
> On Thu, Sep 8, 2011 at 12:31 AM, Steve White
>  wrote:
>> Hi,
>>
>> I run some tests of simple number-crunching loops whenever new
>> architectures and compilers arise.
>>
>> These tests on recent Intel architectures show similar performance
>> between gcc and icc compilers, at full optimization.
>>
>> However a recent test on x86_64 showed the open64 compiler
>> outstripping gcc by a factor of 2 to 3.  I tried all the obvious
>> flags; nothing helped.
>
> Like -funroll-loops?
>

** Let's turn it around:  What are a good set of flags then for
improving speed in simple loops such as these on the x86_64?

In fact, I did try -funroll-loops and several others, but I somehow
fooled myself (Maybe partly because, as I wrote, I was under the
impression -O3 turned this on by default.)

With -funroll-loops, the performance is improved a lot.

$ gcc --std=c99 -O3 -funroll-loops -Wall -pedantic mults_by_const.c
$ ./a.out
double array mults by const 320 ms [  1.013193]

Which puts it only a factor of 2 slower than the open64 -O3.

Furthermore, -march=native improves it yet more.

$ gcc --std=c99 -O3 -funroll-loops -march=native -Wall -pedantic
mults_by_const.c
$ ./a.out
double array mults by const 300 ms [  1.013193]

Now it's only 70% slower than the open64 results.

I tried these flags
   -floop-optimize  -fmove-loop-invariants -fprefetch-loop-arrays -fprofile-use
but saw no further improvements.

So I drop my claim of knowing what the problem is (and repent of even
having tried before.)

Simple searches on the web turn up a lot of experiments, nothing definitive.

FWIW, also attached is the whole assembler file generated with the
above settings.

To my eye, the gcc assembler is a great deal more complicated, and
does a lot more stuff, besides being slower.

Thanks!


mults_by_const.s.gz
Description: GNU Zip compressed data


proposal: warning for incomplete multichars

2011-10-03 Thread Steve White
Hi!

This concerns multiple character integer constants, e.g.
'abcd'
as discussed in the C99 standard in subsection 6.4.4.4.
We'll call them "multichars".

First:  everybody agrees multichars are non-portable and therefore to
be avoided.
That said, there are real-life situations where they are very natural.
 People use them, portability notwithstanding.

Presently cpp has options to turn all multichar warnings on, or all
off.  Furthermore, it properly warns of multichars that are too big
for the int type.  This is good.

An edge case is that of *incomplete* multichars, such as
'abc'
While gcc arranges for this to equal '\0abc' as one might expect (and
as in many applications is assumed); other compilers do not.  The C99
standard is silent on the point.
(Note this behavior is independent of endian-ness.)

The primary issue with complete multichars is that of endian-ness, but
that can be handled in various programmatic ways.
Incomplete multichars by contrast are impossible to detect
programmatically and are non-portable even between compilers on the
same architecture.  Furthermore, in many applications, a multichar
with other than four characters is always a typo.

What is meant by an incomplete multichar depends: on a 64-bit
architecture, an int isn't completely specified by four characters
(without a specification of padding and endian-ness). Also, there are
applications too where two-character multichars are always intended.

I propose therefore a default warning of an incomplete multichar, to
compliment the existing option -Wno-multichar, to be turned on by an
option something like
-Wmultichar-besides=
where  is 2, 4, or 8, which would turn on warnings for any
multichar of length besides  bytes.


Cheers!


speed of double-precision divide

2010-01-23 Thread Steve White
Hi,

I recently revised some speed tests of basic CPU operations. 
There were a few surprises, but one was that, a test of double-precision
divide was a factor of ten slower when compiled with gcc than with the
Intel compiler icc.

This was with full optimization turned on, with an Intel Duo (Yonah)
processor.

I figured gcc was simply not using SSE2, and icc was.

But that is not the case at all.  While gcc produces apparently SSE2
assembler, icc does something quite different.

What's going on?

Find the .c file attached.  Assembler snippets follow.


gcc has this (gcc -std=c99 -O3 -msse2 -mfpmath=sse -lm -S dt.c)

.L27:
movapd  (%esi,%eax), %xmm3   ;move 2 dbls at *(esi+eax) to xmm3
divpd   192(%esp,%eax), %xmm3;(192 is xmm2) *(esp+eax), result->xmm3
movapd  %xmm3, (%esi,%eax)   ;move 2 dbls from xmm3 back
addl$16, %eax;add 16 (len of 2 doubles) to eax
cmpl$16384, %eax ;compare eax to 1024 * 16
jne .L27 ;if not equal, do it again


icc has this (icc -Wall -w2 -fast -c dt.c)

# LOE eax xmm2
..B1.69:# Preds ..B1.71 ..B1.68
movsd 8336(%esp,%eax,8), %xmm1  #108.30
movsd _2il0floatpacket.13, %xmm0#108.2
divsd 24720(%esp,%eax,8), %xmm0 #108.2
unpcklpd  %xmm2, %xmm1  #108.30
xorl  %edx, %edx#
movddup   %xmm0, %xmm0  #108.2
movddup   %xmm0, %xmm0  #108.2
# LOE eax edx xmm0 xmm1 xmm2
..B1.70:# Preds ..B1.70 ..B1.69
mulpd %xmm0, %xmm1  #108.2
mulpd %xmm0, %xmm1  #108.2
mulpd %xmm0, %xmm1  #108.2
mulpd %xmm0, %xmm1  #108.2
addl  $8, %edx  #
cmpl  $131072, %edx #108.2
jb..B1.70   # Prob 99%  #108.2


-- 
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
| Steve White +49(331)7499-202
| e-Science / AstroGrid-D   Zi. 35  Bg. 20
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
| Astrophysikalisches Institut Potsdam (AIP)
| An der Sternwarte 16, D-14482 Potsdam
|
| Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz
|
| Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
#include 
#include 
#include 
#include 

#include 
enum {
ITERATIONS = 131072,
size = 2048
};
inline void
double_array_divs_variable( double * restrict dvec1, double * restrict dvec2 )
{
longi, j;

for( j = 0; j < ITERATIONS; j++ )
for( i = 0; i < size; i++ )
dvec1[i] /= dvec2[i];
}
static const int who = RUSAGE_SELF;
static struct rusage local;
static time_t tv_sec;
static long tv_usec;

void START_CLOCK()
{
getrusage( who, &local );
}
long
MS_SINCE()
{
return tv_usec = local.ru_utime.tv_usec, tv_sec = local.ru_utime.tv_sec,
getrusage( who, &local),
(long)( ( local.ru_utime.tv_sec - tv_sec ) * 1000
+ ( local.ru_utime.tv_usec - tv_usec ) / 1000 );
}

int
main( int argc, char *argv[] )
{
double  *dvec1, *dvec2;
const char  *compiler = NULL;
longi;

printf( " SpeedTest >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n" );
#ifdef __INTEL_COMPILER
compiler = "INTEL";
#elif defined( __PATHSCALE__ )
compiler = "PathScale";
#elif defined( __PGI )
compiler = "Portland Group";
#elif defined( __GNUC__)
compiler = "Gnu gcc";
#endif
posix_memalign( &dvec1, 16, size * sizeof(double) );
posix_memalign( &dvec2, 16, size * sizeof(double) );

printf( " C version" );
if( compiler )
printf( ", %s compiler ", compiler );
printf( "\n" );
printf( " size of int: %zu  size of long: %zu  size of double: %zu\n",
sizeof( int ), sizeof( long ), sizeof( double ) );
printf( " %i iterations of each test. ", ITERATIONS );

Re: speed of double-precision divide

2010-01-23 Thread Steve White
Hi, Andrew!

Thanks for the suggestion, but it didn't make any difference for me.
Neither the speed nor the assembler was significantly altered.

Which version of gcc did you use?  Mine is 4.4.1.

I threw everything at it:
gcc -std=c99 -Wall -pedantic -O3 -ffast-math -mmmx -msse -msse2 
-mfpmath=sse -mtune=pentium-m -o dt dt.c -lm -lc
I should say, I have tried a lot of other combinations.  
I have never got gcc to perform well with this test.  You will also
see that I thought of alignment, and tried to correct for that.

Nevermind icc for the moment, with whatever trick it may be doing.  
Why is the SSE2 division so slow, compared to multiplication?

Change one character in the division test to make a multiplication test.
It is an order of magnitude difference in speed.

Try it yourself!

Thanks!

On 23.01.10, Andrew Pinski wrote:
> On Sat, Jan 23, 2010 at 8:47 AM, Steve White  wrote:
> > gcc has this (gcc -std=c99 -O3 -msse2 -mfpmath=sse -lm -S dt.c)
> > icc has this (icc -Wall -w2 -fast -c dt.c)
> 
> icc's -fast is equivalent to gcc's -ffast-math option which you did
> not supply so you comparing apples to oranges.
> 
> Note supplying -ffast-math will have gcc to pull out the division out
> of the loop which should speed up your program with some loss of
> precision.
> 
> Thanks,
> Andrew Pinski
> 

-- 
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
| Steve White +49(331)7499-202
| e-Science / AstroGrid-D   Zi. 35  Bg. 20
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
| Astrophysikalisches Institut Potsdam (AIP)
| An der Sternwarte 16, D-14482 Potsdam
|
| Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz
|
| Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -


Re: speed of double-precision divide

2010-01-24 Thread Steve White
Richard,

Could you provide us with a good reference for the latencies and other
speed issues of SSE operations?  What I've found is scattered and hard
to compare.

Frankly, I was under the misconception that each of these SSE operatons
was meant to be accomplished in a single clock cycle (although I knew there
are various other issues.)

Cheers!

On 23.01.10, Richard Guenther wrote:
> On Sat, Jan 23, 2010 at 6:33 PM, Steve White  wrote:
> > Hi, Andrew!
> >
...
> >
> > Nevermind icc for the moment, with whatever trick it may be doing.
> > Why is the SSE2 division so slow, compared to multiplication?
> >
> > Change one character in the division test to make a multiplication test.
> > It is an order of magnitude difference in speed.
> 
> It's because multiplication latency is like 4 cycles while division is about
> 20, also one mutliplication can be issued per cycle while only every
> 17th instruction can be a division (AMD Fam10 values).
> 
> GCC performs loop interchange with -ftree-loop-linear but the pass
> is scheduled in an unfortunate place so no further optimization happens.
> 
> Richard.
> 

-- 
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
| Steve White +49(331)7499-202
| e-Science / AstroGrid-D   Zi. 35  Bg. 20
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
| Astrophysikalisches Institut Potsdam (AIP)
| An der Sternwarte 16, D-14482 Potsdam
|
| Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz
|
| Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg: III/7-71-026
| -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -