optimizing predictable branches on x86

2008-02-25 Thread Nick Piggin
Hi list,

gcc-4.3 appears to make quite heavy use of cmov to eliminate
conditional branches on x86(-64) architecture, even for those
branches that are determined to be predictable.

The problem with this is that the data dependancy introduced
by the cmov can restrict execution, wheras a predicted branch
will not. The Intel manual says not to use cmov for
predictable branches.

I have a test case which I /believe/ demonstrates that a cond
jump is not a great deal slower in the case where there are
no data dependancy hazards hit, and can be quite a bit faster
in the case where there is a dependancy -- if the branch is
predictable. It also shows that if the branch is unpredictable
then cmov can be a good win (as expected).

Compiled with:
gcc-4.3 -O3 -falign-functions=64 -falign-loops=64 -falign-jumps=64
-falign-labels=64 -march=core2/opteron

Opteron:
 no deps,   predictable -- Ccode took   8.92ns per iteration
 no deps,   predictable -- cmov code took   9.09ns per iteration
 no deps,   predictable -- jmp  code took   9.60ns per iteration[1]
has deps,   predictable -- Ccode took  20.04ns per iteration
has deps,   predictable -- cmov code took  18.09ns per iteration
has deps,   predictable -- jmp  code took  14.97ns per iteration[2]
 no deps, unpredictable -- Ccode took   8.92ns per iteration
 no deps, unpredictable -- cmov code took   9.09ns per iteration
 no deps, unpredictable -- jmp  code took  15.19ns per iteration[3]
has deps, unpredictable -- Ccode took  32.07ns per iteration
has deps, unpredictable -- cmov code took  33.15ns per iteration
has deps, unpredictable -- jmp  code took  69.04ns per iteration[4]

[1] jmp is slightly slower, can it be improved?
[2] nice improvement here
[3] mispredict penalty, cmov is a big win
[4] mispredict which includes load missing cache!

Core2:
 no deps,   predictable -- Ccode took   4.24ns per iteration
 no deps,   predictable -- cmov code took   4.27ns per iteration
 no deps,   predictable -- jmp  code took   4.24ns per iteration
has deps,   predictable -- Ccode took   6.04ns per iteration
has deps,   predictable -- cmov code took   6.59ns per iteration
has deps,   predictable -- jmp  code took   5.04ns per iteration
 no deps, unpredictable -- Ccode took   4.24ns per iteration
 no deps, unpredictable -- cmov code took   4.25ns per iteration
 no deps, unpredictable -- jmp  code took   8.79ns per iteration
has deps, unpredictable -- Ccode took  49.78ns per iteration
has deps, unpredictable -- cmov code took  50.28ns per iteration
has deps, unpredictable -- jmp  code took  75.67ns per iteration

Core2 follows a similar pattern, although it's not seeing any
slowdown in the "no deps, predictable, jmp" case like K8 does.

Any comments? (please cc me) Should gcc be using conditional jumps
more often eg. in the case of __builtin_expect())?

Thanks,
Nick
//#define noinline __attribute__((noinline))
#define noinline
#define likely(x) __builtin_expect(!!(x), 1)

static noinline int test_cmov(int a, int b)
{
	int ret;

	asm volatile (
		"cmpl	%1, %2\n\t"
		"cmovle	%2, %1\n\t"
		"movl	%1, %0\n\t"
		: "=r" (ret)
		: "r" (a), "r" (b));

	return ret;
}

static noinline int test_jmp(int a, int b)
{
	int ret;

	asm volatile (
		"cmpl	%1, %2\n\t"
		"jle	1f\n\t"
		"movl	%1, %0\n\t"
		"2:\n\t"
		".subsection 1\n\t"
		"1:\n\t"
		"movl	%2, %0\n\t"
		"jmp	2b\n\t"
		".previous\n\t"
		: "=r" (ret)
		: "r" (a), "r" (b));

	return ret;
}

static noinline int test_c(int a, int b)
{
	int ret;

	if (likely(a < b))
		ret = a;
	else
		ret = b;

	return ret;
}

#define ITERS	1
#define SIZE	(4*1024*1024)
static int array1[SIZE];
static int array2[SIZE];
static volatile int result[SIZE];

#include 
#include 
#include 
#include 

static void print_time(const char *str, struct timeval *s, struct timeval *e, unsigned long iters)
{
	unsigned long long us, ns;
	double ns_iter;

	us = 100*(e->tv_sec - s->tv_sec) + (e->tv_usec - s->tv_usec);
	ns = us * 1000;
	ns_iter = (double)ns / iters;
	ns /= iters;

	printf("%s took % 6.2lfns per iteration\n", str, ns_iter);
}

int main(void)
{
	struct timeval s, e;
	int i;

	assert(test_cmov(1, 2) == test_c(1, 2));
	assert(test_cmov(1, 1) == test_c(1, 1));
	assert(test_cmov(2, 1) == test_c(2, 1));
	assert(test_jmp(1, 2) == test_c(1, 2));
	assert(test_jmp(1, 1) == test_c(1, 1));
	assert(test_jmp(2, 1) == test_c(2, 1));

	for (i = 0; i < SIZE; i++) {
		array1[i] = i;
		array2[i] = i + 1;
		result[i] = 0;
	}

	gettimeofday(&s, NULL);
	for (i = 0; i < ITERS; i++) {
		result[i%SIZE] = test_c(array1[i%SIZE], array2[i%SIZE]);
	}
	gettimeofday(&e, NULL);
	print_time(" no deps,   predictable -- Ccode", &s, &e, ITERS);

	gettimeofday(&s, NULL);
	for (i = 0; i < ITERS; i++) {
		result[i%SIZE] = test_cmov(array1[i%SIZE], array2[i%SIZE]);
	}
	gettimeofday(&e, NULL);
	print_time(" no deps,   predictable -- cmov code", &s, &e, ITERS);

	gettimeofday(&s, NULL);
	for (i = 0; i < ITERS; i++) {
		result[i%SIZE] = test_jmp(array1[i%SIZ

Re: optimizing predictable branches on x86

2008-02-26 Thread Nick Piggin
On Tuesday 26 February 2008 21:14, Jan Hubicka wrote:
> Hi,
>
> > Core2 follows a similar pattern, although it's not seeing any
> > slowdown in the "no deps, predictable, jmp" case like K8 does.
> >
> > Any comments? (please cc me) Should gcc be using conditional jumps
> > more often eg. in the case of __builtin_expect())?
>
> The problem is that in general GCC's branch prediction algorithms are
> very poor on predicting predictability of branch: they are pretty good
> on guessing outcome but that is.

Yes, I guess this would be tricky. I wonder if there would be use
in having a __builtin_predictable() type of thing. I know there
are cases where we could use it in Linux (eg. we have a lot of
tunable things, but usually they aren't changed often). Then again,
maybe the benefit of doing these annotations would be too small
to bother about.

Linux generally has reasonable likely/unlikely annotations, so I
guess we can first wait to see if predictable branch optimization
gives any benefit there. (or for those doing benchmarks with
profile feedback)

> Only cases we do so quite reliably IMO are:
>   1) loop branches that are not interesting for cmov conversion
>   2) branches leading to noreturn calls, also not interesting
>   3) builtin_expect mentioned.
>   4) when profile feedback is around to some degree (ie we know when the
>   branch is very likely or very unlikely. We don't simulate what
>   hardware will do on it).

At least on x86 it should also be a good idea to know which way
the branch is going to go, because it doesn't have explicit branch
hints, you really want to be able to optimize the cold branch
predictor case if converting from cmov to conditional branches.


> I guess we can implement the machinery for 3 and 4 (in fact once
> I played adding EDGE_PREDICTABLE_P predicate that basically tested if
> the esimated probability of branch is <5% or >95%) but never got really
> noticeable improvements out of it and gave up.
>
> It was before Core2 times, so it might be helping now.  But it needs
> updating for backend cost interface as ifcvt is bit inflexible in this.
> I had BRANCH_COST and PREDICTABLE_BRANCH_COST macros.

cmov performance seems to be pretty robust (I was surprised it is so
good), so it definitely seems like the right thing to do by default.
It will be hard to beat, but I hope there is room for some improvement.

Thanks,
Nick



Re: optimizing predictable branches on x86

2008-03-02 Thread Nick Piggin
On Wednesday 27 February 2008 03:06, J.C. Pizarro wrote:
> Compiling and executing the code of Nick Piggin at
> http://gcc.gnu.org/ml/gcc/2008-02/msg00601.html
>
> in my old Athlon64 Venice 3200+ 2.0 GHz,
> 3 GiB DDR400, 32-bit kernel, gcc 3.4.6, i got
>
> $ gcc -O3 -falign-functions=64 -falign-loops=64 -falign-jumps=64
> -falign-labels=64 -march=i686 foo.c -o foo
> $ ./foo
>  no deps,   predictable -- Ccode took  10.08ns per iteration
>  no deps,   predictable -- cmov code took  11.07ns per iteration
>  no deps,   predictable -- jmp  code took  11.25ns per iteration
> has deps,   predictable -- Ccode took  26.66ns per iteration
> has deps,   predictable -- cmov code took  35.44ns per iteration
> has deps,   predictable -- jmp  code took  18.89ns per iteration
>  no deps, unpredictable -- Ccode took  10.17ns per iteration
>  no deps, unpredictable -- cmov code took  11.07ns per iteration
>  no deps, unpredictable -- jmp  code took  22.51ns per iteration
> has deps, unpredictable -- Ccode took  104.02ns per iteration
> has deps, unpredictable -- cmov code took  107.19ns per iteration
> has deps, unpredictable -- jmp  code took  176.18ns per iteration

Thanks for the numbers... just be careful, sometimes the numbers
are a bit funny: eg. "C" should be very similar code as "cmov", but
sometimes the numbers vary more than I'd like (eg. 26 vs 35ns case)
which I guess is due to gcc having better control over the code
generation in the C case. This should apply to jmp as well, and
indeed if I compile for i586 (without cmov), then the C code uses
jmp, and can be even slightly faster than my jmp asm.

So this is really a demonstration / guideline only, and I think it
would have to be implemented natively in gcc before you can look
really closely at the numbers.

Thanks,
Nick




Re: [RFA] optimizing predictable branches on x86

2008-03-03 Thread Nick Piggin
On Monday 03 March 2008 22:38, Jan Hubicka wrote:
> Hi,
> I had to tweak the testcase a bit to not compute minimum: GCC optimizes
> this early into MIN_EXPR throwing away any profile information.  If we
> get serious here we can maintain it via histogram, but I am not sure it
> is worth the effort at least until IL is sanitized and expansion cleaned
> up with tupple branch.
>
> I also had to fix bug in branch prediction ignoring __builtin_expect of
> any early inlined function and update your testcase to not use
> __buliltin_expect in predictable case.

I guess you mean, not to use it in the _unpredictable_ case?


> However this is what I get on AthlonXP:
>  no deps,   predictable -- Ccode took  13.71ns per iteration
>  no deps,   predictable -- cmov code took  13.83ns per iteration
>  no deps,   predictable -- jmp  code took  13.94ns per iteration
>  has deps,   predictable -- Ccode took  15.54ns per iteration
>  has deps,   predictable -- cmov code took  22.21ns per iteration
>  has deps,   predictable -- jmp  code took  16.55ns per iteration
>  no deps, unpredictable -- Ccode took  13.99ns per iteration
>  no deps, unpredictable -- cmov code took  13.76ns per iteration
>  no deps, unpredictable -- jmp  code took  26.12ns per iteration
>  has deps, unpredictable -- Ccode took  120.37ns per iteration
>  has deps, unpredictable -- cmov code took  120.76ns per iteration
>  has deps, unpredictable -- jmp  code took  165.82ns per iteration

At least for the __builtin_expect case, I guess this is showing
that gcc now does exactly what we'd like of it.


> The patch is quite SPEC neutral, saving 190Kb in FDO binaries.  Still I
> think it is worthwhile to have especially because I do believe that all
> the target COST predicates should be populated by hotness argument so we
> get same results for -Os or -O2 with profile feeback specifying that
> nothing is executed or if one marks all functions cold.
> At the moment profile feedback with all functions not executed leads to
> code smaller than -O2 but closer to -O2 than -Os so there is quite some
> fruit here. With LTO or for codebases with more __builtin_expect and
> cold hints like kernel or libstdc++ we can get a lot of this benefits
> without FDO too.

I hope so too. For the kernel we have some parts where
__builtin_expect is used quite a lot and noticably helps, and could
help even more if we cut down the use of cmov too. I guess on
architectures with even more predictated instructions it could be
even more useful too.

Thanks,
Nick



Re: [RFA] optimizing predictable branches on x86

2008-03-03 Thread Nick Piggin
On Tuesday 04 March 2008 00:01, Jan Hubicka wrote:
> > On Monday 03 March 2008 22:38, Jan Hubicka wrote:

> > I hope so too. For the kernel we have some parts where
> > __builtin_expect is used quite a lot and noticably helps, and could
> > help even more if we cut down the use of cmov too. I guess on
> > architectures with even more predictated instructions it could be
> > even more useful too.
>
> Looking at kernel's __builtin_expect usage, I think we ought to do a lot
> better on taking the hints than we do now.  In particular we should
> be able to derrive more from
>  likely (test1||test2) and likely (test1&&test2)
> and similar cases.  I will try to prepare patch for this later too.

Yes, it is used for multiple conditions like that. I would also like it
to work nicely for:

if (likely(test1) lop unlikely(test2))

It may seem unusual, but I have used it before, and also often the
__builtin_expect is hidden inside macros/inlines.