optimizing predictable branches on x86
Hi list, gcc-4.3 appears to make quite heavy use of cmov to eliminate conditional branches on x86(-64) architecture, even for those branches that are determined to be predictable. The problem with this is that the data dependancy introduced by the cmov can restrict execution, wheras a predicted branch will not. The Intel manual says not to use cmov for predictable branches. I have a test case which I /believe/ demonstrates that a cond jump is not a great deal slower in the case where there are no data dependancy hazards hit, and can be quite a bit faster in the case where there is a dependancy -- if the branch is predictable. It also shows that if the branch is unpredictable then cmov can be a good win (as expected). Compiled with: gcc-4.3 -O3 -falign-functions=64 -falign-loops=64 -falign-jumps=64 -falign-labels=64 -march=core2/opteron Opteron: no deps, predictable -- Ccode took 8.92ns per iteration no deps, predictable -- cmov code took 9.09ns per iteration no deps, predictable -- jmp code took 9.60ns per iteration[1] has deps, predictable -- Ccode took 20.04ns per iteration has deps, predictable -- cmov code took 18.09ns per iteration has deps, predictable -- jmp code took 14.97ns per iteration[2] no deps, unpredictable -- Ccode took 8.92ns per iteration no deps, unpredictable -- cmov code took 9.09ns per iteration no deps, unpredictable -- jmp code took 15.19ns per iteration[3] has deps, unpredictable -- Ccode took 32.07ns per iteration has deps, unpredictable -- cmov code took 33.15ns per iteration has deps, unpredictable -- jmp code took 69.04ns per iteration[4] [1] jmp is slightly slower, can it be improved? [2] nice improvement here [3] mispredict penalty, cmov is a big win [4] mispredict which includes load missing cache! Core2: no deps, predictable -- Ccode took 4.24ns per iteration no deps, predictable -- cmov code took 4.27ns per iteration no deps, predictable -- jmp code took 4.24ns per iteration has deps, predictable -- Ccode took 6.04ns per iteration has deps, predictable -- cmov code took 6.59ns per iteration has deps, predictable -- jmp code took 5.04ns per iteration no deps, unpredictable -- Ccode took 4.24ns per iteration no deps, unpredictable -- cmov code took 4.25ns per iteration no deps, unpredictable -- jmp code took 8.79ns per iteration has deps, unpredictable -- Ccode took 49.78ns per iteration has deps, unpredictable -- cmov code took 50.28ns per iteration has deps, unpredictable -- jmp code took 75.67ns per iteration Core2 follows a similar pattern, although it's not seeing any slowdown in the "no deps, predictable, jmp" case like K8 does. Any comments? (please cc me) Should gcc be using conditional jumps more often eg. in the case of __builtin_expect())? Thanks, Nick //#define noinline __attribute__((noinline)) #define noinline #define likely(x) __builtin_expect(!!(x), 1) static noinline int test_cmov(int a, int b) { int ret; asm volatile ( "cmpl %1, %2\n\t" "cmovle %2, %1\n\t" "movl %1, %0\n\t" : "=r" (ret) : "r" (a), "r" (b)); return ret; } static noinline int test_jmp(int a, int b) { int ret; asm volatile ( "cmpl %1, %2\n\t" "jle 1f\n\t" "movl %1, %0\n\t" "2:\n\t" ".subsection 1\n\t" "1:\n\t" "movl %2, %0\n\t" "jmp 2b\n\t" ".previous\n\t" : "=r" (ret) : "r" (a), "r" (b)); return ret; } static noinline int test_c(int a, int b) { int ret; if (likely(a < b)) ret = a; else ret = b; return ret; } #define ITERS 1 #define SIZE (4*1024*1024) static int array1[SIZE]; static int array2[SIZE]; static volatile int result[SIZE]; #include #include #include #include static void print_time(const char *str, struct timeval *s, struct timeval *e, unsigned long iters) { unsigned long long us, ns; double ns_iter; us = 100*(e->tv_sec - s->tv_sec) + (e->tv_usec - s->tv_usec); ns = us * 1000; ns_iter = (double)ns / iters; ns /= iters; printf("%s took % 6.2lfns per iteration\n", str, ns_iter); } int main(void) { struct timeval s, e; int i; assert(test_cmov(1, 2) == test_c(1, 2)); assert(test_cmov(1, 1) == test_c(1, 1)); assert(test_cmov(2, 1) == test_c(2, 1)); assert(test_jmp(1, 2) == test_c(1, 2)); assert(test_jmp(1, 1) == test_c(1, 1)); assert(test_jmp(2, 1) == test_c(2, 1)); for (i = 0; i < SIZE; i++) { array1[i] = i; array2[i] = i + 1; result[i] = 0; } gettimeofday(&s, NULL); for (i = 0; i < ITERS; i++) { result[i%SIZE] = test_c(array1[i%SIZE], array2[i%SIZE]); } gettimeofday(&e, NULL); print_time(" no deps, predictable -- Ccode", &s, &e, ITERS); gettimeofday(&s, NULL); for (i = 0; i < ITERS; i++) { result[i%SIZE] = test_cmov(array1[i%SIZE], array2[i%SIZE]); } gettimeofday(&e, NULL); print_time(" no deps, predictable -- cmov code", &s, &e, ITERS); gettimeofday(&s, NULL); for (i = 0; i < ITERS; i++) { result[i%SIZE] = test_jmp(array1[i%SIZ
Re: optimizing predictable branches on x86
On Tuesday 26 February 2008 21:14, Jan Hubicka wrote: > Hi, > > > Core2 follows a similar pattern, although it's not seeing any > > slowdown in the "no deps, predictable, jmp" case like K8 does. > > > > Any comments? (please cc me) Should gcc be using conditional jumps > > more often eg. in the case of __builtin_expect())? > > The problem is that in general GCC's branch prediction algorithms are > very poor on predicting predictability of branch: they are pretty good > on guessing outcome but that is. Yes, I guess this would be tricky. I wonder if there would be use in having a __builtin_predictable() type of thing. I know there are cases where we could use it in Linux (eg. we have a lot of tunable things, but usually they aren't changed often). Then again, maybe the benefit of doing these annotations would be too small to bother about. Linux generally has reasonable likely/unlikely annotations, so I guess we can first wait to see if predictable branch optimization gives any benefit there. (or for those doing benchmarks with profile feedback) > Only cases we do so quite reliably IMO are: > 1) loop branches that are not interesting for cmov conversion > 2) branches leading to noreturn calls, also not interesting > 3) builtin_expect mentioned. > 4) when profile feedback is around to some degree (ie we know when the > branch is very likely or very unlikely. We don't simulate what > hardware will do on it). At least on x86 it should also be a good idea to know which way the branch is going to go, because it doesn't have explicit branch hints, you really want to be able to optimize the cold branch predictor case if converting from cmov to conditional branches. > I guess we can implement the machinery for 3 and 4 (in fact once > I played adding EDGE_PREDICTABLE_P predicate that basically tested if > the esimated probability of branch is <5% or >95%) but never got really > noticeable improvements out of it and gave up. > > It was before Core2 times, so it might be helping now. But it needs > updating for backend cost interface as ifcvt is bit inflexible in this. > I had BRANCH_COST and PREDICTABLE_BRANCH_COST macros. cmov performance seems to be pretty robust (I was surprised it is so good), so it definitely seems like the right thing to do by default. It will be hard to beat, but I hope there is room for some improvement. Thanks, Nick
Re: optimizing predictable branches on x86
On Wednesday 27 February 2008 03:06, J.C. Pizarro wrote: > Compiling and executing the code of Nick Piggin at > http://gcc.gnu.org/ml/gcc/2008-02/msg00601.html > > in my old Athlon64 Venice 3200+ 2.0 GHz, > 3 GiB DDR400, 32-bit kernel, gcc 3.4.6, i got > > $ gcc -O3 -falign-functions=64 -falign-loops=64 -falign-jumps=64 > -falign-labels=64 -march=i686 foo.c -o foo > $ ./foo > no deps, predictable -- Ccode took 10.08ns per iteration > no deps, predictable -- cmov code took 11.07ns per iteration > no deps, predictable -- jmp code took 11.25ns per iteration > has deps, predictable -- Ccode took 26.66ns per iteration > has deps, predictable -- cmov code took 35.44ns per iteration > has deps, predictable -- jmp code took 18.89ns per iteration > no deps, unpredictable -- Ccode took 10.17ns per iteration > no deps, unpredictable -- cmov code took 11.07ns per iteration > no deps, unpredictable -- jmp code took 22.51ns per iteration > has deps, unpredictable -- Ccode took 104.02ns per iteration > has deps, unpredictable -- cmov code took 107.19ns per iteration > has deps, unpredictable -- jmp code took 176.18ns per iteration Thanks for the numbers... just be careful, sometimes the numbers are a bit funny: eg. "C" should be very similar code as "cmov", but sometimes the numbers vary more than I'd like (eg. 26 vs 35ns case) which I guess is due to gcc having better control over the code generation in the C case. This should apply to jmp as well, and indeed if I compile for i586 (without cmov), then the C code uses jmp, and can be even slightly faster than my jmp asm. So this is really a demonstration / guideline only, and I think it would have to be implemented natively in gcc before you can look really closely at the numbers. Thanks, Nick
Re: [RFA] optimizing predictable branches on x86
On Monday 03 March 2008 22:38, Jan Hubicka wrote: > Hi, > I had to tweak the testcase a bit to not compute minimum: GCC optimizes > this early into MIN_EXPR throwing away any profile information. If we > get serious here we can maintain it via histogram, but I am not sure it > is worth the effort at least until IL is sanitized and expansion cleaned > up with tupple branch. > > I also had to fix bug in branch prediction ignoring __builtin_expect of > any early inlined function and update your testcase to not use > __buliltin_expect in predictable case. I guess you mean, not to use it in the _unpredictable_ case? > However this is what I get on AthlonXP: > no deps, predictable -- Ccode took 13.71ns per iteration > no deps, predictable -- cmov code took 13.83ns per iteration > no deps, predictable -- jmp code took 13.94ns per iteration > has deps, predictable -- Ccode took 15.54ns per iteration > has deps, predictable -- cmov code took 22.21ns per iteration > has deps, predictable -- jmp code took 16.55ns per iteration > no deps, unpredictable -- Ccode took 13.99ns per iteration > no deps, unpredictable -- cmov code took 13.76ns per iteration > no deps, unpredictable -- jmp code took 26.12ns per iteration > has deps, unpredictable -- Ccode took 120.37ns per iteration > has deps, unpredictable -- cmov code took 120.76ns per iteration > has deps, unpredictable -- jmp code took 165.82ns per iteration At least for the __builtin_expect case, I guess this is showing that gcc now does exactly what we'd like of it. > The patch is quite SPEC neutral, saving 190Kb in FDO binaries. Still I > think it is worthwhile to have especially because I do believe that all > the target COST predicates should be populated by hotness argument so we > get same results for -Os or -O2 with profile feeback specifying that > nothing is executed or if one marks all functions cold. > At the moment profile feedback with all functions not executed leads to > code smaller than -O2 but closer to -O2 than -Os so there is quite some > fruit here. With LTO or for codebases with more __builtin_expect and > cold hints like kernel or libstdc++ we can get a lot of this benefits > without FDO too. I hope so too. For the kernel we have some parts where __builtin_expect is used quite a lot and noticably helps, and could help even more if we cut down the use of cmov too. I guess on architectures with even more predictated instructions it could be even more useful too. Thanks, Nick
Re: [RFA] optimizing predictable branches on x86
On Tuesday 04 March 2008 00:01, Jan Hubicka wrote: > > On Monday 03 March 2008 22:38, Jan Hubicka wrote: > > I hope so too. For the kernel we have some parts where > > __builtin_expect is used quite a lot and noticably helps, and could > > help even more if we cut down the use of cmov too. I guess on > > architectures with even more predictated instructions it could be > > even more useful too. > > Looking at kernel's __builtin_expect usage, I think we ought to do a lot > better on taking the hints than we do now. In particular we should > be able to derrive more from > likely (test1||test2) and likely (test1&&test2) > and similar cases. I will try to prepare patch for this later too. Yes, it is used for multiple conditions like that. I would also like it to work nicely for: if (likely(test1) lop unlikely(test2)) It may seem unusual, but I have used it before, and also often the __builtin_expect is hidden inside macros/inlines.