http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186
--- Comment #13 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-10-26 16:36:05 UTC --- > This multiplication transformation is incorrect if the loop wraps > (unsigned always wraps; never overflows). I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64 here) which "works" for additions and multiplications, so if there is wrapping, the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped. On my Core2duo 2.53Ghz with -Ofast the run time is ~1.2s for elementary 2*10^9 loops or .6ns/loop or ~1.5 clock cycle per loop. For a perfect vectorization and no loop overhead, I would expect a minimum of 0.5 clock cycle per loop. If you get anything below this number, it means that the loop for (; a < b; a++) sum += a; is replaced with sum=(b*(b-1)-a*(a-1))/2 (you can confirm it by checking that the timing behaves as O(len) or not). Apparently clang does this (valid) transformation while gcc don't for any options I have tried. Note that If I write such a loop, it is because I am interested by the timing of the loop, not by the result I know for more than 40 years!