http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #13 from Dominique d'Humieres <dominiq at lps dot ens.fr> 
2010-10-26 16:36:05 UTC ---
> This multiplication transformation is incorrect if the loop wraps  
> (unsigned always wraps; never overflows).

I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64
here) which "works" for additions and multiplications, so if there is wrapping,
the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped.

On my Core2duo 2.53Ghz with -Ofast the run time is ~1.2s for elementary 2*10^9
loops or .6ns/loop or ~1.5 clock cycle per loop. For a perfect vectorization
and no loop overhead, I would expect a minimum of 0.5 clock cycle per loop. If
you get anything below this number, it means that the loop

    for (; a < b; a++)
        sum += a;

is replaced with sum=(b*(b-1)-a*(a-1))/2 (you can confirm it by checking that
the timing behaves as O(len) or not). Apparently clang does this (valid)
transformation while gcc don't for any options I have tried.

Note that If I write such a loop, it is because I am interested by the timing
of the loop, not by the result I know for more than 40 years!

Reply via email to