http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29874
--- Comment #3 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-03-08 10:03:22 UTC --- I raised the number of FFTs to 10000000 and get -O2 -O3 -O3 -ffast-math -O3 -ffast-math -funroll-loops 3.3-H 7.32 7.47 7.48 7.39 4.1 7.21 7.22 7.18 7.21 4.3 7.21 7.20 7.20 7.34 4.5 7.27 7.27 7.21 7.34 4.6 7.09 7.06 7.01 7.16 I don't have a 64bit 3.4 compiler handy, but 3.3-H is the hammer branch so should be close to 3.4. Thus I can't reproduce the slowdown (but I don't have a real 3.4) and 4.6 looks promising here. The generated code looks quite good, though we still have some stack spills left (not sure if due to required temporaries). ICC 12.0 does not manage to come close to the above performance, the best I found was -fast -xHOST which makes the benchmark take 7.30.