https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008
--- Comment #16 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 11 Feb 2022, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008 > > --- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> --- > (In reply to Richard Biener from comment #12) > > Just as data-point on znver2 Uros testcase shows > > > > rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2 > > rguenther@ryzen:/tmp> numactl --physcpubind=3 /usr/bin/time ./a.out > > 19.18user 0.00system 0:19.18elapsed 99%CPU (0avgtext+0avgdata > > 1528maxresident)k > > 0inputs+0outputs (0major+76minor)pagefaults 0swaps > > rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2 -fno-builtin-fmod > > You should use -fno-builtin-fmodf in the above compile flags. Oops, yes. Then the glibc version is 22.53user 0.00system 0:22.53elapsed 99%CPU (0avgtext+0avgdata 1600maxresident)k 0inputs+0outputs (0major+77minor)pagefaults 0swaps so indeed for float the x87 inline version is faster when benchmarked this way. For double it's 19.31user 0.00system 0:19.31elapsed 99%CPU (0avgtext+0avgdata 1536maxresident)k 0inputs+0outputs (0major+76minor)pagefaults 0swaps vs. 18.47user 0.00system 0:18.47elapsed 99%CPU (0avgtext+0avgdata 1600maxresident)k 0inputs+0outputs (0major+77minor)pagefaults 0swaps so glibc is a bit faster here while the x87 version is of course similar. Avoiding the libcall can of course avoid spilling SSE regs around the call. So what remains is really the special case in blender doing fmod (x, 1.) which can eventually be optimized with SSE4.