https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497
Bug ID: 94497 Summary: Branchless clamp in the general case gets a branch in a particular case ? Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: grasland at lal dot in2p3.fr Target Milestone: --- (Triage note: I think this is probably a compiler middle-end or back-end issue, but I am not knowledgeable enough about the structure of the GCC codebase to pick the right component.) --- I am trying to make a floating-point computation autovectorization-friendly, without mandating the use of -ffast-math for optimal performance as that is a numerical stability and compiler portability hazard. This turned out to be an interesting exercise in IEEE-754 pedantry, of course, but I can live with that. However, while trying to optimize a "clamp" computation, I ended up at a point where the behavior of the GCC optimizer just does not make sense to me and I could use the opinion of an expert. Consider the following functions: ``` double fast_min(double x, double y) { return (x < y) ? x : y; } double fast_max(double x, double y) { return (x > y) ? x : y; } ``` The definitions of fast_min and fast_max are carefully crafted to match the semantics of x86's min and max instruction family, and indeed if I compile this code with -O1 or above I get minsd/maxsd or vminsd/vmaxsd instructions depending on which vector instruction sets are enabled. This is exactly what I wanted, so far I'm happy. And if I now try to use these min and max functions to write a clamp function... ``` double fast_clamp(double x, double min, double max) { return fast_max(fast_min(x, max), min); } ``` ...again, at -O1 optimization level and above, I get a minsd/maxsd pair, short and sweet: ``` fast_clamp(double, double, double): minsd xmm0, xmm2 maxsd xmm0, xmm1 ret ``` Where this perfect picture becomes tainted, however, is as soon as I try to _use_ this function with certain min/max arguments. ``` double use_fast_clamp(double x) { return fast_clamp(x, 0.0, 1.0); } ``` All of a sudden, the assembly becomes branchy and terrible-looking, even in -O3 mode! ``` use_fast_clamp(double): movapd xmm1, xmm0 movsd xmm0, QWORD PTR .LC0[rip] comisd xmm0, xmm1 jbe .L13 maxsd xmm1, QWORD PTR .LC1[rip] movapd xmm0, xmm1 .L13: ret .LC0: .long 0 .long 1072693248 .LC1: .long 0 .long 0 ``` I can make the generated code go back to a minsd/maxsd pair if I enable -ffast-math (more precisely -ffinite-math-only -funsafe-math-optimizations), but to the best of my knowledge, I shouldn't need fast-math flags here. Further, even if I did forget about an IEEE-754 oddity that requires fast-math flags, it would still mean that the above compilation of the general fast_clamp function is incorrect: if this compilation output should work for any pair of "min" and "max" double-precision arguments, then it trivially should work when the min is 0.0 and max is 1.0. So one way or another, I think the GCC optimizer is doing something strange here. --- This is the most minimal example of this behavior that I managed to come up with. Using only the fast_min or fast_math functions in isolation will behave as expected and codegen into a single minsd or maxsd: ``` double use_fast_min(double x) { return fast_min(x, 1.0); } double use_fast_max(double x) { return fast_max(x, 0.0); } ``` I observed similar behavior on any GCC build I could get my hands on, all the way from the most recent GCC trunk build currently available on godbolt (10.0.1 20200405) to the most ancient build provided by godbolt (4.1.2). Both my local system and godbolt run are Linux-based. My local GCC build was configured with ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go,d --enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver --disable-werror --with-gxx-include-dir=/usr/include/c++/9 --enable-ssp --disable-libssp --disable-libvtv --disable-cet --disable-libcc1 --enable-plugin --with-bugurl=https://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-libphobos --enable-version-specific-runtime-libs --with-gcc-major-version-only --enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function --program-suffix=-9 --without-system-libunwind --enable-multilib --with-arch-32=x86-64 --with-tune=generic --with-build-config=bootstrap-lto-lean --enable-link-mutex --build=x86_64-suse-linux --host=x86_64-suse-linux As for godbolt builds, it is easy to go to godbolt.org and add a -v to the compiler options of the build you're interested in, so I will invite you to do that instead of cluttering this already long bug report further. --- FWIW, clang 10 behaves the way I would expect without fast-math flags (and also generates the zero in place with a xorpd instead of loading it from memory, which is kind of cool), but I'm well aware of the danger of comparing the floating-point behavior of various compiler optimizers. So I wouldn't read too much into that: ``` .LCPI5_0: .quad 4607182418800017408 # double 1 use_fast_clamp(double): # @use_fast_clamp(double) minsd xmm0, qword ptr [rip + .LCPI5_0] xorpd xmm1, xmm1 maxsd xmm0, xmm1 ret ``` If you like to experiment on godbolt too, here's my setup: https://godbolt.org/z/eD-guY .