https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921
--- Comment #16 from Arjan van de Ven <arjan at linux dot intel.com> --- A comparable (but optimized to generate smaller asm) testcase is this: #include <algorithm> void RELU(float *buffer, int size) { float *ptr = (float *) __builtin_assume_aligned(buffer, 64); int i; for (i = 0; i < (size * 8); i++) { float f = ptr[i]; ptr[i] = std::max(float(0), f); } } this will generate, without vectorization on x86 the following core asm (-mavx2): vmovss (%rdi), %xmm0 addq $4, %rdi vmaxss %xmm1, %xmm0, %xmm0 vmovss %xmm0, -4(%rdi) cmpq %rax, %rdi jne .L6 but with vectorization enabled one gets vmovaps (%rdi), %ymm1 addl $1, %eax addq $32, %rdi vcmpltps %ymm1, %ymm2, %ymm0 vandps %ymm1, %ymm0, %ymm0 vmovaps %ymm0, -32(%rdi) cmpl %eax, %esi ja .L4 or in other words, the compiler trusts vmax[sp]s for the non-vector case, but does not trust it with the vector case without -ffast-math. when adding -ffast-math to the vectorized case one gets vmaxps (%rdi), %ymm1, %ymm0 addl $1, %eax addq $32, %rdi vmovaps %ymm0, -32(%rdi) cmpl %eax, %esi ja .L4 as the core loop, which is the expected outcome for this case. I will make the argument that gcc is wrong to not trust vmaxps in the vectorization case on x86, because it clearly trusts it in the non-vector case. The attached patch will make this so, but does it for all architectures not just x86; I will seek help to turn this into a proper patch, but wanted to put it here for now to keep track of it.