https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71921

--- Comment #16 from Arjan van de Ven <arjan at linux dot intel.com> ---
A comparable (but optimized to generate smaller asm) testcase is this:

#include <algorithm>
void RELU(float *buffer, int size)
{
        float *ptr = (float *) __builtin_assume_aligned(buffer, 64);
        int i;
        for (i = 0; i < (size * 8); i++) {
                float f = ptr[i];
                ptr[i] = std::max(float(0), f);
        }
}


this will generate, without vectorization on x86 the following core asm
(-mavx2):

        vmovss  (%rdi), %xmm0
        addq    $4, %rdi
        vmaxss  %xmm1, %xmm0, %xmm0
        vmovss  %xmm0, -4(%rdi)
        cmpq    %rax, %rdi
        jne     .L6

but with vectorization enabled one gets

        vmovaps (%rdi), %ymm1
        addl    $1, %eax
        addq    $32, %rdi
        vcmpltps        %ymm1, %ymm2, %ymm0
        vandps  %ymm1, %ymm0, %ymm0
        vmovaps %ymm0, -32(%rdi)
        cmpl    %eax, %esi
        ja      .L4

or in other words, the compiler trusts vmax[sp]s for the non-vector case, but
does not trust it with the vector case without -ffast-math.

when adding -ffast-math to the vectorized case one gets

        vmaxps  (%rdi), %ymm1, %ymm0
        addl    $1, %eax
        addq    $32, %rdi
        vmovaps %ymm0, -32(%rdi)
        cmpl    %eax, %esi
        ja      .L4

as the core loop, which is the expected outcome for this case.

I will make the argument that gcc is wrong to not trust vmaxps in the
vectorization case on x86, because it clearly trusts it in the non-vector case.
The attached patch will make this so, but does it for all architectures not
just x86; I will seek help to turn this into a proper patch, but wanted to put
it here for now to keep track of it.

Reply via email to