[Bug tree-optimization/72517] [7 Regression] 436.cactusADM: More than 40% regression in O3 and Ofast on AMD bdver4 m/c.

rguenth at gcc dot gnu.org Tue, 26 Jul 2016 03:02:10 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72517


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2016-07-26
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
     Ever confirmed|0                           |1

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that the patch should strictly allow more vectorization - _especially_ for
the case we see strided group stores.  For the strided group store case we
extract sub-vectors that might be not natively supported (like V2SI) and
then store them.

gcc.dg/vect/slp-45.c

contains a parametric testcase that lets you explore things.

I wonder what the difference is for -fopt-info-vec before/after the revision
and which of the now vectorized loops are in hot areas of the benchmark.

On x86_64 I see that for example extracting V2QI from V16QI works just fine:

        pextrw  $1, %xmm0, %ecx
        movw    %cx, (%rdx)

but for some reason extracting QImode from V16QI causes us to spill to the
stack
(in an awkward way due to char aliasing?):

        movdqa  (%rdi), %xmm0
        movslq  %edx, %rdx
        movaps  %xmm0, 392(%rsp)
        movzbl  392(%rsp), %eax
        movaps  %xmm0, 616(%rsp)
        movaps  %xmm0, 600(%rsp)
        movaps  %xmm0, 584(%rsp)
        movb    %al, (%rsi)
        movzbl  617(%rsp), %eax
        movzbl  587(%rsp), %ecx
        movdqa  16(%rdi), %xmm1
        movb    %al, 1(%rsi)
...

(in the slp-45.c testcase this is the foo_char_3 function).

Extracting V4QI from V16QI also results in awkward code.

I suspect that both the expander and the target are to blame (and the
vectorizer for its coarse cost model with just a single cost for
vector-to-element).  Similar to vec_construct vec_extract is designed
for element extracts, not sub-vector extracts.  Expansion does some
tricks with punning the source vector but only to supported modes.

That said, slp-45.c contains plenty examples that create non-profitable
vectorizations but knowing the case in question for cactusADM would be nice.

[Bug tree-optimization/72517] [7 Regression] 436.cactusADM: More than 40% regression in O3 and Ofast on AMD bdver4 m/c.

Reply via email to