[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

rguenth at gcc dot gnu.org Tue, 07 Jun 2016 02:15:30 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2016-06-07
                 CC|                            |jamborm at gcc dot gnu.org,
                   |                            |mliska at suse dot cz
          Component|other                       |tree-optimization
             Blocks|                            |53947
            Summary|2x slower than clang        |2x slower than clang
                   |summing small float array   |summing small float array,
                   |                            |GCC should consider larger
                   |                            |vectorization factor for
                   |                            |"unrolling" reductions
     Ever confirmed|0                           |1

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
An interesting observation is that we clone sum32 for IPA-CP of n == 1024 but
for some unknown reason figure Alignment of 'a' as unusable:

Lattices:
  Node: main/35:
  Node: sum32/34:
    param [0]: VARIABLE
         ctxs: VARIABLE
         Alignment unusable (BOTTOM)
        AGGS VARIABLE
    param [1]: VARIABLE
               1024 [from: 35(99000)] [loc_time: 65, loc_size: 10, prop_time:
0, prop_size: 0]
         ctxs: VARIABLE
         Alignment unusable (BOTTOM)
        AGGS VARIABLE

Evaluating opportunities for sum32/34.
 - considering value 1024 for param #1 n (caller_count: 1)
     good_cloning_opportunity_p (time: 65, size: 10, freq_sum: 99000) ->
evaluation: 643500, threshold: 500
  Creating a specialized node of sum32/34.
    replacing param #1 n with const 1024
                Accounting size:7.00, time:72.78 on predicate:(true)
                Accounting size:3.00, time:2.00 on new predicate:(not inlined)
     the new node is sum32.constprop/43.


iff LLVM disables IPA CP cloning with 'noinline' the testcase should add
'noclone' as well to be a fair comparison.

The vectorizer decides to peel the loop for alignment (as usual...) and thus
creates both prologue and epilogue loop.  That shouldn't matter in practice
but it likely obfuscates code enough to make the Job for IVOPTs harder.
If the desire was to have nothing known about alignment and 'n' in sum32
the above cannot be avoided anyway.  We also peel both prologue and epilogue
loop.

clang 3.6 (the one I have locally) doesn't peel for alignment and thus uses
unaligned loads and unrolls the loop by 2 only.  It doesn't do any IPA CP
with -Ofast.

Note that the difference WRT clangs unrolling and GCCs unrolling is that
clang uses two accumulators while GCC just processes multiple loads on
the same accumulator with its unrolling.  Thus clang can exploit parallelism
in the pipeline of the CPU while GCC restricts the CPU due to the dependences.

This means that it is the vectorizer that needs to consider using a larger
vectorization factor rather than post-vectorization unrolling (that's likely
to pay off only for reductions).  I wonder what LLVMs heuristic is here.

The IPA-CP alignment question remains though.  Martins?


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

Reply via email to