[PATCH] fix vectorizer performance problem on cygwin hosted cross compiler

Jim Wilson Thu, 19 Nov 2015 23:22:35 -0800

A cygwin hosted cross compiler to aarch64-linux, compiling a C version
of linpack with -Ofast, produces code that runs 17% slower than a
linux hosted compiler.  The problem shows up in the vect dump, where
some different vectorization optimization decisions were made by the
cygwin compiler than the linux compiler.  That happened because
tree-vect-data-refs.c calls qsort in vect_analyze_data_ref_accesses,
and the newlib and glibc qsort routines sort the list differently.  I
can reproduce the same problem on linux by adding the newlib qsort
sources to a gcc build.  For an x86_64 target, I see about a 30%
performance loss using the newlib qsort.


The qsort trouble turns out to be a problem in the qsort comparison
function, dr_group_sort_cmp.  It does this
  if (!operand_equal_p (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb), 0))
    {
      cmp = compare_tree (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb));
      if (cmp != 0)
        return cmp;
    }
operand_equal_p calls STRIP_NOPS, so it will consider two trees to be
the same even if they have NOP_EXPR.  However, compare_tree is not
calling STRIP_NOPS, so it handles trees with NOP_EXPRs differently
than trees without.  The result is that depending on which array entry
gets used as the qsort pivot point, you can get very different sorts.
The newlib qsort happens to be accidentally choosing a bad pivot for
this testcase.  The glibc qsort happens to be accidentally choosing a
good pivot for this testcase.  This then triggers a cascading problem
in vect_analyze_data_ref_accesses which assumes that array entries
that pass the operand_equal_p test for the base address will end up
adjacent, and will only vectorize in that case.

For a contrived example, suppose we have four entries to sort: (plus Y
8), (mult A 4), (pointer_plus Z 16), and (nop (mult A 4)).  Suppose we
choose the mult as the pivot point. The plus sorts before because
tree_code plus is less than mult. The pointer_plus sorts after for the
same reason. The nop sorts equal. So we end up with plus, mult, nop,
pointer_plus. The mult and nop are then combined into the same
vectorization group.  Now suppose we choose the pointer_plus as the
pivot point. The plus and mult sort before. The nop sorts after. The
final result is plus, mult, pointer_plus, nop. And we fail to
vectorize as the mult and nop are not adjacent as they should be.

When I modify compare_tree to call STRIP_NOPS, this problem goes away.
I get the same sort from both the newlib and glibc qsort functions,
and I get the same linpack performance from a cygwin hosted compiler
and a linux hosted compiler.

This patch was tested with an x86_64 bootstrap and make check.  There
were no regressions.  I've also done a SPEC CPU2000 run with and
without the patch on aarch64-linux, there is no performance change.
And I've verified it by building linpack for aarch64-linux with cygwin
hosted cross compiler, x86_64 hosted cross compiler, and an aarch64
native compiler.

Jim

2015-11-19  Jim Wilson  <jim.wil...@linaro.org>

	* tree-vect-data-refs.c (compare_tree): Call STRIP_NOPS.

Index: tree-vect-data-refs.c
===================================================================
--- tree-vect-data-refs.c	(revision 230429)
+++ tree-vect-data-refs.c	(working copy)
@@ -2545,6 +2545,8 @@ compare_tree (tree t1, tree t2)
   if (t2 == NULL)
     return 1;
 
+  STRIP_NOPS (t1);
+  STRIP_NOPS (t2);
 
   if (TREE_CODE (t1) != TREE_CODE (t2))
     return TREE_CODE (t1) < TREE_CODE (t2) ? -1 : 1;

[PATCH] fix vectorizer performance problem on cygwin hosted cross compiler

Reply via email to