A cygwin hosted cross compiler to aarch64-linux, compiling a C version of linpack with -Ofast, produces code that runs 17% slower than a linux hosted compiler. The problem shows up in the vect dump, where some different vectorization optimization decisions were made by the cygwin compiler than the linux compiler. That happened because tree-vect-data-refs.c calls qsort in vect_analyze_data_ref_accesses, and the newlib and glibc qsort routines sort the list differently. I can reproduce the same problem on linux by adding the newlib qsort sources to a gcc build. For an x86_64 target, I see about a 30% performance loss using the newlib qsort.
The qsort trouble turns out to be a problem in the qsort comparison function, dr_group_sort_cmp. It does this if (!operand_equal_p (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb), 0)) { cmp = compare_tree (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb)); if (cmp != 0) return cmp; } operand_equal_p calls STRIP_NOPS, so it will consider two trees to be the same even if they have NOP_EXPR. However, compare_tree is not calling STRIP_NOPS, so it handles trees with NOP_EXPRs differently than trees without. The result is that depending on which array entry gets used as the qsort pivot point, you can get very different sorts. The newlib qsort happens to be accidentally choosing a bad pivot for this testcase. The glibc qsort happens to be accidentally choosing a good pivot for this testcase. This then triggers a cascading problem in vect_analyze_data_ref_accesses which assumes that array entries that pass the operand_equal_p test for the base address will end up adjacent, and will only vectorize in that case. For a contrived example, suppose we have four entries to sort: (plus Y 8), (mult A 4), (pointer_plus Z 16), and (nop (mult A 4)). Suppose we choose the mult as the pivot point. The plus sorts before because tree_code plus is less than mult. The pointer_plus sorts after for the same reason. The nop sorts equal. So we end up with plus, mult, nop, pointer_plus. The mult and nop are then combined into the same vectorization group. Now suppose we choose the pointer_plus as the pivot point. The plus and mult sort before. The nop sorts after. The final result is plus, mult, pointer_plus, nop. And we fail to vectorize as the mult and nop are not adjacent as they should be. When I modify compare_tree to call STRIP_NOPS, this problem goes away. I get the same sort from both the newlib and glibc qsort functions, and I get the same linpack performance from a cygwin hosted compiler and a linux hosted compiler. This patch was tested with an x86_64 bootstrap and make check. There were no regressions. I've also done a SPEC CPU2000 run with and without the patch on aarch64-linux, there is no performance change. And I've verified it by building linpack for aarch64-linux with cygwin hosted cross compiler, x86_64 hosted cross compiler, and an aarch64 native compiler. Jim
2015-11-19 Jim Wilson <jim.wil...@linaro.org> * tree-vect-data-refs.c (compare_tree): Call STRIP_NOPS. Index: tree-vect-data-refs.c =================================================================== --- tree-vect-data-refs.c (revision 230429) +++ tree-vect-data-refs.c (working copy) @@ -2545,6 +2545,8 @@ compare_tree (tree t1, tree t2) if (t2 == NULL) return 1; + STRIP_NOPS (t1); + STRIP_NOPS (t2); if (TREE_CODE (t1) != TREE_CODE (t2)) return TREE_CODE (t1) < TREE_CODE (t2) ? -1 : 1;