vectorizable_slp_permutation_1 has two ways of generating the
permutations: one that looks for repeating patterns and one that
calculates the permutation index for every output element individually.
The former works for VLA and VLS whereas the latter only works for VLS.

There are two justifications for using the repeating code for VLS:
it gives more testing coverage, and it should reduce the analysis
overhead for common cases.  This PR kind-of demonstrates both:
the VLS coverage was showing a bug in the analysis shortcut.

The bug seems to go back to g:ab7e60cec1a6, which added the
repeating_p path.  It generated N copies of the permutation vector
in the repeating case, but didn't multiply the number of permutation
instructions for costing purposes by N.  So we seem to have been
undercounting ncopies>1 permutations all this time...

The problem became more visible with g:8157f3f2d211, which extended
the repeating code to handle more cases.

In the patch, I think noutputs is in practice always a multiple
of unpack_factor, but it seemed more future-proof to handle the
general case.

Tested on aarch64-linux-gnu & x86_64-linux-gnu.  Also tested by
making sure that g:8157f3f2d211 with this patch added didn't
change the imagick code for -O3 -mcpu=generic on aarch64
compared to g:8157f3f2d211~.  OK to install?

Richard


gcc/
        PR tree-optimization/117270
        * tree-vect-slp.cc (vectorizable_slp_permutation_1): Make nperms
        account for the number of times that each permutation will be used
        during transformation.
---
 gcc/tree-vect-slp.cc | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 115b7b73eee..ac1733004b6 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -10884,9 +10884,15 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, 
gimple_stmt_iterator *gsi,
      vectors to check during analysis, but we need to generate NOUTPUTS
      vectors during transformation.  */
   unsigned total_nelts = olanes;
-  if (repeating_p && gsi)
-    total_nelts = (total_nelts / unpack_factor) * noutputs;
-  for (unsigned i = 0; i < total_nelts; ++i)
+  unsigned process_nelts = olanes;
+  if (repeating_p)
+    {
+      total_nelts = (total_nelts / unpack_factor) * noutputs;
+      if (gsi)
+       process_nelts = total_nelts;
+    }
+  unsigned last_ei = (total_nelts - 1) % process_nelts;
+  for (unsigned i = 0; i < process_nelts; ++i)
     {
       /* VI is the input vector index when generating code for REPEATING_P.  */
       unsigned vi = i / olanes * (pack_p ? 2 : 1);
@@ -10960,7 +10966,7 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, 
gimple_stmt_iterator *gsi,
            }
 
          if (!identity_p)
-           nperms++;
+           nperms += CEIL (total_nelts, process_nelts) - (ei > last_ei);
          if (gsi)
            {
              if (second_vec.first == -1U)
-- 
2.25.1

Reply via email to