https://gcc.gnu.org/g:e3d3d6d7d2c8ab73ff597f4c82514c3217256567

commit r16-869-ge3d3d6d7d2c8ab73ff597f4c82514c3217256567
Author: Jan Hubicka <hubi...@ucw.cz>
Date:   Sun May 25 14:33:17 2025 +0200

    Make i386 construcotr vectorizer costs more realistics
    
    this patch attempts to make vectorizer costs of vector consructions more
    realistic.  Currently we account one integer_to_sse cost for integer vector
    construction but we over-estimate 256 and 512bit vinserts by using addss
    instead of sse_op.  This is because in reality, especially on AMD machines,
    vectorization of constructors may get expensive due to quite large
    integer<->sse move costs.
    
    Estimating real integer<->sse register traffic is quite hard since some of
    integer non-vector arithmetics can be done in SSE registers (for example,
    if there is no real arithmetics, just memory load or any code that can be
    converted by scalar-to-vector RTL pass).
    
    I think to fix the situation we need to proceed with Richi's recent patch on
    adding extra info to the cost hooks and pattern match what can eventually be
    STV converted. Towards that we however also need to fix current STC 
limitations
    (such as lack for int->sse conversion) and make the cost mode more 
meaningful.
    
    This patch removes the hack using addss to "add extra cost" to 256 and 
512bit
    constructors.  Instead I use integer_to_sse cost in add_stmt_cost.  We 
already
    account 1 consversion for all constructs (no matter of size).  I made it to 
be
    2 conversions for 256 and 3 for 512 since it is closest to what we do now.
    
    Current costs tables are not matching reality for zens
      1) SSE loads (which are pushed down from 10 cycles to 3 cycles)
      2) SSE stores
      2) SSE->integer conversion cost (which is 3 cycles instead of 5)
    Similarly we are not having realistic values for Intel chips, especially
    artifically increasing SSE->integer costs.
    
    The reason is that changing those values regressed benchmarks. This was 
mostly
    because these costs were accounted wrong on multiple spots and we kind of
    fine-tuned for SPECs.
    
    Other reason is that at the time the tables was merged with register 
allocator
    increasing those costs led to IRA using integer registers to spill SSE 
values
    and vice versa which does not work that well in practice.  I think one of
    problems there is missing model for memory renaming which makes integer
    spilling significantly cheaper then modelled.
    
    In previous patches I fixed multiple issues on accounting loads and stores 
and
    with this change, I hope I will be able to get the tables more realistics 
and
    incrementally fix issues with individual benchmarks.
    
    I benchmarked the patch wtih -Ofast -march=native -flto on znver5 and 
skylake.
    It seems in noise for skylake, for znver5 I got what seems off-noise for
    xalabcbmk 8.73->8.81 (rate). Rest seems in noise too,
    however the change affects quite some SLP decisions when the sequence is
    just loads followed by vector store.
    
    gcc/ChangeLog:
    
            * config/i386/i386.cc (ix86_builtin_vectorization_cost):
            use sse_op instead of addss to cost vinsertti128 and vinsertti64x4;
            compute correct mode of vinsertti128.
            (ix86_vector_costs::add_stmt_cost): For integer 256bit and 512bit
            vector constructions account more integer_to_sse moves.

Diff:
---
 gcc/config/i386/i386.cc | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 1b7dbd425d69..d48654a729a1 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25185,12 +25185,18 @@ ix86_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          else if (GET_MODE_BITSIZE (mode) == 256)
            return ((n - 2) * ix86_cost->sse_op
-                   + ix86_vec_cost (mode, ix86_cost->addss));
+                   + ix86_vec_cost (mode, ix86_cost->sse_op));
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
-           return ((n - 4) * ix86_cost->sse_op
-                   + 3 * ix86_vec_cost (mode, ix86_cost->addss));
+           {
+             machine_mode half_mode
+               = mode_for_vector (GET_MODE_INNER (mode),
+                                  GET_MODE_NUNITS (mode) / 2).require ();
+             return ((n - 4) * ix86_cost->sse_op
+                     + 2 * ix86_vec_cost (half_mode, ix86_cost->sse_op)
+                     + ix86_vec_cost (mode, ix86_cost->sse_op));
+           }
          gcc_unreachable ();
        }
 
@@ -26048,7 +26054,22 @@ ix86_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
              else
                {
                  m_num_gpr_needed[where]++;
-                 stmt_cost += COSTS_N_INSNS (ix86_cost->integer_to_sse) / 2;
+
+                 int cost = COSTS_N_INSNS (ix86_cost->integer_to_sse) / 2;
+
+                 /* For integer construction, the number of actual GPR -> XMM
+                    moves will be somewhere between 0 and n.
+                    We do not have very good idea about actual number, since
+                    the source may be a constant, memory or a chain of
+                    instructions that will be later converted by
+                    scalar-to-vector pass.  */
+                 if (kind == vec_construct
+                     && GET_MODE_BITSIZE (mode) == 256)
+                   cost *= 2;
+                 else if (kind == vec_construct
+                          && GET_MODE_BITSIZE (mode) == 512)
+                   cost *= 3;
+                 stmt_cost += cost;
                }
            }
        }

Reply via email to