[Bug target/119298] [15 Regression] 538.imagick_r is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with master on Zen5 since r15-3441-g4292297a0f938f

hubicka at gcc dot gnu.org via Gcc-bugs Wed, 09 Apr 2025 07:44:39 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298


--- Comment #14 from Jan Hubicka <hubicka at gcc dot gnu.org> ---

> > I am OK with using addss cost of 3 for trunk&release branches and make this
> > more precise next stage1.
> 
> That's what we use now?  But I still don't understand why exactly 
> 538.imagick_r regresses.

We currently use 2 for znver5.
In r15-3441-g4292297a0f938f (causing this regressoin) I changed

  COSTS_N_INSNS (3),                    /* cost of ADDSS/SD SUBSS/SD insns.  */

which was copied from znver4 tables to

  /* ADDSS has throughput 2 and latency 2
     (in some cases when source is another addition).  */
  COSTS_N_INSNS (2),                    /* cost of ADDSS/SD SUBSS/SD insns.  */

since ADDSS has latency of 2 at znver5 (at least in some cases, but common
ones).  Martin claims he tested that this particular change causes the
regression and moreover that the difference is:

-  _882 = {_929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930};
-  vectp.214_881 = prephitmp_900 + 32;
-  MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] =
_882;
+  MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931;
+  MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932;
+  MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930;
+  MEM[(long unsigned int *)prephitmp_900 + 32B] = _929;

Which boils down to:

long test[4];
void
foo (unsigned long a, unsigned long b, unsigned long c, unsigned long d)
{
        test[0]=a;
        test[1]=b;
        test[2]=c;
        test[3]=d;
}

being vectorized with cost of 2 (trunk) as

_Z3foommmm:
.LFB0:
        .cfi_startproc
        vmovq   %rdx, %xmm2
        vmovq   %rdi, %xmm3
        vpinsrq $1, %rcx, %xmm2, %xmm1
        vpinsrq $1, %rsi, %xmm3, %xmm0
        vinserti64x2    $0x1, %xmm1, %ymm0, %ymm0
        vmovdqa %ymm0, test(%rip)
        vzeroupper
        ret

while with cost of 3 we get:

_Z3foommmm:
.LFB0:
        .cfi_startproc
        vmovq   %rdi, %xmm2
        vmovq   %rdx, %xmm3
        vpinsrq $1, %rsi, %xmm2, %xmm1
        vpinsrq $1, %rcx, %xmm3, %xmm0
        vmovdqa %xmm1, test(%rip)
        vmovdqa %xmm0, test+16(%rip)
        ret

I would not guess which one is better.  I can see the trunk's version saves one
move and potential store->load stall. In stand alone benchmark (which does not
consume the stored values) both variants have the same runtime.

Costing difference is:

a.0_1 1 times scalar_store costs 16 in body
b.1_2 1 times scalar_store costs 16 in body
c.2_3 1 times scalar_store costs 16 in body
d.3_4 1 times scalar_store costs 16 in body
node 0x396b4c0 1 times vec_construct costs 40 in prologue
a.0_1 1 times vector_store costs 24 in body
r.C:5:16: note: Cost model analysis for part in loop 0:
  Vector cost: 64
  Scalar cost: 64

to:

a.0_1 1 times scalar_store costs 16 in body
b.1_2 1 times scalar_store costs 16 in body
c.2_3 1 times scalar_store costs 16 in body
d.3_4 1 times scalar_store costs 16 in body
node 0x396b4c0 1 times vec_construct costs 44 in prologue
a.0_1 1 times vector_store costs 24 in body
r.C:5:16: note: Cost model analysis for part in loop 0:
  Vector cost: 68
  Scalar cost: 64

Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is integer->sse
move that is slower and set to 4 cycles.
Overall it is wrong that we use addss cost to estimate vec_construct:

      case vec_construct:
        {
          int n = TYPE_VECTOR_SUBPARTS (vectype);
          /* N - 1 element inserts into an SSE vector, the possible
             GPR -> XMM move is accounted for in add_stmt_cost.  */
          if (GET_MODE_BITSIZE (mode) <= 128)
            return (n - 1) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          else if (GET_MODE_BITSIZE (mode) == 256)
            return ((n - 2) * ix86_cost->sse_op
                    + ix86_vec_cost (mode, ix86_cost->addss));
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
            return ((n - 4) * ix86_cost->sse_op
                    + 3 * ix86_vec_cost (mode, ix86_cost->addss));
          gcc_unreachable ();
        }

I think we may want to have ix86_cost->hard_register->integer_to_sse to cost
the construction in integer modes instead of addss?


Overall r15-3441-g4292297a0f938f is right for addss itself, but wrong for all
other instructions we glob into addss cost (conversion, packing etc.). This was
a silly mistake to forget that addss is used in many other cases.
So I think it still makes sense to have at least one extra cost entry (for
common simple fp operation that is more expensive that common simple integer
operation we already have) and name it in a way that it is clear it represents
various kinds typical FP operations (unless we want to have costs of all of
them).

[Bug target/119298] [15 Regression] 538.imagick_r is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with master on Zen5 since r15-3441-g4292297a0f938f

Reply via email to