-ftree-slp-vectorize turns ROL into a mess

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 12 Feb 2021 06:03:08 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166


--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #4)
> While with SLP vectorization, we end up with:
>    _9 = (int) _3;
>    _10 = BIT_FIELD_REF <_3, 32, 32>;
> -  MEM[(int &)&y] = _10;
> -  MEM[(int &)&y + 4] = _9;
> +  _11 = {_10, _9};
> +  MEM <vector(2) int> [(int &)&y] = _11;
>    _4 = MEM <long unsigned int> [(char * {ref-all})&y];
>    MEM <long unsigned int> [(char * {ref-all})x_2(D)] = _4;
> and aren't able to undo the vectorization during the RTL optimizations.
> I'm surprised costs suggest such vectorization is beneficial, constructing a
> vector just to store it into memory seems more expensive than just doing two
> stores, isn't it?

In general yes (esp. with the components in GPRs).  Of course x86
vectorizer costing assigns 12 + 12 to the scalar stores and
just 12 for the vector store and the CTOR isn't even close to 12.

We're doing

      case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
            cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
          return cost;

so what we miss here is costing GPR -> xmm moves required where they
are not "free" (IIRC there are some AVX grp->xmm insert instructions?).

Then we generally want larger (vector) stores because they are
more likely subject to STLF than smaller stores (the other way
around for loads!).

So say for two scalar SFmode stores doing movlhps + movps mem is
clearly beneificial to two movss.

Now for the testcase the IL before vectorization is

  _3 = MEM <long unsigned int> [(char * {ref-all})x_2(D)];
  _9 = BIT_FIELD_REF <_3, 32, 0>;
  _10 = BIT_FIELD_REF <_3, 32, 32>;
  y = _10;
  MEM[(int &)&y + 4] = _9;

and the vectorizer simply "reloads" _3 to a vector mode, swaps it
and then vectorizes the store.  But it considers the BIT_FIELD_REFs
to come at a cost here.

t.c:4:5: note: Cost model analysis:
0x3ef0a60 _10 1 times scalar_store costs 12 in body
0x3ef0a60 _9 1 times scalar_store costs 12 in body
0x3ef0a60 BIT_FIELD_REF <_3, 32, 32> 1 times scalar_stmt costs 4 in body
0x3ef0a60 BIT_FIELD_REF <_3, 32, 0> 1 times scalar_stmt costs 4 in body
0x3ef0a60 <unknown> 1 times vec_perm costs 4 in body
0x3ef0a60 _10 1 times vector_store costs 12 in body
t.c:4:5: note: Cost model analysis for part in loop 0:
  Vector cost: 16
  Scalar cost: 32
t.c:4:5: note: Basic block will be vectorized using SLP

so the issue is really the vectorizer doesn't see the scalar code can
be implemented with a simple

  rolq    $32, (%rdi)

because that's not how the GIMPLE looks like (of course GIMPLE would
have a wide load, a bswap and a wide store - exactly the same as
the vector code has).

That's

(define_insn "*<insn><mode>3_1"
  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r")
        (any_rotate:SWI48
          (match_operand:SWI48 1 "nonimmediate_operand" "0,rm")
          (match_operand:QI 2 "nonmemory_operand" "c<S>,<S>")))

and combine even tries

Trying 6, 9 -> 10:
    6: r87:DI=[r86:DI]
    9: r88:V2SI=vec_select(r87:DI#0,parallel)
      REG_DEAD r87:DI
   10: [r86:DI]=r88:V2SI
      REG_DEAD r88:V2SI
      REG_DEAD r86:DI
Failed to match this instruction:
(set (mem:V2SI (reg/v/f:DI 86 [ x ]) [0 MEM <long unsigned int> [(char *
{ref-all})x_2(D)]+0 S8 A8])
    (vec_select:V2SI (mem:V2SI (reg/v/f:DI 86 [ x ]) [0 MEM <long unsigned int>
[(char * {ref-all})x_2(D)]+0 S8 A8])
        (parallel [
                (const_int 1 [0x1])
                (const_int 0 [0])
            ])))

but simplification fails to consider doing this with DImode.  So maybe
a combine helper pattern does the trick?

[Bug target/96166] [10/11 Regression] -O3/-ftree-slp-vectorize turns ROL into a mess

Reply via email to