https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Note that the rotate isn't something created by the bswap pass, it isn't really
byteswap, just swapping of two halves of the long long.
It comes from expansion and combine.  Expanding
  _9 = (int) _3;
  _10 = BIT_FIELD_REF <_3, 32, 32>;
  MEM[(int &)&y] = _10;
  MEM[(int &)&y + 4] = _9;
  _4 = MEM <long unsigned int> [(char * {ref-all})&y];
  MEM <long unsigned int> [(char * {ref-all})x_2(D)] = _4;
results in
(insn 7 6 8 (parallel [
            (set (reg:DI 88)
                (ashiftrt:DI (reg:DI 82 [ _3 ])
                    (const_int 32 [0x20])))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 8 7 9 (set (reg:DI 89)
        (zero_extend:DI (subreg:SI (reg:DI 88) 0))) "pr96166.c":4:5 -1
     (nil))

(insn 9 8 10 (set (reg:DI 91)
        (const_int -4294967296 [0xffffffff00000000])) "pr96166.c":4:5 -1
     (nil))

(insn 10 9 11 (parallel [
            (set (reg:DI 90)
                (and:DI (reg/v:DI 86 [ y ])
                    (reg:DI 91)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 11 10 12 (parallel [
            (set (reg:DI 92)
                (ior:DI (reg:DI 90)
                    (reg:DI 89)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":4:5 -1
     (nil))

(insn 12 11 0 (set (reg/v:DI 86 [ y ])
        (reg:DI 92)) "pr96166.c":4:5 -1
     (nil))

(insn 13 12 14 (set (reg:DI 93)
        (zero_extend:DI (subreg:SI (reg:DI 82 [ _3 ]) 0))) "pr96166.c":5:5 -1
     (nil))

(insn 14 13 15 (parallel [
            (set (reg:DI 94)
                (ashift:DI (reg:DI 93)
                    (const_int 32 [0x20])))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":5:5 -1
     (nil))

(insn 15 14 16 (set (reg:DI 95)
        (zero_extend:DI (subreg:SI (reg/v:DI 86 [ y ]) 0))) "pr96166.c":5:5 -1
     (nil))

(insn 16 15 17 (parallel [
            (set (reg:DI 96)
                (ior:DI (reg:DI 95)
                    (reg:DI 94)))
            (clobber (reg:CC 17 flags))
        ]) "pr96166.c":5:5 -1
     (nil))

(insn 17 16 0 (set (reg/v:DI 86 [ y ])
        (reg:DI 96)) "pr96166.c":5:5 -1
     (nil))

(insn 18 17 0 (set (mem:DI (reg/v/f:DI 87 [ x ]) [0 MEM <long unsigned int>
[(char * {ref-all})x_2(D)]+0 S8 A8])
        (reg/v:DI 86 [ y ])) "pr96166.c":13:19 -1
     (nil))

(I must say I'm surprised y hasn't been forced into stack even when it is
stored in parts) and then combine matches a rotate out of that.
While with SLP vectorization, we end up with:
   _9 = (int) _3;
   _10 = BIT_FIELD_REF <_3, 32, 32>;
-  MEM[(int &)&y] = _10;
-  MEM[(int &)&y + 4] = _9;
+  _11 = {_10, _9};
+  MEM <vector(2) int> [(int &)&y] = _11;
   _4 = MEM <long unsigned int> [(char * {ref-all})&y];
   MEM <long unsigned int> [(char * {ref-all})x_2(D)] = _4;
and aren't able to undo the vectorization during the RTL optimizations.
I'm surprised costs suggest such vectorization is beneficial, constructing a
vector just to store it into memory seems more expensive than just doing two
stores, isn't it?

Reply via email to