https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #6 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #5)
> (In reply to Segher Boessenkool from comment #4)
> > Yes, timing suggests there is some SHL/LHS flush.
> > 
> > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
> > bytes into place at one), which reduces the number of moves from
> > 16 to 8, and the number of merges from 15 to 7 (and reduces path
> > length by 1).  This sounds like a no-brainer win with that :-)
> 
> Good idea, it looks better on P9. One thing to double confirm, currently
> there are no instructions like vmrgob and vmrgoh, so of the mergings you
> mentioned from vector bytes to vector short and vector short to vector word
> needs artificial control vector?

Improve the patch to support mtvsrdd, the asm for char looks like:

0000000000000000 <test_char>:
   0:   00 00 4c 3c     addis   r2,r12,0
                        0: R_PPC64_REL16_HA     .TOC.
   4:   00 00 42 38     addi    r2,r2,0
                        4: R_PPC64_REL16_LO     .TOC.+0x4
   8:   e8 ff a1 fb     std     r29,-24(r1)
   c:   00 00 a2 3f     addis   r29,r2,0
                        c: R_PPC64_TOC16_HA     .rodata.cst16
  10:   f0 ff c1 fb     std     r30,-16(r1)
  14:   f8 ff e1 fb     std     r31,-8(r1)
  18:   67 1b 24 7c     mtvsrdd vs33,r4,r3
  1c:   67 3b 28 7d     mtvsrdd vs41,r8,r7
  20:   68 00 c1 8b     lbz     r30,104(r1)
  24:   78 00 e1 8b     lbz     r31,120(r1)
  28:   00 00 bd 3b     addi    r29,r29,0
                        28: R_PPC64_TOC16_LO    .rodata.cst16
  2c:   60 00 81 89     lbz     r12,96(r1)
  30:   70 00 61 89     lbz     r11,112(r1)
  34:   80 00 81 88     lbz     r4,128(r1)
  38:   88 00 61 88     lbz     r3,136(r1)
  3c:   90 00 01 89     lbz     r8,144(r1)
  40:   98 00 e1 88     lbz     r7,152(r1)
  44:   67 2b 46 7c     mtvsrdd vs34,r6,r5
  48:   67 4b aa 7d     mtvsrdd vs45,r10,r9
  4c:   09 00 9d f5     lxv     vs44,0(r29)
  50:   67 63 5e 7d     mtvsrdd vs42,r30,r12
  54:   67 5b 1f 7c     mtvsrdd vs32,r31,r11
  58:   e8 ff a1 eb     ld      r29,-24(r1)
  5c:   f0 ff c1 eb     ld      r30,-16(r1)
  60:   67 23 63 7d     mtvsrdd vs43,r3,r4
  64:   f8 ff e1 eb     ld      r31,-8(r1)
  68:   3b 0b 42 10     vpermr  v2,v2,v1,v12
  6c:   67 43 27 7c     mtvsrdd vs33,r7,r8
  70:   3b 4b ad 11     vpermr  v13,v13,v9,v12
  74:   3b 53 00 10     vpermr  v0,v0,v10,v12
  78:   3b 5b 21 10     vpermr  v1,v1,v11,v12
  7c:   97 11 4d f0     xxmrglw vs34,vs45,vs34
  80:   97 01 01 f0     xxmrglw vs32,vs33,vs32
  84:   57 13 40 f0     xxmrgld vs34,vs32,vs34
  88:   20 00 80 4e     blr

For:
  1) mtvsrdd under TARGET_DIRECT_MOVE_128
  2) mtvsrd under  TARGET_DIRECT_MOVE
  3) original

The time evaluation on Power9 looks like
  1) 7.28s
  2) 7.41s
  3) 18.19s

Reply via email to