https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933
--- Comment #6 from Kewen Lin <linkw at gcc dot gnu.org> --- (In reply to Kewen Lin from comment #5) > (In reply to Segher Boessenkool from comment #4) > > Yes, timing suggests there is some SHL/LHS flush. > > > > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two > > bytes into place at one), which reduces the number of moves from > > 16 to 8, and the number of merges from 15 to 7 (and reduces path > > length by 1). This sounds like a no-brainer win with that :-) > > Good idea, it looks better on P9. One thing to double confirm, currently > there are no instructions like vmrgob and vmrgoh, so of the mergings you > mentioned from vector bytes to vector short and vector short to vector word > needs artificial control vector? Improve the patch to support mtvsrdd, the asm for char looks like: 0000000000000000 <test_char>: 0: 00 00 4c 3c addis r2,r12,0 0: R_PPC64_REL16_HA .TOC. 4: 00 00 42 38 addi r2,r2,0 4: R_PPC64_REL16_LO .TOC.+0x4 8: e8 ff a1 fb std r29,-24(r1) c: 00 00 a2 3f addis r29,r2,0 c: R_PPC64_TOC16_HA .rodata.cst16 10: f0 ff c1 fb std r30,-16(r1) 14: f8 ff e1 fb std r31,-8(r1) 18: 67 1b 24 7c mtvsrdd vs33,r4,r3 1c: 67 3b 28 7d mtvsrdd vs41,r8,r7 20: 68 00 c1 8b lbz r30,104(r1) 24: 78 00 e1 8b lbz r31,120(r1) 28: 00 00 bd 3b addi r29,r29,0 28: R_PPC64_TOC16_LO .rodata.cst16 2c: 60 00 81 89 lbz r12,96(r1) 30: 70 00 61 89 lbz r11,112(r1) 34: 80 00 81 88 lbz r4,128(r1) 38: 88 00 61 88 lbz r3,136(r1) 3c: 90 00 01 89 lbz r8,144(r1) 40: 98 00 e1 88 lbz r7,152(r1) 44: 67 2b 46 7c mtvsrdd vs34,r6,r5 48: 67 4b aa 7d mtvsrdd vs45,r10,r9 4c: 09 00 9d f5 lxv vs44,0(r29) 50: 67 63 5e 7d mtvsrdd vs42,r30,r12 54: 67 5b 1f 7c mtvsrdd vs32,r31,r11 58: e8 ff a1 eb ld r29,-24(r1) 5c: f0 ff c1 eb ld r30,-16(r1) 60: 67 23 63 7d mtvsrdd vs43,r3,r4 64: f8 ff e1 eb ld r31,-8(r1) 68: 3b 0b 42 10 vpermr v2,v2,v1,v12 6c: 67 43 27 7c mtvsrdd vs33,r7,r8 70: 3b 4b ad 11 vpermr v13,v13,v9,v12 74: 3b 53 00 10 vpermr v0,v0,v10,v12 78: 3b 5b 21 10 vpermr v1,v1,v11,v12 7c: 97 11 4d f0 xxmrglw vs34,vs45,vs34 80: 97 01 01 f0 xxmrglw vs32,vs33,vs32 84: 57 13 40 f0 xxmrgld vs34,vs32,vs34 88: 20 00 80 4e blr For: 1) mtvsrdd under TARGET_DIRECT_MOVE_128 2) mtvsrd under TARGET_DIRECT_MOVE 3) original The time evaluation on Power9 looks like 1) 7.28s 2) 7.41s 3) 18.19s