On the PowerPC starting with ISA 2.07 (power8), moving a single precision value (SFmode) from a vector register to a GPR involves converting the scalar value in the register from being in double (DFmode) format to the 32-bit vector/storage format, doing the move to the GPR, and then doing a shift right 32-bits to get the value into the bottom 32-bits of the GPR for use as a scalar:
xscvdpspn 0,1 mfvsrd 3,0 srdi 3,3,32 It turns out that the current processors starting with ISA 2.06 (power7) through ISA 3.0 (power9) actually duplicates the 32-bit value produced by the XSCVDPSPN and XSCVDPSP instructions into the top 32-bits of the register and to the second 32-bit word. This allows us to eliminate the shift instruction, since the value is already in the correct location for a 32-bit scalar. ISA 3.0 is being updated to include this specification (and other fixes) so that future processors will also be able to eliminate the shift. The new code is: xscvdpspn 0,1 mfvsrwz 3,0 While I was working on the modifications, I noticed that if the user did a round from DFmode to SFmode and then tried to move it to a GPR, it would originally do: frsp 1,2 xscvdpspn 0,1 mfvsrd 3,0 srdi 3,3,32 The XSCVDPSP instruction already handles values outside of the SFmode range (XSCVDPSPN does not), and so I added a combiner pattern to combine the two instructions: xscvdpsp 0,1 mfvsrwz 3,0 While I was looking at the code, I was noticing that if we have a SImode value in a vector register, and we want to sign extended it and leave the value in a GPR register, on power8 the register allocator would decide to do a 32-bit store integer instruction and a sign extending load in the GPR to do the sign extension. I added a splitter to convert this into a pair of MFVSRWZ and EXTSH instructions. I built Spec 2006 with the changes, and I noticed the following changes in the code: * Round DF->SF and move to GPR: namd, wrf; * Eliminate 32-bit shift: gromacs, namd, povray, wrf; * Use of MFVSRWZ/EXTSW: gromacs, povray, calculix, h264ref. I have built these changes on the following machines with bootstrap and no regressions in the regression test: * Big endian power7 (with both 32/64-bit targets); * Little endian power8; * Little endian power9 prototype. Can I check these changes into GCC 8? Can I back port these changes into the GCC 7 branch? [gcc] 2017-09-19 Michael Meissner <meiss...@linux.vnet.ibm.com> * config/rs6000/vsx.md (vsx_xscvspdp_scalar2): Move insn so it is next to vsx_xscvspdp. (vsx_xscvdpsp_scalar): Use 'ww' constraint instead of 'f' to allow SFmode values being in Altivec registers. (vsx_xscvdpspn): Eliminate uneeded alternative. Use correct constraint ('ws') for DFmode. (vsx_xscvspdpn): Likewise. (vsx_xscvdpspn_scalar): Likewise. (peephole for optimizing move SF to GPR): Adjust code to eliminate needing to do the shift right 32-bits operation after XSCVDPSPN. * config/rs6000/rs6000.md (extendsi<mode>2): Add alternative to do sign extend from vector register to GPR via a split, preventing the register allocator from doing the move via store/load. (extendsi<mode>2 splitter): Likewise. (movsi_from_sf): Adjust code to eliminate doing a 32-bit shift right or vector extract after doing XSCVDPSPN. Use MFVSRWZ instead of MFVSRD to move the value to a GPR register. (movdi_from_sf_zero_ext): Likewise. (movsi_from_df): Add optimization to merge a convert from DFmode to SFmode and moving the SFmode to a GPR to use XSCVDPSP instead of round and XSCVDPSPN. (reload_gpr_from_vsxsf): Use MFVSRWZ instead of MFVSRD to move the value to a GPR register. Rename p8_mfvsrd_4_disf insn to p8_mfvsrwz_disf. (p8_mfvsrd_4_disf): Likewise. (p8_mfvsrwz_disf): Likewise. [gcc/testsuite] 2017-09-19 Michael Meissner <meiss...@linux.vnet.ibm.com> * gcc.target/powerpc/pr71977-1.c: Adjust scan-assembler codes to reflect that we don't generate a 32-bit shift right after XSCVDPSPN. * gcc.target/powerpc/direct-move-float1.c: Likewise. * gcc.target/powerpc/direct-move-float3.c: New test. -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Index: gcc/config/rs6000/vsx.md =================================================================== --- gcc/config/rs6000/vsx.md (revision 252844) +++ gcc/config/rs6000/vsx.md (working copy) @@ -1781,6 +1781,15 @@ (define_insn "vsx_xscvspdp" "xscvspdp %x0,%x1" [(set_attr "type" "fp")]) +;; Same as vsx_xscvspdp, but use SF as the type +(define_insn "vsx_xscvspdp_scalar2" + [(set (match_operand:SF 0 "vsx_register_operand" "=ww") + (unspec:SF [(match_operand:V4SF 1 "vsx_register_operand" "wa")] + UNSPEC_VSX_CVSPDP))] + "VECTOR_UNIT_VSX_P (V4SFmode)" + "xscvspdp %x0,%x1" + [(set_attr "type" "fp")]) + ;; Generate xvcvhpsp instruction (define_insn "vsx_xvcvhpsp" [(set (match_operand:V4SF 0 "vsx_register_operand" "=wa") @@ -1794,41 +1803,32 @@ (define_insn "vsx_xvcvhpsp" ;; format of scalars is actually DF. (define_insn "vsx_xscvdpsp_scalar" [(set (match_operand:V4SF 0 "vsx_register_operand" "=wa") - (unspec:V4SF [(match_operand:SF 1 "vsx_register_operand" "f")] + (unspec:V4SF [(match_operand:SF 1 "vsx_register_operand" "ww")] UNSPEC_VSX_CVSPDP))] "VECTOR_UNIT_VSX_P (V4SFmode)" "xscvdpsp %x0,%x1" [(set_attr "type" "fp")]) -;; Same as vsx_xscvspdp, but use SF as the type -(define_insn "vsx_xscvspdp_scalar2" - [(set (match_operand:SF 0 "vsx_register_operand" "=ww") - (unspec:SF [(match_operand:V4SF 1 "vsx_register_operand" "wa")] - UNSPEC_VSX_CVSPDP))] - "VECTOR_UNIT_VSX_P (V4SFmode)" - "xscvspdp %x0,%x1" - [(set_attr "type" "fp")]) - ;; ISA 2.07 xscvdpspn/xscvspdpn that does not raise an error on signalling NaNs (define_insn "vsx_xscvdpspn" - [(set (match_operand:V4SF 0 "vsx_register_operand" "=ww,?ww") - (unspec:V4SF [(match_operand:DF 1 "vsx_register_operand" "wd,wa")] + [(set (match_operand:V4SF 0 "vsx_register_operand" "=ww") + (unspec:V4SF [(match_operand:DF 1 "vsx_register_operand" "ws")] UNSPEC_VSX_CVDPSPN))] "TARGET_XSCVDPSPN" "xscvdpspn %x0,%x1" [(set_attr "type" "fp")]) (define_insn "vsx_xscvspdpn" - [(set (match_operand:DF 0 "vsx_register_operand" "=ws,?ws") - (unspec:DF [(match_operand:V4SF 1 "vsx_register_operand" "wf,wa")] + [(set (match_operand:DF 0 "vsx_register_operand" "=ws") + (unspec:DF [(match_operand:V4SF 1 "vsx_register_operand" "wa")] UNSPEC_VSX_CVSPDPN))] "TARGET_XSCVSPDPN" "xscvspdpn %x0,%x1" [(set_attr "type" "fp")]) (define_insn "vsx_xscvdpspn_scalar" - [(set (match_operand:V4SF 0 "vsx_register_operand" "=wf,?wa") - (unspec:V4SF [(match_operand:SF 1 "vsx_register_operand" "ww,ww")] + [(set (match_operand:V4SF 0 "vsx_register_operand" "=wa") + (unspec:V4SF [(match_operand:SF 1 "vsx_register_operand" "ww")] UNSPEC_VSX_CVDPSPN))] "TARGET_XSCVDPSPN" "xscvdpspn %x0,%x1" @@ -4773,15 +4773,13 @@ (define_constants ;; ;; (set (reg:DI reg3) (unspec:DI [(reg:V4SF reg2)] UNSPEC_P8V_RELOAD_FROM_VSX)) ;; -;; (set (reg:DI reg3) (lshiftrt:DI (reg:DI reg3) (const_int 32))) +;; (set (reg:DI reg4) (and:DI (reg:DI reg3) (reg:DI reg3))) ;; -;; (set (reg:DI reg5) (and:DI (reg:DI reg3) (reg:DI reg4))) +;; (set (reg:DI reg5) (ashift:DI (reg:DI reg4) (const_int 32))) ;; -;; (set (reg:DI reg6) (ashift:DI (reg:DI reg5) (const_int 32))) +;; (set (reg:SF reg6) (unspec:SF [(reg:DI reg5)] UNSPEC_P8V_MTVSRD)) ;; -;; (set (reg:SF reg7) (unspec:SF [(reg:DI reg6)] UNSPEC_P8V_MTVSRD)) -;; -;; (set (reg:SF reg7) (unspec:SF [(reg:SF reg7)] UNSPEC_VSX_CVSPDPN)) +;; (set (reg:SF reg6) (unspec:SF [(reg:SF reg6)] UNSPEC_VSX_CVSPDPN)) (define_peephole2 [(match_scratch:DI SFBOOL_TMP_GPR "r") @@ -4792,11 +4790,6 @@ (define_peephole2 (unspec:DI [(match_operand:V4SF SFBOOL_MFVSR_A "vsx_register_operand")] UNSPEC_P8V_RELOAD_FROM_VSX)) - ;; SRDI - (set (match_dup SFBOOL_MFVSR_D) - (lshiftrt:DI (match_dup SFBOOL_MFVSR_D) - (const_int 32))) - ;; AND/IOR/XOR operation on int (set (match_operand:SI SFBOOL_BOOL_D "int_reg_operand") (and_ior_xor:SI (match_operand:SI SFBOOL_BOOL_A1 "int_reg_operand") @@ -4820,15 +4813,15 @@ (define_peephole2 && (REG_P (operands[SFBOOL_BOOL_A2]) || CONST_INT_P (operands[SFBOOL_BOOL_A2])) && (REGNO (operands[SFBOOL_BOOL_D]) == REGNO (operands[SFBOOL_MFVSR_D]) - || peep2_reg_dead_p (3, operands[SFBOOL_MFVSR_D])) + || peep2_reg_dead_p (2, operands[SFBOOL_MFVSR_D])) && (REGNO (operands[SFBOOL_MFVSR_D]) == REGNO (operands[SFBOOL_BOOL_A1]) || (REG_P (operands[SFBOOL_BOOL_A2]) && REGNO (operands[SFBOOL_MFVSR_D]) == REGNO (operands[SFBOOL_BOOL_A2]))) && REGNO (operands[SFBOOL_BOOL_D]) == REGNO (operands[SFBOOL_SHL_A]) && (REGNO (operands[SFBOOL_SHL_D]) == REGNO (operands[SFBOOL_BOOL_D]) - || peep2_reg_dead_p (4, operands[SFBOOL_BOOL_D])) - && peep2_reg_dead_p (5, operands[SFBOOL_SHL_D])" + || peep2_reg_dead_p (3, operands[SFBOOL_BOOL_D])) + && peep2_reg_dead_p (4, operands[SFBOOL_SHL_D])" [(set (match_dup SFBOOL_TMP_GPR) (ashift:DI (match_dup SFBOOL_BOOL_A_DI) (const_int 32))) Index: gcc/config/rs6000/rs6000.md =================================================================== --- gcc/config/rs6000/rs6000.md (revision 252844) +++ gcc/config/rs6000/rs6000.md (working copy) @@ -986,8 +986,11 @@ (define_insn_and_split "*extendhi<mode>2 (define_insn "extendsi<mode>2" - [(set (match_operand:EXTSI 0 "gpc_reg_operand" "=r,r,wl,wu,wj,wK,wH") - (sign_extend:EXTSI (match_operand:SI 1 "lwa_operand" "Y,r,Z,Z,r,wK,wH")))] + [(set (match_operand:EXTSI 0 "gpc_reg_operand" + "=r, r, wl, wu, wj, wK, wH, wr") + + (sign_extend:EXTSI (match_operand:SI 1 "lwa_operand" + "Y, r, Z, Z, r, wK, wH, ?wIwH")))] "" "@ lwa%U1%X1 %0,%1 @@ -996,10 +999,23 @@ (define_insn "extendsi<mode>2" lxsiwax %x0,%y1 mtvsrwa %x0,%1 vextsw2d %0,%1 + # #" - [(set_attr "type" "load,exts,fpload,fpload,mffgpr,vecexts,vecperm") + [(set_attr "type" "load,exts,fpload,fpload,mffgpr,vecexts,vecperm,mftgpr") (set_attr "sign_extend" "yes") - (set_attr "length" "4,4,4,4,4,4,8")]) + (set_attr "length" "4,4,4,4,4,4,8,8")]) + +(define_split + [(set (match_operand:DI 0 "int_reg_operand") + (sign_extend:DI (match_operand:SI 1 "vsx_register_operand")))] + "TARGET_DIRECT_MOVE_64BIT && reload_completed" + [(set (match_dup 2) + (match_dup 1)) + (set (match_dup 0) + (sign_extend:DI (match_dup 2)))] +{ + operands[2] = gen_rtx_REG (SImode, reg_or_subregno (operands[0])); +}) (define_split [(set (match_operand:DI 0 "altivec_register_operand") @@ -6790,25 +6806,25 @@ (define_insn "*movsi_internal1_single" ;; needed. ;; MR LWZ LFIWZX LXSIWZX STW -;; STFS STXSSP STXSSPX VSX->GPR MTVSRWZ -;; VSX->VSX +;; STFS STXSSP STXSSPX VSX->GPR VSX->VSX, +;; MTVSRWZ (define_insn_and_split "movsi_from_sf" [(set (match_operand:SI 0 "nonimmediate_operand" "=r, r, ?*wI, ?*wH, m, - m, wY, Z, r, wIwH, - ?wK") + m, wY, Z, r, ?*wIwH, + wIwH") (unspec:SI [(match_operand:SF 1 "input_operand" "r, m, Z, Z, r, - f, wb, wu, wIwH, r, - wK")] + f, wb, wu, wIwH, wIwH, + r")] UNSPEC_SI_FROM_SF)) (clobber (match_scratch:V4SF 2 "=X, X, X, X, X, X, X, X, wa, X, - wa"))] + X"))] "TARGET_NO_SF_SUBREG && (register_operand (operands[0], SImode) @@ -6823,10 +6839,10 @@ (define_insn_and_split "movsi_from_sf" stxssp %1,%0 stxsspx %x1,%y0 # - mtvsrwz %x0,%1 - #" + xscvdpspn %x0,%x1 + mtvsrwz %x0,%1" "&& reload_completed - && register_operand (operands[0], SImode) + && int_reg_operand (operands[0], SImode) && vsx_reg_sfsubreg_ok (operands[1], SFmode)" [(const_int 0)] { @@ -6836,50 +6852,38 @@ (define_insn_and_split "movsi_from_sf" rtx op0_di = gen_rtx_REG (DImode, REGNO (op0)); emit_insn (gen_vsx_xscvdpspn_scalar (op2, op1)); - - if (int_reg_operand (op0, SImode)) - { - emit_insn (gen_p8_mfvsrd_4_disf (op0_di, op2)); - emit_insn (gen_lshrdi3 (op0_di, op0_di, GEN_INT (32))); - } - else - { - rtx op1_v16qi = gen_rtx_REG (V16QImode, REGNO (op1)); - rtx byte_off = VECTOR_ELT_ORDER_BIG ? const0_rtx : GEN_INT (12); - emit_insn (gen_vextract4b (op0_di, op1_v16qi, byte_off)); - } - + emit_insn (gen_p8_mfvsrwz_disf (op0_di, op2)); DONE; } [(set_attr "type" "*, load, fpload, fpload, store, - fpstore, fpstore, fpstore, mftgpr, mffgpr, - veclogical") + fpstore, fpstore, fpstore, mftgpr, fp, + mffgpr") (set_attr "length" "4, 4, 4, 4, 4, - 4, 4, 4, 12, 4, - 8")]) + 4, 4, 4, 8, 4, + 4")]) ;; movsi_from_sf with zero extension ;; ;; RLDICL LWZ LFIWZX LXSIWZX VSX->GPR -;; MTVSRWZ VSX->VSX +;; VSX->VSX MTVSRWZ (define_insn_and_split "*movdi_from_sf_zero_ext" [(set (match_operand:DI 0 "gpc_reg_operand" "=r, r, ?*wI, ?*wH, r, - wIwH, ?wK") + wK, wIwH") (zero_extend:DI (unspec:SI [(match_operand:SF 1 "input_operand" "r, m, Z, Z, wIwH, - r, wK")] + wIwH, r")] UNSPEC_SI_FROM_SF))) (clobber (match_scratch:V4SF 2 "=X, X, X, X, wa, - X, wa"))] + X, X"))] "TARGET_DIRECT_MOVE_64BIT && (register_operand (operands[0], DImode) @@ -6890,9 +6894,10 @@ (define_insn_and_split "*movdi_from_sf_z lfiwzx %0,%y1 lxsiwzx %x0,%y1 # - mtvsrwz %x0,%1 - #" + # + mtvsrwz %x0,%1" "&& reload_completed + && register_operand (operands[0], DImode) && vsx_reg_sfsubreg_ok (operands[1], SFmode)" [(const_int 0)] { @@ -6901,29 +6906,43 @@ (define_insn_and_split "*movdi_from_sf_z rtx op2 = operands[2]; emit_insn (gen_vsx_xscvdpspn_scalar (op2, op1)); - if (int_reg_operand (op0, DImode)) - { - emit_insn (gen_p8_mfvsrd_4_disf (op0, op2)); - emit_insn (gen_lshrdi3 (op0, op0, GEN_INT (32))); - } + emit_insn (gen_p8_mfvsrwz_disf (op0, op2)); else { - rtx op0_si = gen_rtx_REG (SImode, REGNO (op0)); - rtx op1_v16qi = gen_rtx_REG (V16QImode, REGNO (op1)); - rtx byte_off = VECTOR_ELT_ORDER_BIG ? const0_rtx : GEN_INT (12); - emit_insn (gen_vextract4b (op0_si, op1_v16qi, byte_off)); + rtx op2_si = gen_rtx_REG (SImode, reg_or_subregno (op2)); + emit_insn (gen_zero_extendsidi2 (op0, op2_si)); } DONE; } [(set_attr "type" "*, load, fpload, fpload, mftgpr, - mffgpr, veclogical") + vecexts, mffgpr") (set_attr "length" - "4, 4, 4, 4, 12, - 4, 8")]) + "4, 4, 4, 4, 8, + 8, 4")]) + +;; Like movsi_from_sf, but combine a convert from DFmode to SFmode before +;; moving it to SImode. We can do a SFmode store without having to do the +;; conversion explicitly. If we are doing a register->register conversion, use +;; XSCVDPSP instead of XSCVDPSPN, since the former handles cases where the +;; input will not fit in a SFmode, and the later assumes the value has already +;; been rounded. +(define_insn "*movsi_from_df" + [(set (match_operand:SI 0 "nonimmediate_operand" "=wa,m,wY,Z") + (unspec:SI [(float_truncate:SF + (match_operand:DF 1 "gpc_reg_operand" "wa, f,wb,wa"))] + UNSPEC_SI_FROM_SF))] + + "TARGET_NO_SF_SUBREG" + "@ + xscvdpsp %x0,%x1 + stfs%U0%X0 %1,%0 + stxssp %1,%0 + stxsspx %x1,%y0" + [(set_attr "type" "fp,fpstore,fpstore,fpstore")]) ;; Split a load of a large constant into the appropriate two-insn ;; sequence. @@ -8437,19 +8456,20 @@ (define_insn_and_split "reload_gpr_from_ rtx diop0 = simplify_gen_subreg (DImode, op0, SFmode, 0); emit_insn (gen_vsx_xscvdpspn_scalar (op2, op1)); - emit_insn (gen_p8_mfvsrd_4_disf (diop0, op2)); - emit_insn (gen_lshrdi3 (diop0, diop0, GEN_INT (32))); + emit_insn (gen_p8_mfvsrwz_disf (diop0, op2)); DONE; } [(set_attr "length" "12") (set_attr "type" "three")]) -(define_insn "p8_mfvsrd_4_disf" +;; XSCVDPSPN puts the 32-bit value in both the first and second words, so we do +;; not need to do a shift to extract the value. +(define_insn "p8_mfvsrwz_disf" [(set (match_operand:DI 0 "register_operand" "=r") (unspec:DI [(match_operand:V4SF 1 "register_operand" "wa")] UNSPEC_P8V_RELOAD_FROM_VSX))] "TARGET_POWERPC64 && TARGET_DIRECT_MOVE" - "mfvsrd %0,%x1" + "mfvsrwz %0,%x1" [(set_attr "type" "mftgpr")]) Index: gcc/testsuite/gcc.target/powerpc/pr71977-1.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/pr71977-1.c (revision 252844) +++ gcc/testsuite/gcc.target/powerpc/pr71977-1.c (working copy) @@ -23,9 +23,9 @@ mask_and_float_var (float f, uint32_t ma return u.value; } -/* { dg-final { scan-assembler "\[ \t\]xxland " } } */ -/* { dg-final { scan-assembler-not "\[ \t\]and " } } */ -/* { dg-final { scan-assembler-not "\[ \t\]mfvsrd " } } */ -/* { dg-final { scan-assembler-not "\[ \t\]stxv" } } */ -/* { dg-final { scan-assembler-not "\[ \t\]lxv" } } */ -/* { dg-final { scan-assembler-not "\[ \t\]srdi " } } */ +/* { dg-final { scan-assembler {\mxxland\M} } } */ +/* { dg-final { scan-assembler-not {\mand\M} } } */ +/* { dg-final { scan-assembler-not {\mmfvsrwz\M} } } */ +/* { dg-final { scan-assembler-not {\mstxv\M} } } */ +/* { dg-final { scan-assembler-not {\mlxv\M} } } */ +/* { dg-final { scan-assembler-not {\msrdi\M} } } */ Index: gcc/testsuite/gcc.target/powerpc/direct-move-float1.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/direct-move-float1.c (revision 252844) +++ gcc/testsuite/gcc.target/powerpc/direct-move-float1.c (working copy) @@ -5,7 +5,7 @@ /* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */ /* { dg-options "-mcpu=power8 -O2" } */ /* { dg-final { scan-assembler "mtvsrd" } } */ -/* { dg-final { scan-assembler "mfvsrd" } } */ +/* { dg-final { scan-assembler "mfvsrwz" } } */ /* { dg-final { scan-assembler "xscvdpspn" } } */ /* { dg-final { scan-assembler "xscvspdpn" } } */ Index: gcc/testsuite/gcc.target/powerpc/direct-move-float3.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/direct-move-float3.c (revision 0) +++ gcc/testsuite/gcc.target/powerpc/direct-move-float3.c (revision 0) @@ -0,0 +1,28 @@ +/* { dg-do compile { target { powerpc*-*-linux* && lp64 } } } */ +/* { dg-skip-if "" { powerpc*-*-darwin* } } */ +/* { dg-skip-if "" { powerpc*-*-*spe* } } */ +/* { dg-require-effective-target powerpc_p8vector_ok } */ +/* { dg-options "-mpower8-vector -O2" } */ + +/* Test that we generate XSCVDPSP instead of FRSP and XSCVDPSPN when we combine + a round from double to float and moving the float value to a GPR. */ + +union u { + float f; + unsigned int ui; + int si; +}; + +unsigned int +ui_d (double d) +{ + union u x; + x.f = d; + return x.ui; +} + +/* { dg-final { scan-assembler {\mmfvsrwz\M} } } */ +/* { dg-final { scan-assembler {\mxscvdpsp\M} } } */ +/* { dg-final { scan-assembler-not {\mmtvsrd\M} } } */ +/* { dg-final { scan-assembler-not {\mxscvdpspn\M} } } */ +/* { dg-final { scan-assembler-not {\msrdi\M} } } */