This is an initial patch to work around the slow down of sphinx3 in power7 VSX that first shows up in GCC 4.6 and is still present in the current GCC 4.7 trunk. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031
The key part of the slowdown is in this inner loop in the vector_gautbl_eval_logs3 function in sphinx3 vector.c: { int32 i, r; float64 f; int32 end, veclen; float32 *m1, *m2, *v1, *v2; float64 dval1, dval2, diff1, diff2; /* ... */ for (i = 0; i < veclen; i++) { diff1 = x[i] - m1[i]; dval1 -= diff1 * diff1 * v1[i]; diff2 = x[i] - m2[i]; dval2 -= diff2 * diff2 * v2[i]; } /* ... */ } In particular, the compiler 4.6 and beyond vectorizes this inner loop. Because it doesn't know the alignment of the float pointers, it generates code to use unaligned vector loads unconditionally, which on the powerpc, involves using a load of an aligned pointer, and then doing a vperm instruction to permute the bytes. Since the code first does the calculation in 32-bit floating point and then converts it to 64-bit floating point, the compiler does a vector convert of V4SF to V2DF in the loop. On the powerpc, this involes two more permutes, and then the vector conversion. Thus in the inner loop, there are: 4 vector loads 4 vector permutes to do the unalgined load 8 vector permutes to get things in the right registers for conversion 4 vector conversions This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that disables the vector float/int conversions to double. Overall this is a win: GCC 4.6, 32-bit: 12% improvement, 464.h264ref 5% improvement, 450.soplex 3% regression, 465.tonto 2% improvement, 481.wrf 9% improvement, 482.sphinx3 GCC 4.6, 64-bit: 5% improvement, 456.hmmer 6% improvement, 464.h264ref 14% improvement, 482.sphinx3 GCC 4.7, 32-bit: 2% improvement, 437.leslie3d 9% improvement, 482.sphinx3 I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if desired. While I don't think this is the only solution to 50031, it at least helps us. It is encouraging that GCC 4.7 doesn't have the regression in tonto. I have bootstraped and run make check on both 4.6 and 4.7 compilers with no regressions. Is it ok to install in the 4.7 tree? At present, I have made the default to generate the vectorized conversion, but it may make sense to flip the default. Is this patch ok to apply? Given if affects 4.6, did you want to see it in 4.6 as well? [gcc] 2011-08-09 Michael Meissner <meiss...@linux.vnet.ibm.com> PR tree-optimization/50031 * doc/invoke.texi (RS/6000 and PowerPC Options): Add -mnvsx-vector-32bit-to-64bit switch. * config/rs6000/rs6000.md (vec_unpacks_lo_v4sf): Add conditions on -mvector-convert-32bit-to-64bit switch. (vec_unpacks_float_hi_v4s): Ditto. (vec_unpacks_float_lo_v4s): Ditto. (vec_unpacku_float_hi_v4s): Ditto. (vec_unpacku_float_lo_v4s): Ditto. * config/rs6000/rs6000.opt (-mvector-convert-32bit-to-64bit): New switch to control whether the compiler does 32->64 bit conversions. [gcc/testsuite] 2011-08-09 Michael Meissner <meiss...@linux.vnet.ibm.com> PR tree-optimization/50031 * gcc.target/powerpc/vsx-vector-7.c: New test for -mvector-convert-32bit-to-64bit. * gcc.target/powerpc/vsx-vector-8.c: Ditto. -- Michael Meissner, IBM 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA meiss...@linux.vnet.ibm.com fax +1 (978) 399-6899
Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi (revision 177467) +++ gcc/doc/invoke.texi (working copy) @@ -813,7 +813,8 @@ See RS/6000 and PowerPC Options. -mrecip -mrecip=@var{opt} -mno-recip -mrecip-precision @gol -mno-recip-precision @gol -mveclibabi=@var{type} -mfriz -mno-friz @gol --mpointers-to-nested-functions -mno-pointers-to-nested-functions} +-mpointers-to-nested-functions -mno-pointers-to-nested-functions @gol +-mvector-convert-32bit-to-64bit -mno-vector-convert-32bit-to-64bit} @emph{RX Options} @gccoptlist{-m64bit-doubles -m32bit-doubles -fpu -nofpu@gol @@ -16426,6 +16427,13 @@ static chain value to be loaded in regis not be able to call through pointers to nested functions or pointers to functions compiled in other languages that use the static chain if you use the @option{-mno-pointers-to-nested-functions}. + +@item -mvector-convert-32bit-to-64bit +@itemx -mno-vector-convert-32bit-to-64bit +@opindex mvector-convert-32bit-to-64bit +Generate (do not generate) code to use VSX vector instructions when +converting 32-bit types to 64-bit types. The default is +@option{-mvector-convert-32bit-to-64bit}. @end table @node RX Options Index: gcc/config/rs6000/vector.md =================================================================== --- gcc/config/rs6000/vector.md (revision 177467) +++ gcc/config/rs6000/vector.md (working copy) @@ -797,7 +797,8 @@ (define_expand "vec_pack_ufix_trunc_v2df (define_expand "vec_unpacks_hi_v4sf" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SF 1 "vfloat_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SFmode); @@ -809,7 +810,8 @@ (define_expand "vec_unpacks_hi_v4sf" (define_expand "vec_unpacks_lo_v4sf" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SF 1 "vfloat_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SFmode); @@ -821,7 +823,8 @@ (define_expand "vec_unpacks_lo_v4sf" (define_expand "vec_unpacks_float_hi_v4si" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SI 1 "vint_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SImode); @@ -833,7 +836,8 @@ (define_expand "vec_unpacks_float_hi_v4s (define_expand "vec_unpacks_float_lo_v4si" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SI 1 "vint_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SImode); @@ -845,7 +849,8 @@ (define_expand "vec_unpacks_float_lo_v4s (define_expand "vec_unpacku_float_hi_v4si" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SI 1 "vint_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SImode); @@ -857,7 +862,8 @@ (define_expand "vec_unpacku_float_hi_v4s (define_expand "vec_unpacku_float_lo_v4si" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SI 1 "vint_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SImode); Index: gcc/config/rs6000/rs6000.opt =================================================================== --- gcc/config/rs6000/rs6000.opt (revision 177467) +++ gcc/config/rs6000/rs6000.opt (working copy) @@ -203,6 +203,10 @@ mvsx-scalar-memory Target Undocumented Report Var(TARGET_VSX_SCALAR_MEMORY) ; If -mvsx, use VSX scalar memory reference instructions for scalar double (off by default) +mvector-convert-32bit-to-64bit +Target Report Var(TARGET_VECTOR_CONVERT_32BIT_TO_64BIT) Init(1) +If -mvsx, enable conversion of 32-bit types to 64-bit types with vector instructions + mvsx-align-128 Target Undocumented Report Var(TARGET_VSX_ALIGN_128) ; If -mvsx, set alignment to 128 bits instead of 32/64