On Tue, Aug 9, 2011 at 2:07 PM, Michael Meissner <meiss...@linux.vnet.ibm.com> wrote: > This is an initial patch to work around the slow down of sphinx3 in power7 VSX > that first shows up in GCC 4.6 and is still present in the current GCC 4.7 > trunk. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031 > > The key part of the slowdown is in this inner loop in the > vector_gautbl_eval_logs3 function in sphinx3 vector.c: > > { > int32 i, r; > float64 f; > int32 end, veclen; > float32 *m1, *m2, *v1, *v2; > float64 dval1, dval2, diff1, diff2; > > /* ... */ > > for (i = 0; i < veclen; i++) { > diff1 = x[i] - m1[i]; > dval1 -= diff1 * diff1 * v1[i]; > diff2 = x[i] - m2[i]; > dval2 -= diff2 * diff2 * v2[i]; > } > > /* ... */ > > } > > In particular, the compiler 4.6 and beyond vectorizes this inner loop. > Because > it doesn't know the alignment of the float pointers, it generates code to use > unaligned vector loads unconditionally, which on the powerpc, involves using a > load of an aligned pointer, and then doing a vperm instruction to permute the > bytes. Since the code first does the calculation in 32-bit floating point and > then converts it to 64-bit floating point, the compiler does a vector convert > of V4SF to V2DF in the loop. On the powerpc, this involes two more permutes, > and then the vector conversion. Thus in the inner loop, there are: > > 4 vector loads > 4 vector permutes to do the unalgined load > 8 vector permutes to get things in the right registers for conversion > 4 vector conversions > > This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that > disables the vector float/int conversions to double. Overall this is a win: > > GCC 4.6, 32-bit: > 12% improvement, 464.h264ref > 5% improvement, 450.soplex > 3% regression, 465.tonto > 2% improvement, 481.wrf > 9% improvement, 482.sphinx3 > > GCC 4.6, 64-bit: > 5% improvement, 456.hmmer > 6% improvement, 464.h264ref > 14% improvement, 482.sphinx3 > > GCC 4.7, 32-bit: > 2% improvement, 437.leslie3d > 9% improvement, 482.sphinx3 > > I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if > desired.
Mike, Your analysis pinpoints the problem, but the patch is a work-around. Unlike -mpointers-to-nested-functions, this patch does not change the ABI and the user should not be aware of it. This is a problem in the cost model of the auto-vectorizer or instruction selection. GCC should not be generating this sequence for vectors that are unaligned or whose alignment is unknown. Introducing a new option that we need to maintain going forward is not the correct solution. Thanks, David