On Tue, Aug 9, 2011 at 2:07 PM, Michael Meissner
<meiss...@linux.vnet.ibm.com> wrote:
> This is an initial patch to work around the slow down of sphinx3 in power7 VSX
> that first shows up in GCC 4.6 and is still present in the current GCC 4.7
> trunk.  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031
>
> The key part of the slowdown is in this inner loop in the
> vector_gautbl_eval_logs3  function in sphinx3 vector.c:
>
> {
>  int32 i, r;
>  float64 f;
>  int32 end, veclen;
>  float32 *m1, *m2, *v1, *v2;
>  float64 dval1, dval2, diff1, diff2;
>
>    /* ... */
>
>    for (i = 0; i < veclen; i++) {
>      diff1 = x[i] - m1[i];
>      dval1 -= diff1 * diff1 * v1[i];
>      diff2 = x[i] - m2[i];
>      dval2 -= diff2 * diff2 * v2[i];
>    }
>
>    /* ... */
>
> }
>
> In particular, the compiler 4.6 and beyond vectorizes this inner loop.  
> Because
> it doesn't know the alignment of the float pointers, it generates code to use
> unaligned vector loads unconditionally, which on the powerpc, involves using a
> load of an aligned pointer, and then doing a vperm instruction to permute the
> bytes.  Since the code first does the calculation in 32-bit floating point and
> then converts it to 64-bit floating point, the compiler does a vector convert
> of V4SF to V2DF in the loop.  On the powerpc, this involes two more permutes,
> and then the vector conversion.  Thus in the inner loop, there are:
>
>    4 vector loads
>    4 vector permutes to do the unalgined load
>    8 vector permutes to get things in the right registers for conversion
>    4 vector conversions
>
> This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that
> disables the vector float/int conversions to double.  Overall this is a win:
>
> GCC 4.6, 32-bit:
>    12% improvement, 464.h264ref
>     5% improvement, 450.soplex
>     3% regression,  465.tonto
>     2% improvement, 481.wrf
>     9% improvement, 482.sphinx3
>
> GCC 4.6, 64-bit:
>     5% improvement, 456.hmmer
>     6% improvement, 464.h264ref
>    14% improvement, 482.sphinx3
>
> GCC 4.7, 32-bit:
>      2% improvement, 437.leslie3d
>      9% improvement, 482.sphinx3
>
> I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if
> desired.

Mike,

Your analysis pinpoints the problem, but the patch is a work-around.
Unlike -mpointers-to-nested-functions, this patch does not change the
ABI and the user should not be aware of it.  This is a problem in the
cost model of the auto-vectorizer or instruction selection.  GCC
should not be generating this sequence for vectors that are unaligned
or whose alignment is unknown.  Introducing a new option that we need
to maintain going forward is not the correct solution.

Thanks, David

Reply via email to