[Bug other/29096] New: faster _mm_cvtpi32x2_ps for xmmintrin.h

andrew dot mahone at gmail dot com Thu, 14 Sep 2006 23:39:38 -0700

_mm_cvtpi32x2_ps in xmmintrin.h could be made more efficient. the existing
version:
static __inline __m128 __attribute__((__always_inline__))
_mm_cvtpi32x2_ps(__m64 __A, __m64 __B)
{
  __v4sf __zero = (__v4sf) _mm_setzero_ps ();
  __v4sf __sfa = __builtin_ia32_cvtpi2ps (__zero, (__v2si)__A);
  __v4sf __sfb = __builtin_ia32_cvtpi2ps (__zero, (__v2si)__B);
  return (__m128) __builtin_ia32_movlhps (__sfa, __sfb);
}


generates unneccesary zeroes and copies of xmm registers. as cvtpi2ps and
movlhps both overwrite the target portion of the target register, it is not
neccesary to create and copy a zero-filled register in this function. as
written, it also saves the value of __sfa before the movlhps instruction. i've
modified it to:
static __inline __m128 __attribute__((__always_inline__))
_mm_cvtpi32x2_ps(__m64 __A, __m64 __B)
{
  __v4sf __sfa = __builtin_ia32_cvtpi2ps (__sfa, (__v2si)__A);
  __v4sf __sfb = __builtin_ia32_cvtpi2ps (__sfa, (__v2si)__B);
  return (__m128) (__sfa = __builtin_ia32_movlhps (__sfa, __sfb));
}


and gotten about 25% shorter run-time in a trivial test app that converts an
array of long ints to floats.


-- 
           Summary: faster _mm_cvtpi32x2_ps for xmmintrin.h
           Product: gcc
           Version: 4.1.1
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: other
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: andrew dot mahone at gmail dot com
 GCC build triplet: i686-pc-linux-gnu
  GCC host triplet: i686-pc-linux-gnu
GCC target triplet: i686-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29096

[Bug other/29096] New: faster _mm_cvtpi32x2_ps for xmmintrin.h

Reply via email to