_mm_cvtpi32x2_ps in xmmintrin.h could be made more efficient. the existing version: static __inline __m128 __attribute__((__always_inline__)) _mm_cvtpi32x2_ps(__m64 __A, __m64 __B) { __v4sf __zero = (__v4sf) _mm_setzero_ps (); __v4sf __sfa = __builtin_ia32_cvtpi2ps (__zero, (__v2si)__A); __v4sf __sfb = __builtin_ia32_cvtpi2ps (__zero, (__v2si)__B); return (__m128) __builtin_ia32_movlhps (__sfa, __sfb); }
generates unneccesary zeroes and copies of xmm registers. as cvtpi2ps and movlhps both overwrite the target portion of the target register, it is not neccesary to create and copy a zero-filled register in this function. as written, it also saves the value of __sfa before the movlhps instruction. i've modified it to: static __inline __m128 __attribute__((__always_inline__)) _mm_cvtpi32x2_ps(__m64 __A, __m64 __B) { __v4sf __sfa = __builtin_ia32_cvtpi2ps (__sfa, (__v2si)__A); __v4sf __sfb = __builtin_ia32_cvtpi2ps (__sfa, (__v2si)__B); return (__m128) (__sfa = __builtin_ia32_movlhps (__sfa, __sfb)); } and gotten about 25% shorter run-time in a trivial test app that converts an array of long ints to floats. -- Summary: faster _mm_cvtpi32x2_ps for xmmintrin.h Product: gcc Version: 4.1.1 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: other AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: andrew dot mahone at gmail dot com GCC build triplet: i686-pc-linux-gnu GCC host triplet: i686-pc-linux-gnu GCC target triplet: i686-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29096