https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92188
Bug ID: 92188 Summary: Cannot merge memory write for _mm_cvtps_ph/_mm256_cvtps_ph and x86-64 Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: fredrik987 at gmail dot com Target Milestone: --- Created attachment 47089 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47089&action=edit Test code For this code, the memory write cannot be merged with vcvtps2ph. void test1(__m128i *x, const __m256 *y) { // Cannot merge memory write *x = _mm256_cvtps_ph(*y, _MM_FROUND_CUR_DIRECTION); } ... vcvtps2ph $4, %ymm0, %xmm0 vmovaps %xmm0, (%rdi) ... A workaround is to change the output type to __v8hi as. void test2(__v8hi *x, const __m256 *y) { // Memory write merged *x = (__v8hi)_mm256_cvtps_ph(*y, _MM_FROUND_CUR_DIRECTION); } ... vcvtps2ph $4, %ymm0, (%rdi) ... However it does not work for the 128 bit variant of vcvtps2ph. void test4(__v4hi *x, const __m128 *y) { // Cannot merge memory write *x = (__v4hi)(((__v2di)_mm_cvtps_ph(*y, _MM_FROUND_CUR_DIRECTION))[0]); } ... vcvtps2ph $4, %xmm0, %xmm0 vmovq %xmm0, (%rdi) ... The opposite problem exists for e.g. _mm256_extracti128_si256, which normally merges the memory write but not for output type __v8hi. void test6(__v8hi *x, const __m256i *y) { // Cannot merge memory write *x = (__v8hi)_mm256_extracti128_si256(*y, 1); } ... vextracti128 $0x1, %ymm0, %xmm0 vmovaps %xmm0, (%rdi) ... It would be good if all variants behave the same, with memory write merged. I use "-O3 -march=core-avx2" when compiling (using compiler explorer).