https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64749

            Bug ID: 64749
           Summary: "truncating" instructions generated instead of a load
                    one using SSE & AVX2 intrinsics
           Product: gcc
           Version: 4.8.4
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: adrien at guinet dot me

Created attachment 34553
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34553&action=edit
test case

The code attached compiles and runs fine (that is the output of the program is
the good one) using GCC 4.9. When compiled with GCC 4.8, the output is
different and incorrect.

Indeed, when compiled with GCC 4.8, some kind of truncating is introduced at
the begginig of the loop (in f2). Here is the relevant assembly code (output of
GCC 4.8) :

xor     eax, eax 
mov     rbp, rsp 
and     rsp, 0FFFFFFFFFFFFFFE0h
vbroadcastss ymm3, xmm6
add     rsp, 10h 
nop     dword ptr [rax]

loc_400970:
  vpmovzxwd ymm4, xmmword ptr [rdx+rax*4]
  vpmovzxwd ymm2, xmmword ptr [rcx+rax*4]
  vmovdqa [rsp-8+var_28], ymm4
; truncation here is done
  vmovdqa xmm5, xmmword ptr [rsp-8+var_28]
  vpmulld ymm0, ymm4, ymm2
; here it uses xmm5 which isn't thus the good value.
; xmm5 and ymm4 should be set like with something like this (like GCC 4.9
does): 
; vmovqda xmm5, xmmword ptr [rdx+rax*4]
; vpmovzxwd ymm4, xmm5
  vpmulhuw xmm1, xmm5, xmmword ptr [r8+rax*4]
  vpmovzxwd ymm1, xmm1
  vpmulld ymm1, ymm1, ymm3
  vpsubd  ymm0, ymm0, ymm1
  vmovdqa xmmword ptr [rsi+rax*4], xmm0
  add     rax, 8
  cmp     rdi, rax 
  ja      short loc_400970

GCC 4.9 indeed behaves correctly and generate this assembly code :

vbroadcastss ymm3, dword ptr [rbp-14h]
xor     eax, eax
nop     dword ptr [rax+00h]
loc_4009A8:                             
  vmovdqa xmm0, xmmword ptr [rdx+rax*4] ; 128-bits load
  vpmulhuw xmm2, xmm0, xmmword ptr [r8+rax*4] ; correctly uses xmm0
  vpmovzxwd ymm2, xmm2 ; 16->32 bits conversion here
  vpmulld ymm2, ymm2, ymm3
  vpmovzxwd ymm1, xmm0
  vpmovzxwd ymm0, xmmword ptr [rcx+rax*4]
  vpmulld ymm0, ymm1, ymm0
  vpsubd  ymm0, ymm0, ymm2
  vmovaps xmmword ptr [rsi+rax*4], xmm0
  add     rax, 8
  cmp     rdi, rax
  ja      short loc_4009A8

Thanks for any help about this!

P.S: sorry but I didn't manage to have a shorter test case :/

Reply via email to