https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu.org
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with 2 bytes we get
.L3:
movzwl (%rax), %edx
addq $3, %rax
movw %dx, 8(%rsp)
movq 8(%rsp), %rdx
imulq %rcx, %rdx
shrq $48, %rdx
addq %rdx, %rsi
cmpq %rdi, %rax
jne .L3
while with 3 bytes we see
.L3:
movzwl (%rax), %edx
addq $3, %rax
movw %dx, 8(%rsp)
movzbl -1(%rax), %edx
movb %dl, 10(%rsp)
movq 8(%rsp), %rdx
imulq %rcx, %rdx
shrq $48, %rdx
addq %rdx, %rsi
cmpq %rdi, %rax
jne .L3
while clang outputs
.LBB0_3: # =>This Inner Loop Header: Depth=1
movzwl (%r14,%rcx), %edx
movzbl 2(%r14,%rcx), %edi
shlq $16, %rdi
orq %rdx, %rdi
andq $-16777216, %rbx # imm = 0xFFFFFFFFFF000000
orq %rdi, %rbx
movq %rbx, %rdx
imulq %rax, %rdx
shrq $48, %rdx
addq %rdx, %rsi
addq $3, %rcx
cmpq $999999992, %rcx # imm = 0x3B9AC9F8
jb .LBB0_3
that _looks_ slower. Are you sure performance isn't dominated by the
first init loop (both GCC and clang vectorize it). I notice we spill
in the above loop for the bitfield insert where clang uses register
operations. We refuse to inline the memcpy at the GIMPLE level
and further refuse to optimzie it to a BIT_INSERT_EXPR which would
be a possibility.