https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82438
Bug ID: 82438 Summary: Memory access not optimized for loops with known bounds Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Reading and writing small data in loops with -O2 generates "movzx ecx, BYTE PTR [rdi+rdx]" and "mov BYTE PTR [rax], 15" instead of reading and writing using words/dwords: Code unsigned loop_read(unsigned char* a) { const unsigned size = 128; unsigned sum = 0; for (unsigned i = 0; i < size; ++i) { sum += a[i]; } return sum; } generates assembly loop_read(unsigned char*): lea rcx, [rdi+128] xor eax, eax .L7: movzx edx, BYTE PTR [rdi] add rdi, 1 add eax, edx cmp rdi, rcx jne .L7 rep ret Reading words/dwords significantly reduces iterations count. Clang reads using dwords: loop_read(unsigned char*): # @loop_read(unsigned char*) pxor xmm1, xmm1 mov rax, -128 pxor xmm0, xmm0 .LBB1_1: # =>This Inner Loop Header: Depth=1 movd xmm2, dword ptr [rdi + rax + 128] # xmm2 = mem[0],zero,zero,zero punpcklbw xmm2, xmm1 # xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7] punpcklwd xmm2, xmm1 # xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3] paddd xmm0, xmm2 add rax, 4 jne .LBB1_1 pshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1] paddd xmm1, xmm0 pshufd xmm0, xmm1, 229 # xmm0 = xmm1[1,1,2,3] paddd xmm0, xmm1 movd eax, xmm0 ret