https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110692
Bug ID: 110692 Summary: epilogues for loop which can be also vectorized with half size can be improved. Product: gcc Version: 13.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- for: int a[99]; test() { for (int i = 0 ; i < 99; i++) a[i]++; } we produce: 0: 66 0f 6f 0d 00 00 00 movdqa 0x0(%rip),%xmm1 # 8 <test+0x8> 7: 00 8: b8 00 00 00 00 mov $0x0,%eax d: ba 00 00 00 00 mov $0x0,%edx 12: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 18: 66 0f 6f 00 movdqa (%rax),%xmm0 1c: 48 83 c0 10 add $0x10,%rax 20: 66 0f fe c1 paddd %xmm1,%xmm0 24: 0f 29 40 f0 movaps %xmm0,-0x10(%rax) 28: 48 39 c2 cmp %rax,%rdx 2b: 75 eb jne 18 <test+0x18> 2d: f3 0f 7e 05 00 00 00 movq 0x0(%rip),%xmm0 # 35 <test+0x35> 34: 00 35: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 3c <test+0x3c> 3c: f3 0f 7e 0d 00 00 00 movq 0x0(%rip),%xmm1 # 44 <test+0x44> 43: 00 44: 66 0f fe c1 paddd %xmm1,%xmm0 48: 66 0f d6 05 00 00 00 movq %xmm0,0x0(%rip) # 50 <test+0x50> 4f: 00 50: c3 ret which does the 4x vectorized loop followed by 2x vectorized loopless epilogue and copy of remaining byte. When bound is unknow we do: int a[99]; test(int n) { for (int i = 0 ; i < n; i++) a[i]++; } 0: 85 ff test %edi,%edi 2: 7e 70 jle 74 <test+0x74> 4: 8d 47 ff lea -0x1(%rdi),%eax 7: 83 f8 02 cmp $0x2,%eax a: 76 6d jbe 79 <test+0x79> c: 89 fa mov %edi,%edx e: 66 0f 6f 0d 00 00 00 movdqa 0x0(%rip),%xmm1 # 16 <test+0x16> 15: 00 16: 31 c0 xor %eax,%eax 18: c1 ea 02 shr $0x2,%edx 1b: 48 c1 e2 04 shl $0x4,%rdx 1f: 66 0f 6f 80 00 00 00 movdqa 0x0(%rax),%xmm0 26: 00 27: 48 83 c0 10 add $0x10,%rax 2b: 66 0f fe c1 paddd %xmm1,%xmm0 2f: 0f 29 80 00 00 00 00 movaps %xmm0,0x0(%rax) 36: 48 39 d0 cmp %rdx,%rax 39: 75 e4 jne 1f <test+0x1f> 3b: 89 f8 mov %edi,%eax 3d: 83 e0 fc and $0xfffffffc,%eax 40: 40 f6 c7 03 test $0x3,%dil 44: 74 32 je 78 <test+0x78> 46: 48 63 d0 movslq %eax,%rdx 49: 83 04 95 00 00 00 00 addl $0x1,0x0(,%rdx,4) 50: 01 51: 8d 50 01 lea 0x1(%rax),%edx 54: 39 d7 cmp %edx,%edi 56: 7e 1c jle 74 <test+0x74> 58: 48 63 d2 movslq %edx,%rdx 5b: 83 c0 02 add $0x2,%eax 5e: 83 04 95 00 00 00 00 addl $0x1,0x0(,%rdx,4) 65: 01 66: 39 c7 cmp %eax,%edi 68: 7e 0a jle 74 <test+0x74> 6a: 48 98 cltq 6c: 83 04 85 00 00 00 00 addl $0x1,0x0(,%rax,4) 73: 01 74: c3 ret 75: 0f 1f 00 nopl (%rax) 78: c3 ret 79: 31 c0 xor %eax,%eax 7b: eb c9 jmp 46 <test+0x46> the profitability threshold is 4. Producing the loopless epilogue just as in the first case with additional tests for block size would work better. The code looks quite bad for small trip counts since there is extra jump down to 79.