https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703
Bug ID: 119703 Summary: x86: spurious branches for inlined memset in ranges (40; 64) when requesting unrolled loops without simd Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: mjguzik at gmail dot com Target Milestone: --- 13.3.0 runs into it, but I also tested on godbolt which claims to have 15.0.1: gcc (Compiler-Explorer-Build-gcc-ca4e6e6317ae0ceada8c46ef5db5ece165a6d1c4-binutils-2.42) 15.0.1 20250409 (experimental) ... and got the same result. I have not verified memcpy, I suspect it might suffer the same problem. src: void zero(char *buf) { __builtin_memset(buf, 0, SIZE); } compiled like so: cc -O2 -DSIZE=48 -mno-sse -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign -c zero.c objdump says: 0000000000000000 <zero>: 0: f3 0f 1e fa endbr64 4: 31 d2 xor %edx,%edx 6: 31 c0 xor %eax,%eax 8: 89 c1 mov %eax,%ecx a: 48 89 14 0f mov %rdx,(%rdi,%rcx,1) e: 48 89 54 0f 08 mov %rdx,0x8(%rdi,%rcx,1) 13: 48 89 54 0f 10 mov %rdx,0x10(%rdi,%rcx,1) 18: 48 89 54 0f 18 mov %rdx,0x18(%rdi,%rcx,1) 1d: 83 c0 20 add $0x20,%eax 20: 72 e6 jb 8 <zero+0x8> 22: 48 01 c7 add %rax,%rdi 25: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) 2c: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 33: 00 34: c3 ret As you can see it emits body for a 32-byte loop, bumps eax and branches on it. but it knows eax is 0, in fact it zeroed it itself. Then it fills up the trailing 16 bytes. I verified this does not happen with 40: 0000000000000000 <zero>: 0: f3 0f 1e fa endbr64 4: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) b: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 12: 00 13: 48 c7 47 10 00 00 00 movq $0x0,0x10(%rdi) 1a: 00 1b: 48 c7 47 18 00 00 00 movq $0x0,0x18(%rdi) 22: 00 23: 48 c7 47 20 00 00 00 movq $0x0,0x20(%rdi) 2a: 00 2b: c3 ret this looks fine. Note going past 40 is the magic threshold where it starts looking at rep unless told otherwise. Zeroing 41 bytes is also bad: 0000000000000000 <zero>: 0: f3 0f 1e fa endbr64 4: 31 d2 xor %edx,%edx 6: 31 c0 xor %eax,%eax 8: 89 c1 mov %eax,%ecx a: 48 89 14 0f mov %rdx,(%rdi,%rcx,1) e: 48 89 54 0f 08 mov %rdx,0x8(%rdi,%rcx,1) 13: 48 89 54 0f 10 mov %rdx,0x10(%rdi,%rcx,1) 18: 48 89 54 0f 18 mov %rdx,0x18(%rdi,%rcx,1) 1d: 83 c0 20 add $0x20,%eax 20: 72 e6 jb 8 <zero+0x8> 22: 48 01 c7 add %rax,%rdi 25: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) 2c: c6 47 08 00 movb $0x0,0x8(%rdi) 30: c3 ret