[Bug c/119703] New: x86: spurious branches for inlined memset in ranges (40; 64) when requesting unrolled loops without simd

mjguzik at gmail dot com via Gcc-bugs Wed, 09 Apr 2025 23:53:36 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703


            Bug ID: 119703
           Summary: x86: spurious branches for inlined memset in ranges
                    (40; 64) when requesting unrolled loops without simd
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mjguzik at gmail dot com
  Target Milestone: ---

13.3.0 runs into it, but I also tested on godbolt which claims to have 15.0.1:

gcc
(Compiler-Explorer-Build-gcc-ca4e6e6317ae0ceada8c46ef5db5ece165a6d1c4-binutils-2.42)
15.0.1 20250409 (experimental)

... and got the same result.

I have not verified memcpy, I suspect it might suffer the same problem.

src:
void zero(char *buf)
{
        __builtin_memset(buf, 0, SIZE);
}

compiled like so:

cc -O2 -DSIZE=48 -mno-sse
-mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign -c zero.c

objdump says:
0000000000000000 <zero>:
   0:   f3 0f 1e fa             endbr64
   4:   31 d2                   xor    %edx,%edx
   6:   31 c0                   xor    %eax,%eax
   8:   89 c1                   mov    %eax,%ecx
   a:   48 89 14 0f             mov    %rdx,(%rdi,%rcx,1)
   e:   48 89 54 0f 08          mov    %rdx,0x8(%rdi,%rcx,1)
  13:   48 89 54 0f 10          mov    %rdx,0x10(%rdi,%rcx,1)
  18:   48 89 54 0f 18          mov    %rdx,0x18(%rdi,%rcx,1)
  1d:   83 c0 20                add    $0x20,%eax
  20:   72 e6                   jb     8 <zero+0x8>
  22:   48 01 c7                add    %rax,%rdi
  25:   48 c7 07 00 00 00 00    movq   $0x0,(%rdi)
  2c:   48 c7 47 08 00 00 00    movq   $0x0,0x8(%rdi)
  33:   00
  34:   c3                      ret

As you can see it emits body for a 32-byte loop, bumps eax and branches on it.
but it knows eax is 0, in fact it zeroed it itself. Then it fills up the
trailing 16 bytes.

I verified this does not happen with 40:
0000000000000000 <zero>:
   0:   f3 0f 1e fa             endbr64
   4:   48 c7 07 00 00 00 00    movq   $0x0,(%rdi)
   b:   48 c7 47 08 00 00 00    movq   $0x0,0x8(%rdi)
  12:   00
  13:   48 c7 47 10 00 00 00    movq   $0x0,0x10(%rdi)
  1a:   00
  1b:   48 c7 47 18 00 00 00    movq   $0x0,0x18(%rdi)
  22:   00
  23:   48 c7 47 20 00 00 00    movq   $0x0,0x20(%rdi)
  2a:   00
  2b:   c3                      ret

this looks fine.

Note going past 40 is the magic threshold where it starts looking at rep unless
told otherwise.

Zeroing 41 bytes is also bad:
0000000000000000 <zero>:
   0:   f3 0f 1e fa             endbr64
   4:   31 d2                   xor    %edx,%edx
   6:   31 c0                   xor    %eax,%eax
   8:   89 c1                   mov    %eax,%ecx
   a:   48 89 14 0f             mov    %rdx,(%rdi,%rcx,1)
   e:   48 89 54 0f 08          mov    %rdx,0x8(%rdi,%rcx,1)
  13:   48 89 54 0f 10          mov    %rdx,0x10(%rdi,%rcx,1)
  18:   48 89 54 0f 18          mov    %rdx,0x18(%rdi,%rcx,1)
  1d:   83 c0 20                add    $0x20,%eax
  20:   72 e6                   jb     8 <zero+0x8>
  22:   48 01 c7                add    %rax,%rdi
  25:   48 c7 07 00 00 00 00    movq   $0x0,(%rdi)
  2c:   c6 47 08 00             movb   $0x0,0x8(%rdi)
  30:   c3                      ret

[Bug c/119703] New: x86: spurious branches for inlined memset in ranges (40; 64) when requesting unrolled loops without simd

Reply via email to