On Mon, Jul 7, 2025 at 3:27 PM Hongtao Liu <crazy...@gmail.com> wrote: > > On Tue, Jun 24, 2025 at 2:11 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Mon, Jun 23, 2025 at 2:24 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > On Wed, Jun 18, 2025 at 3:17 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > 1. Don't generate the loop if the loop count is 1. > > > > 2. For memset with vector on small size, use vector if small size > > > > supports > > > > vector, otherwise use the scalar value. > > > > 3. Duplicate the promoted scalar value for vector. > > > > 4. Always expand vector-version of memset for vector_loop. > > > > 5. Use misaligned prologue if alignment isn't needed. When misaligned > > > > prologue is used, check if destination is actually aligned and update > > > > destination alignment if aligned. > > > > > > > > The included tests show that codegen of vector_loop/unrolled_loop for > > > > memset/memcpy are significantly improved. For > > > > > > > > --- > > > > void > > > > foo (void *p1, size_t len) > > > > { > > > > __builtin_memset (p1, 0, len); > > > > } > > > > --- > > > > > > > > with > > > > > > > > -O2 -minline-all-stringops > > > > -mmemset-strategy=vector_loop:256:noalign,libcall:-1:noalign > > > > -march=x86-64 > > > > > > > > we used to generate > > > > > > > > foo: > > > > .LFB0: > > > > .cfi_startproc > > > > movq %rdi, %rax > > > > pxor %xmm0, %xmm0 > > > > cmpq $64, %rsi > > > > jnb .L18 > > > > .L2: > > > > andl $63, %esi > > > > je .L1 > > > > xorl %edx, %edx > > > > testb $1, %sil > > > > je .L5 > > > > movl $1, %edx > > > > movb $0, (%rax) > > > > cmpq %rsi, %rdx > > > > jnb .L19 > > > > .L5: > > > > movb $0, (%rax,%rdx) > > > > movb $0, 1(%rax,%rdx) > > > > addq $2, %rdx > > > > cmpq %rsi, %rdx > > > > jb .L5 > Lili found that the regression of 527.cam4_r (PR120943) is caused by > more instructions due to the usage of movb instruction(takes more > iterations) instead of original movq. > The patch optimizes it with vector moves and solves the issue.
len with range_info will be inlined with vector_loop without specifying -mmemset-strategy=vector_loop:256:noalign,libcall:-1:noalign void foo (void *p1, int len) { if (len < 256) __builtin_memset (p1, 0, len); } > > > > > .L1: > > > > ret > > > > .p2align 4,,10 > > > > .p2align 3 > > > > .L18: > > > > movq %rsi, %rdx > > > > xorl %eax, %eax > > > > andq $-64, %rdx > > > > .L3: > > > > movups %xmm0, (%rdi,%rax) > > > > movups %xmm0, 16(%rdi,%rax) > > > > movups %xmm0, 32(%rdi,%rax) > > > > movups %xmm0, 48(%rdi,%rax) > > > > addq $64, %rax > > > > cmpq %rdx, %rax > > > > jb .L3 > > > > addq %rdi, %rax > > > > jmp .L2 > > > > .L19: > > > > ret > > > > .cfi_endproc > > > > > > > > with very poor prologue/epilogue. With this patch, we now generate: > > > > > > > > foo: > > > > .LFB0: > > > > .cfi_startproc > > > > pxor %xmm0, %xmm0 > > > > cmpq $64, %rsi > > > > jnb .L2 > > > > testb $32, %sil > > > > jne .L19 > > > > testb $16, %sil > > > > jne .L20 > > > > testb $8, %sil > > > > jne .L21 > > > > testb $4, %sil > > > > jne .L22 > > > > testq %rsi, %rsi > > > > jne .L23 > > > > .L1: > > > > ret > > > > .p2align 4,,10 > > > > .p2align 3 > > > > .L2: > > > > movups %xmm0, -64(%rdi,%rsi) > > > > movups %xmm0, -48(%rdi,%rsi) > > > > movups %xmm0, -32(%rdi,%rsi) > > > > movups %xmm0, -16(%rdi,%rsi) > > > > subq $1, %rsi > > > > cmpq $64, %rsi > > > > jb .L1 > > > > andq $-64, %rsi > > > > xorl %eax, %eax > > > > .L9: > > > > movups %xmm0, (%rdi,%rax) > > > > movups %xmm0, 16(%rdi,%rax) > > > > movups %xmm0, 32(%rdi,%rax) > > > > movups %xmm0, 48(%rdi,%rax) > > > > addq $64, %rax > > > > cmpq %rsi, %rax > > > > jb .L9 > > > > ret > > > > .p2align 4,,10 > > > > .p2align 3 > > > > .L23: > > > > movb $0, (%rdi) > > > > testb $2, %sil > > > > je .L1 > > > > xorl %eax, %eax > > > > movw %ax, -2(%rdi,%rsi) > > > > ret > > > > .p2align 4,,10 > > > > .p2align 3 > > > > .L19: > > > > movups %xmm0, (%rdi) > > > > movups %xmm0, 16(%rdi) > > > > movups %xmm0, -32(%rdi,%rsi) > > > > movups %xmm0, -16(%rdi,%rsi) > > > > ret > > > > .p2align 4,,10 > > > > .p2align 3 > > > > .L20: > > > > movups %xmm0, (%rdi) > > > > movups %xmm0, -16(%rdi,%rsi) > > > > ret > > > > .p2align 4,,10 > > > > .p2align 3 > > > > .L21: > > > > movq $0, (%rdi) > > > > movq $0, -8(%rdi,%rsi) > > > > ret > > > > .p2align 4,,10 > > > > .p2align 3 > > > > .L22: > > > > movl $0, (%rdi) > > > > movl $0, -4(%rdi,%rsi) > > > > ret > > > > .cfi_endproc > > > > > > > > > Here is the v2 patch with the memset improvements: > > > > > > 1. Always duplicate the promoted scalar value for vector_loop if not 0 > > > nor -1. > > > 2. Update setmem_epilogue_gen_val to use the RTL info from the previous > > > iteration. > > > > > > OK for master? > > > > > > > Here is the v3 patch rebased against > > > > commit d073bb6cfc219d4b6c283a0b527ee88b42e640e0 > > Author: H.J. Lu <hjl.to...@gmail.com> > > Date: Thu Mar 18 18:43:10 2021 -0700 > > > > x86: Update memcpy/memset inline strategies for -mtune=generic > > > > OK for master? > Ok for trunk. > > > > Thanks. > > > > -- > > H.J. > > > > -- > BR, > Hongtao -- BR, Hongtao