[AMD Official Use Only - AMD Internal Distribution Only]

Hi HJ,

> -----Original Message-----
> From: Kumar, Venkataramanan <[email protected]>
> Sent: Tuesday, November 4, 2025 3:24 PM
> To: Hongtao Liu <[email protected]>; H.J. Lu <[email protected]>
> Cc: GCC Patches <[email protected]>; Uros Bizjak <[email protected]>;
> Hongtao Liu <[email protected]>
> Subject: RE: [PATCH] x86-64: Inline memmove with overlapping unaligned loads
> and stores
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi HJ,
>
> > -----Original Message-----
> > From: Hongtao Liu <[email protected]>
> > Sent: Monday, November 3, 2025 11:36 AM
> > To: H.J. Lu <[email protected]>
> > Cc: GCC Patches <[email protected]>; Uros Bizjak
> > <[email protected]>; Hongtao Liu <[email protected]>
> > Subject: Re: [PATCH] x86-64: Inline memmove with overlapping unaligned
> > loads and stores
> >
> > Caution: This message originated from an External Source. Use proper
> > caution when opening attachments, clicking links, or responding.
> >
> >
> > On Tue, Oct 28, 2025 at 11:21 AM Hongtao Liu <[email protected]> wrote:
> > >
> > > On Thu, Oct 23, 2025 at 10:15 AM H.J. Lu <[email protected]> wrote:
> > > >
> > > > Inline memmove in 64-bit since there are much less registers
> > > > available in 32-bit:
> > > >
> > > > 1. Load all sources into registers and store them together to avoid
> > > >    possible address overlap between source and destination.
> > > > 2. For known size, first try to fully unroll with 8 registers.
> > > > 3. For size <= 2 * MOVE_MAX, load all sources into 2 registers first
> > > >    and then store them together.
> > > > 4. For size > 2 * MOVE_MAX and size <= 4 * MOVE_MAX, load all sources
> > > >    into 4 registers first and then store them together.
> > > > 5. For size > 4 * MOVE_MAX and size <= 8 * MOVE_MAX, load all sources
> > > >    into 8 registers first and then store them together.
> > > > 6. For size > 8 * MOVE_MAX,
> > > >    a. If address of destination > address of source, copy backward
> > > >       with a 4 * MOVE_MAX loop with unaligned loads and stores.  Load
> > > >       the first 4 * MOVE_MAX into 4 registers before the loop and
> > > >       store them after the loop to support overlapping addresses.
> > > >    b. Otherwise, copy forward with a 4 * MOVE_MAX loop with unaligned
> > > >       loads and stores.  Load the last 4 * MOVE_MAX into 4 registers
> > > >       before the loop and store them after the loop to support
> > > >       overlapping addresses.
> > > >
> > > > Verified and benchmarked memmove implementations inlined with GPR,
> > > > SSE2,
> > > > AVX2 and AVX512 using glibc memmove tests.  It is available at
> > > >
> > > > https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/test/memmov
> > > > e
> > > >
> > > > Their performances are comparable with optimized memmove
> > > > implementations in glibc on Intel Core i7-1195G7.
> > > I'll measure performance on SPEC and get back later, could take
> > > couple
> > days.
> > No big performance impact for SPEC, and I checked the logic of
> > ix86_expand_movmem looks correct.
> >
> > So it LGTM.
>
> We will also measure the SPEC performance with Zen5 and get back to you.
>
> Regards,
> Venkat.

Sorry it took so long for this activity. We measured the SPEC performance with 
Zen5 and didn’t find big performance impact for SPEC.

Regards,
Suganya

>
> > > >
> > > > gcc/
> > > >
> > > > PR target/90262
> > > > * config/i386/i386-expand.cc (ix86_expand_unroll_movmem): New.
> > > > (ix86_expand_n_move_movmem): Likewise.
> > > > (ix86_expand_load_movmem): Likewise.
> > > > (ix86_expand_store_movmem): Likewise.
> > > > (ix86_expand_n_overlapping_move_movmem): Likewise.
> > > > (ix86_expand_less_move_movmem): Likewise.
> > > > (ix86_expand_movmem): Likewise.
> > > > * i386-protos.h (ix86_expand_movmem): Likewise.
> > > > * config/i386/i386.md (movmem<mode>): Likewise.
> > > >
> > > > gcc/testsuite/
> > > >
> > > > * gcc.target/i386/builtin-memmove-1a.c: New test.
> > > > * gcc.target/i386/builtin-memmove-1b.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-1c.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-1d.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-2a.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-2b.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-2c.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-2d.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-3a.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-3b.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-3c.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-4a.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-4b.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-4c.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-5a.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-5b.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-5c.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-6.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-7.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-8.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-9.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-10.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-11a.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-11b.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-11c.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-12.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-13.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-14.c: Likewise.
> > > > * gcc.target/i386/builtin-memmove-15.c: Likewise.
> > > >
> > > > OK for master?
> > > >
> > > > Thanks.
> > > >
> > > > --
> > > > H.J.
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> >
> >
> >
> > --
> > BR,
> > Hongtao

Reply via email to