On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka <hubi...@ucw.cz> wrote:
> >
> > >       PR target/102294
> > >       PR target/119596
> > >       * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> > >       (generic_memset): Likewise.
> > >       (generic_cost): Change CLEAR_RATIO to 17.
> > >       * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> > >       Add m_GENERIC.
> >
> > Looking through the PRs, there they are primarily about CLEAR_RATIO
> > being lower than on clang which makes us to produce slower (but smaller)
> > initialization sequence for blocks of certain size.
> > It seems Kenrel is discussed there too (-mno-sse).
> >
> > Bumping it up for SSE makes sense provided that SSE codegen does not
> > suffer from the long $0 immediates. I would say it is OK also for
> > -mno-sse provided speedups are quite noticeable, but it would be really
> > nice to solve this incrementally.
> >
> > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is
> > that Intel chips likes stosb for small blocks, since they are not
> > optimized for stosw/q.  Zen seems to preffer stopsq over stosb for
> > blocks up to 128 bytes.
> >
> > How does the loop version compare to stopsb for blocks in rage
> > 1...128 bytes in Intel hardware?
> >
> > Since the case we prove block size to be small but we do not know a
> > size, I think using loop or unrolled for blocks up to say 128 bytes
> > may work well for both.
> >
> > Honza
>
> My patch has a 256 byte threshold.  Are you suggesting changing it
> to 128 bytes?
>

256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are
17 which is  16 * 16 (256) bytes.  To lower the threshold to 128 bytes,
MOVE_RATIO/CLEAR_RATIO will be changed to 9.  Do we want to
do that?


-- 
H.J.

Reply via email to