>       PR target/102294
>       PR target/119596
>       * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
>       (generic_memset): Likewise.
>       (generic_cost): Change CLEAR_RATIO to 17.
>       * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
>       Add m_GENERIC.

Looking through the PRs, there they are primarily about CLEAR_RATIO
being lower than on clang which makes us to produce slower (but smaller)
initialization sequence for blocks of certain size.
It seems Kenrel is discussed there too (-mno-sse).

Bumping it up for SSE makes sense provided that SSE codegen does not
suffer from the long $0 immediates. I would say it is OK also for
-mno-sse provided speedups are quite noticeable, but it would be really
nice to solve this incrementally.

concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is
that Intel chips likes stosb for small blocks, since they are not
optimized for stosw/q.  Zen seems to preffer stopsq over stosb for
blocks up to 128 bytes.

How does the loop version compare to stopsb for blocks in rage
1...128 bytes in Intel hardware?

Since the case we prove block size to be small but we do not know a
size, I think using loop or unrolled for blocks up to say 128 bytes
may work well for both.

Honza

Reply via email to