On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > > PR target/102294 > > > PR target/119596 > > > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > > > (generic_memset): Likewise. > > > (generic_cost): Change CLEAR_RATIO to 17. > > > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > > > Add m_GENERIC. > > > > Looking through the PRs, there they are primarily about CLEAR_RATIO > > being lower than on clang which makes us to produce slower (but smaller) > > initialization sequence for blocks of certain size. > > It seems Kenrel is discussed there too (-mno-sse). > > > > Bumping it up for SSE makes sense provided that SSE codegen does not > > suffer from the long $0 immediates. I would say it is OK also for > > -mno-sse provided speedups are quite noticeable, but it would be really > > nice to solve this incrementally. > > > > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is > > that Intel chips likes stosb for small blocks, since they are not > > optimized for stosw/q. Zen seems to preffer stopsq over stosb for > > blocks up to 128 bytes. > > > > How does the loop version compare to stopsb for blocks in rage > > 1...128 bytes in Intel hardware? > > > > Since the case we prove block size to be small but we do not know a > > size, I think using loop or unrolled for blocks up to say 128 bytes > > may work well for both. > > > > Honza > > My patch has a 256 byte threshold. Are you suggesting changing it > to 128 bytes? >
256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are 17 which is 16 * 16 (256) bytes. To lower the threshold to 128 bytes, MOVE_RATIO/CLEAR_RATIO will be changed to 9. Do we want to do that? -- H.J.