> PR target/102294 > PR target/119596 > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > (generic_memset): Likewise. > (generic_cost): Change CLEAR_RATIO to 17. > * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): > Add m_GENERIC.
Looking through the PRs, there they are primarily about CLEAR_RATIO being lower than on clang which makes us to produce slower (but smaller) initialization sequence for blocks of certain size. It seems Kenrel is discussed there too (-mno-sse). Bumping it up for SSE makes sense provided that SSE codegen does not suffer from the long $0 immediates. I would say it is OK also for -mno-sse provided speedups are quite noticeable, but it would be really nice to solve this incrementally. concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is that Intel chips likes stosb for small blocks, since they are not optimized for stosw/q. Zen seems to preffer stopsq over stosb for blocks up to 128 bytes. How does the loop version compare to stopsb for blocks in rage 1...128 bytes in Intel hardware? Since the case we prove block size to be small but we do not know a size, I think using loop or unrolled for blocks up to say 128 bytes may work well for both. Honza