https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811
--- Comment #8 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Created attachment 55101 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55101&action=edit hottest loop jpegxl build machinery adds -fno-vectorize and -fno-slp-vectorize to clang flags. Adding -fno-tree-vectorize -fno-tree-slp-vectorize makes GCC generated code more similar. With this most difference is caused by FindBestPatchDictionary or FindTextLikePatches if that function is not inlined. 15.22% cjxl libjxl.so.0.7.0 [.] jxl::(anonymous namespace)::FindTextLikePatches 10.19% cjxl libjxl.so.0.7.0 [.] jxl::FindBestPatchDictionary 5.27% cjxl libjxl.so.0.7.0 [.] jxl::N_AVX2::QuantizeBlockAC 5.06% cjxl libjxl.so.0.7.0 [.] jxl::N_AVX2::EstimateEntropy 4.82% cjxl libjxl.so.0.7.0 [.] jxl::N_AVX2::EstimateEntropy 4.35% cjxl libjxl.so.0.7.0 [.] jxl::N_AVX2::QuantizeBlockAC 4.21% cjxl libjxl.so.0.7.0 [.] jxl::N_AVX2::(anonymous namespace)::TransformFromPixels 3.87% cjxl libjxl.so.0.7.0 [.] jxl::N_AVX2::(anonymous namespace)::TransformFromPixels 3.78% cjxl libjxl.so.0.7.0 [.] jxl::N_AVX2::FindBestMultiplier 3.27% cjxl libjxl.so.0.7.0 [.] jxl::N_AVX2::FindBestMultiplier I think it is mostly register allocation not handling well the internal loop quoted above. I am adding preprocessed sources.