Hi, while looking for x86 tuning issues I noticed PR81614 about partial register stalls on core. We currently support two schemes for our of order CPUs - partial register dependencies where registers are renamed always as whole and thus it is important to always write complete register at the begginnig of dependency chain and partial register stalls where registers are renamed by parts and it is important to not read full size after partial store.
Core renames partial registers, like pentiumPro+ but it is currently set to partial reg dependency. PR log also claims that there was change in Haswell that avoids the partial register stalls completely (how?). This is per Agner Fog manual and I have verified that dropping partial register dependncy on haswell produce no regressions and slighly reduce code size. I plan to experiment with switching pre-Haswell cores to partial register stals but need to find set up for benchmarking (Vladimir is still running regular tester on Conroe). But I plan to do that incrementally. Because AMD chips are all partial reg dependency, we will probably need to find a way to avoid both on the code sequences mentioned in the PRs. This is another incremental step. Bootstrapped/regtested x86_64-linux and benchmarked on Haswell on Spec2k, spec2k6, C++ benchmarks, polyhedron and my own microbenchmarks which I developed for partial register stalls/dependencies at the PPro/K7 times. I have also noticed: #define m_CORE_ALL (m_CORE2 | m_NEHALEM | m_SANDYBRIDGE | m_HASWELL) #define m_SKYLAKE_AVX512 (1U<<PROCESSOR_SKYLAKE_AVX512) notice that skylake512 is thus not included in core tuning which seems wrong. However because of {"skylake-avx512", PROCESSOR_HASWELL, CPU_HASWELL, PTA_SKYLAKE_AVX512}, I think PROCESSOR_SKYLAKE_AVX512 is never set. It is used though: case PROCESSOR_SKYLAKE_AVX512: def_or_undef (parse_in, "__skylake_avx512"); def_or_undef (parse_in, "__skylake_avx512__"); break; How this is supposed to work? I will commit the patch tonight if there are no complains. Honza * x86-tune.def (X86_TUNE_PARTIAL_REG_DEPENDENCY, X86_TUNE_MOVX): disable for Haswell and newer revisions of core. Index: x86-tune.def =================================================================== --- x86-tune.def (revision 254073) +++ x86-tune.def (working copy) @@ -48,7 +48,8 @@ DEF_TUNE (X86_TUNE_SCHEDULE, "schedule", over partial stores. For example preffer MOVZBL or MOVQ to load 8bit value over movb. */ DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency", - m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_INTEL + m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE + | m_BONNELL | m_SILVERMONT | m_INTEL | m_KNL | m_KNM | m_AMD_MULTIPLE | m_GENERIC) /* X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY: This knob promotes all store @@ -84,8 +85,9 @@ DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STAL /* X86_TUNE_MOVX: Enable to zero extend integer registers to avoid partial dependencies. */ DEF_TUNE (X86_TUNE_MOVX, "movx", - m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT - | m_KNL | m_KNM | m_INTEL | m_GEODE | m_AMD_MULTIPLE | m_GENERIC) + m_PPRO | m_P4_NOCONA | m_CORE2 | m_NEHALEM | m_SANDYBRIDGE + | m_BONNELL | m_SILVERMONT | m_KNL | m_KNM | m_INTEL + | m_GEODE | m_AMD_MULTIPLE | m_GENERIC) /* X86_TUNE_MEMORY_MISMATCH_STALL: Avoid partial stores that are followed by full sized loads. */