Re: feature request: a linker option to avoid merging variables from separate object files into shared cache lines
On Tue, Nov 5, 2024 at 11:18 AM Florian Weimer wrote: > > * David Brown via Gcc: > > > I would have thought it would be better as part of the compiler. For > > each compilation unit, you generate one or more data sections > > depending on the variable initialisations, compiler options and target > > (.bss, .data, .rodata, .sbss, etc.). If the compiler has > > "-align-object-data-section=64" in effect, then you could add ".align > > 64" to the start and the end of each section. As far as I can see, > > that would give the effect the OP is looking for in a simple manner > > that can also be different for different translation units in the same > > program (which would be hard to do with a linker flag). And if this > > is a flag in gcc, then it will work for any ld-compatible linker > > (gold, mold, etc.). > > I agree it's more of a compiler flag from an implementation perspective. > There's another aspect that supports this: I don't think we want to do > this for data sections that we use to implement vague linkage. Bumping > the alignment for them could be really wasteful and probably not what > the programmer intends. > Huh, this fell to the wayside on my end due to $reallife. Anyhow I have 0 opinion where this should be implemented, I'm merely interested in the feature showing up. I believe I made a good enough case to justify its existence, but I don't know if anyone can be bothered to follow through with it -- I for one have 0 cycles spare for this one. That said someone(tm) picking it up would be most welcome. :) Cheers, -- Mateusz Guzik
Re: [RFC] Enabling SVE with offloading to nvptx
On Mon, Nov 04, 2024 at 10:21:58AM +, Andrew Stubbs wrote: > @@ -999,6 +1000,18 @@ omp_max_vf (void) > && OPTION_SET_P (flag_tree_loop_vectorize))) > return 1; > > + if (ENABLE_OFFLOADING && offload) > +{ > + for (const char *c = getenv ("OFFLOAD_TARGET_NAMES"); c;) > + { > + if (startswith (c, "amdgcn")) > + return 64; > + else if ((c = strchr (c, ':'))) > + c++; > + } > + /* Otherwise, fall through to host VF. */ > +} This assumes that the host can't have max_vf more than 64. Because the offload code isn't compiled just for a single offload target, but perhaps multiple and the host as well. I think it should be max (64, omp_max_vf (false)) in that case. Though SVE/RISC-V can complicate that by omp_max_vf returning a poly_uint64. Maybe upper_bound is the function to call? > --- a/gcc/omp-expand.cc > +++ b/gcc/omp-expand.cc > @@ -229,7 +229,15 @@ omp_adjust_chunk_size (tree chunk_size, bool > simd_schedule, bool offload) >if (!simd_schedule || integer_zerop (chunk_size)) > return chunk_size; > > - poly_uint64 vf = omp_max_vf (offload); > + if (offload) > +{ > + cfun->curr_properties &= ~PROP_gimple_lomp_dev; > + tree vf = build_call_expr_internal_loc (UNKNOWN_LOCATION, > IFN_GOMP_MAX_VF, > + unsigned_type_node, 0); > + return fold_convert (TREE_TYPE (chunk_size), vf); This is incorrect. The code below that doesn't return the vf returned by omp_max_vf, but instead it returns (chunk_size + vf - 1) & -vf. So, question is if we can rely omp_max_vf to always return a power of two, clearly we rely on it now already in the existing code. And, either it should build that (tmp = IFN_GOMP_MAX_VF (), (chunk_size + tmp - 1) & -tmp) expression as trees, or IFN_GOMP_MAX_VF should take an argument, the chunk_size, and be folded to (chunk_size + vf - 1) & -vf later. Jakub
RFC: IPA/LTO: Ordering functions for locality
Hi all, I'd like to continue the discussion on teaching GCC to optimise code layout for locality between callees and callers. This is work that we've been doing at NVIDIA, primarily Prachi Godbole (CC'ed) and myself. This is a follow-up to the discussion we had at GNU Cauldron at the IPA/LTO BoF [1]. We're pretty far along in some implementation aspects, but some implementation and evaluation areas have questions that we'd like advice on. Goals and motivation: For some CPUs it is beneficial to minimise the branch distance between frequently called functions. This effect is more pronounced for large applications, sometimes composed of multiple APIs/modules where each module has deep call chains that form a sort of callgraph cluster. The effect is more pronounced for multi-DSO applications but solving this cross-DSO is out of scope for this work. We see value in having GCC minimising the branching distances within even single applications. Design: To perform this the compiler needs to see as much of the callgraph as possible so naturally this should be performed during LTO. Profile data from PGO is needed to determine the hot caller/calle relationships so we require that as well. However, it is possible that static non-PGO heuristics could do a good-enough job in many cases, but we haven't experimented with them extensively thus far. The optimisation performs two things: 1) It partitions the callgraph into clusters based on the caller/callee hotness and groups the functions within those clusters together. 2) For functions in the callgraph that crass cluster boundaries we perform cloning so that each clone can be grouped close to their cluster for locality. Implementation: The partitioning 1) is done at the LTO partitioning stage through a new option to -flto-partition. We add -flto-partition=locality. At the Cauldron Richi suggested that maybe we could have a separated dedicated clustering/partitioning pass for this, I'd like to check whether that's indeed the direction we want to take. The cloning 2) is done separately in an IPA pass we're calling "IPA locality cloning". This is currently run after pass_ipa_inline and before pass_ipa_pure_const. We found that trying to do both partitioning and cloning in the same pass hit all kinds of asserts about function summaries not being valid. Remaining TODOs, issues: * For testing we added a bootstrap-lto-locality configuration that enables this optimisation for GCC bootstrap. Currently the code bootstraps successfully with LTO bootstrap and profiledbootstrap with the locality partitioning and cloning enabled. This gives us confidence that nothing is catastrophically wrong with the code. * The bulk of the work was developed against GCC 12 because the motivating use case is a large internal workload that only supported GCC 12. We've rebased the work to GCC trunk and updated the code to bootstrap and test there, but we'd appreciate the usual code review that it uses the appropriate GCC 15 APIs. We're open to ideas about integrating these optimisations with existing passes to avoid duplication where possible. * Thanks to Honza for pointing out previous work in this area by Martin Liska [2] that proposes a -freorder-functions-algorithm=call-chain-clustering option. This looks like good work that we'd be interesting in seeing go in. We haven't evaluated it yet ourselves but one thing it's missing is the cloning from our approach. Also the patch seems to rely on a .text.sorted. section in the linker. Is that because we're worried about the linker doing further reordering of functions that invalidates this optimisation? Could we do this optimisation without the linker section? We're currently viewing it as an orthogonal optimisation that should be pursued on its own, but are interested in other ideas. * The size of the clusters depends on the microarchitecture and I think we'd want to control its size through something like a param that target code can set. We currently have a number of params that we added that control various aggressiveness settings around cluster size and cloning. We would want to have sensible defaults or deduce them from code analysis if possible. * In the absence of PGO data we're interested in developing some static heuristics to guide this. One area where we'd like advice is how to detect functions that have been instantiated from the same template, as we find that they are usually the kind of functions that we want to keep together. We are exploring a few options here and if we find something that works we’ll propose them. * Our prototype gives measurable differences in the large internal app that motivated this work. We will be performing more benchmarking on workloads that we can share with the community, but generally the idea of laying out code to maximise locality is now an established Good Thing (TM) in toolchains given the research from Facebook that Martin quotes [2] and the invention of tools like BOLT. So I'm hoping the motiva
[PATCH] PR target/117449: Restrict vector rotate match and split to pre-reload
Hi all, The vector rotate splitter has some logic to deal with post-reload splitting but not all cases in aarch64_emit_opt_vec_rotate are post-reload-safe. In particular the ROTATE+XOR expansion for TARGET_SHA3 can create RTL that can later be simplified to a simple ROTATE post-reload, which would then match the insn again and try to split it. So do a clean split pre-reload and avoid going down this path post-reload by restricting the insn_and_split to can_create_pseudo_p (). Bootstrapped and tested on aarch64-none-linux. Pushing to trunk. Thanks, Kyrill Signed-off-by: Kyrylo Tkachov gcc/ PR target/117449 * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm): Match only when can_create_pseudo_p (). * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Assume can_create_pseudo_p (). gcc/testsuite/ PR target/117449 * gcc.c-torture/compile/pr117449.c: New test. 0001-PR-target-117449-Restrict-vector-rotate-match-and-sp.patch Description: 0001-PR-target-117449-Restrict-vector-rotate-match-and-sp.patch
Re: feature request: a linker option to avoid merging variables from separate object files into shared cache lines
* David Brown via Gcc: > I would have thought it would be better as part of the compiler. For > each compilation unit, you generate one or more data sections > depending on the variable initialisations, compiler options and target > (.bss, .data, .rodata, .sbss, etc.). If the compiler has > "-align-object-data-section=64" in effect, then you could add ".align > 64" to the start and the end of each section. As far as I can see, > that would give the effect the OP is looking for in a simple manner > that can also be different for different translation units in the same > program (which would be hard to do with a linker flag). And if this > is a flag in gcc, then it will work for any ld-compatible linker > (gold, mold, etc.). I agree it's more of a compiler flag from an implementation perspective. There's another aspect that supports this: I don't think we want to do this for data sections that we use to implement vague linkage. Bumping the alignment for them could be really wasteful and probably not what the programmer intends. Thanks, Florian