Re: feature request: a linker option to avoid merging variables from separate object files into shared cache lines

2024-11-05 Thread Mateusz Guzik via Gcc
On Tue, Nov 5, 2024 at 11:18 AM Florian Weimer  wrote:
>
> * David Brown via Gcc:
>
> > I would have thought it would be better as part of the compiler.  For
> > each compilation unit, you generate one or more data sections
> > depending on the variable initialisations, compiler options and target
> > (.bss, .data, .rodata, .sbss, etc.).  If the compiler has
> > "-align-object-data-section=64" in effect, then you could add ".align
> > 64" to the start and the end of each section.  As far as I can see,
> > that would give the effect the OP is looking for in a simple manner
> > that can also be different for different translation units in the same
> > program (which would be hard to do with a linker flag).  And if this
> > is a flag in gcc, then it will work for any ld-compatible linker
> > (gold, mold, etc.).
>
> I agree it's more of a compiler flag from an implementation perspective.
> There's another aspect that supports this: I don't think we want to do
> this for data sections that we use to implement vague linkage.  Bumping
> the alignment for them could be really wasteful and probably not what
> the programmer intends.
>

Huh, this fell to the wayside on my end due to $reallife.

Anyhow I have 0 opinion where this should be implemented, I'm merely
interested in the feature showing up.

I believe I made a good enough case to justify its existence, but I
don't know if anyone can be bothered to follow through with it -- I
for one have 0 cycles spare for this one. That said someone(tm)
picking it up would be most welcome. :)

Cheers,
-- 
Mateusz Guzik 


Re: [RFC] Enabling SVE with offloading to nvptx

2024-11-05 Thread Jakub Jelinek via Gcc
On Mon, Nov 04, 2024 at 10:21:58AM +, Andrew Stubbs wrote:
> @@ -999,6 +1000,18 @@ omp_max_vf (void)
> && OPTION_SET_P (flag_tree_loop_vectorize)))
>  return 1;
>  
> +  if (ENABLE_OFFLOADING && offload)
> +{
> +  for (const char *c = getenv ("OFFLOAD_TARGET_NAMES"); c;)
> + {
> +   if (startswith (c, "amdgcn"))
> + return 64;
> +   else if ((c = strchr (c, ':')))
> + c++;
> + }
> +  /* Otherwise, fall through to host VF.  */
> +}

This assumes that the host can't have max_vf more than 64.
Because the offload code isn't compiled just for a single offload target,
but perhaps multiple and the host as well.
I think it should be max (64, omp_max_vf (false)) in that case.
Though SVE/RISC-V can complicate that by omp_max_vf returning a poly_uint64.
Maybe upper_bound is the function to call?

> --- a/gcc/omp-expand.cc
> +++ b/gcc/omp-expand.cc
> @@ -229,7 +229,15 @@ omp_adjust_chunk_size (tree chunk_size, bool 
> simd_schedule, bool offload)
>if (!simd_schedule || integer_zerop (chunk_size))
>  return chunk_size;
>  
> -  poly_uint64 vf = omp_max_vf (offload);
> +  if (offload)
> +{
> +  cfun->curr_properties &= ~PROP_gimple_lomp_dev;
> +  tree vf = build_call_expr_internal_loc (UNKNOWN_LOCATION, 
> IFN_GOMP_MAX_VF,
> +   unsigned_type_node, 0);
> +  return fold_convert (TREE_TYPE (chunk_size), vf);

This is incorrect.
The code below that doesn't return the vf returned by omp_max_vf, but
instead it returns (chunk_size + vf - 1) & -vf.

So, question is if we can rely omp_max_vf to always return a power of
two, clearly we rely on it now already in the existing code.

And, either it should build that
(tmp = IFN_GOMP_MAX_VF (), (chunk_size + tmp - 1) & -tmp)
expression as trees, or IFN_GOMP_MAX_VF should take an argument,
the chunk_size, and be folded to (chunk_size + vf - 1) & -vf
later.

Jakub



RFC: IPA/LTO: Ordering functions for locality

2024-11-05 Thread Kyrylo Tkachov via Gcc
Hi all,

I'd like to continue the discussion on teaching GCC to optimise code layout
for locality between callees and callers. This is work that we've been doing
at NVIDIA, primarily Prachi Godbole (CC'ed) and myself.
This is a follow-up to the discussion we had at GNU Cauldron at the IPA/LTO
BoF [1]. We're pretty far along in some implementation aspects, but some
implementation and evaluation areas have questions that we'd like advice on.

Goals and motivation:
For some CPUs it is beneficial to minimise the branch distance between
frequently called functions. This effect is more pronounced for large
applications, sometimes composed of multiple APIs/modules where each module
has deep call chains that form a sort of callgraph cluster. The effect is more
pronounced for multi-DSO applications but solving this cross-DSO is out of
scope for this work. We see value in having GCC minimising the branching
distances within even single applications.

Design:
To perform this the compiler needs to see as much of the callgraph as possible
so naturally this should be performed during LTO. Profile data from PGO is
needed to determine the hot caller/calle relationships so we require that as
well. However, it is possible that static non-PGO heuristics could do a
good-enough job in many cases, but we haven't experimented with them
extensively thus far. The optimisation performs two things:
1) It partitions the callgraph into clusters based on the caller/callee hotness
and groups the functions within those clusters together.
2) For functions in the callgraph that crass cluster boundaries we perform
cloning so that each clone can be grouped close to their cluster for locality.

Implementation:
The partitioning 1) is done at the LTO partitioning stage through a new option
to -flto-partition. We add -flto-partition=locality. At the Cauldron Richi
suggested that maybe we could have a separated dedicated
clustering/partitioning pass for this, I'd like to check whether that's indeed
the direction we want to take.
The cloning 2) is done separately in an IPA pass we're calling "IPA locality
cloning". This is currently run after pass_ipa_inline and before
pass_ipa_pure_const. We found that trying to do both partitioning and cloning
in the same pass hit all kinds of asserts about function summaries not being
valid.

Remaining TODOs, issues:
* For testing we added a bootstrap-lto-locality configuration that enables this
optimisation for GCC bootstrap. Currently the code bootstraps successfully
with LTO bootstrap and profiledbootstrap with the locality partitioning and
cloning enabled. This gives us confidence that nothing is catastrophically
wrong with the code.

* The bulk of the work was developed against GCC 12 because the motivating use
case is a large internal workload that only supported GCC 12. We've rebased
the work to GCC trunk and updated the code to bootstrap and test there, but
we'd appreciate the usual code review that it uses the appropriate GCC 15 APIs.
We're open to ideas about integrating these optimisations with existing passes
to avoid duplication where possible.

* Thanks to Honza for pointing out previous work in this area by Martin Liska 
[2]
that proposes a -freorder-functions-algorithm=call-chain-clustering option.
This looks like good work that we'd be interesting in seeing go in. We haven't
evaluated it yet ourselves but one thing it's missing is the cloning from our
approach. Also the patch seems to rely on a .text.sorted. section in the
linker. Is that because we're worried about the linker doing further
reordering of functions that invalidates this optimisation? Could we do this
optimisation without the linker section? We're currently viewing it as an
orthogonal optimisation that should be pursued on its own, but are interested
in other ideas.

* The size of the clusters depends on the microarchitecture and I think we'd
want to control its size through something like a param that target code can
set. We currently have a number of params that we added that control various
aggressiveness settings around cluster size and cloning. We would want to
have sensible defaults or deduce them from code analysis if possible.

* In the absence of PGO data we're interested in developing some static
heuristics to guide this. One area where we'd like advice is how to detect
functions that have been instantiated from the same template, as we find that
they are usually the kind of functions that we want to keep together.
We are exploring a few options here and if we find something that works we’ll
propose them.

* Our prototype gives measurable differences in the large internal app that
motivated this work. We will be performing more benchmarking on workloads that
we can share with the community, but generally the idea of laying out code to
maximise locality is now an established Good Thing (TM) in toolchains given the
research from Facebook that Martin quotes [2] and the invention of tools like
BOLT. So I'm hoping the motiva

[PATCH] PR target/117449: Restrict vector rotate match and split to pre-reload

2024-11-05 Thread Kyrylo Tkachov via Gcc
Hi all,

The vector rotate splitter has some logic to deal with post-reload splitting
but not all cases in aarch64_emit_opt_vec_rotate are post-reload-safe.
In particular the ROTATE+XOR expansion for TARGET_SHA3 can create RTL that
can later be simplified to a simple ROTATE post-reload, which would then
match the insn again and try to split it.
So do a clean split pre-reload and avoid going down this path post-reload
by restricting the insn_and_split to can_create_pseudo_p ().

Bootstrapped and tested on aarch64-none-linux.
Pushing to trunk.
Thanks,
Kyrill

Signed-off-by: Kyrylo Tkachov 
gcc/

PR target/117449
* config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm):
Match only when can_create_pseudo_p ().
* config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Assume
can_create_pseudo_p ().

gcc/testsuite/

PR target/117449
* gcc.c-torture/compile/pr117449.c: New test.



0001-PR-target-117449-Restrict-vector-rotate-match-and-sp.patch
Description: 0001-PR-target-117449-Restrict-vector-rotate-match-and-sp.patch


Re: feature request: a linker option to avoid merging variables from separate object files into shared cache lines

2024-11-05 Thread Florian Weimer via Gcc
* David Brown via Gcc:

> I would have thought it would be better as part of the compiler.  For
> each compilation unit, you generate one or more data sections
> depending on the variable initialisations, compiler options and target
> (.bss, .data, .rodata, .sbss, etc.).  If the compiler has
> "-align-object-data-section=64" in effect, then you could add ".align
> 64" to the start and the end of each section.  As far as I can see,
> that would give the effect the OP is looking for in a simple manner
> that can also be different for different translation units in the same
> program (which would be hard to do with a linker flag).  And if this
> is a flag in gcc, then it will work for any ld-compatible linker
> (gold, mold, etc.).

I agree it's more of a compiler flag from an implementation perspective.
There's another aspect that supports this: I don't think we want to do
this for data sections that we use to implement vague linkage.  Bumping
the alignment for them could be really wasteful and probably not what
the programmer intends.

Thanks,
Florian