RE: Should ARMv8-A generic tuning default to -moutline-atomics

2020-04-29 Thread Kyrylo Tkachov
Hi Florian,

> -Original Message-
> From: Gcc  On Behalf Of Florian Weimer via Gcc
> Sent: 29 April 2020 13:33
> To: gcc@gcc.gnu.org
> Cc: nmeye...@amzn.com
> Subject: Should ARMv8-A generic tuning default to -moutline-atomics
> 
> Distributions are receiving requests to build things with
> -moutline-atomics:
> 
>   
> 
> Should this be reflected in the GCC upstream defaults for ARMv8-A
> generic tuning?  It does not make much sense to me if every distribution
> has to overide these flags, either in their build system or by patching
> GCC.

I don't think this is a "tuning" decision as such, it is a useful feature for 
deploying LSE in a backwards-compatible manner.
I would support  a GCC configure option that would allow distributions to 
default GCC to it.
WDYT?

Thanks,
Kyrill

> 
> Thanks,
> Florian



RE: Should ARMv8-A generic tuning default to -moutline-atomics

2020-04-30 Thread Kyrylo Tkachov
Hi Michael,

> -Original Message-
> From: Gcc  On Behalf Of Michael Matz
> Sent: 30 April 2020 12:10
> To: Florian Weimer 
> Cc: gcc@gcc.gnu.org; nmeye...@amzn.com
> Subject: Re: Should ARMv8-A generic tuning default to -moutline-atomics
> 
> Hello,
> 
> On Wed, 29 Apr 2020, Florian Weimer via Gcc wrote:
> 
> > Distributions are receiving requests to build things with
> > -moutline-atomics:
> >
> >   
> >
> > Should this be reflected in the GCC upstream defaults for ARMv8-A
> > generic tuning?  It does not make much sense to me if every distribution
> > has to overide these flags, either in their build system or by patching
> > GCC.
> 
> Yep, same here.  It would be nicest if upstream would switch to
> outline-atomics by default on armv8-a :-)  (the problem with build system
> overrides is that some compilers don't understand the option, complicating
> the overrides; and patching GCC package would create a deviation from
> upstream also for users)

Thanks for your input.
I've posted a couple of possible patches for this here:
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544923.html
Kyrill

> 
> 
> Ciao,
> Michael.


RE: GCC 10.0.1 Status Report (2019-04-30)

2020-05-05 Thread Kyrylo Tkachov



> -Original Message-
> From: Gcc  On Behalf Of Jakub Jelinek via Gcc
> Sent: 30 April 2020 18:11
> To: gcc@gcc.gnu.org
> Subject: GCC 10.0.1 Status Report (2019-04-30)
> 
> Status
> ==
> 
> We have reached zero P1 regressions today and releases/gcc-10 branch has
> been created;  GCC 10.1-rc1 will be built and announced later tonight
> or tomorrow.
> The branch is now frozen for blocking regressions and documentation
> fixes only, all changes to the branch require a RM approval now.
> 
> If no show stoppers appear, we'd like to release 10.1 late next week,
> or soon after that, if any important issues are discovered and
> fixed, rc2 could be released next week.

Bootstrap and testing on aarch64-none-linux-gnu and arm-none-linux-gnueabihf 
successful.
I've also built defconfig and allyesconfig kernels for both successfully.
I've tried -fanalyzer on an aarch64 (arm64) kernel and it gave some interesting 
results although it did hit what I think is:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94689

I may commit https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545143.html to 
fix the OOL atomics issue with glibc to the GCC 10 branch if possible once I 
get a sanity check form Joseph and/or Florian.
Thanks,
Kyrill

> 
> 
> Quality Data
> 
> 
> Priority  #   Change from last report
> ---   ---
> P10   -  21
> P2  208   -  14
> P3   14   -   1
> P4  173   -   5
> P5   21   -   2
> ---   ---
> Total P1-P3 222   -  36
> Total   416   -  43
> 
> 
> Previous Report
> ===
> 
> https://gcc.gnu.org/pipermail/gcc/2020-April/000242.html



Re: Right way to represent flag-setting arithmetic instructions in MD files

2017-03-10 Thread Kyrylo Tkachov


On 10/03/17 10:38, Jakub Jelinek wrote:

On Fri, Mar 10, 2017 at 10:10:34AM +, Kyrill Tkachov wrote:

Hi all,

Some (many?) targets have instructions that perform an arithmetic operation and 
set the condition flags based on the result.
For example, on aarch64, we have instructions like ADDS, SUBS, ANDS etc.
In the machine description we represent them as a PARALLEL pattern of a COMPARE 
and the arithmetic operation.
For example, the ADDS instruction is represented as:

(define_insn "add3_compare0"
   [(set (reg:CC_NZ CC_REGNUM)
 (compare:CC_NZ
  (plus:GPI (match_operand:GPI 1 "register_operand" "%r,r,r")
(match_operand:GPI 2 "aarch64_plus_operand" "r,I,J"))
  (const_int 0)))
(set (match_operand:GPI 0 "register_operand" "=r,r,r")
 (plus:GPI (match_dup 1) (match_dup 2)))]

My understanding was that the order of the two in this pattern here doesn't 
matter because there is
an implicit PARALLEL around them, but I found that the compare-elimination pass 
(compare-elim.c)
assumes that the COMPARE set must be in the second position for it to do the 
transformations it wants.

Is there a recommended order for specifying the compare and the arithmetic 
operation in the MD files?
(in which case we should go through the aarch64 MD files and make sure the 
patterns are written the right
way round). Or is the compare-elimination pass just not robust enough? (In 
which case we should teach it
to look into both SETs of the pattern).

Please see http://gcc.gnu.org/ml/gcc-patches/2014-12/msg00584.html and
surrounding thread.



Thanks, that is helpful.
It seems to me that teaching cmpelim to handle either order would not be a very 
complicated task.
Would folks object to making such a change?

Kyrill


Jakub




cond_exec no-ops in RTL optimisations

2013-03-26 Thread Kyrylo Tkachov
Hi everyone,

While working with some splitters I noticed that the RTL optimisation
passes do not optimise away a no-op wrapped in a cond_exec.

So for example, if my splitter generates something like:
(cond_exec (lt:SI (reg:CC CC_REGNUM) (const_int 0))
   (set (match_dup 1)
(match_dup 2)))

and operand 1 and 2 are the same register (say r0), this persists through
all the optimisation passes and results on ARM in a redundant
 movlt r0, r0

I noticed that if I generate an unconditional SET it gets optimised away in
the cases when it's a no-op.

I can work around this by introducing a peephole2 that lifts the SET out of
the cond_exec like so:

(define_peephole2
[(cond_exec (match_operator 0 "comparison_operator"
  [(reg:CC CC_REGNUM) (const_int 0)])
   (set (match_operand:SI 1 "register_operand" "")
(match_dup 1)))]
  ""
[(set (match_dup 1) (match_dup 1))])

and the optimisers will catch it and remove it but this seems like a hack.
 What if it was a redundant ADD (with 0) or an AND (and r0, r0, r0)?
Doesn't seem right to add peepholes for each of those cases.

Is that something the RTL optimisers should be able to remove?

Are there any targets where a conditional no-op may not be removed?

Thanks,
Kyrill





RE: Recognizing loop pattern

2020-10-26 Thread Kyrylo Tkachov via Gcc



> -Original Message-
> From: Gcc  On Behalf Of Stefan Schulze
> Frielinghaus via Gcc
> Sent: 26 October 2020 09:58
> To: gcc@gcc.gnu.org
> Subject: Recognizing loop pattern
> 
> I'm trying to detect loops of the form
> 
>   while (*x != y)
> ++x;
> 
> which mimic the behaviour of function rawmemchr.  Note, the size of *x is
> not
> necessarily one byte.  Thus ultimately I would like to detect such loops and
> replace them with calls to builtins rawmemchr8, rawmemchr16,
> rawmemchr32 if
> they are implemented in the backend:
> 
>   x = __builtin_rawmemchr16(x, y);
> 
> I'm wondering whether there is a particular place in order to detect such
> loop
> patterns.  For example, in the loop distribution pass GCC recognizes loops
> which mimic the behavior of memset, memcpy, memmove and replaces
> them with
> calls to their corresponding builtins, respectively.  The pass and in
> particular partitioning of statements depends on whether a statement is used
> outside of a partition or not.  This works perfectly fine for loops which
> implement the mentioned mem* operations since their result is typically
> ignored.  However, the result of a rawmemchr function is/should never be
> ignored.  Therefore, such loops are currently recognized as having a
> reduction
> which makes an implementation into the loop distribution pass not straight
> forward to me.
> 
> Are there other places where you would detect such loops?  Any comments?

Not an expert on these, but GCC has some similar stuff in 
tree-scalar-evolution.c for detecting popcount loops etc.

Thanks,
Kyrill

> 
> Cheers,
> Stefan


RE: How about providing an interface to fusing instructions via scheduling

2021-09-03 Thread Kyrylo Tkachov via Gcc
Hi,

> -Original Message-
> From: Gcc  On Behalf
> Of gengqi via Gcc
> Sent: 03 September 2021 11:56
> To: gcc@gcc.gnu.org
> Subject: How about providing an interface to fusing instructions via
> scheduling
> 
> When I was adding pipeline to my backend, some instructions needed to be
> fused and I found that there was no suitable interface to implement my
> requirements.
> 
> 
> 
> My hope is that
> 
> 1. Do instruction scheduling and combine any two instructions, and
> sometimes
> the two instructions can be treated as 1 when they are issued
> 
> 2. The two instructions only work better when they are immediately adjacent
> to each other
> 
> 3. An instruction can only be fused once, i.e. if the current instruction
> has been fused with the previous one, the next one cannot be fused with the
> current one.
> 
> 
> 
> I have referred to numerous interfaces in the “GCC INTERNALS” which
> implement some of my requirements, but all of which just happen not to
> cover
> my needs completely.

Indeed, there are a few places in GCC that help, but not a clean catch-all 
solution.

> 
> 
> 
> These interfaces are:
> 
> -  bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx insn *prev, rtx insn
> *curr)
> 
> The name of the interface looks a lot like what I need. But in reality I
> found that this interface only fuses instructions that are already adjacent
> to each other and does not do scheduling (not satisfy 1). And this interface
> may fuse 3 or more instructions (not satisfy 3).

Indeed, this interface ensures that instructions that are already adjacent are 
kept together, but doesn't bring them together from far away.

> 
> 
> 
> -  void TARGET_SCHED_FUSION_PRIORITY (rtx insn *insn, int max_pri, int
> *fusion_pri, int *pri)
> 
> This interface is very powerful, but with only one insn being processed at a
> time, this interface does not seem to be suitable for context sensitive
> situations.
> 

This is likely more appropriate for your needs. You may want to look in the 
implementation of this (and related) hook in the aarch64 backend.
We use it there to bring certain loads and stores together with the intent to 
form special load/store-pair instructions.
The scheduler brings them insns together, but we rely on post-scheduling 
peepholes to actually combine the two together into a single instruction.
Although there are a few cases where it misses opportunities, it works pretty 
well.

Thanks,
Kyrill

> 
> 
> -  Use (define_bypass number out_insn_names in_insn_names [guard])
> 
> The “bypass” does not guarantee that the instruction being dispatched is
> immediately adjacent to (not satisfy 2). Moreover, bypass only handles
> instructions with true dependence.
> 
> 
> 
> -  int TARGET_SCHED_REORDER (FILE *file, int verbose, rtx insn **ready,
> int *n_readyp, int clock) and TARGET_SCHED_REORDER2()
> 
> This interface allows free adjustment of ready instructions, but it is not
> eay to get the last scheduled instruction. The last scheduled instruction
> needs to be taken into account for fusion.
> 
> 
> 
> -  Use define_peephole2
> 
> Since the fused instructions are somehow identical to one instruction, it is
> thought that a peephole might be a good choice. But “define_peephole2”
> also does not schedule instructions.
> 
> 
> 
> In summary, I have not found an interface that does both scheduling and
> fusion. Maybe we should enhance one of the above interfaces, or maybe we
> should provide a new one. I think it is necessary and beneficial to have an
> interface that does both scheduling and fusion.



Ordering function layout for locality follow-up

2024-09-17 Thread Kyrylo Tkachov via Gcc
Hello,

Thanks to those that attended the IPA/LTO BoF at GNU Cauldron over the weekend 
and gave us feedback on teaching GCC to optimize for layout locality in the 
callgraph
I’d like to follow-up on the previous work in the area that Honza mentioned to 
see if we can reuse some of it or follow its best practices.
Could you give us some pointers on the previous attempts?
Also, I think richi suggested having a second WPA phase that does the layout 
separately from the partitioning.
Is that something we decided was a good idea?

Thanks again for the great discussion!
Kyrill

Re: C Standard Libraries

2024-10-15 Thread Kyrylo Tkachov via Gcc



> On 15 Oct 2024, at 18:09, Bryon Quackenbush via Gcc  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Does anyone know where in the GCC hierarchy that I can find implementation
> code for standard C library functions like fgetc / fputs, etc, or would
> that be outside the scope of GCC?  I've been hunting around on the source
> tree for the last few days and found the headers, but not the
> implementation.

These are implemented in the runtime C library (libc), which are outside the 
scope of GCC itself.
Popular such libraries you can look at are glibc and musl libc.
Thanks,
Kyrill


> 
> Thanks for the help.
> 
> - Bryon



Re: Christophe Lyon as MVE reviewer for the AArch32 (arm) port.

2024-09-30 Thread Kyrylo Tkachov via Gcc


> On 26 Sep 2024, at 19:22, Ramana Radhakrishnan  
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> I am pleased to announce that the GCC Steering Committee has appointed
> Christophe Lyon as a MVE Reviewer for the AArch32 port.
> 
> Please join me in congratulating Christophe on his new role.
> 

Congratulations Christophe, well deserved!
Kyrill

> Christophe, please update your listings in the MAINTAINERS file.
> 
> Regards,
> Ramana



Re: [RFC][AArch64] Defining lrotm3 optabs for SVE modes for TARGET_SVE2?

2024-10-21 Thread Kyrylo Tkachov via Gcc


> On 18 Oct 2024, at 19:46, Richard Sandiford  wrote:
> 
> Kyrylo Tkachov  writes:
>> Hello,
>> 
>> I’ve been optimizing various code sequences relating to vector rotates 
>> recently.
>> I ended up proposing we expand the vector-rotate-by-immediate optab rotlm3 
>> for
>> the Advanced SIMD (Neon) modes here:
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665635.html
>> This expands to a ROTATE RTL code that can be later combined into more 
>> complex
>> instructions like XAR and for certain rotate amounts can be optimized in a 
>> single instruction.
>> If they fail to be optimized then a splitter breaks it down into an SHL + 
>> USRA pair.
>> 
>> For SVE, because we have predicates in the general case it’s not feasible to 
>> detect
>> these rotates at the RTL level, so I was hoping that GIMPLE could do it, and 
>> indeed
>> GIMPLE has many places where it can detect rotate idioms: forwprop1, bswap 
>> detection,
>> pattern matching in the vectorizer, match.pd for simple cases etc.
>> The vectorizer is probably a good place to do it (rather than asking the 
>> other places to deal
>> with VLA types) but I think it would need the target to affirm that it 
>> supports SVE vector rotates
>> through the lrotm3 optab, hence my question. 
>> 
>> Though some rotate amounts can be implemented with a single instruction 
>> (REVB, REVH, REVW),
>> the fallback expansion for TARGET_SVE2 would be a two-instruction LSL+USRA 
>> which is better than
>> what we currently emit in the motivating test case:
>> https://godbolt.org/z/o55or8hYv
>> We currently cannot combine the LSL+LSR+ORR sequence because the predicates 
>> get in the way during
>> combine (even though the instructions involved are actually unpredicated and 
>> the predicate would get
>> dropped later anyway).
>> It would also allow us to keep an RTL-level ROTATE long enough to combine it 
>> into the XAR and RAX
>> instructions from TARGET_SVE2_SHA3.
>> 
>> Finally, it would allow us to experiment with more optimal SVE-specific 
>> rotate sequences in the future.
>> For example, we could consider emitting high-throughput TBLs for rotates 
>> that are a multiple of 8.
>> 
>> I’m suggesting doing this for TARGET_SVE2 as we have the combined USRA 
>> instruction there,
>> but I wouldn’t object doing this for TARGET_SVE.
> 
> I think there are three cases here:
> 
> (1) Using permutes for rotates.  That part on its own could be a
>target-independent optimisation.  I imagine other targets without
>native rotate support would benefit.

It seems to me that this is something to be done at (generic) expand-time?
Or do you think it’s something the vectorizer should be doing during its 
detection of rotates?
I suppose it’s easiest for the vectorizer to generate the IR for it but 
expand-time may be a better place
to query the target given there are nuances in the selection (see below)...


> 
> (2) Encouraging the use of XAR.  I suppose the question here is:
>is XAR so good that can we consider using it instead of LSL/USRA
>even when the XOR part isn't needed?  That is, when XAR is available,
>one way of implementing the rotate optab would be to zero the
>destination register (hopefully free) and then use XAR itself as
>the rotate instruction.
> 
>If that's a win, then defining the optab like that sounds good.
> 
>If it's not a win, then we could end up being too aggressive about
>forming XAR in general, since XORs fold with other things too.

I think using XAR to implement rotates is a good idea, but a number of 
complexities come to mind (and these seem to apply to Advanced SIMD too) that 
make it a less universal solution than I’d hoped:
* Advanced SIMD XAR is only available for SHA3 sub-targets, we may need a 
separate code path for non-SHA3 codegen.
* Whether the XAR scheme is a win would depend on the CPU being targeted. From 
the optimization guides that I’ve checked the latency of XAR is always good (2 
cycles) but the throughput varies.
On Neoverse V2 it is 1, and so would lose out to a vector permute 
implementation (throughput 4) unless it’s also used to combine away a XOR. On 
Neoverse V3 however the throughput is the maximum 4
and so it would be the preferred way of doing vector rotates in general.
* XAR on Advanced SIMD only supports V2DImode operands. The SVE2 version of XAR 
supports all widths. That means for TARGET_SVE2 we can use the SVE XAR 
instruction even for Neon modes, but we’d still need a reasonable fallback for 
!TARGET_SVE2 rotates.

Sorry, the above is a bit more Advanced SIMD-specific than I had original

RFC: IPA/LTO: Ordering functions for locality

2024-11-05 Thread Kyrylo Tkachov via Gcc
Hi all,

I'd like to continue the discussion on teaching GCC to optimise code layout
for locality between callees and callers. This is work that we've been doing
at NVIDIA, primarily Prachi Godbole (CC'ed) and myself.
This is a follow-up to the discussion we had at GNU Cauldron at the IPA/LTO
BoF [1]. We're pretty far along in some implementation aspects, but some
implementation and evaluation areas have questions that we'd like advice on.

Goals and motivation:
For some CPUs it is beneficial to minimise the branch distance between
frequently called functions. This effect is more pronounced for large
applications, sometimes composed of multiple APIs/modules where each module
has deep call chains that form a sort of callgraph cluster. The effect is more
pronounced for multi-DSO applications but solving this cross-DSO is out of
scope for this work. We see value in having GCC minimising the branching
distances within even single applications.

Design:
To perform this the compiler needs to see as much of the callgraph as possible
so naturally this should be performed during LTO. Profile data from PGO is
needed to determine the hot caller/calle relationships so we require that as
well. However, it is possible that static non-PGO heuristics could do a
good-enough job in many cases, but we haven't experimented with them
extensively thus far. The optimisation performs two things:
1) It partitions the callgraph into clusters based on the caller/callee hotness
and groups the functions within those clusters together.
2) For functions in the callgraph that crass cluster boundaries we perform
cloning so that each clone can be grouped close to their cluster for locality.

Implementation:
The partitioning 1) is done at the LTO partitioning stage through a new option
to -flto-partition. We add -flto-partition=locality. At the Cauldron Richi
suggested that maybe we could have a separated dedicated
clustering/partitioning pass for this, I'd like to check whether that's indeed
the direction we want to take.
The cloning 2) is done separately in an IPA pass we're calling "IPA locality
cloning". This is currently run after pass_ipa_inline and before
pass_ipa_pure_const. We found that trying to do both partitioning and cloning
in the same pass hit all kinds of asserts about function summaries not being
valid.

Remaining TODOs, issues:
* For testing we added a bootstrap-lto-locality configuration that enables this
optimisation for GCC bootstrap. Currently the code bootstraps successfully
with LTO bootstrap and profiledbootstrap with the locality partitioning and
cloning enabled. This gives us confidence that nothing is catastrophically
wrong with the code.

* The bulk of the work was developed against GCC 12 because the motivating use
case is a large internal workload that only supported GCC 12. We've rebased
the work to GCC trunk and updated the code to bootstrap and test there, but
we'd appreciate the usual code review that it uses the appropriate GCC 15 APIs.
We're open to ideas about integrating these optimisations with existing passes
to avoid duplication where possible.

* Thanks to Honza for pointing out previous work in this area by Martin Liska 
[2]
that proposes a -freorder-functions-algorithm=call-chain-clustering option.
This looks like good work that we'd be interesting in seeing go in. We haven't
evaluated it yet ourselves but one thing it's missing is the cloning from our
approach. Also the patch seems to rely on a .text.sorted. section in the
linker. Is that because we're worried about the linker doing further
reordering of functions that invalidates this optimisation? Could we do this
optimisation without the linker section? We're currently viewing it as an
orthogonal optimisation that should be pursued on its own, but are interested
in other ideas.

* The size of the clusters depends on the microarchitecture and I think we'd
want to control its size through something like a param that target code can
set. We currently have a number of params that we added that control various
aggressiveness settings around cluster size and cloning. We would want to
have sensible defaults or deduce them from code analysis if possible.

* In the absence of PGO data we're interested in developing some static
heuristics to guide this. One area where we'd like advice is how to detect
functions that have been instantiated from the same template, as we find that
they are usually the kind of functions that we want to keep together.
We are exploring a few options here and if we find something that works we’ll
propose them.

* Our prototype gives measurable differences in the large internal app that
motivated this work. We will be performing more benchmarking on workloads that
we can share with the community, but generally the idea of laying out code to
maximise locality is now an established Good Thing (TM) in toolchains given the
research from Facebook that Martin quotes [2] and the invention of tools like
BOLT. So I'm hoping the motiva

[PATCH] PR target/117449: Restrict vector rotate match and split to pre-reload

2024-11-05 Thread Kyrylo Tkachov via Gcc
Hi all,

The vector rotate splitter has some logic to deal with post-reload splitting
but not all cases in aarch64_emit_opt_vec_rotate are post-reload-safe.
In particular the ROTATE+XOR expansion for TARGET_SHA3 can create RTL that
can later be simplified to a simple ROTATE post-reload, which would then
match the insn again and try to split it.
So do a clean split pre-reload and avoid going down this path post-reload
by restricting the insn_and_split to can_create_pseudo_p ().

Bootstrapped and tested on aarch64-none-linux.
Pushing to trunk.
Thanks,
Kyrill

Signed-off-by: Kyrylo Tkachov 
gcc/

PR target/117449
* config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm):
Match only when can_create_pseudo_p ().
* config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Assume
can_create_pseudo_p ().

gcc/testsuite/

PR target/117449
* gcc.c-torture/compile/pr117449.c: New test.



0001-PR-target-117449-Restrict-vector-rotate-match-and-sp.patch
Description: 0001-PR-target-117449-Restrict-vector-rotate-match-and-sp.patch


[RFC][AArch64] Defining lrotm3 optabs for SVE modes for TARGET_SVE2?

2024-10-17 Thread Kyrylo Tkachov via Gcc
Hello,

I’ve been optimizing various code sequences relating to vector rotates recently.
I ended up proposing we expand the vector-rotate-by-immediate optab rotlm3 for
the Advanced SIMD (Neon) modes here:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665635.html
This expands to a ROTATE RTL code that can be later combined into more complex
instructions like XAR and for certain rotate amounts can be optimized in a 
single instruction.
If they fail to be optimized then a splitter breaks it down into an SHL + USRA 
pair.

For SVE, because we have predicates in the general case it’s not feasible to 
detect
these rotates at the RTL level, so I was hoping that GIMPLE could do it, and 
indeed
GIMPLE has many places where it can detect rotate idioms: forwprop1, bswap 
detection,
pattern matching in the vectorizer, match.pd for simple cases etc.
The vectorizer is probably a good place to do it (rather than asking the other 
places to deal
with VLA types) but I think it would need the target to affirm that it supports 
SVE vector rotates
through the lrotm3 optab, hence my question. 

Though some rotate amounts can be implemented with a single instruction (REVB, 
REVH, REVW),
the fallback expansion for TARGET_SVE2 would be a two-instruction LSL+USRA 
which is better than
what we currently emit in the motivating test case:
https://godbolt.org/z/o55or8hYv
We currently cannot combine the LSL+LSR+ORR sequence because the predicates get 
in the way during
combine (even though the instructions involved are actually unpredicated and 
the predicate would get
dropped later anyway).
It would also allow us to keep an RTL-level ROTATE long enough to combine it 
into the XAR and RAX
instructions from TARGET_SVE2_SHA3.

Finally, it would allow us to experiment with more optimal SVE-specific rotate 
sequences in the future.
For example, we could consider emitting high-throughput TBLs for rotates that 
are a multiple of 8.

I’m suggesting doing this for TARGET_SVE2 as we have the combined USRA 
instruction there,
but I wouldn’t object doing this for TARGET_SVE.

Thanks,
Kyrill

Re: Review for gcc-15/changes.html

2025-05-01 Thread Kyrylo Tkachov via Gcc
Hi Heiko,

Thanks for doing this...

> On 30 Apr 2025, at 18:53, Richard Earnshaw (lists) via Gcc  
> wrote:
> 
> On 30/04/2025 17:23, Heiko Eißfeldt wrote:
>> Hi,
>> 
>> here is a patch for some mostly minor typos in 
>> https://gcc.gnu.org/gcc-15/changes.html.
>> My fixes might be wrong of course, so they are just suggestions.
>> 
>> Also, the linked page https://gcc.gnu.org/gcc-15/porting_to.html contains 
>> the now outdated
>> "Note: GCC 15 has not been released yet, so this document is a 
>> work-in-progress."
>> which is luckily not true anymore.
>> 
>> Greetings and thanks, Heiko
>> 
> 
> -The diagnostics code has seen a major refactor, it now supports the sarif
> +The diagnostics code has seen a major refactoring, it now supports the 
> sarif
> 
> I think 'has seen major refactoring' would be better.
> 
> -DerefMut. This makes gccrs more correct and allow 
> to handle
> +DerefMut. This makes gccrs more correct and allows 
> to handle
> complicated cases where the type-checker would previously fail.
> 
> allows handling
> 
> -  Although most variadic functions work, the implementation
> -of them is not yet complete.
> +  Although most variadic functions work, the implementations
> +of them are not yet complete.
> 
> Drop 'of them', then stick with the singular - 'the implementation is not yet 
> complete'
> 
> -  FEAT_LRCPC2 (+rcpc2), enabled by default for
> +  FEAT_RCPC2 (+rcpc2), enabled by default for
> 
> and
> 
> -FEAT_LRCPC3 instructions, when support for the instructions is
> +FEAT_RCPC3 instructions, when support for the instructions is
> 
> These are incorrect.  The features really are FEAT_LRCPC2/3.
> 
> Otherwise, I think these look generally like improvements.
> 

… If you’re fixing typos in this area there’s another one in the AArch64 
section:

The following architecture level is now supported by -march and related 
source-level constructs (GCC identifiers in parentheses):
• Armv9.5-A (arm9.5-a)

The identifier should be “armv9.5-a”

Thanks,
Kyrill


> R.
> 
> 



Re: aarch64 built-in SIMD types

2025-02-25 Thread Kyrylo Tkachov via Gcc
Hi Tom,

> On 24 Feb 2025, at 20:40, Tom Kacvinsky via Gcc  wrote:
> 
> Hi all,
> 
> I am trying to find where the aarch64 SIMD built in types are defined in
> GCC.
> For instance, __Int8x8_t.  I see some code in gcc/config/aarch64 for these,
> but
> then it goes deeper into internals of gcc that I don't quite follow.
> 
> Any help pointing to where I should look would be appreciated.
> 

The logic for defining them is in aarch64-builtins.cc . The information about 
them is tracked in aarch64_simd_type_info structs.
There’s some preprocessor machinery at work so it may not be obvious how it 
works from a first read.
There are user-level typedefs in arm_private_neon_types.h that is later 
included standard ACLE headers like arm_neon.h and arm_sve.h
What particular information are you looking for?

Thanks,
Kyrill


> Thanks,
> 
> Tom



Re: GCC used to store pointers in FP registers on aarch64

2025-02-24 Thread Kyrylo Tkachov via Gcc
Hi Attila,

> On 24 Feb 2025, at 10:46, Attila Szegedi via Gcc  wrote:
> 
> Hi folks,
> 
> I'm looking for a bit of a historic context for a fun GCC behavior we
> stumbled across. For... reasons we build some of our binaries using an
> older version of GCC (8.3.1, yes, we'll be upgrading soon, and no, this
> message is not about helping with an ancient version :-) )
> 
> We noticed that this version of GCC compiling on aarch64 will happily use
> FP registers to temporarily store/load pointers, so there'd be "fmov d9,
> x1" to store a pointer, and then later when it's used as a parameter to a
> function call we'll see "fmov x1, d9" etc. We noticed this while
> investigating some crashes that seemed to always occur in functions called
> with parameters loaded through this mechanism, on certain specific models
> of aarch64 CPUs. On the face of it, this doesn't seem a _too_ terrible idea
> – one'd think that a FP register should preserve the bit pattern so as long
> as the only operations are stores and loads, what's the harm, right? Hey,
> more free registers! Except, on some silicon, it's unfortunately strongly
> correlated with crashes further down the callee chain.
> 
> Further proving the theory is that after we did some judicious application
> of __attribute__((target("general-regs-only"))) to offending functions to
> discourage the compiler from the practice, the crashes were gone.
> Unfortunately, it sometimes required contorting the code to move any
> implied uses of FP out of the way (heck, an inlined std::map constructor
> requires FP operations 'cause of its load factor!)
> 
> I also noticed that a more modern version of GCC (e.g. 12.x) does not seem
> to emit such code anymore (thus also eliminating the problem.) Curiously, I
> couldn't wrangle a good enough Google search term to find anything about
> what brought about the change – a discussion, a blog post, anything. I
> wanted to know if the practice of stashing pointers in FP registers indeed
> proved to be dangerous and was thus deliberately abandoned, or is it maybe
> just a byproduct of some other change.
> 
> If someone knows more about this, I'd be very curious to hear about it.
> It'd be great to know that this was an explicitly eliminated behavior so we
> can rest assured that by using a newer version of GCC we will not get
> bitten by it again.
> 

I’d say it was just a side-effect of various optimization decisions. GCC may 
still decide to move things between the FP and GP regs instead of the stack, 
it’s really a matter of CPU-specific costs.
I haven’t heard of such an issue like you describe before.
Generally, the base AArch64 ABI assumes the presence of FP+SIMD registers.
-mgeneral-regs-only and the general-regs-only attribute can be used if you know 
what you’re doing in a software stack that you control, but it’s probably just 
a workaround for what seems to be a hardware issue you’re facing.

Thanks,
Kyrill 

> Thanks,
>  Attila.