analyzer: Weekly update and questions on extending c++ support.

2023-07-31 Thread Benjamin Priour via Gcc
Hi David,

(subject's line totally stolen inspired from Eric's)
Last week as you know had some special circumstances therefore not much has
been done.

I've changed my in-progress patch about trimming the system headers event
to fit your suggestions from
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625245.html.
I had suggested that we keep events (1) (2) (11) and (12) in the reproducer
https://godbolt.org/z/sb9dM9Gqa
but after comparing it to your first suggestion of only keeping (11), I
don't know which one to keep.
Just (11) has the benefit of being really concise and verbose enough in
this case, mimicking the behavior of a raw pointer.
However having the call site and return site might be helpful for the user,
e.g. to understand what template arguments were deduced.

I think we could still have events (11) and (12): what if for call and
> return events we consider the location of both the call site and the
> called function: we suppress if both are in a system header, but don't
> suppress if only one is a system header.  That way by default we'd show
> the call into a system header and show the return from the system
> header, but we'd suppress all the implementation details within the
> system header.
>
> Does that seem like it could work?


It works and is one the two implementations I have.
Now we just have to decide which is better, and I'll submit the patch to
the mail list.

Some updates too on the added support of placement new

What do you think of "null-dereference" superceding
"use-of-unitialized-value" ?
Both can occur at the same statement (see
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625844.html)
If you think adding this supercedes_p to null-deref is acceptable, then
I'll fix PR 110830 
before submitting the placement new patch.


Otherwise, I've been working on what we talked about last time, i.e. moving
the tests under gcc.dg/analyzer
to c-c++-common/analyzer. It is still in progress (about 100 tests done out
of 700+).

A question on that front:
What is the purpose of test gcc.dg/analyzer/data-model-11.c ?

int test (void)
{
  unsigned char *s = (unsigned char *) "abc";
  char *t = "xyz";
  return s[1] + t[1];
}

Are the non-constness of the strings being tested or only their sign ?

Thanks
Benjamin.


Re: Calling convention for Intel APX extension

2023-07-31 Thread Michael Matz via Gcc
Hello,

On Sun, 30 Jul 2023, Thomas Koenig wrote:

> > I've recently submitted a patch that adds some attributes that basically
> > say "these-and-those regs aren't clobbered by this function" (I did them
> > for not clobbered xmm8-15).  Something similar could be used for the new
> > GPRs as well.  Then it would be a matter of ensuring that the interesting
> > functions are marked with that attributes (and then of course do the
> > necessary call-save/restore).
> 
> Interesting.
> 
> Taking this a bit further: The compiler knows which registers it used
> (and which ones might get clobbered by called functions) and could
> generate such information automatically and embed it in the assembly
> file, and the assembler could, in turn, put it into the object file.
> 
> A linker (or LTO) could then check this and elide save/restore pairs
> where they are not needed.

LTO with interprocedural register allocation (-fipa-ra) already does this.  

Doing it without LTO is possible to implement in the way you suggest, but 
is very hard to get effective: the problem is that saving/restoring of 
registers might be scheduled in non-trivial ways and getting rid of 
instruction bytes within function bodies at link time is fairly 
non-trivial: it needs excessive meta-information to be effective (e.g. all 
jumps that potentially cross the removed bytes must get relocations).

So you either limit the ways that prologue and epilogues are emitted to 
help the linker (thereby limiting effectiveness of unchanged xlogues) or 
you emit more meta-info than the instruction bytes themself, bloating 
object files for dubious outcomes.

> It would probably be impossible for calls into shared libraries, since
> the saved registers might change from version to version.

The above scheme could be extended to also allow introducing stubs 
(wrappers) for shared lib functions, handled by the dynamic loader.  But 
then you would get hard problems to solve related to function addresses 
and their uniqueness.

> Still, potential gains could be substantial, and it could have an
> effect which could come close to inlining, while actually saving space
> instead of using extra.
> 
> Comments?

I think it would be an interesting experiment to implement such scheme 
fully just to see how effective it would be in practice.  But it's very 
non-trivial to do, and my guess is that it won't be super effective.  So, 
could be a typical research paper topic :-)

At least outside of extreme cases like the SSE regs, where none are 
callee-saved, and which can be handled in a different way like the 
explicit attributes.


Ciao,
Michael.


[Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-07-31 Thread Changbin Du via Gcc
Hello, folks.
This is to discuss Gcc's heuristic strategy about Predicated Instructions and
Branches. And probably something needs to be improved.

[The story]
Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
This program is nothing special, just a random code I found on the internet. You
can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.

Build it with O2/O3/PGO (My GCC is 13.1):
$ gcc -O2 -march=native -g -o huffman huffman.c
$ gcc -O3 -march=native -g -o huffman.O3 huffman.c

$ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
$ ./huffman.instrumented test.data
$ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o 
huffman.pgo huffman.c

Run them on my 12900H laptop:
$ head -c 50M /dev/urandom > test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data

The result (p-core, no ht, no turbo, performance mode):

O2  O3  PGO
cycles  2,581,832,749   8,638,401,568   9,394,200,585
(1.07s) (3.49s) (3.80s)
instructions12,609,600,094  11,827,675,782  12,036,010,638
branches2,303,416,221   2,671,184,833   2,723,414,574
branch-misses   0.00%   7.94%   8.84%
cache-misses3,012,613   3,055,722   3,076,316
L1-icache-load-misses   11,416,391  12,112,703  11,896,077
icache_tag.stalls   1,553,521   1,364,092   1,896,066
itlb_misses.stlb_hit6,856   21,756  22,600
itlb_misses.walk_completed  14,430  4,454   15,084
baclears.any131,573 140,355 131,644
int_misc.clear_resteer_cycles   2,545,915   586,578,125 679,021,993
machine_clears.count22,235  39,671  37,307
dsb2mite_switches.penalty_cycles 6,985,838  12,929,675  8,405,493
frontend_retired.any_dsb_miss   28,785,677  28,161,724  28,093,319
idq.dsb_cycles_any  1,986,038,896   5,683,820,258   5,971,969,906
idq.dsb_uops11,149,445,952  26,438,051,062  28,622,657,650
idq.mite_uops   207,881,687 216,734,007 212,003,064


Above data shows:
  o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
  o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
aggressive inline.
  o O3/PGO introduced very bad branch prediction. I will explain it later.
  o Code built with O3 has high iTLB miss but much lower sTLB miss. This is 
beyond
my expectation.
  o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
I don't know why. (subcategory MC is not measured yet)
  o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
  o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
  o Other events are mainly affected by bad branch misprediction.

Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
idq.mite_uops data.

$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman 
test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman test.data':

 %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
tma_fetch_latency %  tma_machine_clears
   0.0  0.8 11.4
 76.8  2.0 8.3  
0.80.0

   1.073381357 seconds time elapsed

   0.945233000 seconds user
   0.095719000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 
test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':

 %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
tma_fetch_latency %  tma_machine_clears
  38.2  6.6  3.5
 21.7  0.920.9  
7.50.8

   3.501875873 seconds time elapsed

   3.378572000 seconds user
   0.084163000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo 
test.data
test.data.huf is 1.

Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-07-31 Thread Richard Biener via Gcc
On Mon, Jul 31, 2023 at 2:57 PM Changbin Du via Gcc  wrote:
>
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
>
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO 
> respectively.
> This program is nothing special, just a random code I found on the internet. 
> You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
>
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
>
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented 
> huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o 
> huffman.pgo huffman.c
>
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data
>
> The result (p-core, no ht, no turbo, performance mode):
>
> O2  O3  PGO
> cycles  2,581,832,749   8,638,401,568   9,394,200,585
> (1.07s) (3.49s) (3.80s)
> instructions12,609,600,094  11,827,675,782  12,036,010,638
> branches2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses   0.00%   7.94%   8.84%
> cache-misses3,012,613   3,055,722   3,076,316
> L1-icache-load-misses   11,416,391  12,112,703  11,896,077
> icache_tag.stalls   1,553,521   1,364,092   1,896,066
> itlb_misses.stlb_hit6,856   21,756  22,600
> itlb_misses.walk_completed  14,430  4,454   15,084
> baclears.any131,573 140,355 131,644
> int_misc.clear_resteer_cycles   2,545,915   586,578,125 679,021,993
> machine_clears.count22,235  39,671  37,307
> dsb2mite_switches.penalty_cycles 6,985,838  12,929,675  8,405,493
> frontend_retired.any_dsb_miss   28,785,677  28,161,724  28,093,319
> idq.dsb_cycles_any  1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops   207,881,687 216,734,007 212,003,064
>
>
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
> aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is 
> beyond
> my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
> I don't know why. (subcategory MC is not measured yet)
>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
> DSB hit well. So frontend fetching and decoding is not a problem for 
> O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
>
> Additionally, here is the TMA level 2 analysis: The main changes in the 
> pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the 
> accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman 
> test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman test.data':
>
>  %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
> tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
> tma_fetch_latency %  tma_machine_clears
>0.0  0.8 11.4  
>76.8  2.0 8.3  
> 0.80.0
>
>1.073381357 seconds time elapsed
>
>0.945233000 seconds user
>0.095719000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 
> ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
>
>  %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
> tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
> tma_fetch_latency %  tma_machine_clears
>   38.2  6.6  3.5  
>21.7  0.920.9   

Re: analyzer: Weekly update and questions on extending c++ support.

2023-07-31 Thread David Malcolm via Gcc
On Mon, 2023-07-31 at 14:19 +0200, Benjamin Priour wrote:
> Hi David,

Hi Benjamin

> 
> (subject's line totally stolen inspired from Eric's)

:)

> Last week as you know had some special circumstances therefore not
> much has
> been done.

FWIW you still seem to have got a lot done.

> 
> I've changed my in-progress patch about trimming the system headers
> event
> to fit your suggestions from
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625245.html.
> I had suggested that we keep events (1) (2) (11) and (12) in the
> reproducer
> https://godbolt.org/z/sb9dM9Gqa
> but after comparing it to your first suggestion of only keeping (11),
> I
> don't know which one to keep.
> Just (11) has the benefit of being really concise and verbose enough
> in
> this case, mimicking the behavior of a raw pointer.
> However having the call site and return site might be helpful for the
> user,
> e.g. to understand what template arguments were deduced.
> 
> I think we could still have events (11) and (12): what if for call
> and
> > return events we consider the location of both the call site and
> > the
> > called function: we suppress if both are in a system header, but
> > don't
> > suppress if only one is a system header.  That way by default we'd
> > show
> > the call into a system header and show the return from the system
> > header, but we'd suppress all the implementation details within the
> > system header.
> > 
> > Does that seem like it could work?
> 
> 
> It works and is one the two implementations I have.
> Now we just have to decide which is better, and I'll submit the patch
> to
> the mail list.

Great.  Are patches still taking a long time to test?  If so, feel free
to post the patches before the testing is done (but acknowledge that in
the blurb of the patch).  Do you need an account on the compiler farm?

> 
> Some updates too on the added support of placement new
> 
> What do you think of "null-dereference" superceding
> "use-of-unitialized-value" ?
> Both can occur at the same statement (see
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625844.html)
> If you think adding this supercedes_p to null-deref is acceptable,
> then
> I'll fix PR 110830
> 
> before submitting the placement new patch.

I think I covered this in my reply to your other email.

> 
> 
> Otherwise, I've been working on what we talked about last time, i.e.
> moving
> the tests under gcc.dg/analyzer
> to c-c++-common/analyzer. It is still in progress (about 100 tests
> done out
> of 700+).

Excellent; thanks.

> 
> A question on that front:
> What is the purpose of test gcc.dg/analyzer/data-model-11.c ?
> 
> int test (void)
> {
>   unsigned char *s = (unsigned char *) "abc";
>   char *t = "xyz";
>   return s[1] + t[1];
> }
> 
> Are the non-constness of the strings being tested or only their sign
> ?

"git log" tells me that testcase goes all the way back to the initial
commit of the analyzer.  I think that my concern with that test was the
signedness of the strings, not the constness, so feel free to add a
"const" to the type of s (and now I look at it, maybe make "t" be
explicitly signed char, rather than just char?)


Thanks
Dave



Add clang's invalid-noreturn warning flag

2023-07-31 Thread Julian Waters via Gcc
Please review a patch to add clang's invalid-noreturn flag to toggle
noreturn  warnings. This patch keeps the old behaviour of always warning on
every noreturn violation, but unlike clang also adds an extra layer of fine
tuning by turning invalid-noreturn into a warning with levels, where level
1 warns about noreturn functions that do return, level 2 warns about
noreturn functions that explicitly have return statements, and level 3,
which is the default to match old behaviour, warns for both instances. I am
under the impression that behaviour changing patches should be sent to this
list rather than gcc-patches, please do correct me if I am mistaken

 gcc/c-family/c.opt |  8 
 gcc/c/c-typeck.cc  |  9 ++---
 gcc/c/gimple-parser.cc |  9 ++---
 gcc/cp/coroutines.cc   | 11 +++
 gcc/cp/typeck.cc   |  7 +--
 gcc/tree-cfg.cc|  5 -
 6 files changed, 36 insertions(+), 13 deletions(-)


0001-Add-the-invalid-noreturn-warning-to-match-clang.patch
Description: Binary data


Re: Add clang's invalid-noreturn warning flag

2023-07-31 Thread Andrew Pinski via Gcc
On Mon, Jul 31, 2023 at 8:18 PM Julian Waters via Gcc  wrote:
>
> Please review a patch to add clang's invalid-noreturn flag to toggle
> noreturn  warnings. This patch keeps the old behaviour of always warning on
> every noreturn violation, but unlike clang also adds an extra layer of fine
> tuning by turning invalid-noreturn into a warning with levels, where level
> 1 warns about noreturn functions that do return, level 2 warns about
> noreturn functions that explicitly have return statements, and level 3,
> which is the default to match old behaviour, warns for both instances. I am
> under the impression that behaviour changing patches should be sent to this
> list rather than gcc-patches, please do correct me if I am mistaken

No all patches just goto gcc-patches@.
I see you didn't add any documentation to doc/invoke.texi at all.
Also you missed a changelog.
You are also missing a few testcases for this change.
Also did you test your patch and for what target?
Please read https://gcc.gnu.org/contribute.html for guidelines on
submitting a patch which should help you there.
https://gcc.gnu.org/onlinedocs/gccint/Testsuites.html and
https://gcc.gnu.org/wiki/HowToPrepareATestcase should help on how to
write the testcases.

Thanks,
Andrew Pinski


>
>  gcc/c-family/c.opt |  8 
>  gcc/c/c-typeck.cc  |  9 ++---
>  gcc/c/gimple-parser.cc |  9 ++---
>  gcc/cp/coroutines.cc   | 11 +++
>  gcc/cp/typeck.cc   |  7 +--
>  gcc/tree-cfg.cc|  5 -
>  6 files changed, 36 insertions(+), 13 deletions(-)