on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2
and -Ofast -march=native which is both very nice!
Zen1 for some reason sees less improvement, about 6%.
With PGO it is 3.8%
Overall it seems a win, but there are few noteworthy issues.
I also see a 6.69% regression on x64 with
> --- Comment #6 from Richard Biener ---
> Honza, -Og was supposed to not do so much work, I intended to disable IPA
> inlining but there's no knob for that. I wonder where to best put such
> guard? I set flag_inline_small_functions to zero for -Og but we still
> run inline_small_functions ().
> You can not disable an IPA pass becasuse then we will mishandle
> optimize attributes. I think you simply want to set
>
> flag_inline_small_functions = 0
> flag_inline_functions_called_once = 0
Actually I forgot, we have flag_no_inline which makes
tree_inlinable_function_p to return false for
>
> Sure - I just remember (falsely?) that we finally decided to do it :)
I do not recall this, but I may have forgotten :))
> If we don't run IPA inline we don't figure we failed to inline the
> always_inline either ;) And IPA inline can expose more indirect
> alywas-inlines we only discover a
> So nothing to see? I guess our unit growth limit doesn't trigger because it's
> a small (benchmark) unit?
Yep, unit growths do not apply for very small units. ipa-cp heuristics
still IMO needs work and be based on relative speedups rather then
absolute for the cutoffs.
So I assume that this is due to new pass_waccess which was added into
early optimizations. I think this is not really ipa component but
tree-optimize.
> > bool
> Since the pass issues a bunch other warnings (e.g., -Wstringop-overflow,
> -Wuse-after-free, etc.) the gate doesn't seem right. But since #pragma GCC
> diagnostic can re-enable warnings disabled by -w (or turn them into errors)
> any
> gate that considers the global option setting will
> > According to znver2_cost
> >
> > Cost of sse_to_integer is a little bit less than fp_store, maybe increase
> > sse_to_integer cost(more than fp_store) can helps RA to choose memory
> > instead of GPR.
>
> That sounds reasonable - GPR<->xmm is cheaper than GPR -> stack -> xmm
> but GPR<->xmm s
> I would say so. It saves code size and also uop space unless the two
> can magically fuse to a immediate to %xmm move (I doubt that).
I made simple benchmark
double a=10;
int
main()
{
long int i;
double sum,val1,val2,val3,val4;
for (i=0;i<10;i++)
{
#if
> See above comments from Iain, even if that pre-initialization is removed it is
> still miscompiled. And, the testcase fails not because of the padding bits
> not
> being zero, but because the address of self stored into one of the fields
> isn't
> there or modref thinks it can't be changed or
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103040
>
> --- Comment #15 from Iain Buclaw ---
> Got it. The difference between D and C++ is a matter of early inlining.
>
> The C++ example Jakub posted fails in the same way that D does if you compile
> with: -O1 -fno-inline
Great, I will take a
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
>
> Aldy Hernandez changed:
>
>What|Removed |Added
>
> Depends on||103058
>
> --- Comment #
>
> This PR is still open, at least for slowdown in the threader with LTO. The
> issue is ranger wide, so it may also cause slowdowns on non-LTO builds for
> WRF, though I haven't checked.
I just wanted to record the fact somewhere since I was looking up the
revision range mostly to figure out i
Note that it still seems to me that the crossed_loop_header handling is
overly conservative. We have:
@ -2771,6 +2771,7 @@ jt_path_registry::cancel_invalid_paths
(vec &path)
bool seen_latch = false;
int loops_crossed = 0;
bool crossed_latch = false;
+ bool crossed_loop_header = false;
The sanity check verifies that functions acessing parameter indirectly
also reads the parameter (otherwise the indirect reference can not
happen). This patch moves the check earlier and removes some overactive
flag cleaning on function call boundary which introduces the non-sential
situation. I g
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103211
>
> --- Comment #2 from Martin Liška ---
> Optimized dump differs for couple of functions in the same way:
>
> diff -u good bad
> --- good2021-11-12 17:42:36.995947103 +0100
> +++ bad 2021-11-12 17:41:56.728194961 +0100
> @@ -38,7 +38
> Happens with UBSAN compiler for:
>
> $ gcc gcc/testsuite/gcc.c-torture/execute/pr71494.c -O1 -flto
> ...
> /home/marxin/Programming/gcc/gcc/ipa-modref-tree.h:550:33: runtime error: load
> of value 255, which is not a valid value for type 'bool'
> #0 0x18acc38 in modref_tree::merge(modref_tr
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103230
>
> --- Comment #2 from Martin Liška ---
> > How do you build ubsan compiler?
>
> F="-O0 -g -fsanitize=undefined" ; make -j16 all-host -k CFLAGS="$F"
> CXXFLAGS="$F" LDFLAGS="$F"
>
> is the fastest approach.
Thanks, it is similar to what I
> [659] %
> [659] % gcctk -O0 -w small.c
> [660] %
> [660] % gcctk -O1 -w small.c
> [661] % gcctk -O1 -w small.c
> [662] % gcctk -O1 -w small.c
> gcctk: internal compiler error: Segmentation fault signal terminated program
> cc1
> Please submit a full bug report,
> with preprocessed source if app
Works for me even with the 3 warnings.
hubicka@lomikamen:/aux/hubicka/trunk/build-lto2/gcc$ cat >tt.c
__attribute__ ((noinline,const))
infinite (int p)
{
if (p)
while (1);
return p;
}
__attribute__ ((noinline))
static void
test(int p, int *a)
{
int v = infinite (p);
if (*a && v)
__
Aha, but here is better example (reproduces same way).
In the former one I forgot const attribute which makes it invalid.
The testcase tests that ipa-sra is missing ECF_LOOPING_CONST_OR_PURE
check
static int
__attribute__ ((noinline))
infinite (int p)
{
if (p)
while (1);
return p;
}
__attr
> @@ -1,4 +1,3 @@
> -static int
> __attribute__ ((noinline,const))
> infinite (int p)
> {
Just for a record, it crahes with or without static int here for me :)
I run across it because the code tracking must access in ipa-sra is IMO
conceptually wrong. I noticed that because ipa-modref solves
Needs -O2 -floop-unroll-and-jam --param early-inlining-insns=14
to fail, so I guess it may be issue with unrol-and-jam.
> (The -fno-semantic-interposition thing is probably the biggest performance gap
> between gcc -fpic and clang -fpic.)
Yep, it is often confusing to users (who do not understand what ELF
interposition is) that clang and gcc disagree on default flags here.
Recently -Ofast was extended to imply -fno-
This is bit modified patch I am testing. I added pre-computation of the
number of accesses, enabled the path for const functions (in case they
have memory operand), initialized alias sets and clarified the logic
around every_* and global_memory_accesses
PR tree-optimization/103168
The patch passed testing on x86_64-linux.
>
> Well, I'm specifically speaking about:
> error: the control flow of function ‘BZ2_compressBlock’ does not match its
> profile data (counter ‘arcs’)
>
> this type of errors should not happen even in a multi-threaded programs.
There are some cases where I see even those on clang build - I am
>
> fixup_cfg already removes write-only stores so that seems fit for that
> purpose.
>
> Btw,
>
> static int x = 1;
>
> int main()
> {
> x = 1;
> }
>
> should ideally be handled as well as maybe the more common(?)
>
> static int x[128];
>
> int main()
> {
> memset (x, 0, 128*4);
> }
>
> Not seen on Haswell (but w/o PGO). Is this PGO specific? There's another
> large jump visible end of 2019.
It is between 2019-11-15 and 18 but the revisions does not exist at git
- perhaps they reffer to the old git mirror. Martin will know better.
In that range there are many of Richard's vec
>
> Do you mean we should fix modeling of divisions there as well? I don't have
> latency/throughput measurements for those CPUs, nor access so I can run
> experiments myself, unfortunately.
>
> I guess you mean just making a patch to model division units separately,
> leaving latency/throughput
> To me, all of these do the same thing and should generate the same code.
> As nobody else can see removeme, and we aren't leaking its address, shouldn't
> the compiler be able to deduce that all accesses to removeme are
> inconsequential and can be removed?
>
> My gcc 11.3 generates a condidion
> > My guess is that the
> > BUILD_BUG();
> > line is the sole thing that is wrong, it should be just break;
> > as the memory_is_poisoned_n(addr, size); will handle all the sizes,
> > regardless if they are constants or not.
>
> Sure, I'm going to suggest such a change.
To me it looked like a pro
> > For this one it's PRE hoisting *b across the endless loop (PRE handles
> > calls as possibly not returning but not loops as possibly not
> > terminating...)
> > So it's a different bug.
>
> Btw, C++ requiring forward progress makes the testcase undefined.
In my understanding access to volatil
> > I guess PTA gets around by tracking points-to set also for non-pointer
> > types and consequently it also gives up on any such addition.
>
> It does. But note it does _not_ for POINTER_PLUS where it treats
> the offset operand as non-pointer.
>
> > I think it is ipa-prop.c::unadjusted_ptr_an
Looking at the prototype patch, why need to change also the splitters?
My original goal was to use splitters to expand to faster code sequences
while having patterns necessary for both variants. This makes it
possible to use optimize_insn_for_size/speed and make decisions using BB
profile, since
> Note GCC has not retuned its -Os heurstics for a long time because it has been
> decent enough for most folks and corner cases like this is almost never come
> up.
There were quite few changes to -Os heuristics :)
One of bigger challenges is that we do see more and more C++ code built
with -Os wh
> I suspect this is most likely the profile updates changes ...
Quite possibly. The goal of this excercise is to figure out if there are
some bugs in profile estimate or whether passes somehow preffer broken
profile or if it is just back luck.
Looking at sphinx and fatigue it seems that LRA really
> This heuristic wants to catch
>
>
> if (foo) abort ();
>
>
> and avoid sinking "too far" across a path with "similar enough"
> execution count (I think the original motivation was to fix some
> spilling / register pressure issue). The loop depth test
> should be !(bb_loop_depth (best_b
> > If I comment it out as above patch, then O3/PGO can get 16% and 12%
> > performance
> > improvement compared to O2 on x86.
> >
> > O2 O3 PGO
> > cycles 2,497,674,824 2,104,993,224 2,199,753,593
> > instructions1
> But adds a return with a value. And then the inliner inlines foo into foo2 but
> we still have the return with a value around ...
I guess ICF can special case unused return value, but why this is not
taken care of by ipa-sra?
> > Indeed it is quite long time problem with clang not building with lifetime
> > DSE and strict aliasing. I wonder why this is not fixed on clang side?
>
> Because the problems were not communicated? I knew that Firefox needed
> -flifetime-dse=1, but it's the first time I hear that any such pro
There is still problem with loop bounds. I am testing patch on that and
then we should be (finally) finally safe.
This patch attempts to add __builtin_operator_new/delete. So far they
are not optimized, which will need to be done by extra flag of BUILT_IN_
code. also the decl.cc code can be refactored to be less of cut&paste
and I guess has_builtin hack to return proper value needs to be moved
to C++ FE.
How
> Is the option supposed to be only about the standard global scope operator
> new/delete (_Znam etc.) or also user operator new/delete class methods? If
> the
> former, then I agree it is a global property (or at least a per shared
> library/binary property, one can arrange stuff with symbol vis
>
> There is no guarantee that std::vector::max_size() is PTRDIFF_MAX. It
> depends on the Allocator type, A. A user-defined allocator could have
> max_size() == 100.
If inliner we see path to the throw functions, it will not determine
_M_check_len as early inlinable.
Perhaps we can __builtin_con
Just so it is somewhere, here is a testcase that we can't inline leaf
functions to always_inlines unless we do some tracking of what calls
were formerly indirect calls.
We really overloaded always_inline from the original semantics "drop
inlining heuristics" into "be sure that result is inlined" w
>
> why disallow caller->indirect_calls?
See testcase in comment #9
>
> > + return false;
> > + for (cgraph_edge *e2 = callee->callees; e2; e2 = e2->next_callee)
>
> I don't think this flys - it looks quadratic. Can we compute this
> in the inline summary once instead?
I guess I can
> Confirm. But option save/restore has been always implemented:
>
> .section.gnu.lto_.opts,"",@progbits
> .ascii "'-fno-openmp' '-fno-openacc' '-fno-pie' '-fcf-protection"
> .ascii "=none' '-mabi=lp64d' '-march=loongarch64' '-mfpu=64' '-m"
> .ascii "simd=lasx' '-mcmodel=nor
> different issue from the one that is raised in the PR. (Unless we think that
> -O2 and -O3 should always have the same inlining heuristics henceforward, but
> that seems unlikely.)
Yes, I think point of -O3 is to let compiler to be more aggressive than
what seems desirable for your average dist
> > So I think all we can hope for is merging memcpy with the extra write of 0.
>
> That's not actually clear.
>
> It would be reasonable to assume that foo isn't likely to change the string
> and have the inlined destructor for a string that was initialized as a short
> string like here do somet
50 matches
Mail list logo