https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832
--- Comment #8 from Alexander Monakov ---
(In reply to Jan Hubicka from comment #7)
> > 53730 r btver2_fp_min_issue_delay
> > 53760 r znver1_fp_transitions
> > 93960 r bdver3_fp_transitions
> > 106102 r lujiazui_core_check
> > 106102 r lujiazui_c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715
--- Comment #3 from Alexander Monakov ---
There's a forward dependency over 'c' (read of c[i] vs. write of c[i+1] with
'i' iterating forward), and the vectorized variant takes the hit on each
iteration. How is a slowdown even surprising.
For th
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832
--- Comment #10 from Alexander Monakov ---
(In reply to Jan Hubicka from comment #9)
> Actually for older cores I think the manufacturers do not care much. I
> still have a working Bulldozer machine and I can do some testing.
> I think in Buldoz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107719
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647
--- Comment #15 from Alexander Monakov ---
I'm confused about the first hunk in the attached patch:
--- a/gcc/tree-vect-slp-patterns.cc
+++ b/gcc/tree-vect-slp-patterns.cc
@@ -1035,8 +1035,10 @@ complex_mul_pattern::matches (complex_operation_t
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107879
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #21 from Alexander Monakov ---
(In reply to Michael_S from comment #19)
> > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> > 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> > IVs for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107772
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
--- Comment #24 from Alexander Monakov ---
(In reply to Peter Cordes from comment #23)
> But at least on Linux, I don't think there's a way for user-space to even
> ask for a page of WT or WP memory (or UC or WC). Only WB memory is easily
> ava
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
--- Comment #26 from Alexander Monakov ---
Sure, the right course of action seems to be to simply document that atomic
types and built-ins are meant to be used on "common" (writeback) memory, and no
guarantees can be given otherwise, because it
||amonakov at gcc dot gnu.org
--- Comment #3 from Alexander Monakov ---
LLVM does a better job at code layout, and massively wins on the amount of
executed branches (in particular unconditional jumps). With -fdisable-rtl-bbro
gcc achieves a similar performance.
||amonakov at gcc dot gnu.org
Resolution|--- |FIXED
--- Comment #3 from Alexander Monakov ---
Fixed for gcc-13.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905
--- Comment #5 from Alexander Monakov ---
Not sure what you don't like about the inputs, they appear quite reasonable.
Perhaps GCC's estimation of bb frequencies is off (with profile feedback we
achieve good performance).
Georgi: you'll likely
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107905
--- Comment #6 from Alexander Monakov ---
Let me add that Clang supports GCC's -fprofile-{generate,use} flags for
compatibility as well.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107879
--- Comment #10 from Alexander Monakov ---
If anyone is confused like I was, the commit actually includes a testcase, but
the addition is not mentioned in the Changelog. I was sure the server-side
receive hook was supposed to reject such incompl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107971
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832
--- Comment #11 from Alexander Monakov ---
Factoring out Lujiazui divider shrinks its tables by almost 20x:
3 r lujiazui_decoder_min_issue_delay
20 r lujiazui_decoder_transitions
32 r lujiazui_agu_min_issue_delay
126 r lujiazui_agu_transitions
3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008
--- Comment #9 from Alexander Monakov ---
I think this is tree-ldist placing memset(sameZ, 0, zPlaneCount) after the
loop, overwriting conditional 'sameZ[i] = true' assignments that happen in the
loop.
For the smaller testcase from comment #6,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108008
--- Comment #10 from Alexander Monakov ---
Looks similar to PR 107323, but needs explicit -ftree-loop-distribution to
trigger.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108076
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117
Alexander Monakov changed:
What|Removed |Added
Status|RESOLVED|UNCONFIRMED
Resolution|INVA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117
--- Comment #9 from Alexander Monakov ---
(In reply to Feng Xue from comment #8)
> In another angle, because gcc already model control flow and SSA web for
> setjmp/longjmp, explicit volatile specification is not really needed.
That covers GIM
-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
Target Milestone: ---
match.pd has multi-pattern matcher 'nop_atomic_bit_test_and_p'.
It expands to ~38 KLOC in gimple-match.cc and ~350 KB in the compiled binary.
There h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117
--- Comment #12 from Alexander Monakov ---
Shouldn't there be another bug for the sched1 issue specifically? In absence of
abnormal control flow, extending lifetimes of pseudos across calls is still
likely to be a pessimization.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117
Alexander Monakov changed:
What|Removed |Added
Resolution|DUPLICATE |FIXED
--- Comment #14 from Alexande
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117
Alexander Monakov changed:
What|Removed |Added
Resolution|FIXED |DUPLICATE
--- Comment #15 from Alex
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067
--- Comment #9 from Alexander Monakov ---
*** Bug 108117 has been marked as a duplicate of this bug. ***
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108140
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117
--- Comment #16 from Alexander Monakov ---
Draft patch for the sched1 issue:
https://inbox.sourceware.org/gcc-patches/cf62c3ec-0a9e-275e-5efa-2689ff1f0...@ispras.ru/T/#m95238afa0f92daa0ba7f8651741089e7cfc03481
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
Target Milestone: ---
It pretends that define_operator_list is commutative when its first member is
NOT commutative:
if (user_id *uid = dyn_cast (id))
{
int res = commutative_op (uid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108209
--- Comment #1 from Alexander Monakov ---
Keeping notes as I go...
Duplicated checks for 'op0' in lower_for are duplicated.
: target
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
Target Milestone: ---
Target: x86_64-*-*
In the following example, STV is making a very unprofitable transformation on
trunk, but not on gcc-12:
#include
#include
struct b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108229
--- Comment #3 from Alexander Monakov ---
Thank you! I considered this unprofitable for these reasons:
1. As you said, the code grows in size, but the speed benefit is not clear.
2. The transform converts load+add operations in a loop, and the
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
Target Milestone: ---
For
unsigned short f(unsigned short x, unsigned short y)
{
return x * y;
}
unsigned short g(unsigned short x
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
Target Milestone: ---
Target: powerpc64le-*-*
Created attachment 54202
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54202&action=edit
testcase
At le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108318
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
Target Milestone: ---
Created attachment 52933
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52933&action=edit
testcase
Hit
: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
Target Milestone: ---
Target: x86_64-*-* i?86-*-*
Minimized from PR 105504.
Compile with -O2 -mtune=haswell -mavx (other
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504
--- Comment #5 from Alexander Monakov ---
The strange xmm0 spill issue may affect more code, so I reported an isolated
testcase: PR 105513 (regression vs. gcc-8, the complete testcase in this PR
also does not spill with gcc-8).
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105513
--- Comment #7 from Alexander Monakov ---
The second sequence is 3 uops vs 1/2 (issued/executed) uops in first, and on
Haswell and Skylake it ties up port 5 for two cycles.
Unclear if you're microbenchmarking latency or throughput, but in any c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61810
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105700
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105700
--- Comment #5 from Alexander Monakov ---
(In reply to Artem S. Tashkinov from comment #4)
> > There should be a note in dmesg when a process segfaults outside of a
> > debugger. If you run wine without gdb, and winedevice.exe crashes, is there
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105688
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105863
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
Target Milestone: ---
In the following code, 'f' is not SLP-vectorized, but 'g' is. From a brief look
at slp2 dump, looks like
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101347
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91299
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80053
Alexander Monakov changed:
What|Removed |Added
Last reconfirmed||2021-07-24
Resolution|INVALI
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80053
Alexander Monakov changed:
What|Removed |Added
Resolution|INVALID |---
Status|RESOLVED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80053
--- Comment #13 from Alexander Monakov ---
Yes, I'm talking only about labels which are potential branch targets, of
course after the jumps have been DCE'd it is not really observable where the
label points to. Unfortunately after four years I do
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80053
--- Comment #15 from Alexander Monakov ---
(In reply to Richard Biener from comment #14)
> I think the original asm goto case clearly remains and this is a difficult
> to handle case since the label address only appears as regular input and the
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113890
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113903
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66487
--- Comment #28 from Alexander Monakov ---
The bug is about the issue of lacking diagnostics, it should be fine to make
note of various approaches to remedy the problem in one bug report.
(in any case, all discussion of the Valgrind-based approa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114261
--- Comment #3 from Alexander Monakov ---
The first attachment is empty (perhaps you made a non-recursive archive when
you meant to recursively zip a directory).
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114261
Alexander Monakov changed:
What|Removed |Added
CC||mkuvyrkov at gcc dot gnu.org
--- Co
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114261
--- Comment #8 from Alexander Monakov ---
If we want to get rid of the compilation time regression sooner rather than
later, I can suggest limiting my change only to functions that call setjmp:
diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114261
--- Comment #10 from Alexander Monakov ---
Indeed, but OTOH according to bug 84402 comment 58 it caused a noticeable hit
on gimple-match.cc compilation:
733a1b777f16cd397b43a242d9c31761f66d3da8 13th January 2023
sched-deps: do not schedule pseu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108866
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114337
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762
--- Comment #14 from Alexander Monakov ---
That seems undesirable in light of comment #4, you'd risk creating a situation
when -fno-trapping-math is unpredictably slower when denormals appear in dirty
upper halves.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110799
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110799
--- Comment #9 from Alexander Monakov ---
(In reply to Tom de Vries from comment #7)
> Can you elaborate on what you consider a correct approach?
I think this optimization is incorrect and should be active only under -Ofast.
I can offer two ar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110823
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110799
--- Comment #16 from Alexander Monakov ---
In C11 and C++11 the issue of compiler-introduced racing loads is discussed as
follows (5.1.2.4 Multi-threaded executions and data races in C11):
28 NOTE 14 Transformations that introduce a speculative
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
Alexander Monakov changed:
What|Removed |Added
Status|NEW |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #8 from Alexander Monakov ---
Why? There's no bswap here, in particular mbedtls_put_unaligned_uint64 is a
straightforward wrapper for memcpy:
inline void mbedtls_put_unaligned_uint64(void *p, uint64_t x)
{
memcpy(p, &x, sizeof(x
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #9 from Alexander Monakov ---
(In reply to Alexander Monakov from comment #2)
> Note that inline functions in mbedtls/library/alignment.h all miss the
> 'static' qualifier, which affects inlining decisions, and looks like a
> mistake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #10 from Alexander Monakov ---
Ah, the non-static inlines are intentional, the corresponding extern
declarations appear in library/platform_util.c. Sorry, I missed that file the
first time around.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110946
--- Comment #11 from Alexander Monakov ---
(In reply to Alexander Monakov from comment #8)
> inline void mbedtls_put_unaligned_uint64(void *p, uint64_t x)
> {
> memcpy(p, &x, sizeof(x));
> }
>
>
> We deciding to not inline this, while inli
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979
--- Comment #2 from Alexander Monakov ---
Yes, it is wrong-code to full extent. To demonstrate, you can initialize 'sum'
and the array to negative zeroes:
#define FLT double
#define N 20
__attribute__((noipa))
FLT
foo3 (FLT *a)
{
FLT sum =
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111009
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
|--- |INVALID
CC||amonakov at gcc dot gnu.org
--- Comment #1 from Alexander Monakov ---
0x7fe5ed65 is a quiet NaN, not signaling (it differs from the input 0x7fa5ed65
sNaN by the leading mantissa bit 0x0040).
IEEE-754 does not pin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43
--- Comment #6 from Alexander Monakov ---
Thanks.
i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight
"efficiency cores". They have different micro-architecture. Are you binding the
benchmark to some core in particular?
|UNCONFIRMED |RESOLVED
CC||amonakov at gcc dot gnu.org
--- Comment #1 from Alexander Monakov ---
'c' is called with 'd' pointing to 'long e[2]', so
return *(int *)(d + 1);
is an aliasing violation (dereferencin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111210
--- Comment #4 from Alexander Monakov ---
The testcase is small enough to notice the issue by inspection.
Note that you get the "expected" answer with -fno-strict-aliasing, and as
explained in https://gcc.gnu.org/bugs/ it is one of the things y
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51446
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655
--- Comment #11 from Alexander Monakov ---
(In reply to Richard Biener from comment #10)
> And this conservatively has to apply to all FP divisions where we might infer
> "nonnegative" unless we can also infer !zerop?
Yes, I think the logic in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111694
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: amonakov at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, eggert at cs dot ucla.edu,
rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111643
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736
Alexander Monakov changed:
What|Removed |Added
CC||amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736
--- Comment #3 from Alexander Monakov ---
Sorry, the second half of my comment is confusing. To clarify, ASan works fine
for TLS data (the compiler knows that TLS base is at fs:0; libsanitizer uses
some hacks to initialize shadow for TLS anyway,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111694
--- Comment #7 from Alexander Monakov ---
No backport for gcc-13 planned?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768
--- Comment #5 from Alexander Monakov ---
I think it's similar to attempting -march=native under distcc, which is already
warned about on Gentoo wiki: https://wiki.gentoo.org/wiki/Distcc
The difference here is that Intel so far decided to make
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116458
--- Comment #3 from Alexander Monakov ---
David, thanks for Cc'ing me and for running Valgrind builds!
Richi, I'll check in more detail later today, I think we should unbreak
Valgrind builds ASAP by initializing padding under #ifdef
ENABLE_VALG
|1
Last reconfirmed||2024-08-22
Assignee|unassigned at gcc dot gnu.org |amonakov at gcc dot
gnu.org
--- Comment #5 from Alexander Monakov ---
Turns out we already initialize padding, just in a different file, and I
completely
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116458
--- Comment #6 from Alexander Monakov ---
As for Valgrind false positive, it handles this SSSE3 code really well and
misses the key point by a very narrow margin. We have
found = m1 + (m2 << 16);
where both m1 and m2 hold 16-bit masks from p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116458
--- Comment #8 from Alexander Monakov ---
Thanks for the reference, but it doesn't help. Something more subtle is going
on, because placing the shift-add combo in a separate function makes Valgrind
properly compute known bits even without the ma
901 - 1000 of 1199 matches
Mail list logo