https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88284
--- Comment #8 from Michael_S ---
(In reply to sandra from comment #7)
> While Intel has revived the "Altera" name, the Nios II processor is still
> listed as discontinued. I see they are offering ARM-based FPGA products
> again instead.
>
Arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88284
--- Comment #4 from Michael_S ---
Deprecation of Nios2 was pushed by Intel that appears to have a love affair
with RISC-V. But now Altera is spun off. Intel is no longer involved in
technical side of their business.
So, may be, before purging all
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #21 from Michael_S ---
(In reply to Mason from comment #20)
> Doh! You're right.
> I come from a background where overlapping/aliasing inputs are heresy,
> thus got blindsided :(
>
> This would be the optimal code, right?
>
> add4i
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #19 from Michael_S ---
(In reply to Mason from comment #18)
> Hello Michael_S,
>
> As far as I can see, massaging the source helps GCC generate optimal code
> (in terms of instruction count, not convinced about scheduling).
>
> #in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #24 from Michael_S ---
(In reply to Michael_S from comment #22)
> (In reply to Michael_S from comment #8)
> > (In reply to Thomas Koenig from comment #6)
> > > And there will have to be a decision about 32-bit targets.
> > >
> >
> >
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #23 from Michael_S ---
(In reply to Jakub Jelinek from comment #19)
> So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86
> sfp-machine.h
> #define FP_INIT_ROUNDMODE \
> do {
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #22 from Michael_S ---
(In reply to Michael_S from comment #8)
> (In reply to Thomas Koenig from comment #6)
> > And there will have to be a decision about 32-bit targets.
> >
>
> IMHO, 32-bit targets should be left in their current
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #16 from Michael_S ---
(In reply to Jakub Jelinek from comment #15)
> libquadmath is not needed nor useful on aarch64-linux, because long double
> type there is already IEEE 754 quad.
That's good to know. Thank you.
If you are here
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #12 from Michael_S ---
(In reply to Thomas Koenig from comment #10)
> What we would need for incorporation into gcc is to have several
> functions, which would then called depending on which floating point
> options are in force at t
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #11 from Michael_S ---
(In reply to Thomas Koenig from comment #9)
> Created attachment 54273 [details]
> matmul_r16.i
>
> Here is matmul_r16.i from a relatively recent trunk.
Thank you.
Unfortunately, I was not able to link it wit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #8 from Michael_S ---
(In reply to Thomas Koenig from comment #6)
> (In reply to Michael_S from comment #5)
> > Hi Thomas
> > Are you in or out?
>
> Depends a bit on what exactly you want to do, and if there is
> a chance that what
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #7 from Michael_S ---
Either here or my yahoo e-mail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #5 from Michael_S ---
Hi Thomas
Are you in or out?
If you are still in, I can use your help on several issues.
1. Torture.
See if Invalid Operand exception raised properly now. Also if there are still
remaining problems with NaN.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279
--- Comment #4 from Michael_S ---
(In reply to Jakub Jelinek from comment #2)
> From what I can see, they are certainly not portable.
> E.g. the relying on __int128 rules out various arches (basically all 32-bit
> arches,
> ia32, powerpc 32-bit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #22 from Michael_S ---
(In reply to Alexander Monakov from comment #21)
> (In reply to Michael_S from comment #19)
> > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> > > 'unlaminated' (turned to 2 uops before r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #20 from Michael_S ---
(In reply to Richard Biener from comment #17)
> (In reply to Michael_S from comment #16)
> > On unrelated note, why loop overhead uses so many instructions?
> > Assuming that I am as misguided as gcc about load-
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #19 from Michael_S ---
(In reply to Alexander Monakov from comment #18)
> The apparent 'bias' is introduced by instruction scheduling: haifa-sched
> lifts a +64 increment over memory accesses, transforming +0 and +32
> displacements t
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #16 from Michael_S ---
On unrelated note, why loop overhead uses so many instructions?
Assuming that I am as misguided as gcc about load-op combining, I would write
it as:
sub %rax, %rdx
.L3:
vmovupd (%rdx,%rax), %ymm1
vmovupd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #14 from Michael_S ---
I tested a smaller test bench from Comment 3 with gcc trunk on godbolt.
Issue appears to be only partially fixed.
-Ofast result is no longer a horror that it was before, but it is still not as
good as -O3 or -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #15 from Michael_S ---
(In reply to Richard Biener from comment #14)
> (In reply to Michael_S from comment #12)
> > On related note...
> > One of the historical good features of gcc relatively to other popular
> > compilers was absen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106220
--- Comment #3 from Michael_S ---
-march-haswell is not very important.
I added it only because in absence of BMI extension an issue is somewhat
obscured by need to keep shift count in CL register.
-O2 is also not important. -O3 is the same. An
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106220
Bug ID: 106220
Summary: x86-64 optimizer forgets about shrd peephole
optimization pattern when faced with more than one in
close proximity
Product: gcc
Version:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #23 from Michael_S ---
(In reply to jos...@codesourcery.com from comment #22)
> On Mon, 13 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote:
>
> > > The function should be sqrtf128 (present in glibc 2.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #21 from Michael_S ---
(In reply to jos...@codesourcery.com from comment #20)
> On Sat, 11 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote:
>
> > On MSYS2 _Float128 and __float128 appears to be mostly th
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #19 from Michael_S ---
(In reply to jos...@codesourcery.com from comment #18)
> libquadmath is essentially legacy code. People working directly in C
> should be using the C23 _Float128 interfaces and *f128 functions, as in
> curre
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #17 from Michael_S ---
(In reply to Jakub Jelinek from comment #15)
> From what I can see, it is mostly integral implementation and we already
> have one such in GCC, so the question is if we just shouldn't use it (most
> of the sou
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #16 from Michael_S ---
(In reply to Thomas Koenig from comment #14)
> @Michael: Now that gcc 12 is out of the door, I would suggest we try to get
> your code into the gcc tree for gcc 13.
>
> It should follow the gcc style guideline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #12 from Michael_S ---
On related note...
One of the historical good features of gcc relatively to other popular
compilers was absence of auto-vectorization at -O2.
When did you decide to change it and why?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #11 from Michael_S ---
(In reply to Richard Biener from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Hongtao.liu from comment #8)
> > > (In reply to Hongtao.liu from comment #7)
> > > > Hmm, we have speci
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #6 from Michael_S ---
(In reply to Michael_S from comment #5)
>
> Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
> does not have a zero cost, because quite often renamer itself constitutes
> the narrowes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #5 from Michael_S ---
(In reply to Richard Biener from comment #3)
> We are vectorizing the store it dst[] now at -O2 since that appears
> profitable:
>
> t.c:10:10: note: Cost model analysis:
> r0.0_12 1 times scalar_store costs 12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #4 from Michael_S ---
(In reply to Andrew Pinski from comment #1)
> This is just the vectorizer still being too aggressive right before a return.
> It is a cost model issue and it might not really be an issue in the final
> code just
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
Bug ID: 105617
Summary: Regression in code generation for _addcarry_u64()
Product: gcc
Version: 12.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Compone
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468
--- Comment #4 from Michael_S ---
Created attachment 52925
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52925&action=edit
build script
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468
--- Comment #3 from Michael_S ---
Created attachment 52924
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52924&action=edit
Another test bench that shows lower impact on Zen3, but higher impact on some
Intel CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468
--- Comment #2 from Michael_S ---
Created attachment 52923
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52923&action=edit
test bench that shows lower impact on Zen3, but higher impact on some Intel
CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468
--- Comment #1 from Michael_S ---
Created attachment 52922
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52922&action=edit
test bench that demonstrates maximal impact on Zen3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468
Bug ID: 105468
Summary: Suboptimal code generation for access of function
parameters and return values of type __float128 on
x86-64 Windows target.
Product: gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #13 from Michael_S ---
It turned out that on all micro-architectures that I care about (and majority
of those that I don't care) double precision floating point division is quite
fast.
It's so fast that it easily beats my clever reci
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #12 from Michael_S ---
(In reply to Michael_S from comment #11)
> (In reply to Michael_S from comment #10)
> > BTW, the same ideas as in the code above could improve speed of division
> > operation (on modern 64-bit HW) by factor of
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #11 from Michael_S ---
(In reply to Michael_S from comment #10)
> BTW, the same ideas as in the code above could improve speed of division
> operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD).
Did it.
On Intel i
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #10 from Michael_S ---
BTW, the same ideas as in the code above could improve speed of division
operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD).
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
--- Comment #9 from Michael_S ---
(In reply to Michael_S from comment #4)
> If you want quick fix for immediate shipment then you can take that:
>
> #include
> #include
>
> __float128 quick_and_dirty_sqrtq(__float128 x)
> {
> if (isnanq(x)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101
Michael_S changed:
What|Removed |Added
CC||already5chosen at yahoo dot com
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #10 from Michael_S ---
I lost track of what you're talking about long time ago.
But that's o.k.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #3 from Michael_S ---
(In reply to Richard Biener from comment #2)
> It's again reassociation making a mess out of the natural SLP opportunity
> (and thus SLP discovery fails miserably).
>
> One idea worth playing with would be to ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173
--- Comment #9 from Michael_S ---
Despite what I wrote above, I did took a look at the trunk on godbolt with same
old code from a year ago. Because it was so easy. And indeed a trunk looks ALOT
better.
But until it's released I wouldn't know if i
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173
--- Comment #8 from Michael_S ---
(In reply to Jakub Jelinek from comment #7)
> (In reply to Michael_S from comment #5)
> > I agree with regard to "other targets", first of all, aarch64, but x86_64
> > variant of gcc already provides requested fu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173
--- Comment #6 from Michael_S ---
(In reply to Marc Glisse from comment #1)
> We could start with the simpler:
>
> void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned
> b,unsigned c, unsigned d){
> *r=a+b;
> *s=c+d+(*r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173
Michael_S changed:
What|Removed |Added
CC||already5chosen at yahoo dot com
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
Bug ID: 97832
Summary: AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7
times slower than -O3
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
Severity: normal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428
--- Comment #9 from Michael_S ---
Hopefully, you did regression tests for all main AoS<->SoA cases.
I.e.
typedef struct { double re, im; } dcmlx_t;
void soa2aos(double* restrict dstRe, double* restrict dstIm, const dcmlx_t
src[], int nq)
{
for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428
--- Comment #6 from Michael_S ---
(In reply to Richard Biener from comment #4)
>
> while the lack of cross-lane shuffles in AVX2 requires a
>
> .L3:
> vmovupd (%rsi,%rax), %xmm5
> vmovupd 32(%rsi,%rax), %xmm6
> vinsertf1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428
--- Comment #5 from Michael_S ---
(In reply to Richard Biener from comment #4)
> I have a fix that, with -mavx512f generates just
>
> .L3:
> vmovupd (%rcx,%rax), %zmm0
> vpermpd (%rsi,%rax), %zmm1, %zmm2
> vpermpd %zmm0,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428
Bug ID: 97428
Summary: -O3 is great for basic AoSoA packing of complex
arrays, but horrible one step above the basic
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343
--- Comment #2 from Michael_S ---
(In reply to Richard Biener from comment #1)
> All below for Part 2.
>
> Without -ffast-math you are seeing GCC using in-order reductions now while
> with -ffast-math the vectorizer gets a bit confused about rea
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343
Bug ID: 97343
Summary: AVX2 vectorizer generates extremely strange and slow
code for AoSoA complex dot product
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #15 from Michael_S ---
(In reply to Hongtao.liu from comment #14)
> > Still I don't understand why compiler does not compare the cost of full loop
> > body after combining to the cost before combining and does not come to
> > conclusi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #13 from Michael_S ---
(In reply to Hongtao.liu from comment #11)
> (In reply to Michael_S from comment #10)
> > (In reply to Hongtao.liu from comment #9)
> > > (In reply to Michael_S from comment #8)
> > > > What are values of gcc "l
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #10 from Michael_S ---
(In reply to Hongtao.liu from comment #9)
> (In reply to Michael_S from comment #8)
> > What are values of gcc "loop" cost of the relevant instructions now?
> > 1. AVX256 Load
> > 2. FMA3 ymm,ymm,ymm
> > 3. AVX2
60 matches
Mail list logo