--- Additional Comments From tbptbp at gmail dot com 2005-05-09 08:46
---
I'm going to ping this bugreport because there's still some very bad interaction
with gcse in current gcc.
Just compile the 'packet_intersection.cpp' testcase with ie g++-4.1-4120050501
for ia32 to convince yoursel
--- Additional Comments From cvs-commit at gcc dot gnu dot org 2005-02-02
00:31 ---
Subject: Bug 19680
CVSROOT:/cvs/gcc
Module name:gcc
Changes by: [EMAIL PROTECTED] 2005-02-02 00:30:38
Modified files:
gcc: ChangeLog
gcc/config/i386: i
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 23:42
---
d-19680-1 + d-19680-3 isn't as good, 14.9fps, as some silly stack movements are
induced; ie:
40265f: 0f 29 04 24 movaps %xmm0,(%esp)
402663: 0f 57 c0xorps %xmm0,%xmm0
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 23:28
---
Wow! We got a winner. 15.8 fps with -fno-gcse, inlining and only d-19680-3.
402680: 66 0f 6f d1 movdqa %xmm1,%xmm2
..
402688: 66 0f db 50 30 pand 0x30(%eax),%xmm2
40268d:
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 22:58
---
In previous test i've used a crufted string of compilation options; i've removed
all that crap for -O3 -march=k8 -mfpmath=sse -fno-gcse -fno-exceptions.
The second patch, hack sse simode inputs, is a small win or
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 22:21
---
Oops, my bad. Thought pshufd mixed both operands à la shufps; i'm obviously not
familiar with the integer side of SSE.
And yes the combination is a lose, albeit a small one around 3%. But i'm timing
the whole thin
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-31 21:12
---
(In reply to comment #21)
> 4010ce: 0f 29 6c 24 10 movaps %xmm5,0x10(%esp)
> 4010de: 0f 59 5c 24 10 mulps 0x10(%esp),%xmm3
> 4011a1: 0f 29 04 24 movaps %xmm
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-31 21:02
---
(In reply to comment #22)
No, it isn't. Look at your functions again. The assembly that you
pasted is 100% perfect. You cannot improve on that in any way.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 20:35
---
Hmm, there's something fishy with _mm_set1_epi32.
With your patches there's no stack copy anymore but, with
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19714 testcase, i get:
00401080 :
401080: 66 0f 6e 4
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 20:18
---
-fno-gcse is a godsend, instant speedup and most of the sillyness when inlining
is gone.
Now i've applied both your patches, and while there's promising they also
triggers their own nastyness; gcc is so fond of me
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-31 19:04
---
I think you'll also want to try using -fno-gcse. The gcse pass is hoisting
values out of your loop (as it is supposed to), except that we don't have
enough registers to hold it all, so the values get spilled
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 14:14
---
Yes, and i'm not asking for a GPR->SSE transfer. What i'm asking is why gcc
feels the urge to copy that memory reference to the stack before fooling around
with it.
The full sequence is:
401298: 8b 42 28
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-31 05:31
---
If you're still looking at K8, moving through memory is two cycles faster than
moving directly between the general register file and the sse register file.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=1968
--- Additional Comments From tbptbp at gmail dot com 2005-01-30 18:59
---
Ah! Seems that another temporary isn't eliminated, much like
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19274, this time with
_mm_set1_epi32.
40129b: 89 44 24 1c mov%eax,0x1c(%esp)
4012
--- Additional Comments From tbptbp at gmail dot com 2005-01-30 18:40
---
Yes that's not a win per se but even with those "unrolled" addr computations its
encodings end up generally tighter, ie:
gcc:
40114d: c1 e1 04shl$0x4,%ecx
401150: 8d 41 30
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-30 18:04
---
Ok, I see what Intel is doing. It's computing an index by 16 by doing
addl %ecx,%ecx
movl (%ebx, %ecx, 8), %eax
instead of
sall $4, %ecx
movl (%ebx, %ecx), %eax
which, considering the suckitude of t
--- Additional Comments From tbptbp at gmail dot com 2005-01-30 13:37
---
But i had to rewrite the hit_t structure in a way more closer to what's found in
the original source to avoid the same useless cloning i noted earlier with gcc.
Something like:
union float4_t {
float f[4]
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-30 10:59
---
Ah hah. This is a bit of "cleverness" in the backend. It turns out that
for K8, imul with an 8-bit immediate is vector decoded, and imul with a
register is direct decoded. In theory, splitting out the constan
18 matches
Mail list logo