[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-05-09 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-05-09 08:46 --- I'm going to ping this bugreport because there's still some very bad interaction with gcse in current gcc. Just compile the 'packet_intersection.cpp' testcase with ie g++-4.1-4120050501 for ia32 to convince yoursel

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-02-01 Thread cvs-commit at gcc dot gnu dot org
--- Additional Comments From cvs-commit at gcc dot gnu dot org 2005-02-02 00:31 --- Subject: Bug 19680 CVSROOT:/cvs/gcc Module name:gcc Changes by: [EMAIL PROTECTED] 2005-02-02 00:30:38 Modified files: gcc: ChangeLog gcc/config/i386: i

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 23:42 --- d-19680-1 + d-19680-3 isn't as good, 14.9fps, as some silly stack movements are induced; ie: 40265f: 0f 29 04 24 movaps %xmm0,(%esp) 402663: 0f 57 c0xorps %xmm0,%xmm0

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 23:28 --- Wow! We got a winner. 15.8 fps with -fno-gcse, inlining and only d-19680-3. 402680: 66 0f 6f d1 movdqa %xmm1,%xmm2 .. 402688: 66 0f db 50 30 pand 0x30(%eax),%xmm2 40268d:

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 22:58 --- In previous test i've used a crufted string of compilation options; i've removed all that crap for -O3 -march=k8 -mfpmath=sse -fno-gcse -fno-exceptions. The second patch, hack sse simode inputs, is a small win or

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 22:21 --- Oops, my bad. Thought pshufd mixed both operands à la shufps; i'm obviously not familiar with the integer side of SSE. And yes the combination is a lose, albeit a small one around 3%. But i'm timing the whole thin

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread rth at gcc dot gnu dot org
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-31 21:12 --- (In reply to comment #21) > 4010ce: 0f 29 6c 24 10 movaps %xmm5,0x10(%esp) > 4010de: 0f 59 5c 24 10 mulps 0x10(%esp),%xmm3 > 4011a1: 0f 29 04 24 movaps %xmm

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread rth at gcc dot gnu dot org
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-31 21:02 --- (In reply to comment #22) No, it isn't. Look at your functions again. The assembly that you pasted is 100% perfect. You cannot improve on that in any way. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 20:35 --- Hmm, there's something fishy with _mm_set1_epi32. With your patches there's no stack copy anymore but, with http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19714 testcase, i get: 00401080 : 401080: 66 0f 6e 4

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 20:18 --- -fno-gcse is a godsend, instant speedup and most of the sillyness when inlining is gone. Now i've applied both your patches, and while there's promising they also triggers their own nastyness; gcc is so fond of me

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread rth at gcc dot gnu dot org
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-31 19:04 --- I think you'll also want to try using -fno-gcse. The gcse pass is hoisting values out of your loop (as it is supposed to), except that we don't have enough registers to hold it all, so the values get spilled

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-31 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-31 14:14 --- Yes, and i'm not asking for a GPR->SSE transfer. What i'm asking is why gcc feels the urge to copy that memory reference to the stack before fooling around with it. The full sequence is: 401298: 8b 42 28

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-30 Thread rth at gcc dot gnu dot org
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-31 05:31 --- If you're still looking at K8, moving through memory is two cycles faster than moving directly between the general register file and the sse register file. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=1968

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-30 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-30 18:59 --- Ah! Seems that another temporary isn't eliminated, much like http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19274, this time with _mm_set1_epi32. 40129b: 89 44 24 1c mov%eax,0x1c(%esp) 4012

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-30 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-30 18:40 --- Yes that's not a win per se but even with those "unrolled" addr computations its encodings end up generally tighter, ie: gcc: 40114d: c1 e1 04shl$0x4,%ecx 401150: 8d 41 30

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-30 Thread rth at gcc dot gnu dot org
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-30 18:04 --- Ok, I see what Intel is doing. It's computing an index by 16 by doing addl %ecx,%ecx movl (%ebx, %ecx, 8), %eax instead of sall $4, %ecx movl (%ebx, %ecx), %eax which, considering the suckitude of t

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-30 Thread tbptbp at gmail dot com
--- Additional Comments From tbptbp at gmail dot com 2005-01-30 13:37 --- But i had to rewrite the hit_t structure in a way more closer to what's found in the original source to avoid the same useless cloning i noted earlier with gcc. Something like: union float4_t { float f[4]

[Bug rtl-optimization/19680] sub-optimial register allocation with sse

2005-01-30 Thread rth at gcc dot gnu dot org
--- Additional Comments From rth at gcc dot gnu dot org 2005-01-30 10:59 --- Ah hah. This is a bit of "cleverness" in the backend. It turns out that for K8, imul with an 8-bit immediate is vector decoded, and imul with a register is direct decoded. In theory, splitting out the constan