On 05/17/2011 11:31 AM, Loren Merritt wrote:
Use 16 xmmregs instead of spills, and transpose in pass5. 125->104 cycles on penryn x86_64. (But take the numbers with some salt: it's sensitive to code alignment (and was before the patch too).) Doesn't touch avx; I don't know if the same strategy would help there.
I like this idea. Unfortunately, for AVX it doesn't help, since everything fits in the 8 bigger registers.
I modified the x86_32 version too, but it doesn't get any speedup. Mine is more regular than the giant list of unstructured scalar math in PASS6_AND_PERMUTE; if this method can be applied to avx (and thus remove PASS6_AND_PERMUTE) then that's a simplification, but if it can't then the extra version is a complication and should be reverted.
I'll give a look at it later (I don't think this should block mine or your first patch), but I'm afraid that the lack of lane-crossing permutes might make this more expensive in AVX.
From 701f40aef4de4c001f619db20fecddaf8d1348af Mon Sep 17 00:00:00 2001 From: Loren Merritt <[email protected]> Date: Tue, 17 May 2011 08:51:10 +0000 Subject: [PATCH 1/2] s/xmm/m/
Squashed into my patch. I'm not really happy about the way I misuse the INIT_XMM macro, as it resets the permutations, suggestions are welcome.
-Vitor _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
