Re: [libav-devel] [PATCH 3/3] dct32: Add AVX implementation of 32-point DCT

Vitor Sessak Tue, 17 May 2011 13:28:02 -0700

On 05/17/2011 11:31 AM, Loren Merritt wrote:

Use 16 xmmregs instead of spills, and transpose in pass5.
125->104 cycles on penryn x86_64. (But take the numbers with some salt:
it's sensitive to code alignment (and was before the patch too).)
Doesn't touch avx; I don't know if the same strategy would help there.

I like this idea. Unfortunately, for AVX it doesn't help, sinceeverything fits in the 8 bigger registers.

I modified the x86_32 version too, but it doesn't get any speedup. Mine
is more regular than the giant list of unstructured scalar math in
PASS6_AND_PERMUTE; if this method can be applied to avx (and thus remove
PASS6_AND_PERMUTE) then that's a simplification, but if it can't then
the extra version is a complication and should be reverted.

I'll give a look at it later (I don't think this should block mine oryour first patch), but I'm afraid that the lack of lane-crossingpermutes might make this more expensive in AVX.


From 701f40aef4de4c001f619db20fecddaf8d1348af Mon Sep 17 00:00:00 2001
From: Loren Merritt <[email protected]>
Date: Tue, 17 May 2011 08:51:10 +0000
Subject: [PATCH 1/2] s/xmm/m/

Squashed into my patch. I'm not really happy about the way I misuse theINIT_XMM macro, as it resets the permutations, suggestions are welcome.


-Vitor
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 3/3] dct32: Add AVX implementation of 32-point DCT

Reply via email to