hi On 17 June 2014 18:04, Peter Frühberger <[email protected]> wrote:
> We don't support broken wrappers, that are not maintained since > several years. We support vdpau for amd oss and nvidia and use vaapi > for intel. We had implemented XVBA for AMD a while back, but that code > died of constant no support. We (mythtv) haven't implemented XVBA, only VAAPI and VDPAU. AMD OSS' vdpau is actually pretty good now, almost as good as nvidia's. With AMD's closed-source drivers, VAAPI is as good as it gets. On my AMD 6970 however, all you get is VC1 and H264 decoding. >> We always get back to the problem I mentioned in my first email. >> Unfortunately, there's not a generic solution that can be adapted. >> If memory used is USWC, you must use SSE4, if not, you certainly don't >> want to use SSE4 and a buffer > > Yes, I see that problem and I find all methods that we currently have > quite suboptimal. If you see how for example nvidia does it with their > glinterop, that even mesa implements. I think the proposed API changes > here go into a similar direction. I hope that the lot of "sync", > "locks" and so in there that I see in the patches won't make things > too slow or even slow down multithreaded approaches (decoder + vpp + > output in different threads), but we will see. Here are my attempts and results so far: https://github.com/MythTV/mythtv/blob/master/mythtv/libs/libmythtv/mythframe.cpp#L586 There are 4 primary routines implemented: For plane YV12 frame copy SSE_copyplane (this is, very similar to Intel's whitepaper, but various optimisation added, it's a tad faster than their example, and obviously XBMC's seeing it's the same) Make use of a 64 bytes aligned, 4kB buffer. For deinterleaving the U/V channels in a NV12->YV12: SSE_splitplanes (with buffer) As above, make use of the buffer Those two routines have copy functions making use of movntdqa, and works extremely well with USWC based memory. SSE_splitplanes (without buffer) this one is a SSE3 optimised routine, that deinterleaved the UV channels, and that works directly between source and destination frames, regardless of their memory alignment (16 bytes aligned or not) copyplane: which is a plain C implementation, using memcpy. My findings are as follow (i7-4650U with HD5000). Convert 2000 h264 frames, extract image with either vaDeriveImage or vaGetImage, and measure the conversion from either N12->YV12 or plain YV12->YV12 (within VLC playback) if memory is USWC: NV12->YV12: 1-One call to SSE_copyplane + one to SSE_splitplanes (with buffer): 2.07ms per 1080 frame 2-One call to C copyplane + SSE_splitplanes (without buffer): 10.96ms per 1080 frame If memory isn't USWC: 1-One call to SSE_copyplane + one to SSE_splitplanes (with buffer): 1.05ms per per 1080 frame 2-One call to SSE_copyplane + one to SSE_splitplanes (without buffer): 0.97ms per 1080 frame 3-One call to C copyplane + one to SSE_splitplanes (without buffer): 0.96ms per 1080 frame I can't give a comparison with a simple YV12->YV12 frame copy, seeing as I can't get a USWC mapped memory. YV12->YV12 If memory isn't USWC: 1-three calls to SSE_copyplane: 0.94ms per 1080 frame 2-three calls to C copyplane: 0.94ms per 1080 frame Running those tests made me realise I could gain some speed with a SSE_copyplane, one that doesn't use any buffers but use SSE4. I had written the routine before, but discarded it after comparing the original SSE_copyplane with the C version, didn't think of comparing C and that routine... In which case, with new SSE copy routine I get: Non-USWC memory: NV12->YV12 4-One call to SSE_copyplane (without buffer) + one to SSE_splitplanes (without buffer): 0.80ms YV12->YV12 3-Three calls to SSE_copyplane (without buffer) 0.68ms Conclusion, if speed is the main concern: Use YV12 whenever possible with vaGetImage. If memory is USWC, use SS4 code, via a 4kB buffer If memory isn't USWC use SSE4/movntdqa (if line is aligned) or SSE2/movdqu if non-aligned, don't bother with a buffer. So still keen in getting a reliable way of knowing which type of memory we're using... though my method of simply checking the running speed first may probably be the easiest approach _______________________________________________ Libva mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/libva
