On 22 July 2017 at 12:18, Ivan Kalvachev <[email protected]> wrote:
> This patch is ready for review and inclusion.
>
>
>
>+%macro emu_blendvps 3 ; dst/src_a, src_b, mask
>+%macro lea_pic_base 2; reg, const_label
Capitalize macro names.
>+ %error sse41 blendvps uses xmm0 as default 3d operand, you used %3
Remove this error
>+ %error "blendvps" emulation needs SSE
Definitely remove this error too.
>+ %error AVX1 does not support integer 256bit ymm operations
And this one is particularly pointless...
>+ %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
Same...
>+ %error "broadcastss" emulation needs SSE
Same...
>+ %error "pbroadcastd" emulation needs SSE2
Yep, the same...
>+ %error pick HSUMPS implementation
Again...
All of these are pointless and unnecessary. Always assume at least SSE2.
>+
>+; 256bit version seems slower than the 128bit AVX on all current CPU's
>+; So disable it for now and keep the code in case something changes in
future.
>+%if HAVE_AVX2_EXTERNAL > 0 && 0
>+INIT_YMM avx2
>+PVQ_FAST_SEARCH
>+%endif
Remove this altogether.
>+%if 1
>+ emu_pbroadcastd m1, xmm1
...
>+%else
>+ ; Ryzen is the only CPU at the moment that doesn't have
>+ ; write forwarding stall and where this code is faster
>+ ; with about 7 cycles on avarage.
>+ %{1}ps m5, mm_const_float_1
>+ movss [tmpY + r4q], xmm5 ; Y[max_idx] +-= 1.0;
Remove the %else and always use the first bit of code.
>+%if cpuflag(avx2) && 01
>+%elif cpuflag(avx) && 01
>+%if cpuflag(avx2) && 01
Remove the && 01 in these 3 places.
>+; vperm2f128 %1, %1, %1, 0 ; ymm, ymm, ymm ; 2-3c 1/1
Remove this commented out code.
>+cglobal pvq_search, 4, 5+num_pic_regs, 11, 256*4, inX, outY, K, N
>+%if ARCH_X86_64 == 0 ;sbrdsp
You're using more than 11 xmm regs, so the entire code is always going to
need a 64 bit system.
So wrap the entire file, from top to bottom after the %include in
%if ARCH_X86_64
<everything>
%endif
>+SECTION_RODATA 64
>+
>+const_float_abs_mask: times 8 dd 0x7fffffff
>+const_align_abs_edge: times 8 dd 0
>+
>+const_float_0_5: times 8 dd 0.5
>+const_float_1: times 8 dd 1.0
>+const_float_sign_mask: times 8 dd 0x80000000
>+
>+const_int32_offsets:
>+ %rep 8
>+ dd $-const_int32_offsets
>+ %endrep
>+SECTION .text
This whole thing needs to go right at the very top after the %include. All
macros must then follow it.
Having read only sections in the middle of the file is very confusing and
not at all what all of our asm code follows.
_______________________________________________
ffmpeg-devel mailing list
[email protected]
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel