[FFmpeg-devel] [PATCH] feature/classify_neon (PR #20377)

2025-08-31 Thread george.zaguri via ffmpeg-devel
PR #20377 opened by george.zaguri
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20377
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20377.patch

Optimisations for NEON platform with fixes to improve performance on Mac and 
fixed comments to patch

RPi4:


Apple M2 (MacBook Air):
vvc_alf_classify_8x8_8_c:2.6 ( 1.00x)
vvc_alf_classify_8x8_8_neon: 1.2 ( 2.06x)
vvc_alf_classify_8x8_10_c:   2.7 ( 1.00x)
vvc_alf_classify_8x8_10_neon:1.1 ( 2.41x)
vvc_alf_classify_8x8_12_c:   2.8 ( 1.00x)
vvc_alf_classify_8x8_12_neon:1.1 ( 2.48x)
vvc_alf_classify_16x16_8_c:  7.2 ( 1.00x)
vvc_alf_classify_16x16_8_neon:   3.4 ( 2.09x)
vvc_alf_classify_16x16_10_c: 4.3 ( 1.00x)
vvc_alf_classify_16x16_10_neon:  3.1 ( 1.38x)
vvc_alf_classify_16x16_12_c: 4.4 ( 1.00x)
vvc_alf_classify_16x16_12_neon:  3.2 ( 1.40x)
vvc_alf_classify_32x32_8_c: 13.6 ( 1.00x)
vvc_alf_classify_32x32_8_neon:  10.6 ( 1.29x)
vvc_alf_classify_32x32_10_c:12.1 ( 1.00x)
vvc_alf_classify_32x32_10_neon:  9.6 ( 1.26x)
vvc_alf_classify_32x32_12_c:12.3 ( 1.00x)
vvc_alf_classify_32x32_12_neon:  9.6 ( 1.28x)
vvc_alf_classify_64x64_8_c: 44.0 ( 1.00x)
vvc_alf_classify_64x64_8_neon:  38.6 ( 1.14x)
vvc_alf_classify_64x64_10_c:41.0 ( 1.00x)
vvc_alf_classify_64x64_10_neon: 35.0 ( 1.17x)
vvc_alf_classify_64x64_12_c:41.7 ( 1.00x)
vvc_alf_classify_64x64_12_neon: 34.9 ( 1.20x)
vvc_alf_classify_128x128_8_c:  157.8 ( 1.00x)
vvc_alf_classify_128x128_8_neon:   147.2 ( 1.07x)
vvc_alf_classify_128x128_10_c: 150.4 ( 1.00x)
vvc_alf_classify_128x128_10_neon:  131.6 ( 1.14x)
vvc_alf_classify_128x128_12_c: 150.0 ( 1.00x)
vvc_alf_classify_128x128_12_neon:  130.6 ( 1.15x)


>From 8b279086db3eb4d1c680be706756f57ca926e0b2 Mon Sep 17 00:00:00 2001
From: Georgii Zagoruiko 
Date: Tue, 8 Jul 2025 23:52:18 +0400
Subject: [PATCH 1/3] avcodec/aarch64/vvc: optimised alf_classify function
 8/10/12bit of vvc codec for aarch64

 - vvc_alf.alf_classify  [OK]
vvc_alf_classify_8x8_8_c: 1314.4 ( 1.00x)
vvc_alf_classify_8x8_8_neon:   794.3 ( 1.65x)
vvc_alf_classify_8x8_10_c:1154.7 ( 1.00x)
vvc_alf_classify_8x8_10_neon:  770.0 ( 1.50x)
vvc_alf_classify_8x8_12_c:1091.7 ( 1.00x)
vvc_alf_classify_8x8_12_neon:  770.7 ( 1.42x)
vvc_alf_classify_16x16_8_c:   3710.0 ( 1.00x)
vvc_alf_classify_16x16_8_neon:2205.6 ( 1.68x)
vvc_alf_classify_16x16_10_c:  3306.2 ( 1.00x)
vvc_alf_classify_16x16_10_neon:   2087.9 ( 1.58x)
vvc_alf_classify_16x16_12_c:  3307.9 ( 1.00x)
vvc_alf_classify_16x16_12_neon:   2089.6 ( 1.58x)
vvc_alf_classify_32x32_8_c:  12770.2 ( 1.00x)
vvc_alf_classify_32x32_8_neon:7124.6 ( 1.79x)
vvc_alf_classify_32x32_10_c: 11780.3 ( 1.00x)
vvc_alf_classify_32x32_10_neon:   6856.7 ( 1.72x)
vvc_alf_classify_32x32_12_c: 11779.2 ( 1.00x)
vvc_alf_classify_32x32_12_neon:   7002.8 ( 1.68x)
vvc_alf_classify_64x64_8_c:  49332.3 ( 1.00x)
vvc_alf_classify_64x64_8_neon:   26040.4 ( 1.89x)
vvc_alf_classify_64x64_10_c: 45353.7 ( 1.00x)
vvc_alf_classify_64x64_10_neon:  26251.5 ( 1.73x)
vvc_alf_classify_64x64_12_c: 44876.9 ( 1.00x)
vvc_alf_classify_64x64_12_neon:  26491.3 ( 1.69x)
vvc_alf_classify_128x128_8_c:   191953.5 ( 1.00x)
vvc_alf_classify_128x128_8_neon: 96166.3 ( 2.00x)
vvc_alf_classify_128x128_10_c:  177198.5 ( 1.00x)
vvc_alf_classify_128x128_10_neon:96077.9 ( 1.84x)
vvc_alf_classify_128x128_12_c:  177461.1 ( 1.00x)
vvc_alf_classify_128x128_12_neon:96184.4 ( 1.85x)
---
 libavcodec/aarch64/vvc/alf.S  | 278 ++
 libavcodec/aarch64/vvc/alf_template.c |  87 
 libavcodec/aarch64/vvc/dsp_init.c |   6 +
 3 files changed, 371 insertions(+)

diff --git a/li

[FFmpeg-devel] [PATCH] aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit (PR #20737)

2025-10-22 Thread george.zaguri via ffmpeg-devel
PR #20737 opened by george.zaguri
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20737
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20737.patch

RPi4:
put_luma_h_10_4x4_c:   261.8 ( 1.00x)
put_luma_h_10_8x8_c:  1051.5 ( 1.00x)
put_luma_h_10_8x8_neon:231.5 ( 4.54x)
put_luma_h_10_16x16_c:4131.0 ( 1.00x)
put_luma_h_10_16x16_neon:  848.6 ( 4.87x)
put_luma_h_10_32x32_c:   16469.5 ( 1.00x)
put_luma_h_10_32x32_neon: 3345.6 ( 4.92x)
put_luma_h_10_64x64_c:   66734.0 ( 1.00x)
put_luma_h_10_64x64_neon:14586.9 ( 4.57x)
put_luma_h_10_128x128_c:264228.9 ( 1.00x)
put_luma_h_10_128x128_neon:  52199.7 ( 5.06x)
put_luma_h_12_4x4_c:   262.1 ( 1.00x)
put_luma_h_12_8x8_c:  1051.3 ( 1.00x)
put_luma_h_12_8x8_neon:230.9 ( 4.55x)
put_luma_h_12_16x16_c:4124.4 ( 1.00x)
put_luma_h_12_16x16_neon:  848.0 ( 4.86x)
put_luma_h_12_32x32_c:   16446.9 ( 1.00x)
put_luma_h_12_32x32_neon: 3347.4 ( 4.91x)
put_luma_h_12_64x64_c:   66770.1 ( 1.00x)
put_luma_h_12_64x64_neon:14360.2 ( 4.65x)
put_luma_h_12_128x128_c:264419.5 ( 1.00x)
put_luma_h_12_128x128_neon:  52200.6 ( 5.07x)

M2 Air (with auto-vectorization feature):
put_luma_h_10_4x4_c: 0.3 ( 1.00x)
put_luma_h_10_8x8_c: 1.0 ( 1.00x)
put_luma_h_10_8x8_neon:  0.4 ( 2.58x)
put_luma_h_10_16x16_c:   3.0 ( 1.00x)
put_luma_h_10_16x16_neon:1.5 ( 2.01x)
put_luma_h_10_32x32_c:   9.7 ( 1.00x)
put_luma_h_10_32x32_neon:6.2 ( 1.57x)
put_luma_h_10_64x64_c:  36.6 ( 1.00x)
put_luma_h_10_64x64_neon:   23.9 ( 1.53x)
put_luma_h_10_128x128_c:   134.2 ( 1.00x)
put_luma_h_10_128x128_neon: 95.4 ( 1.41x)
put_luma_h_12_4x4_c: 0.3 ( 1.00x)
put_luma_h_12_8x8_c: 1.0 ( 1.00x)
put_luma_h_12_8x8_neon:  0.4 ( 2.57x)
put_luma_h_12_16x16_c:   3.0 ( 1.00x)
put_luma_h_12_16x16_neon:1.5 ( 2.01x)
put_luma_h_12_32x32_c:   9.7 ( 1.00x)
put_luma_h_12_32x32_neon:6.0 ( 1.63x)
put_luma_h_12_64x64_c:  36.5 ( 1.00x)
put_luma_h_12_64x64_neon:   23.9 ( 1.53x)
put_luma_h_12_128x128_c:   134.8 ( 1.00x)
put_luma_h_12_128x128_neon: 95.2 ( 1.42x)


>From dba0d5709658f01e40496d1f1fc8a1832e21b708 Mon Sep 17 00:00:00 2001
From: Georgii Zagoruiko 
Date: Wed, 22 Oct 2025 19:22:23 +0100
Subject: [PATCH] aarch64/vvc: Optimisations of put_luma_h() functions for
 10/12-bit

RPi4:
put_luma_h_10_4x4_c:   261.8 ( 1.00x)
put_luma_h_10_8x8_c:  1051.5 ( 1.00x)
put_luma_h_10_8x8_neon:231.5 ( 4.54x)
put_luma_h_10_16x16_c:4131.0 ( 1.00x)
put_luma_h_10_16x16_neon:  848.6 ( 4.87x)
put_luma_h_10_32x32_c:   16469.5 ( 1.00x)
put_luma_h_10_32x32_neon: 3345.6 ( 4.92x)
put_luma_h_10_64x64_c:   66734.0 ( 1.00x)
put_luma_h_10_64x64_neon:14586.9 ( 4.57x)
put_luma_h_10_128x128_c:264228.9 ( 1.00x)
put_luma_h_10_128x128_neon:  52199.7 ( 5.06x)
put_luma_h_12_4x4_c:   262.1 ( 1.00x)
put_luma_h_12_8x8_c:  1051.3 ( 1.00x)
put_luma_h_12_8x8_neon:230.9 ( 4.55x)
put_luma_h_12_16x16_c:4124.4 ( 1.00x)
put_luma_h_12_16x16_neon:  848.0 ( 4.86x)
put_luma_h_12_32x32_c:   16446.9 ( 1.00x)
put_luma_h_12_32x32_neon: 3347.4 ( 4.91x)
put_luma_h_12_64x64_c:   66770.1 ( 1.00x)
put_luma_h_12_64x64_neon:14360.2 ( 4.65x)
put_luma_h_12_128x128_c:264419.5 ( 1.00x)
put_luma_h_12_