On 10 July 2012 14:57, Ramana Radhakrishnan
wrote:
> On 6 July 2012 16:52, Mans Rullgard wrote:
>> I ran my usual set of benchmarks of libav compiled with the current gcc
>> releases (hand-written assembly disabled). The results are in this
>> spreadsheet:
>> https://docs.google.com/spreadsheet/ccc?key=0AguHvNGaLXy9dHExeWZ1YWZ1c0s2VnpJRkl2bVRPU2c
>>
>> First the good news, almost everything is faster with 4.6+ than with
>> linaro-4.5.
>>
>> The bad news is that some things have regressed since 4.6, even if not
>> all the way back to 4.5 levels. A few especially problematic pieces
>> stand out:
>>
>> - The mp3 test performs 5-15% worse. This regression is (mostly)
>> attributable to the ff_mpadsp_apply_window_fixed [1] function.
>> We have looked at this one before.
>>
>> - FLAC is 9% slower in upstream 4.7/4.8 compared to Linaro releases.
>> Here flac_lpc_16_c [2] and flac_decorrelate_indep_c_16 [3] are
>> mainly to blame.
>>
>
> Looking at this in the middle of the summit - In the flac_lpc_16_c
> code in the vectorized case could you take a look with perf and say
> which part is hot ?
>
> is it the top level nested loop over i and j or is it the loop that
> does a summation when i < len ?
>
> The non-vectorized case looks interesting because it might be a
> fallout with sched-pressure.
Here's the perf annotate output for that function from 4.8 trunk with
vectorisation enabled:
Percent | Source code & Disassembly of avconv
:
:
:
: Disassembly of section .text:
:
: 002aa55c :
: #define SAMPLE_SIZE 32
: #include "flacdsp_template.c"
:
: static void flac_lpc_16_c(int32_t *decoded, const int
coeffs[32],
:int pred_order, int qlevel, int len)
: {
0.02 :2aa55c: push{r4, r5, r6, r7, r8, r9, sl, fp}
0.00 :2aa560: sub sp, sp, #80 ; 0x50
0.00 :2aa564: str r0, [sp, #68] ; 0x44
: int i, j;
:
: for (i = pred_order; i < len - 1; i += 2) {
0.00 :2aa568: ldr r0, [sp, #112] ; 0x70
: #define SAMPLE_SIZE 32
: #include "flacdsp_template.c"
:
: static void flac_lpc_16_c(int32_t *decoded, const int
coeffs[32],
:int pred_order, int qlevel, int len)
: {
0.00 :2aa56c: str r2, [sp, #60] ; 0x3c
0.00 :2aa570: str r1, [sp, #52] ; 0x34
: int i, j;
:
: for (i = pred_order; i < len - 1; i += 2) {
0.00 :2aa574: sub r0, r0, #1
: #define SAMPLE_SIZE 32
: #include "flacdsp_template.c"
:
: static void flac_lpc_16_c(int32_t *decoded, const int
coeffs[32],
:int pred_order, int qlevel, int len)
: {
0.00 :2aa578: str r3, [sp, #56] ; 0x38
: int i, j;
:
: for (i = pred_order; i < len - 1; i += 2) {
0.00 :2aa57c: cmp r2, r0
0.00 :2aa580: str r0, [sp, #72] ; 0x48
0.00 :2aa584: bge 2aa93c
:
: #undef SAMPLE_SIZE
: #define SAMPLE_SIZE 32
: #include "flacdsp_template.c"
:
: static void flac_lpc_16_c(int32_t *decoded, const int
coeffs[32],
0.00 :2aa588: add r3, r2, #4
0.00 :2aa58c: mov r8, r2
0.00 :2aa590: lsl r3, r3, #2
0.00 :2aa594: sub r2, r2, #10
0.00 :2aa598: ldr sl, [sp, #68] ; 0x44
0.00 :2aa59c: bic r2, r2, #7
0.00 :2aa5a0: ldr ip, [sp, #68] ; 0x44
0.00 :2aa5a4: mov r0, r1
0.00 :2aa5a8: rsb r2, r2, r8
0.00 :2aa5ac: sub r1, r3, #16
0.00 :2aa5b0: sub r9, r8, #1
0.00 :2aa5b4: add sl, sl, #16
0.00 :2aa5b8: add r3, ip, r3
0.00 :2aa5bc: add r1, r0, r1
0.00 :2aa5c0: sub r2, r2, #9
0.00 :2aa5c4: str r9, [sp, #64] ; 0x40
0.00 :2aa5c8: str sl, [sp, #44] ; 0x2c
0.00 :2aa5cc: str r3, [sp, #36] ; 0x24
0.00 :2aa5d0: str r1, [sp, #76] ; 0x4c
0.00 :2aa5d4: str r2, [sp, #40] ; 0x28
0.00 :2aa5d8: str r8, [sp, #48] ; 0x30
:
: for (i = pred_order; i < len - 1; i += 2) {
: int c;
: int d = decoded[i-pred_order];
: