在 2020/11/16 23:56, Dave Martin 写道:
>> --8<--
>> ...
>> adler_A .req x10
>> adler_B .req x11
>>
>> .macro adler32_core
>> ld1b zX.h, p0/z, [x1] // load bytes
>> inch x1
>>
>> uaddv d0, p0, zX.h
>> mul zX.h, p0/m, zX.h, zJ.h // Sum [j=0 .. v-1] j*X[j+n]
>> mov x9, v0.d[0]
>> uaddv d1, p0, zX.h
>> add adler_A, adler_A, x9 // A[n+v] = An + Sum [j=0 ... v-1] X[j]
>> mov x9, v1.d[0]
>> madd adler_B, x7, adler_A, adler_B // Bn + v*A[n+v]
>> sub adler_B, adler_B, x9 // B[n+v] = Bn + v*A[n+v] - Sum
>> [j=0 .. v-1] j*X[j+n]
>> .endm
> If this has best performance, I find that quite surprising. Those uaddv
> instructions will stop the vector lanes flowing independently inside the
> loop, so if an individual element load is slow arriuaddving then everything
> will have to wait.
I don't know much about this problem, do you mean that the uaddv instruction
used in
the loop has a great impact on performance?
>
> A decent hardware prefetcher may tend to hide that issue for sequential
> memory access, though: i.e., if the hardware does a decent job of
> fetching data before the actual loads are issued, the data may appear to
> arrive with minimal delay.
>
> The effect might be a lot worse for algorithms that have less
> predictable memory access patterns.
>
> Possibly you do win some additional performance due to processing twice
> as many elements at once, here.
I think so. Compared to loading bytes into zX.h, if you load them directly into
zX.b,
and then use uunpklo/uunpkhi for register expansion, the performance will be
more better(20% faster).
This may be the reason.
--
Best regards,
Li Qiang