> 
> The implementation for copying up to 64 bytes does not depend on address
> alignment with the size of the CPU's vector registers. Nonetheless, the
> exact same code for copying up to 64 bytes was present in both the aligned
> copy function and all the CPU vector register size specific variants of
> the unaligned copy functions.
> With this patch, the implementation for copying up to 64 bytes was
> consolidated into one instance, located in the common copy function,
> before checking alignment requirements.
> This provides three benefits:
> 1. No copy-paste in the source code.
> 2. A performance gain for copying up to 64 bytes, because the
> address alignment check is avoided in this case.
> 3. Reduced instruction memory footprint, because the compiler only
> generates one instance of the function for copying up to 64 bytes, instead
> of two instances (one in the unaligned copy function, and one in the
> aligned copy function).
> 
> Furthermore, the function for copying less than 16 bytes was replaced with
> a smarter implementation using fewer branches and potentially fewer
> load/store operations.
> This function was also extended to handle copying of up to 16 bytes,
> instead of up to 15 bytes.
> This small extension reduces the code path, and thus improves the
> performance, for copying two pointers on 64-bit architectures and four
> pointers on 32-bit architectures.
> 
> Also, __rte_restrict was added to source and destination addresses.
> 
> And finally, the missing implementation of rte_mov48() was added.
> 
> Regarding performance, the memcpy performance test showed cache-to-cache
> copying of up to 32 bytes now takes 2 cycles, versus ca. 6.5 cycles before
> this patch.
> Copying 64 bytes now takes 4 cycles, versus 7 cycles before.
> 
> Signed-off-by: Morten Brørup <[email protected]>
> ---

Acked-by: Konstantin Ananyev <[email protected]>

> --
> 2.43.0

Reply via email to