> > The implementation for copying up to 64 bytes does not depend on address > alignment with the size of the CPU's vector registers. Nonetheless, the > exact same code for copying up to 64 bytes was present in both the aligned > copy function and all the CPU vector register size specific variants of > the unaligned copy functions. > With this patch, the implementation for copying up to 64 bytes was > consolidated into one instance, located in the common copy function, > before checking alignment requirements. > This provides three benefits: > 1. No copy-paste in the source code. > 2. A performance gain for copying up to 64 bytes, because the > address alignment check is avoided in this case. > 3. Reduced instruction memory footprint, because the compiler only > generates one instance of the function for copying up to 64 bytes, instead > of two instances (one in the unaligned copy function, and one in the > aligned copy function). > > Furthermore, the function for copying less than 16 bytes was replaced with > a smarter implementation using fewer branches and potentially fewer > load/store operations. > This function was also extended to handle copying of up to 16 bytes, > instead of up to 15 bytes. > This small extension reduces the code path, and thus improves the > performance, for copying two pointers on 64-bit architectures and four > pointers on 32-bit architectures. > > Also, __rte_restrict was added to source and destination addresses. > > And finally, the missing implementation of rte_mov48() was added. > > Regarding performance, the memcpy performance test showed cache-to-cache > copying of up to 32 bytes now takes 2 cycles, versus ca. 6.5 cycles before > this patch. > Copying 64 bytes now takes 4 cycles, versus 7 cycles before. > > Signed-off-by: Morten Brørup <[email protected]> > ---
Acked-by: Konstantin Ananyev <[email protected]> > -- > 2.43.0

