David Gibson <[email protected]> writes: > [ Unknown signature status ] > On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote: >> Load 8byte at a time and manipulate. >> >> Big-Endian Storage >> +-------------+-------------+-------------+-------------+ >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | >> +-------------+-------------+-------------+-------------+ >> >> Little-Endian Storage >> +-------------+-------------+-------------+-------------+ >> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC | >> +-------------+-------------+-------------+-------------+ >> >> Vector load results in: >> +-------------+-------------+-------------+-------------+ >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF | >> +-------------+-------------+-------------+-------------+ > > Ok. I'm guessing from this that implementing those GPR<->VSR > instructions showed that the earlier versions were endian-incorrect as > I suspected. > > Have you verified that this new implementation is actually faster (or > at least no slower) on LE than the original implementation with > individual 32-bit stores?
Result of million lxvw4x, mfvsrd/mfvsrld and print Without patch: ============== [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null real 0m2.812s user 0m2.792s sys 0m0.020s [tcg_test]$ With patch: =========== [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9 le_lxvw4x >/dev/null real 0m2.801s user 0m2.783s sys 0m0.018s [tcg_test]$ Not much perceivable difference, is there a better way to benchmark? Regards Nikunj
