On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic <[email protected]> wrote: > Hi Siarhei, > > I implemented a new version of the (patch below) > BILINEAR_INTERPOLATE_SINGLE_PIXEL macro where ANDI/EXT instructions, > are substituted with load byte instructions (for better dual-issue > instruction balancing) and got these results on my Malta board: > > Original: > [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 > Opt (ANDI/EXT) > [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6 > Opt2 (load byte instructions) > [ 0] image firefox-fishtank 1671.700 1672.006 0.03% 4/4 > > There is performance improvement, but not impressive as I expected.
This is interesting. Have you tried to check where the time is actually spent and what is the performance bottleneck? If the code from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro is taken and put into unrolled loop so that the total number of iterations is equal to CPU clock frequency, then we get: MIPS 74K and ANDI/EXT variant of BILINEAR_INTERPOLATE_SINGLE_PIXEL: real 0m40.279s user 0m40.260s sys 0m0.006s MIPS 74K and LBU variant of BILINEAR_INTERPOLATE_SINGLE_PIXEL: real 0m26.479s user 0m26.468s sys 0m0.006s That's ~40 cycles vs. ~26.5 cycles, or approximately ~13.5 cycles saving by replacing 16 ALU instructions with 16 LS instructions. Ideally we would want to have perfect dual issue here and 16 cycles saving, but at least dual issue works. > And now code also becomes vulnerable to endianess of the target CPUs. > Of course, this can be guarded with some #ifdef's where byte offset in a word > is changed according to the endianess of the target CPU (since MIPS CPUs can > be both LE and BE). > Is this small improvement worth making this code vulnerable to endian issues? If you are already satisfied with this level of performance, then it's probably fine for now. > I still need to add improvement for that packing/unpacking of the RGBA pixels > after bilinear/before OVER operation, but I don't expect big improvement > there (it is just a couple of instructions). It's not just a couple of instructions. By combining the color channels in a register, you are also forcing the processor to finish the calculations for all the needed data. And this is an extra data dependency, which may inhibit instructions reordering. But big improvements are not likely to happen unless there is a clear understanding about what is going on in the CPU pipeline and accounting each spent cycle. -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
