On Wed, Aug 10, 2011 at 10:08:54AM +0200, Richard Guenther wrote: > Are the arrays all well-aligned in practice? Thus, would versioning the loop > for all-good-alignment help?
I suspect yes on 64-bit, but no on 32-bit, due to malloc not returning 128-bit aligned memory in 32-bit. It only returns memory that is aligned to double the alignment of size_t. Long doubles in powerpc are 128 bits, as are the vector types. I did a test, eliminating the vec_realign stuff under switch control. This has the effect of versioning the loop into a vector loop that is run when all are aligned, and a scalar loop that is run when they aren't all aligned. I ran spec 2006 in 32-bit, and I see the following differences (eliminating the ones that are close enough). Benchmark % of baseline ========= ============= 400.perlbench 96.09% 429.mcf 104.50% 456.hmmer 95.85% 458.sjeng 104.23% 464.h264ref 112.18% 483.xalancbmk 102.35% 410.bwaves 107.02% 416.gamess 96.01% 433.milc 98.90% 434.zeusmp 94.92% 435.gromacs 105.55% 450.soplex 108.58% 453.povray 103.71% 454.calculix 97.54% 459.GemsFDTD 97.35% 465.tonto 97.79% 470.lbm 98.56% 481.wrf 87.11% 482.sphinx3 110.33% I was hoping that doing the versioning for an aligned loop and unaligned loop would eliminate the percentages under 100%. Note, the powerpc VSX memory instructions for V4SF/V4SI types can run if the pointer is not aligned to a 128-bit boundary, but there is a slowdown if they get pointers that aren't aligned to a 64-bit boundary. I'm doing a run right now, with movmisalign enabled for V4SF/V4SI, and I am seeing some regressions in the run. > If we have 4 permutes and then 8 further ones - can we combine for example > an unaligned load permute and the following permute for the sf->df conversion? I don't think so. The unaligned stuff is to load up a 128-bit value in a register using a left half and a right half, and a mask. The Altivec instruction set has an instruction (lvsl) that computes the mask based on the address, and the loads and stores ignore the bottom 4 bits. The unaligned loop looks something like: left = vector_load (addr & -16) mask = lvsl (addr) for (...) { addr += 16; right = vector_load (addr & -16) value = permute (left, right, mask); /* ... */ left = right; } The two permutes for the conversion, get the values in the correct place for the conversion instruction, ie if you have a vector with the parts: +====+====+====+====+ | A | B | C | D | +====+====+====+====+ The first permute (xxmrghw) in the conversion would create a vector: +====+====+====+====+ | A | A | B | B | +====+====+====+====+ and the second (xxmrglw) would create: +====+====+====+====+ | C | C | D | D | +====+====+====+====+ Note, the values are doubled, because the instruction takes 2 registers as input, and we just give the same register for both inputs. The xvcvspdp instruction then takes a vector of the form (ignoring the 2nd and 4th fields): +====+====+====+====+ | X | ?? | Y | ?? | +====+====+====+====+ and converts it to double precision: +=========+=========+ | X | Y | +=========+=========+ > Does ppc have a VSX tuned cost-model and is it applied correctly in this case? > Maybe we need more fine-grained costs? The ppc has a cost model, but as I said in 50031, I think it needs to be improved. -- Michael Meissner, IBM 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA meiss...@linux.vnet.ibm.com fax +1 (978) 399-6899