Richard Earnshaw wrote: > Hmm, this is going to cause bottlenecks on Cortex-A15: writing a Neon > single-precision register and then reading it back as a double-precision > value will cause scheduling problems.
Ok, that is a problem ... > The awkward thing here is that the shift only uses the bottom 8 bits of > the register, even though the instruction takes a 64-bit register, so we > don't want to go to the trouble of sign-extending the value all the way > out to 64-bits. We don't really care what the upper bits are set to. Would a vdup.32 Dn, Rm (instead of the vmov) help here, or does this likewise have performance issues? > A solution to this is to have the set of the shifter register done as a > lane-set operation rather than as a set of the lower register, but it > probably needs some thought as to how to achieve this without creating > other overheads. What instruction are you refering to here? Loads from memory? Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com