Steve White wrote:
 I was under the misconception that each of these SSE operatons
was meant to be accomplished in a single clock cycle (although I knew there
are various other issues.)


Current CPU architectures permit an SSE scalar or parallel multiply and add instruction to be issued on each clock cycle. Completion takes at least 4 cycles for add, significantly more for multiply. The instruction timing tables quote throughput (how many cycles between issue) and latency (number of cycles to complete an individual operation). An even more common misconception than yours is that the extra time taken to complete multiply, compared with the time of add, would disappear with fused multiply-add instructions. SSE divide, as has been explained, is not pipelined. The best way to speed up a loop with divide is with vectorization, barring situations such as the one you brought up where divide may not actually be a necessary part of the algorithm.

Reply via email to