Re: speed of double-precision divide

Tim Prince Sun, 24 Jan 2010 14:13:21 -0800

Steve White wrote:

 I was under the misconception that each of these SSE operatons
was meant to be accomplished in a single clock cycle (although I knew there
are various other issues.)

Current CPU architectures permit an SSE scalar or parallel multiply andadd instruction to be issued on each clock cycle. Completion takes atleast 4 cycles for add, significantly more for multiply.The instruction timing tables quote throughput (how many cycles betweenissue) and latency (number of cycles to complete an individual operation).An even more common misconception than yours is that the extra timetaken to complete multiply, compared with the time of add, woulddisappear with fused multiply-add instructions.SSE divide, as has been explained, is not pipelined. The best way tospeed up a loop with divide is with vectorization, barring situationssuch as the one you brought up where divide may not actually be anecessary part of the algorithm.

Re: speed of double-precision divide

Reply via email to