Matthew Dillon <[EMAIL PROTECTED]> wrote:
>    The change in code flow used to be the expensive piece, but not any
>    more.  You typically either see a branch prediction cache (Intel)
>    offering a best-case of 0-cycle latency, or a single-cycle latency 
>    that is slot-fillable (MIPS).

In the case of an indirect branch, you also need to fetch the
destination address from memory.  This is presumably 1 cycle (if it's
cached).  It may be possible to pre-fetch the address, but this
requires a substantial amount of silicon for the interlocks.

>    Since the jump portion of a subroutine call to a direct label is nothing
>    more then a deterministic branch, the branch prediction cache actually
>    operates in this case.  You do not quite get 0-cycle latency due to
>    the push/pop, and potential arguments, but it is very fast.

I'm not sure there's any reason why you shouldn't.  If you changed the
semantics of a stack segment so that memory addresses below the stack
pointer were irrelevant, you could implement a small, 0-cycle, on-chip
stack (that overflowed into memory).  I don't know whether this
semantic change would be allowable (and whether the associated silicon
could be justified) for the IA-32.

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Reply via email to