Matthew Dillon <[EMAIL PROTECTED]> wrote:
> The change in code flow used to be the expensive piece, but not any
> more. You typically either see a branch prediction cache (Intel)
> offering a best-case of 0-cycle latency, or a single-cycle latency
> that is slot-fillable (MIPS).
In the case of an indirect branch, you also need to fetch the
destination address from memory. This is presumably 1 cycle (if it's
cached). It may be possible to pre-fetch the address, but this
requires a substantial amount of silicon for the interlocks.
> Since the jump portion of a subroutine call to a direct label is nothing
> more then a deterministic branch, the branch prediction cache actually
> operates in this case. You do not quite get 0-cycle latency due to
> the push/pop, and potential arguments, but it is very fast.
I'm not sure there's any reason why you shouldn't. If you changed the
semantics of a stack segment so that memory addresses below the stack
pointer were irrelevant, you could implement a small, 0-cycle, on-chip
stack (that overflowed into memory). I don't know whether this
semantic change would be allowable (and whether the associated silicon
could be justified) for the IA-32.
Peter
To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message