On Thu, Jun 4, 2015 at 11:17 PM, Andi Kleen <a...@linux.intel.com> wrote: >> Rather than just a sequence of NOP's, should the first NOP be a >> unconditional branch to the beginning of the real function? I don't >> know if this applies to AArch64 cpus, but I believe some cpus can handle >> such branches already in the decode unit, thus avoiding any extra cycles >> for skipping the NOPs. > > nops are very cheap. Typically they are already discard in the frontend. > It's unlikely all of this is worth it.
Maybe on Intel's chip but not for an example on ThunderX they are not discarded. But we can issue 2 at a time. So what is a few cycles overhead for each function when an icache miss is much higher. Thanks, Andrew