https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110724
--- Comment #3 from Javier Martinez <javier.martinez.bugzilla at gmail dot com> --- The generic tuning of 16:11:8 looks reasonable to me, I do not argue against it. From Anger Fog’s Optimizing subroutines in assembly language: > Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. > If an important subroutine entry or jump label happens to be near the > end of a 16-byte block then the microprocessor will only get a few > useful bytes of code when fetching that block of code. It may have > to fetch the next 16 bytes too before it can decode the first instructions > after the label. This can be avoided by aligning important subroutine > entries and loop entries by 16. Aligning by 8 will assure that at least 8 > bytes of code can be loaded with the first instruction fetch, which may > be sufficient if the instructions are small. This looks like the reason behind the alignment. That section of the book goes on to explain the inconvenience (execution of nops) of alignment on labels reachable by other means than branching - which I presume lead to the :m and :m2 tuning values, the distinction between -falign-labels and -falign-jumps, and the reason padding is removed when my label is reachable by fall-through with [[unlikely]]. All this is fine. My thesis is that this alignment strategy is always unnecessary in one specific circumstance - when the branch target is itself an unconditional branch of size 1, as in: .L1: ret Because the ret instruction will never cross a block boundary, and the instructions following the ret must not execute, so there is no front-end stall to avoid.