------- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 -------
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:
1. AMD limitation for 16-bytes page (memory range XXXXXXX0 - XXXXXXXF), but
Intel limitation for 16-bytes chunk (memory range XXXXXXXX - XXXXXXXX+10h)
2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or
returns),
Intel - maximum of _FOUR_ branches!
Quotation from Software Optimization Guide for AMD64 Processors
6.1 Density of Branches
When possible, align branches such that they do not cross a 16-byte boundary.
The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache
branch-prediction history for a maximum of three near branches (CALL, JMP,
conditional branches, or returns) per 16-byte fetch window. A branch
instruction that crosses a 16-byte boundary is counted in the second 16-byte
window. Due to architectural restrictions, a branch that is split across a
16-byte
boundary cannot dispatch with any other instructions when it is predicted
taken. Perform this alignment by rearranging code; it is not beneficial to
align branches using padding sequences.
The following branches are limited to three per 16-byte window:
jcc rel8
jcc rel32
jmp rel8
jmp rel32
jmp reg
jmp WORD PTR
jmp DWORD PTR
call rel16
call r/m16
call rel32
call r/m32
Coding more than three branches in the same 16-byte code window may lead to
conflicts in the branch target buffer. To avoid conflicts in the branch target
buffer, space out branches such that three or fewer exist in a given 16-byte
code window. For absolute optimal performance, try to limit branches to one per
16-byte code window. Avoid code sequences like the following:
ALIGN 16
label3:
call label1 ; 1st branch in 16-byte code window
jc label3 ; 2nd branch in 16-byte code window
call label2 ; 3rd branch in 16-byte code window
jnz label4 ; 4th branch in 16-byte code window
; Cannot be predicted.
If there is a jump table that contains many frequently executed branches, pad
the table entries to 8 bytes each to assure that there are never more than
three branches per 16-byte block of code.
Only branches that have been taken at least once are entered into the dynamic
branch prediction, and therefore only those branches count toward the
three-branch limit.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942