https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118356

--- Comment #2 from Javier Mora <cousteaulecommandant at gmail dot com> ---
It has been brought to my attention that there are -falign-labels,
-falign-loops, and -falign-jumps options, and that perhaps -labels should be
left as is, and only -loops and -jumps should be touched.

I (possibly incorrectly) interpreted from the documentation that labels were
classified as either "can only be reached by jumping" (-jumps) or "can not only
be reached by jumping" (-loops), and that -labels was the union of the two
sets; but perhaps the correct interpretation is that there are actually three
categories (always jumped to, often jumped to, rarely/never jumped to), and
that -loops refers strictly to the second and not the third, whereas -labels
refers to all three, with the compiler following some magical algorithm to
determine whether a target is "often" or "rarely" jumped to.  If that's the
case, then it might make no sense to force the alignment of the latter, and
doing so might be detrimental to performance depending on how this is done (if
it's done by inserting NOPs then it would, but if it's done by expanding
compressed instructions to force alignment then that wouldn't have an impact on
the clock cycle count).

So, depending on that, it might make more sense to only modify the default
value of -falign-loops=0 and -falign-jumps=0 so that they default to 4, but not
that of -falign-labels=0 which should be kept as 1.  (Or 2; I don't know how
it's internally implemented.)

FWIW, here's some code for experimentation.  I haven't tested this specific
example but if I'm not mistaken it should take about 26*limit clock cycles to
run in a CVE4 platform with `-O2 -falign-loops=4`, but 28*limit clock cycles
with `-O2 -falign-loops=1` (two of the four loops take 1 clock cycle longer due
to misalignment), resulting in a 7% performance loss.

```c
void test_align(int limit) {
    for (int i=0; i<limit; i++) asm volatile ("nop");
    for (int i=0; i<limit; i++) asm volatile ("nop\n\tnop");
    for (int i=0; i<limit; i++) asm volatile ("nop");
    for (int i=0; i<limit; i++) asm volatile ("nop\n\tnop");
}
```

Reply via email to