https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118356

Palmer Dabbelt <palmer at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2025-01-08
           Keywords|                            |missed-optimization
                 CC|                            |palmer at gcc dot gnu.org
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Palmer Dabbelt <palmer at gcc dot gnu.org> ---
(In reply to Javier Mora from comment #0)
> Some RISC-V implementations, including the CORE-V CVE4 family [1], allow
> having instructions aligned to 2- or 4-byte boundaries, but introduce an
> extra clock cycle penalty if the target of a branch instruction is a 4-byte
> instruction that is not aligned to a 4-byte boundary (but not if the target
> instruction is aligned to a 4-byte boundary, or if it's a 2-byte
> instruction).
> 
> In those cases, forcing alignment of branch targets to 4 bytes (which can be
> achieved by providing `-falign-labels=4`) can provide a great improvement on
> certain programs.  For example, a tight `for` loop may take 9 clock cycles
> to run if the branch target is aligned but 10 if it's not, resulting in a
> 10% performance loss.  (What's worse, this performance loss will only kick
> in arbitrarily, and can appear or disappear even if I change a completely
> different part of the code, which drove me crazy when I was trying to
> measure the performance of a function affected by this issue; enabling
> `-falign-labels=4` also has the advantage of removing this uncertainty.)
> 
> Here, my expectation would be that enabling a certain optimization level
> (such as `-O2`) enabled this particular optimization.  In fact, the
> documentation [2] states that `-O2` enables the `-falign-labels` flag, but
> without specifying an alignment.  It later states that `-falign-labels`
> without a value or with `=0` will "use a machine-dependent default which is
> very likely to be ‘1’, meaning no alignment".
>
> Now, I don't quite understand why `-O2` would want to enable an optimization
> option whose default behavior is to do nothing, but my guess is that this is
> so that specific targets where setting `-falign-labels=X` can provide an
> advantage (as is the case with RISC-V) use `X` as the default value rather
> than 1.

The GCC for docs say

    -O2
    Optimize even more. GCC performs nearly all supported optimizations that do
not involve a space-speed tradeoff.

so this is something I'd expect to be off for "-O2" and on for "-O3".

> What do you think?  Would it make sense for RISC-V targets to make
> `-falign-labels=4` the default alignment value when `-falign-labels` is used
> without providing an explicit value, so that this forced alignment will
> happen when `-O2` or `-O3` are used?

This seems like a good candidate for a microarchitecture-specific tuning
parameter.  The exact cost here is going to be very implementation-dependent
and the mapping to GCC codegen is a little clunky, so it might be hard to get
this all to be a net performance win on real code.

We might also need that alignment-based decompression stuff in binutils,
depending on how implementations handle the extra NOPs (which IIRC we never
merged because nobody cared enough to figure out if the code actually worked).

> [1]:
> https://docs.openhwgroup.org/projects/cv32e40p-user-manual/en/latest/
> pipeline.html#cycle-counts-per-instruction-type
> [2]: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Reply via email to