https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119628

            Bug ID: 119628
           Summary: Need better mechanisms to manage register saves in
                    callee for tail calls
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ak at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64

Created attachment 60997
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60997&action=edit
toy byte code interpreter

To compile the test case use

One use case for the musttail feature is to write threaded interpreters with
individual small functions each implementing an byte code and calling the next
function in the byte code program using musttail. 

This is a replacement for an older code style that put all these byte code
handlers into a large function and called them using indirect goto.

See the attached test case as an example.

This works fine for small functions that fit into the callee scratch registers
in the x86-64 ABI. But when you have more complex functions that need more
registers the individual functions starting saving/restoring the registers that
are supposed to be callee saved (this is simulated using inline asm in the test
case, thanks the Andrew Pinski for that trick)

You can see that in the case if you make the SAVE_REGS/DONT_SAVE_REGS empty,
there are lots of extra push/pops on each opcode.

Now this can be changed by modifying the calling convention as it's done in the
unmodified test case. The original caller of the byte code can save all and the
rest of the tail called byte code functions none. LLVM has
preserve_none/most/all for this and it is used in the field for this.

When the tail called functions are not called through pointers gcc has -fipa-ra
for static functions, which should take care of it. But unfortunately this only
works for direct calls because for indirects the IPA cgraph RTL mechanism
doesn't work.

gcc has no_callee_saved_registers/no_caller_saved_registers which was
originally developed for a different use case (fast interrupt handlers in OS)
but can modify the callee registers saving. The main drawback of them is that
they require -mgeneral-regs-only (as they were designed for an OS), which makes
it impossible to use floating point in the interpreter code. While this works
for the toy example it's probably a show stopper for real interpreters.

Another problem with them is that they don't affect the caller unlike the LLVM
attributes. Luckily for the tail call case the shrink wrapping code takes care
of this, although it's a problem if the byte code functions are called non tail
for some reason (e.g. in the first function of the interpreter), a well as for
other use cases (e.g. to use them to optimize calling of general cold
functions)

gcc should:
- support no_callee_saved_registers/no_caller_saved_registers without
-mgeneral-regs-only (there might be already bugs for this, but I'm filing it
separately to track the particular use case)
- figure out how -fipa-ra can be made to work for indirects? (maybe with some
type based analysis)
- Make the attributes affect the caller
- Do we need an equivalent of preserve_most
- Once no_callee/caller_saved_registers work similar to clang perhaps they
should be aliased for compatibility.

Reply via email to