The lazy binding code of aarch64 currently only preserves q0-q7 of the fp registers, but for an SVE call [AAPCS64+SVE] it should preserve p0-p3 and z0-z23, and for an AdvSIMD vector call [VABI64] it should preserve q0-q23. (Vector calls are extensions of the base PCS [AAPCS64].)
A possible fix is to save and restore the additional register state in the lazy binding entry code, this was discussed in https://sourceware.org/ml/libc-alpha/2018-08/msg00017.html the main objections were (1) Linux may optimize the kernel entry code for processes that don't use SVE, so lazy binding should avoid accessing SVE registers. (2) If this is fixed in the dynamic linker, vector calls will not be backward compatible with old glibc. (3) The saved SVE register state can be large (> 8K), so binaries that work today may run out of stack space on an SVE system during lazy binding (which can e.g. happen in a signal handler on a tiny stack). and the proposed solution was to force bind now semantics for vector functions e.g. by not calling them via PLT. This turned out to be harder than I expected. I no longer think (1) and (2) are critically important, but (3) is a correctness issue which is hard to argue away (would require larger stack allocations to accommodate the worst case stack size increase, but the stack allocation is not always under the control of glibc, so it cannot provide strict guarantees). Some approaches to make symbols "bind now" were discussed at https://groups.google.com/forum/#!topic/generic-abi/Bfb2CwX-u4M The ABI change draft is below the notes, it requires marking symbols in the ELF symbol table that follow the vector PCS (or other variant PCS conventions). This is most relevant to dynamic linkers with lazy binding support and to ELF linkers targeting AArch64, but assemblers will need to be updated too. Note 1: the dynamic linker may have to run user code during lazy binding because of ifunc resolvers, so it cannot avoid clobbering fp regs. Note 2: the tlsdesc entry is also affected by (3), so either the the initial DTV setup should avoid clobbering fp regs or the SVE register state should not be callee-preserved by the tlsdesc call ABI (the latter was chosen, which is backward compatible with old dynamic linkers, but tls access from SVE code is as expensive as an extern call now: the caller has to spill). Note 3: signal frame and SVE register spills in code using SVE can also lead to variable stack usage (AT_MINSIGSZTKSZ was introduced to address the former issue on linux) so it is a valid approach to just increase min stack size limits on aarch64 compared to other targets (this is less invasive, but does not fix old binaries). Note 4: the proposal requires marking symbols in asm and elf objects, so it is not compatible with existing tooling (old as or ld cannot create valid vector function symbol references or definitions) and it is only effective with a new dynamic linker. Note 5: -fno-plt style code generation for vector function calls might have worked too, but on aarch64 it requires compiler and linker changes to avoid PLT in position dependent code when that is emitted for the sake of pointer equality. It also requires tightening the ABI to ensure the static linker does not introduce PLT when processing certain static relocations. This approach would generate suboptimal static linked code (the no-plt code is hard to relax into direct calls on aarch64) fragile (easy to accidentally introduce a PLT) and hard to diagnose. Note 6: the proposed solution applies to both SVE calls and AdvSIMD vector calls, even though some issues only apply to SVE. Note 7: a separate dynamic linker entry point for variant PCS calls may be introduced (requires further ELF changes for a PLT0 like stub) or the dynamic linker may decide to always preserve all registers or decide to always bind symbols at load time. AAELF64: in the Symbol Table section add st_other Values The st_other member of a symbol table entry specifies the symbol's visibility in the lowest 2 bits. The top 6 bits are unused in the generic ELF ABI [SCO-ELF], and while there are no values reserved for processor-specific semantics, many other architectures have used these bits. The defined processor-specific st_other flag values are listed in Table 4-5-1. Table 4-5-1, Processor specific st_other flags +------------------------+------+---------------------+ |Name | Mask | Comment | +------------------------+------+---------------------+ |STO_AARCH64_VARIANT_PCS | 0x80 | The function | | | | associated with the | | | | symbol may follow a | | | | variant procedure | | | | call standard with | | | | different register | | | | usage convention. | +------------------------+------+---------------------+ A symbol table entry that is marked with the STO_AARCH64_VARIANT_PCS flag set in its st_other field may be associated with a function that follows a variant procedure call standard with different register usage convention from the one defined in the base procedure call standard for the list of argument, caller-saved and callee-saved registers [AAPCS64]. The rules in the Call and Jump relocations section still apply to such functions, and if a subroutine is called via a symbol reference that is marked with STO_AARCH64_VARIANT_PCS then code that runs between the calling routine and called subroutine must preserve the contents of all registers except IP0, IP1 and the condition code flags [AAPCS64]. Static linkers must preserve the marking and propagate it to the dynamic symbol table if any reference or definition of the symbol is marked with STO_AARCH64_VARIANT_PCS, and add a DT_AARCH64_VARIANT_PCS dynamic tag if required by the Dynamic Section section. NOTE: In particular, when a call is made via the PLT entry of a symbol marked with STO_AARCH64_VARIANT_PCS, a dynamic linker cannot assume that the call follows the register usage convention of the base procedure call standard. An example of a function that follows a variant procedure call standard with different register usage convention is one that takes parameters in scalable vector or predicate registers. AAELF64: in the Dynamic Section section add Table 5-4, AArch64 specific dynamic array tags +-----------------------+------------+-------+------------+---------------+ |Name | Value | d_un | Executable | Shared Object | +-----------------------+------------+-------+------------+---------------+ |DT_AARCH64_VARIANT_PCS | 0x70000005 | d_val | Platform | Platform | | | | | specific | Specific | +-----------------------+------------+-------+------------+---------------+ DT_AARCH64_VARIANT_PCS must be present if there are R_<CLS>_JUMP_SLOT relocations that reference symbols marked with the STO_AARCH64_VARIANT_PCS flag set in their st_other field. VABI64: after the Vector Procedure Call Standard section add Dynamic linking for AAVPCS On ELF platforms with dynamic linking support, symbol definitions and references must be marked with the STO_AARCH64_VARIANT_PCS flag set in their st_other field if the following holds: 1. the symbol is visible outside of its defining component (executable file or shared object), and 2. the symbol is associated with a function following the AAVPCS convention. For more information on STO_AARCH64_VARIANT_PCS, see AAELF64. NOTE: Marking all function symbol definitions and references is a valid way of implementing this requirement. [AAELF64]: ELF for the Arm 64-bit Architecture (AArch64) https://developer.arm.com/docs/ihi0056/latest [VABI64]: Vector Function ABI Specification for AArch64 https://developer.arm.com/tools-and-software/server-and-hpc/arm-architecture-tools/arm-compiler-for-hpc/vector-function-abi [AAPCS64]: Procedure Call Standard for the Arm 64-bit Architecture (AArch64) https://developer.arm.com/docs/ihi0055/latest [AAPCS64+SVE]: Procedure Call Standard for the ARM 64-bit Architecture (AArch64) with SVE support https://developer.arm.com/docs/100986/latest [SCO-ELF]: System V Application Binary Interface http://www.sco.com/developers/gabi/