https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285
Bug ID: 95285 Summary: AArch64:aarch64 medium code model proposal Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bule1 at huawei dot com Target Milestone: --- Created attachment 48584 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48584&action=edit proposed patch I would like to propose an implementation of the medium code model in aarch64. A prototype is attached, passed bootstrap and the regression test. Mcmodel = medium is a missing code model in aarch64 architecture, which is supported in x86. This code model describes a situation that some small data is relocated by small code model while large data is relocated by large code model. The official statement about medium code model in x86 ABI file page 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf The key difference between x86 and aarch64 is that x86 can use lea+movabs instruction to implement a dynamic relocatable large code model. Currently, large code model in AArch64 relocate the symbol using ldr instruction, which can only be static linked. However, the small code mode use adrp + ldr instruction, which can be dynamic linked. Therefore, the medium code model cannot be implemented directly by simply setting a threshold. As a result a dynamic reloadable large code model is needed first for a functional medium code model. I met this problem when compiling CESM, which is a climate forecast software that widely used in hpc field. In some configure case, when the manipulating large arrays, the large code model with dynamic relocation is needed. The following case is abstract from CESM for this scenario. program main common/baz/a,b,c real a,b,c b = 1.0 call foo() print*, b end subroutine foo() common/baz/a,b,c real a,b,c integer, parameter :: nx = 1024 integer, parameter :: ny = 1024 integer, parameter :: nz = 1024 integer, parameter :: nf = 1 real :: bar(nf,nx*ny*nz) real :: bar1(nf,nx*ny*nz) bar = 0.0 bar1 =0.0 b = bar(1,1024*1024*100) b = bar1(1,1) return end compile with -mcmodel=small -fPIC will give following error due to the access of bar1 array test.f90:(.text+0x28): relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21 against `.bss' test.f90:(.text+0x6c): relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21 against `.bss' compile with -mcmodel=large -fPIC will give unsupported error: f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’ As discussed in the beginning, to tackle this problem we have to solve the static large code model problem. My solution here is to use R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the current PC value. Before change (mcmodel=small) : adrp x0, bar1.2782 add x0, x0, :lo12:bar1.2782 After change:(mcmodel = medium proposed): movz x0, :prel_g3:bar1.2782 movk x0, :prel_g2_nc:bar1.2782 movk x0, :prel_g1_nc:bar1.2782 movk x0, :prel_g0_nc:bar1.2782 adr x1, . sub x1, x1, 0x4 add x0, x0, x1 The first 4 movk instruction will calculate the offset between bar1 and the last movk instruction in 64-bits, which fulfil the requirement of large code model(64-bit relocation). The adr+sub instruction will calculate the pc-address of the last movk instruction. By adding the offset with the PC address, bar1 can be dynamically located. Because this relocation is time consuming, a threshold is set to classify the size of the data to be relocated, like x86. The default value of the threshold is set to 65536, which is max relocation capability of small code model. This implementation will also need to amend the linker in binutils so that the4 movk can calculated the same pc-offset of the last movk instruction. The good side of this implementation is that it can use existed relocation type to prototype a medium code model. The drawback of this implementation also exists. For start, these 4movk instructions and the adr instruction must be combined in this order. No other instruction should insert in between the sequence, which will leads to mistake symbol address. This might impede the insn schedule optimizations. Secondly, the linker need to make the change correspondingly so that every mov instruction calculate the same pc-offset. For example, in my implementation, the fisrt movz instruction will need to add 12 to the result of ":prel_g3:bar1.2782" to make up the pc-offset. I haven't figure out a suitable solution for these problems yet. You are most welcomed to leave your suggestions regarding these issues.