[Bug target/100799] Stackoverflow in optimized code on PPC

jskumari at gcc dot gnu.org via Gcc-bugs Mon, 17 Oct 2022 01:18:11 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799


--- Comment #17 from Surya Kumari Jangala <jskumari at gcc dot gnu.org> ---
I analysed the reduced test case specified in comment 15. In the .s file, the
callee decrements r1 by 224, ie, callee’s frame size is 224. But there is an
instruction in the callee that accesses into the caller’s frame at (r1+272).
At first glance this looks odd, even incorrect, but after further analysis, I
am not sure if this is incorrect.
If we look at the RTL dumps, the offset 272 is introduced in ‘reload’. ‘Insn 4’
stores into (r1+272). 

‘Insn 4’ after vregs:

(insn 4 3 5 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
        (reg:DI 5 5 [ arrayD.2714 ])) "bug.f":1:23 675 {*movdi_internal64}
     (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
                (const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
        (nil)))


‘Insn 4’ after IRA:

(insn 4 214 237 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
        (reg:DI 262)) "bug.f":1:23 675 {*movdi_internal64}
     (expr_list:REG_DEAD (reg:DI 262)
        (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
                    (const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
            (nil))))

‘Insn 4’ after reload:

(insn 4 214 19 2 (set (mem/f/c:DI (plus:DI (reg/f:DI 1 1)
                (const_int 272 [0x110])) [3 arrayD.2714+0 S8 A64])
        (reg:DI 5 5 [262])) "bug.f":1:23 675 {*movdi_internal64}
     (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
                (const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
        (nil)))


As we can see, during vregs phase, we are moving r5 to r177 and r177 is equiv
to (ap+48). ‘ap’ (r99) is the base register for access to arguments of the
function.

In the gcc code:
#define ARG_POINTER_REGNUM 99

During vregs phase, not just r5, but all registers from r3-r10 are moved to
pseudo registers and these pseudo regs are equivalent to (ap+’offset’) with
‘offset’ starting from 32 for r3 and going on till 88 for r10. Note that ap
points to the beginning of the callee frame, hence to access the parameter save
area of the caller’s frame, 32 needs to be added to ap.

During LRA, in curr_insn_transform(), we make equivalence substitution and
change r177 to r1+272. (272 because r177 is equivalent to ap+48, and ap equals
r1+224, so ap+48 = r1+272). 

The argument registers r3-r10 are saved as they need to be reused to pass
parameters to functions called from the callee. But not all parameter registers
are spilled to the stack. For example, r6 is saved in r24. We can see this
after the “final” phase:

(insn 5 289 19 (set (reg/v/f:DI 24 %r24 [orig:178 ldaD.2715 ] [178])
        (reg:DI 6 %r6 [263])) "bug.f":1:23 675 {*movdi_internal64}
     (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
                (const_int 56 [0x38])) [6 ldaD.2715+0 S8 A64])
        (nil)))

I guess r5 had to be spilled to stack because there were no free registers.

Also, note that there is a load from (r1+272) in the reduced test case. This
shows that the value in r5 is needed, and hence it has to be saved somewhere.

I ran the test case with the options: -mcpu=power8 -O2 -fPIC

If -fPIC option is removed, we do not see any access to the caller’s frame in
the generated assembly. But it does have instructions that save the parameter
registers into other registers. I suppose the parameter registers did not have
to be saved on stack (ie, in the caller’s parameter save area) because there
were enough registers available. That is, perhaps there is lesser register
pressure without -fPIC.

After vregs:
(insn 4 3 5 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
        (reg:DI 5 %r5 [ arrayD.2714 ])) "bug.f":1:23 675 {*movdi_internal64}
     (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
                (const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])

After reload:
(insn 4 214 19 2 (set (reg/v/f:DI 17 %r17 [orig:177 arrayD.2714 ] [177])
        (reg:DI 5 %r5 [262])) "bug.f":1:23 675 {*movdi_internal64}
     (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
                (const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
        (nil)))


To summarise, the reduced testcase seems to be correctly compiled. So I shifted
my focus to the original fortran file dgebal.f in the openBLAS library.


In dgebal.f too we have some instructions accessing the caller’s parameter save
area. These are the interesting snippets of instructions from the assembly
code: 

   // The original contents of r23 are spilled.
std %r23,-192(%r1)
   // r3 is saved in r23
mr %r23,%r3
   // frame is allocated
stdu %r1,-400(%r1)

  // restore r3 contents before making call to lsame_. There are several calls
to lsame_ and 
  // each time, r3 is restored.
mr %r3,%r23
bl lsame_

   // save r23 to the stack because we are running out of registers and we need
a free reg.
   // Note that we are saving to the caller’s frame into the parameter save
area. And we 
   // are saving to (400+32) which is the
   // location that r3 would have been spilled. This is correct because r23
holds the contents of r3.
std %r23,432(%r1)
   // Use r23
li %r23,1
cmpwi %cr0,%r23,2

   // Load back r23 as we need to pass parameter to lsame_
ld %r23,432(%r1)
mr %r3,%r23
bl lsame_

   // Epilogue: restore r1 and the original contents of r23.
addi %r1,%r1,400
ld %r23,-192(%r1)
blr

The snippets of assembly code above are for r3 being saved in r23. There are
other parameter registers too being saved like for example, r10 is copied to
r30 which is then later spilled into the caller’s parameter save area at
(r1+488). 488=400+32+56 = 400+32+8*7, and this is the location for r10.

In the rtl dump, after vregs phase, we can see registers r3 to r10 being saved
to pseudo registers.

After vregs phase:
(r3 saved to pseudo r303 which is equiv to ap+32)

(insn 2 43 3 2 (set (reg/v/f:DI 303 [ jobD.2712 ])
        (reg:DI 3 %r3 [ jobD.2712 ])) "dgebal.f":2:23 675 {*movdi_internal64}
     (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
                (const_int 32 [0x20])) [4 jobD.2712+0 S8 A64])

After reload: 
(r303 is assigned to r23 and it is spilled at r1+432).

(insn 1620 931 1627 22 (set (mem/f/c:DI (plus:DI (reg/f:DI 1 %r1)
                (const_int 432 [0x1b0])) [4 jobD.2712+0 S8 A64])
        (reg/v/f:DI 23 %r23 [orig:303 jobD.2712 ] [303])) 675
{*movdi_internal64}
     (nil))


>From the description of the bug
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799#c0), the issue is
occurring when FlexiBLAS is used with OpenBLAS.  And the issue is that when the
fortran routine DGEEV calls another fortran routine DGEBAL, the second
parameter (’N’) gets corrupted when control returns back to DGEEV. (DGEBAL and
DGEEV are routines in openBLAS). When FlexiBLAS is used, any call from an
openBLAS routine to another openBLAS routine goes thru flexiBLAS. So the call
to DGEBAL goes thru flexiblas first. FlexiBLAS is a wrapper library and it
contains a wrapper for dgebal. FlexiBLAS is written in C while openBLAS is a
fortran library. There is a wrapper for DGEBAL in flexiblas which reroutes the
call to DGEBAL in openBLAS. My suspicion is that the wrapper routine written in
C does not allocate the optional parameter save area. I tried compiling the
wrapper routine for dgebal with -O2 -fPIC and with these options, the frame
size is only 32; the parameter save area is not being allocated. And I think
this is resulting in corrupting contents of DGEEV’s stack when the fortran
routine DGEBAL writes into the caller’s parameter save area. I am not sure with
what options flexiBLAS is built, but I suspect we do not allocate parameter
save area irrespective of the options used. 

I wonder if saving the parameter registers r3-r10 to the parameter save area of
caller’s frame is specific to Fortran. In C, looks like these registers are
being saved in the callee frame itself.

[Bug target/100799] Stackoverflow in optimized code on PPC

Reply via email to