I'll switch to replying on PR61538. I had not read all the ticket previously and although I may have found a problem it seems it may not be the cause of this failure.
The generated code differences after the patches seem significant but I may not get chance to look at the differences in detail for a little while. Matthew > -----Original Message----- > From: Joshua Kinard [mailto:ku...@gentoo.org] > Sent: 28 July 2014 10:40 > To: Matthew Fortune; gcc@gcc.gnu.org > Subject: Re: Help w/ PR61538? > > On 07/28/2014 04:41, Matthew Fortune wrote: > > Hi Joshua, > > > > I know very little about this area but I'll try and offer some advice > anyway... > > > > You know more than I do :) > > > >> On 07/05/2014 23:43, Joshua Kinard wrote: > >>> Hi, > >>> > >>> I filed PR61538 about two weeks ago, regarding gcc-4.8.x and up not > >>> compiling a g++/pthreads-linked app correctly on SGI R1x000-based > systems > >>> (Octane, Onyx2), running Linux. Running the subsequently-compiled > >>> application simply hangs in a futex syscall until terminated via Ctrl+C. > >> I > >>> suspect it's a double-locking bug of some design, as evidenced by strace > >>> showing two consecutive syscall()'s w/ 0x108e passed as the syscall # > >> (4238 > >>> or futex on o32 MIPS), but I am stumped as to what else I can do to > debug > >> it > >>> and help fix it. > >>> > >> [snip] > >>> Full details: > >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61538 > >> > >> So I've spent the last few weeks bisecting the gcc tree, and I've > narrowed > >> down the set of commits that appear to have introduced this problem: > >> > >> 1. 39a8c5eaded1e5771a941c56a49ca0a5e9c5eca0 * config/mips/mips.c > >> (mips_emit_pre_atomic_barrier_p,) > > > > This is the prime candidate for introducing the issue. > > This is my guess, too. However, it appears to tie in w/ the fourth commit > because the new mips_emit_{pre,post}_atomic_barrier_p functions added in > commit 39a8c5ea are removed by commit 30c3c442 a mere ~7 minutes later > (which I find really odd). Commit 974f0a74 is really the only one that > seems innocent, but I suspect the other three are linked. If mkuvyrkov is > still around, perhaps he could explain better? > > > >> 2. 974f0a74e2116143b88d8cea8e1dd5a9c18ef96c * config/mips/constraints.md > >> (ZR): New constraint. > > > > Unlikely > > > >> 3. 0f8e46b16a53c02d7255dcd6b6e9b5bc7f8ec953 * config/mips/mips.c > >> (mips_process_sync_loop): Emit cmp result only if > > > > Possible but unlikely still > > > >> 4. 30c3c4427521f96fb58b6e1debb86da4f113f06f * emit-rtl.c > >> (need_atomic_barrier_p): New function. > > > > Seems unlikely > > > >> > >> There's a build failure somewhere in the middle of there that is blocking > me > >> from figuring out which specific one is the cause, but they all appear to > be > >> related anyways. All four were added on 2012-06-20. > >> > >> When I took a git checkout from 2012-06-26 and reverted those four > commits, > >> I was able to compile glibc-2.19 and get a working "sln" binary. I am > >> unable to easily test the C++ side because I built the checkouts in my > >> $HOME, and it's too risky to try and shoehorn one of them in as the > system > >> compiler. However, I think the C++ issue is also fixed by reverting the > >> four, as that also involved hanging in Linux futex syscalls. > > > > Here is a wild guess at the problem... I think the workaround for R10000 > to > > use branch likely instead of delay slot branches is ending up annulling > > an instruction that is required for certain atomic operations. This is an > > entirely untested theory (and patch) but can you see if this fixes the > issue > > you are seeing: > > Well, the branch-likely thing really only affects a specific revision of the > R10000 processors. Later R10000 revisions (3.1+?) and R12000-R16000 > shouldn't be affected. I've been playing with disabling that specific > workaround on my Octane's kernel and haven't seen any ill effects yet. > Though, I haven't tried rebuilding the userland w/ -mno-fix-r10000 just yet. > > If you want, you can take a look at some of the additional info in the > corresponding Gentoo bug that tracks PR61538: > > https://bugs.gentoo.org/show_bug.cgi?id=516548 > > I have a gdb run (comment #5) of the several instructions in > __lll_lock_wait_private, including register values, as each instruction > executes. The hang happens after taking the futex syscall, t0-t3 get set to > 0x0, and the following "ll v0,0(s0)" is what hangs. In gcc-4.7 and earlier, > that 'll' is actually "li v0,2", though control never passes into > __lll_lock_wait_private in the first place. > > There's also a PNG attached to that bug of the disassembled asm in WinMerge > they shows what insns actually changed. Someone who understands MIPS asm > ordering might be able to make something of that. > > > > @@ -13014,7 +13023,8 @@ mips_process_sync_loop (rtx insn, rtx *operands) > > mips_multi_copy_insn (tmp3_insn); > > mips_multi_set_operand (mips_multi_last_index (), 0, newval); > > } > > - else if (!(required_oldval && cmp)) > > + else if (!(required_oldval && cmp) > > + || mips_branch_likely) > > mips_multi_add_insn ("nop", NULL); > > > > /* CMP = 1 -- either standalone or in a delay slot. */ > > > > I suspect I can weave that in more naturally but can you tell me if that > > fixes the problem first. > > Testing a fix takes about 7.5hrs to rebuild, plus another 3.5 to rebuild > glibc. So I am a bit hesitant to task the machine to do that w/o having a > better idea if that solves it or not. Technically, shouldn't passing > -mno-fix-r10000 have a similar effect by causing branch-likely insns to not > get emitted at all? > > Thanks!, > > -- > Joshua Kinard > Gentoo/MIPS > ku...@gentoo.org > 4096R/D25D95E3 2011-03-28 > > "The past tempts us, the present confuses us, the future frightens us. And > our lives slip away, moment by moment, lost in that vast, terrible in- > between." > > --Emperor Turhan, Centauri Republic