Help w/ PR61538?

2014-07-05 Thread Joshua Kinard
Hi,

I filed PR61538 about two weeks ago, regarding gcc-4.8.x and up not
compiling a g++/pthreads-linked app correctly on SGI R1x000-based systems
(Octane, Onyx2), running Linux.  Running the subsequently-compiled
application simply hangs in a futex syscall until terminated via Ctrl+C.  I
suspect it's a double-locking bug of some design, as evidenced by strace
showing two consecutive syscall()'s w/ 0x108e passed as the syscall # (4238
or futex on o32 MIPS), but I am stumped as to what else I can do to debug it
and help fix it.

I haven't fully determined if the flaw originates in gcc, glibc, or even the
kernel.  I picked gcc for now because gcc-4.7.x and earlier do not exhibit
the problem.  So for now, I am stuck on using gcc-4.7.x on these systems
until the problem is located and fixed.

Full details:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61538

Thanks!,

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic


Re: Help w/ PR61538?

2014-07-27 Thread Joshua Kinard
On 07/05/2014 23:43, Joshua Kinard wrote:
> Hi,
> 
> I filed PR61538 about two weeks ago, regarding gcc-4.8.x and up not
> compiling a g++/pthreads-linked app correctly on SGI R1x000-based systems
> (Octane, Onyx2), running Linux.  Running the subsequently-compiled
> application simply hangs in a futex syscall until terminated via Ctrl+C.  I
> suspect it's a double-locking bug of some design, as evidenced by strace
> showing two consecutive syscall()'s w/ 0x108e passed as the syscall # (4238
> or futex on o32 MIPS), but I am stumped as to what else I can do to debug it
> and help fix it.
> 
[snip]
> Full details:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61538

So I've spent the last few weeks bisecting the gcc tree, and I've narrowed
down the set of commits that appear to have introduced this problem:

1. 39a8c5eaded1e5771a941c56a49ca0a5e9c5eca0  * config/mips/mips.c
(mips_emit_pre_atomic_barrier_p,)
2. 974f0a74e2116143b88d8cea8e1dd5a9c18ef96c  * config/mips/constraints.md
(ZR): New constraint.
3. 0f8e46b16a53c02d7255dcd6b6e9b5bc7f8ec953  * config/mips/mips.c
(mips_process_sync_loop): Emit cmp result only if
4. 30c3c4427521f96fb58b6e1debb86da4f113f06f  * emit-rtl.c
(need_atomic_barrier_p): New function.

There's a build failure somewhere in the middle of there that is blocking me
from figuring out which specific one is the cause, but they all appear to be
related anyways.  All four were added on 2012-06-20.

When I took a git checkout from 2012-06-26 and reverted those four commits,
I was able to compile glibc-2.19 and get a working "sln" binary.  I am
unable to easily test the C++ side because I built the checkouts in my
$HOME, and it's too risky to try and shoehorn one of them in as the system
compiler.  However, I think the C++ issue is also fixed by reverting the
four, as that also involved hanging in Linux futex syscalls.

Obviously, reverting these four commits is obviously not an option for gcc
releases, as over the last two years, a lot of code has been added that uses
some of the new bits (like the ZR constraint).  So do any of the gcc MIPS
people have an idea what in these four commits could possibly be breaking
R1x000-series CPUs on SGI systems under gcc-4.8 and gcc-4.9, so a proper
patch can be made?

Thanks!

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic


Re: Help w/ PR61538?

2014-07-28 Thread Joshua Kinard
On 07/28/2014 04:41, Matthew Fortune wrote:
> Hi Joshua,
> 
> I know very little about this area but I'll try and offer some advice 
> anyway...
> 

You know more than I do :)


>> On 07/05/2014 23:43, Joshua Kinard wrote:
>>> Hi,
>>>
>>> I filed PR61538 about two weeks ago, regarding gcc-4.8.x and up not
>>> compiling a g++/pthreads-linked app correctly on SGI R1x000-based systems
>>> (Octane, Onyx2), running Linux.  Running the subsequently-compiled
>>> application simply hangs in a futex syscall until terminated via Ctrl+C.
>> I
>>> suspect it's a double-locking bug of some design, as evidenced by strace
>>> showing two consecutive syscall()'s w/ 0x108e passed as the syscall #
>> (4238
>>> or futex on o32 MIPS), but I am stumped as to what else I can do to debug
>> it
>>> and help fix it.
>>>
>> [snip]
>>> Full details:
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61538
>>
>> So I've spent the last few weeks bisecting the gcc tree, and I've narrowed
>> down the set of commits that appear to have introduced this problem:
>>
>> 1. 39a8c5eaded1e5771a941c56a49ca0a5e9c5eca0  * config/mips/mips.c
>> (mips_emit_pre_atomic_barrier_p,)
> 
> This is the prime candidate for introducing the issue.

This is my guess, too.  However, it appears to tie in w/ the fourth commit
because the new mips_emit_{pre,post}_atomic_barrier_p functions added in
commit 39a8c5ea are removed by commit 30c3c442 a mere ~7 minutes later
(which I find really odd).  Commit 974f0a74 is really the only one that
seems innocent, but I suspect the other three are linked.  If mkuvyrkov is
still around, perhaps he could explain better?


>> 2. 974f0a74e2116143b88d8cea8e1dd5a9c18ef96c  * config/mips/constraints.md
>> (ZR): New constraint.
> 
> Unlikely
> 
>> 3. 0f8e46b16a53c02d7255dcd6b6e9b5bc7f8ec953  * config/mips/mips.c
>> (mips_process_sync_loop): Emit cmp result only if
> 
> Possible but unlikely still
> 
>> 4. 30c3c4427521f96fb58b6e1debb86da4f113f06f  * emit-rtl.c
>> (need_atomic_barrier_p): New function.
> 
> Seems unlikely
> 
>>
>> There's a build failure somewhere in the middle of there that is blocking me
>> from figuring out which specific one is the cause, but they all appear to be
>> related anyways.  All four were added on 2012-06-20.
>>
>> When I took a git checkout from 2012-06-26 and reverted those four commits,
>> I was able to compile glibc-2.19 and get a working "sln" binary.  I am
>> unable to easily test the C++ side because I built the checkouts in my
>> $HOME, and it's too risky to try and shoehorn one of them in as the system
>> compiler.  However, I think the C++ issue is also fixed by reverting the
>> four, as that also involved hanging in Linux futex syscalls.
> 
> Here is a wild guess at the problem... I think the workaround for R1 to
> use branch likely instead of delay slot branches is ending up annulling
> an instruction that is required for certain atomic operations. This is an
> entirely untested theory (and patch) but can you see if this fixes the issue
> you are seeing:

Well, the branch-likely thing really only affects a specific revision of the
R1 processors.  Later R1 revisions (3.1+?) and R12000-R16000
shouldn't be affected.  I've been playing with disabling that specific
workaround on my Octane's kernel and haven't seen any ill effects yet.
Though, I haven't tried rebuilding the userland w/ -mno-fix-r1 just yet.

If you want, you can take a look at some of the additional info in the
corresponding Gentoo bug that tracks PR61538:

https://bugs.gentoo.org/show_bug.cgi?id=516548

I have a gdb run (comment #5) of the several instructions in
__lll_lock_wait_private, including register values, as each instruction
executes.  The hang happens after taking the futex syscall, t0-t3 get set to
0x0, and the following "ll v0,0(s0)" is what hangs.  In gcc-4.7 and earlier,
that 'll' is actually "li v0,2", though control never passes into
__lll_lock_wait_private in the first place.

There's also a PNG attached to that bug of the disassembled asm in WinMerge
they shows what insns actually changed.  Someone who understands MIPS asm
ordering might be able to make something of that.


> @@ -13014,7 +13023,8 @@ mips_process_sync_loop (rtx insn, rtx *operands)
>mips_multi_copy_insn (tmp3_insn);
>mips_multi_set_operand (mips_multi_last_index (), 0, newval);
>  }
> -  else if (!(required_oldval && cmp))
> +  else if (!(required_oldval && cmp)
> +   || mips_branch_likely)
>  mips_multi_add_in

Re: Help w/ PR61538?

2014-08-06 Thread Joshua Kinard
On 07/28/2014 17:38, Matthew Fortune wrote:
> I'll switch to replying on PR61538. I had not read all the ticket
> previously and although I may have found a problem it seems it may not
> be the cause of this failure.
> 
> The generated code differences after the patches seem significant but
> I may not get chance to look at the differences in detail for a little
> while.


For my own information, what's the cutoff date for fixes to regressions like
this to make it into gcc-4.9.1?

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic


Re: GCC 4.8.4 Status Report (2014-12-05)

2014-12-11 Thread Joshua Kinard
On 12/05/2014 04:18, Jakub Jelinek wrote:
> Status
> ==
> 
> It is time for another 4.8 release, I'd like to create 4.8.4 release
> candidate at the end of the next week and if all goes well, 4.8.4 release
> a week after that.  If you have any safe fixes you'd like to be backported,
> please do so soon, and if there are any known issues on the branch, please
> make sure they are reported in bugzilla and let us RMs know about those.
> 
> 
> Quality Data
> 
> 
> Priority  #   Change from last report
> ---   ---
> P10
> P2   95+   3
> P3   45+   2
> ---   ---
> Total   140+   5
> 
> 
> Previous Report
> ===
> 
> https://gcc.gnu.org/ml/gcc/2014-05/msg00263.html

PR61538 could use a look by some of the MIPS folks.  I don't think it affects
newer MIPS chips, but it'll definitely cause problems for anyone running old
R10K/R12K/R14K SGI gear (Origin/Onyx2/Octane).  gcc-4.7.4 is the last working
version on those platforms under Linux.  Last version I checked was a gcc-4.9.2
git checkout, and it's still affected.

gcc bugzilla:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61538

Gentoo tracking bug:
https://bugs.gentoo.org/show_bug.cgi?id=516548


Thanks!,

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic


Odd gcc-6.3.0 code generation on mips64 platform causing kernel Oops

2017-01-23 Thread Joshua Kinard
Hi,

I am trying to use gcc-6.3.0 to cross-compile a kernel for an old mips64
platform, an SGI Onyx2 ("IP27"), however, it looks like a large number of
functions within the compiled code are getting a common instruction emitted at
the top of the function that breaks this particular machine.

Doing a disassembly of the kernel binary, this is what the beginning of several
functions looks like:

a801c400 :
a801c400:   ffa0bff0sd  zero,-16400(sp)
a801c404:   67bdfff0daddiu  sp,sp,-16

a801cea0 :
a801cea0:   ffa0bfe0sd  zero,-16416(sp)
a801cea4:   67bdffe0daddiu  sp,sp,-32

a801c5b0 :
a801c5b0:   ffa0bf90sd  zero,-16496(sp)
a801c5b4:   3c05a800lui a1,0xa800
a801c5b8:   3c020074lui v0,0x74
a801c5bc:   64a5daddiu  a1,a1,0
a801c5c0:   64424840daddiu  v0,v0,18496
a801c5c4:   0005283cdsll32  a1,a1,0x0
a801c5c8:   67bdff90daddiu  sp,sp,-112


If I compare this output against the disassembly of the same kernel tree built
with gcc-5.4.0, I see this:

a801c400 :
a801c400:   67bdfff0daddiu  sp,sp,-16
a801c404:   3c02007blui v0,0x7b

a801cec0 :
a801cec0:   67bdffe0daddiu  sp,sp,-32
a801cec4:   ffb0sd  s0,0(sp)

a801c5a0 :
a801c5a0:   3c05a800lui a1,0xa800
a801c5a4:   3c020075lui v0,0x75
a801c5a8:   64a5daddiu  a1,a1,0
a801c5ac:   64423f40daddiu  v0,v0,16192
a801c5b0:   0005283cdsll32  a1,a1,0x0
a801c5b4:   67bdff90daddiu  sp,sp,-112


I am not sure what this lone store-doubleword instruction is exactly doing, nor
can I locate where in the gcc MIPS code it is being generated from.  On the
IP27 platform, it breaks the '_raw_spin_lock_irq' function, which is a
hard-coded block of assembly code in the kernel at
arch/mips/include/asm/spinlock.h, taking the 'else' branch of the if clause:

https://git.linux-mips.org/cgit/ralf/linux.git/tree/arch/mips/include/asm/spinlock.h

This "sd" instruction triggers an attempted NULL pointer dereference attempt
when processing the load-linked instruction at the top of the assembly, which
crashes the kernel early in the boot process (after checking for the 'daddi' 
bug).

I have another SGI platform, an Octane ("IP30") that is architecturally-similar
to an IP27, and it is unaffected by the presence of this instruction.  IP27 is
a NUMA system, however, while IP30 is not.  I am suspecting this contributes to
the issues.

However, I need to know what this "sd" instruction's purpose at the beginning
of each function is, and where in gcc's source it's located so I can see if
this is something fixable within the Linux kernel in the IP27-specific code, or
if it's a code-generation bug.

Thanks!,

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic


Re: Odd gcc-6.3.0 code generation on mips64 platform causing kernel Oops

2017-01-23 Thread Joshua Kinard
On 01/23/2017 10:34, Andrew Haley wrote:
> On 23/01/17 15:26, Joshua Kinard wrote:
>> I am not sure what this lone store-doubleword instruction is exactly doing, 
>> nor
>> can I locate where in the gcc MIPS code it is being generated from. 
> 
> It's a stack probe, making sure that there is enough stack space.  Its
> only purpose is to provide a SEGV if there is not enough kernel stack.
> 
> Look for`-fstack-check' as a GCC argument.
> 
> 
> Andrew.

Okay, that explains that.  I rebuilt the affected kernel with
'-fno-stack-check', and that particular platform boots now (it has other
issues, but it at least gets past early init).

So now the question is why stack-probing kills this machine on generic MIPS
code that its smaller cousin is seemingly unaffected by.  I do know that IP27
has a different set of memory initialization routines in the MIPS code, so is
it possible that, at the point that _raw_spin_lock_irq is called and the
stack-probe happens, that there isn't any stack space available because the
IP27-specific memory init hasn't yet completed?

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic


Re: Odd gcc-6.3.0 code generation on mips64 platform causing kernel Oops

2017-01-23 Thread Joshua Kinard
On 01/23/2017 11:24, Andrew Haley wrote:
> On 23/01/17 16:11, Joshua Kinard wrote:
>> So now the question is why stack-probing kills this machine on generic MIPS
>> code that its smaller cousin is seemingly unaffected by.  I do know that IP27
>> has a different set of memory initialization routines in the MIPS code, so is
>> it possible that, at the point that _raw_spin_lock_irq is called and the
>> stack-probe happens, that there isn't any stack space available because the
>> IP27-specific memory init hasn't yet completed?
> 
> I'm sorry, but that really is a question for the kernel people.
> 
> Andrew.

Roger, I will take that up with them then.  Thanks for the pointer!

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic