Re: [lldb-dev] LLDB hang loading Linux core files from live processes (Bug 26322)

Greg Clayton via lldb-dev Fri, 11 Nov 2016 10:07:34 -0800

I think both are valid fixes. Threads in core files can have a non-zero signal. 
See comments below.


> On Nov 11, 2016, at 5:36 AM, Howard Hellyer via lldb-dev 
> <lldb-dev@lists.llvm.org> wrote:
> 
> Hi Jim 
> 
> I was afraid someone would say that but I've done some digging and found a 
> difference in the core files I get generated by gcore to those generated by a 
> crash or abort. 
> 
> Most of the core files have one SIGINFO structure in the core, I think it 
> belongs to the preceding thread (the one that caught the signal). 
> In the core files generated by gcore all of the threads have a SIGINFO 
> structure following their PRSTATUS structure. In the non-gcore files the 
> value of info.si_signo in the PRSTATUS structure is a signal number. In the 
> gcore file this is actually 0 but the SIGINFO structure following PRSTATUS 
> has an si_signo value of 19. 
> 
> Looking at it with eu-readelf shows: 
> 
>   CORE                 336  PRSTATUS 
>     info.si_signo: 0, info.si_code: 0, info.si_errno: 0, cursig: 0 
>     sigpend: <> 
>     sighold: <> 
> ... lots of registsers... 
>   CORE                 128  SIGINFO 
>     si_signo: 19, si_errno: 0, si_code: 0 
>     sender PID: 0, sender UID: 0 
> 
> I think gcore is being clever. It's including the "real" signal number the 
> running thread had received at the time the core was taken (info.si_signo is 
> 0) but also the signal it had used to interrupt the thread and gather it's 
> state. The value in PRSTATUS info.si_signo is the signal number that ends up 
> in m_signo in ThreadElfCore and ultimately is looked for in the set of 
> signals lldb should stop on in UnixSignals::GetShouldStop. 0 is not found in 
> that set since there isn't a signal 0. I think gcore is doing all this so 
> that it preserves the real signal state the process had before gcore attached 
> to it, I guess in case you are trying to debug something to do with signals 
> and need to see that state. (That's a bit of a guess mind you.) 
> 
> I can think of three solutions: 
> 
> - Read the signal information from the SIGINFO block for a thread if it's 
> present. Core files generated by abort or a crash only seem to have a SIGINFO 
> for one thread which looks like it's the one that received/trigger the signal 
> in the first place. This means adding a something to parse that block out of 
> the elf core as well as PRSTATUS and override the state from PRSTATUS if we 
> see it. SIGINFO  always seems to come after PRSTATUS and probably has to as 
> PRSTATUS contains the pid and identifies that there is a new thread in the 
> core so if SIGINFO is found that signal number will just replace the first 
> one.

You want to figure out which one the accurate signal and use that. Doesn't 
matter how you do this, but this will be up to the ProcessELFCore or 
ThreadELFCore classes.
> 
> - Never allow a threads signal number to be 0 when it comes form an elf core 
> dump. (This is probably as much of a band aid as the first solution.)

Threads should be able to have no signal. If you have 10 threads and thread 6 
crashes with SIGABRT, but all other threads were just running, I would expect 
all threads except for thread 6 to have 0 signal values, or no stop reason. If 
you end up with 10 threads and all have no signal information, I would say that 
you can just give the first thread a SIGSTOP to be safe.

> 
> - Stick with the first solution of saying that we can never resume a core 
> file. The only thing in this solutions favour is that it means the "real" 
> thread state that gcore tried to preserve is known to lldb. Once the core 
> file is loaded typing continue does result in an error message telling you 
> that you can't resume from a core file. 

The suggested can be done in a cleaner way: Have ProcessELFCore and 
ProcessMachCore override "Error Process::WillResume()" just return an error:

Error ProcessELFCore::WillResume() 
{
    return Error("can't resume a process in a core file");
}

So I think the correct fix is all three of the above.

Greg

> 
> I'll have a go at prototyping the solution to read the SIGINFO structure but 
> I'd appreciate any thoughts on which is the "correct" fix. 
> 
> Thanks, 
> 
> 
> Howard Hellyer 
> IBM Runtime Technologies, IBM Systems         
> 
> 
> 
> 
> 
> From:        Jim Ingham <jing...@apple.com> 
> To:        Howard Hellyer/UK/IBM@IBMGB 
> Cc:        lldb-dev@lists.llvm.org 
> Date:        10/11/2016 18:48 
> Subject:        Re: [lldb-dev] LLDB hang loading Linux core files from live 
> processes (Bug 26322) 
> Sent by:        jing...@apple.com 
> 
> 
> 
> I think that approach is kind of a bandaid.  
> 
> Core files can't resume, so it would be better to figure out why telling a 
> core file which can't resume to resume caused us to go into a tail spin.  
> That should just fall out of WillResume returning false or some other better 
> general signal.  Special-casing core files seems a bit of a hack.
> 
> That being said, if nobody has time to make a better solution, a bandaid is 
> better than bleeding...
> 
> Jim
> 
> 
> 
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 
> 741598. 
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> _______________________________________________
> lldb-dev mailing list
> lldb-dev@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

_______________________________________________
lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

Re: [lldb-dev] LLDB hang loading Linux core files from live processes (Bug 26322)

Reply via email to