I think both are valid fixes. Threads in core files can have a non-zero signal. See comments below.
> On Nov 11, 2016, at 5:36 AM, Howard Hellyer via lldb-dev > <lldb-dev@lists.llvm.org> wrote: > > Hi Jim > > I was afraid someone would say that but I've done some digging and found a > difference in the core files I get generated by gcore to those generated by a > crash or abort. > > Most of the core files have one SIGINFO structure in the core, I think it > belongs to the preceding thread (the one that caught the signal). > In the core files generated by gcore all of the threads have a SIGINFO > structure following their PRSTATUS structure. In the non-gcore files the > value of info.si_signo in the PRSTATUS structure is a signal number. In the > gcore file this is actually 0 but the SIGINFO structure following PRSTATUS > has an si_signo value of 19. > > Looking at it with eu-readelf shows: > > CORE 336 PRSTATUS > info.si_signo: 0, info.si_code: 0, info.si_errno: 0, cursig: 0 > sigpend: <> > sighold: <> > ... lots of registsers... > CORE 128 SIGINFO > si_signo: 19, si_errno: 0, si_code: 0 > sender PID: 0, sender UID: 0 > > I think gcore is being clever. It's including the "real" signal number the > running thread had received at the time the core was taken (info.si_signo is > 0) but also the signal it had used to interrupt the thread and gather it's > state. The value in PRSTATUS info.si_signo is the signal number that ends up > in m_signo in ThreadElfCore and ultimately is looked for in the set of > signals lldb should stop on in UnixSignals::GetShouldStop. 0 is not found in > that set since there isn't a signal 0. I think gcore is doing all this so > that it preserves the real signal state the process had before gcore attached > to it, I guess in case you are trying to debug something to do with signals > and need to see that state. (That's a bit of a guess mind you.) > > I can think of three solutions: > > - Read the signal information from the SIGINFO block for a thread if it's > present. Core files generated by abort or a crash only seem to have a SIGINFO > for one thread which looks like it's the one that received/trigger the signal > in the first place. This means adding a something to parse that block out of > the elf core as well as PRSTATUS and override the state from PRSTATUS if we > see it. SIGINFO always seems to come after PRSTATUS and probably has to as > PRSTATUS contains the pid and identifies that there is a new thread in the > core so if SIGINFO is found that signal number will just replace the first > one. You want to figure out which one the accurate signal and use that. Doesn't matter how you do this, but this will be up to the ProcessELFCore or ThreadELFCore classes. > > - Never allow a threads signal number to be 0 when it comes form an elf core > dump. (This is probably as much of a band aid as the first solution.) Threads should be able to have no signal. If you have 10 threads and thread 6 crashes with SIGABRT, but all other threads were just running, I would expect all threads except for thread 6 to have 0 signal values, or no stop reason. If you end up with 10 threads and all have no signal information, I would say that you can just give the first thread a SIGSTOP to be safe. > > - Stick with the first solution of saying that we can never resume a core > file. The only thing in this solutions favour is that it means the "real" > thread state that gcore tried to preserve is known to lldb. Once the core > file is loaded typing continue does result in an error message telling you > that you can't resume from a core file. The suggested can be done in a cleaner way: Have ProcessELFCore and ProcessMachCore override "Error Process::WillResume()" just return an error: Error ProcessELFCore::WillResume() { return Error("can't resume a process in a core file"); } So I think the correct fix is all three of the above. Greg > > I'll have a go at prototyping the solution to read the SIGINFO structure but > I'd appreciate any thoughts on which is the "correct" fix. > > Thanks, > > > Howard Hellyer > IBM Runtime Technologies, IBM Systems > > > > > > From: Jim Ingham <jing...@apple.com> > To: Howard Hellyer/UK/IBM@IBMGB > Cc: lldb-dev@lists.llvm.org > Date: 10/11/2016 18:48 > Subject: Re: [lldb-dev] LLDB hang loading Linux core files from live > processes (Bug 26322) > Sent by: jing...@apple.com > > > > I think that approach is kind of a bandaid. > > Core files can't resume, so it would be better to figure out why telling a > core file which can't resume to resume caused us to go into a tail spin. > That should just fall out of WillResume returning false or some other better > general signal. Special-casing core files seems a bit of a hack. > > That being said, if nobody has time to make a better solution, a bandaid is > better than bleeding... > > Jim > > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU > _______________________________________________ > lldb-dev mailing list > lldb-dev@lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev _______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev