> You want to figure out which one the accurate signal and use that. > Doesn't matter how you do this, but this will be up to the > ProcessELFCore or ThreadELFCore classes.
I'm going to do a little more research (books and google) to see if I can get an answer on this one. I'm actually having trouble finding core files (at least in my own collection) where threads have different signals in info.si_signo in PRSTATUS. For the ones I've checked that crashed or received a signal all the threads have the same value in info.si_signo. Typically just one thread (the thread that triggered or received the signal) has a SIGINFO note for the thread that actually received the signal. (My collection of cores is a bit random so that's not a comprehensive survey by any means.) I'm getting the impression that the value in PRSTATUS may be for the whole process with any thread that actually received a signal having a SIGINFO note containing that information but I'm not totally sure either way yet. I haven't found anything that documents that behaviour yet. (If anyone knows of a good reference please let me know!) It would explain why all the threads in a core created by gcore have a SIGINFO note as each one will be stopped in turn. It would also mean that for the non-gcore created cores I've got (from crashes and kills) only one thread would have a non-zero signal which sounds correct. Currently for those core files running "thread list" shows all threads as having stopped on the same signal with only one thread in a position where that signal makes sense. Switching to not use info.si_signo is a slightly bigger change though! > > - Never allow a threads signal number to be 0 when it comes form > an elf core dump. (This is probably as much of a band aid as the > first solution.) > > Threads should be able to have no signal. If you have 10 threads and > thread 6 crashes with SIGABRT, but all other threads were just > running, I would expect all threads except for thread 6 to have 0 > signal values, or no stop reason. If you end up with 10 threads and > all have no signal information, I would say that you can just give > the first thread a SIGSTOP to be safe. I checked this with one of the gcore files by just setting the first threads signal and leaving the others to pick up 0 as they used to. That works. Putting in a check that makes sure that at least one thread that has some kind of signal seems reasonable. I'll add that as a fallback sanity check. > The suggested can be done in a cleaner way: Have ProcessELFCore and > ProcessMachCore override "Error Process::WillResume()" just return an error: > > Error ProcessELFCore::WillResume() > { > return Error("can't resume a process in a core file"); > } I think that's called too late. It's not called until the decision has been made to resume the process. Also the base implementation already returns an error and I don't think either ProcessElfCore or ProcessMachCore override it. > So I think the correct fix is all three of the above. I think it's close and discussing the problem is actually helping a lot, thanks for the help. I'll grab the bug and put up a patch - hopefully tomorrow. Thanks, Howard Hellyer IBM Runtime Technologies, IBM Systems Greg Clayton <gclay...@apple.com> wrote on 11/11/2016 18:07:03: > From: Greg Clayton <gclay...@apple.com> > To: Howard Hellyer/UK/IBM@IBMGB > Cc: Jim Ingham <jing...@apple.com>, lldb-dev@lists.llvm.org > Date: 11/11/2016 18:07 > Subject: Re: [lldb-dev] LLDB hang loading Linux core files from live > processes (Bug 26322) > > I think both are valid fixes. Threads in core files can have a non- > zero signal. See comments below. > > > On Nov 11, 2016, at 5:36 AM, Howard Hellyer via lldb-dev <lldb- > d...@lists.llvm.org> wrote: > > > > Hi Jim > > > > I was afraid someone would say that but I've done some digging and > found a difference in the core files I get generated by gcore to > those generated by a crash or abort. > > > > Most of the core files have one SIGINFO structure in the core, I > think it belongs to the preceding thread (the one that caught the signal). > > In the core files generated by gcore all of the threads have a > SIGINFO structure following their PRSTATUS structure. In the non- > gcore files the value of info.si_signo in the PRSTATUS structure is > a signal number. In the gcore file this is actually 0 but the > SIGINFO structure following PRSTATUS has an si_signo value of 19. > > > > Looking at it with eu-readelf shows: > > > > CORE 336 PRSTATUS > > info.si_signo: 0, info.si_code: 0, info.si_errno: 0, cursig: 0 > > sigpend: <> > > sighold: <> > > ... lots of registsers... > > CORE 128 SIGINFO > > si_signo: 19, si_errno: 0, si_code: 0 > > sender PID: 0, sender UID: 0 > > > > I think gcore is being clever. It's including the "real" signal > number the running thread had received at the time the core was > taken (info.si_signo is 0) but also the signal it had used to > interrupt the thread and gather it's state. The value in PRSTATUS > info.si_signo is the signal number that ends up in m_signo in > ThreadElfCore and ultimately is looked for in the set of signals > lldb should stop on in UnixSignals::GetShouldStop. 0 is not found in > that set since there isn't a signal 0. I think gcore is doing all > this so that it preserves the real signal state the process had > before gcore attached to it, I guess in case you are trying to debug > something to do with signals and need to see that state. (That's a > bit of a guess mind you.) > > > > I can think of three solutions: > > > > - Read the signal information from the SIGINFO block for a thread > if it's present. Core files generated by abort or a crash only seem > to have a SIGINFO for one thread which looks like it's the one that > received/trigger the signal in the first place. This means adding a > something to parse that block out of the elf core as well as > PRSTATUS and override the state from PRSTATUS if we see it. SIGINFO > always seems to come after PRSTATUS and probably has to as PRSTATUS > contains the pid and identifies that there is a new thread in the > core so if SIGINFO is found that signal number will just replace thefirst one. > > You want to figure out which one the accurate signal and use that. > Doesn't matter how you do this, but this will be up to the > ProcessELFCore or ThreadELFCore classes. > > > > - Never allow a threads signal number to be 0 when it comes form > an elf core dump. (This is probably as much of a band aid as the > first solution.) > > Threads should be able to have no signal. If you have 10 threads and > thread 6 crashes with SIGABRT, but all other threads were just > running, I would expect all threads except for thread 6 to have 0 > signal values, or no stop reason. If you end up with 10 threads and > all have no signal information, I would say that you can just give > the first thread a SIGSTOP to be safe. > > > > > - Stick with the first solution of saying that we can never resume > a core file. The only thing in this solutions favour is that it > means the "real" thread state that gcore tried to preserve is known > to lldb. Once the core file is loaded typing continue does result in > an error message telling you that you can't resume from a core file. > > The suggested can be done in a cleaner way: Have ProcessELFCore and > ProcessMachCore override "Error Process::WillResume()" just return an error: > > Error ProcessELFCore::WillResume() > { > return Error("can't resume a process in a core file"); > } > > So I think the correct fix is all three of the above. > > Greg > > > > > I'll have a go at prototyping the solution to read the SIGINFO > structure but I'd appreciate any thoughts on which is the "correct" fix. > > > > Thanks, > > > > > > Howard Hellyer > > IBM Runtime Technologies, IBM Systems Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
_______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev