Nice catch. I think this backtrace should be enough to bring the problem to the attention of the kernel developers. Have you tried contacting them already? I can help you with that if you want.
On 6 January 2017 at 23:34, Eugene Birukov <eugen...@hotmail.com> wrote: > Hi, > > > It seems the problem is related to transparent huge pages. Disabling them > helps. > > > The LLDB spins in this code: > > > [11428.634820] Call Trace: > [11428.634821] [<ffffffff811ec7de>] do_huge_pmd_wp_page+0x62e/0xb80 > [11428.634822] [<ffffffff811b0f75>] handle_mm_fault+0x705/0xfe0 > [11428.634823] [<ffffffff811aa34d>] ? follow_page_mask+0x37d/0x7a0 > [11428.634824] [<ffffffff811aa90b>] __get_user_pages+0x19b/0x660 > [11428.634825] [<ffffffff811ab222>] get_user_pages+0x52/0x60 > [11428.634825] [<ffffffff811ac00b>] __access_remote_vm+0xcb/0x1f0 > [11428.634826] [<ffffffff811b2360>] access_process_vm+0x50/0x70 > [11428.634827] [<ffffffff8109500a>] ptrace_request+0x2da/0x620 > [11428.634828] [<ffffffff810c24dc>] ? wait_task_inactive+0x16c/0x1f0 > [11428.634829] [<ffffffff81038cab>] arch_ptrace+0x29b/0x300 > [11428.634830] [<ffffffff81094c6e>] SyS_ptrace+0xbe/0x110 > [11428.634831] [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b > > Eugene > > > Sent from Outlook <http://aka.ms/weboutlook> > > > ------------------------------ > *From:* Pavel Labath <lab...@google.com> > *Sent:* Tuesday, December 13, 2016 1:39 AM > > *To:* Eugene Birukov > *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with 100% > CPU on Linux Ubuntu 16.04 > > Yeah, we support a couple of communication transports. For example, you > can switch to using unix sockets instead. It's easiest if you just connect > the server and client manually (for automatically starting up the server > you would have to edit the source, I think.) > > $ bin/lldb-server gdbserver unix-abstract:///tmp/X -- /bin/ls > > $ bin/lldb > (lldb) process connect unix-abstract-connect:///tmp/X > Process 21731 stopped > * thread #1, name = 'ls', stop reason = signal SIGSTOP > frame #0: 0x00007f4b341bb2d0 > -> 0x7f4b341bb2d0: movq %rsp, %rdi > 0x7f4b341bb2d3: callq 0x7f4b341bea40 > 0x7f4b341bb2d8: movq %rax, %r12 > 0x7f4b341bb2db: movl 0x221b17(%rip), %eax > > > Good luck. Let me know what you find out. > pl > > > On 12 December 2016 at 18:11, Eugene Birukov <eugen...@hotmail.com> wrote: > >> OK, thanks! >> >> I'll work on getting reliable repro on something I can attach kgdb to. >> >> >> Meanwhile - is there any options in lldb-server communication protocol I >> could play with? Say, using pipe instead of tcp/ip or vice versa? >> >> >> Sent from Outlook <http://aka.ms/weboutlook> >> >> >> ------------------------------ >> *From:* Pavel Labath <lab...@google.com> >> *Sent:* Friday, December 9, 2016 2:56 AM >> >> *To:* Eugene Birukov >> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with 100% >> CPU on Linux Ubuntu 16.04 >> >> It looks like debugging this remotely will not lead anywhere. I don't >> think I have enough knowledge about the kernel to help you with that. If I >> were able to reproduce it myself, maybe I'd get somewhere by random >> experiments, but doing it over the email is too slow. :) >> >> I suggest you ask some of the kernel developers about it. They should be >> able to direct you where to look further, and may be aware of kernel >> changes between the two versions that would affect this. >> >> cheers, >> pl >> >> On 8 December 2016 at 17:12, Eugene Birukov <eugen...@hotmail.com> wrote: >> >>> *> **From your description is sounds like un-predicability comes from >>> the tracee side* >>> >>> >>> Not exactly. If problem is there, it repro's 100%. Things that affect >>> repro - make it go away or shift to different address: >>> >>> >>> >>> 1. Delays in the client. If I put ::sleep(0) right after sending >>> write memory request (not sure - are we writing in pipe or in socket?) - >>> it >>> might kill repro. >>> 2. Changes in VM. Copying all bits into would be identical VM hurts >>> repro. >>> 3. Changes in the target application. I assume this is mainly layout >>> of different VMA that shift after significant code change because target >>> is >>> fully stopped and it is hard to believe that the contents of the memory >>> matters... Just in case - I tried to write different values and change >>> write size to 8 (not to trigger the recursion in llldb-server). >>> >>> Things that do not matter: >>> >>> >>> - Delays in the server. As soon as we are about to call ptrace, no >>> amount of delays caused by stepping, etc. matters, the damage is already >>> there. >>> >>> So, it seems that the problem is really not in ptrace but in the >>> communication channel between the client and the server. Ptrace looks >>> like just an innocent bystander caught in the gang crossfire on the street. >>> >>> >>> BTW, I am on short vacation, will be back on Monday 12/12. >>> >>> >>> Sent from Outlook <http://aka.ms/weboutlook> >>> >>> >>> ------------------------------ >>> *From:* Pavel Labath <lab...@google.com> >>> *Sent:* Thursday, December 8, 2016 3:19 AM >>> *To:* Eugene Birukov >>> >>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with 100% >>> CPU on Linux Ubuntu 16.04 >>> >>> Thanks for the info. I agree with your assessment, it definitely looks >>> like a kernel bug. >>> >>> I don't know how to trigger a kernel core dump. I have done a bit of >>> kernel debugging/development, but it mainly consisted of adding a bunch of >>> printf's and watching the dmesg output. I think we should contact the linux >>> kernel mailing list (LKML.org) with this problem. I've had pretty good >>> success with this in the past. The one contact I have there is Oleg >>> Nesterov <o...@redhat.com>. He's quite a ptrace expert, and was able to >>> fix my bug in a day when I contacted him with a not-too-unsimilar problem >>> to yours (although this does sound more complicated, I suspect some >>> interaction between ptrace and virtual memory manager). >>> >>> Before we do that, we may want to reduce the scope of the problem a bit. >>> From your description is sounds like un-predicability comes from the tracee >>> side, and once it gets to a bad state, all tracer writes to the affected >>> area will hang. In that case, it should be possible to take lldb out of the >>> equation, and instead write a trivial app, which takes a pid and an >>> address, and does a PTRACE_ATTACH + a bunch of PTRACE_POKEDATA to the given >>> address. I think this would be enough to get kernel folks interested, as it >>> clearly demonstrates it's not lldb being funny, and they may direct us how >>> to debug this further. I have adapted one of my test apps to do this >>> (attached). Could you try this out on your system? >>> >>> pl >>> >>> >>> >>> >>> On 8 December 2016 at 00:57, Eugene Birukov <eugen...@hotmail.com> >>> wrote: >>> >>>> Some additional info. >>>> >>>> >>>> I stepped through individual instructions… It hangs in syscall >>>> >>>> >>>> >>>> 37 in ../sysdeps/unix/sysv/linux/ptrace.c >>>> >>>> => 0x00007fcac11ee21b <ptrace+43>: 48 63 70 08 movslq >>>> 0x8(%rax),%rsi >>>> >>>> (gdb) >>>> >>>> 38 in ../sysdeps/unix/sysv/linux/ptrace.c >>>> >>>> => 0x00007fcac11ee21f <ptrace+47>: 48 8b 50 10 mov >>>> 0x10(%rax),%rdx >>>> >>>> (gdb) >>>> >>>> 45 in ../sysdeps/unix/sysv/linux/ptrace.c >>>> >>>> => 0x00007fcac11ee223 <ptrace+51>: 89 ff mov %edi,%edi >>>> >>>> (gdb) >>>> >>>> 36 in ../sysdeps/unix/sysv/linux/ptrace.c >>>> >>>> => 0x00007fcac11ee225 <ptrace+53>: 48 89 44 24 c8 mov >>>> %rax,-0x38(%rsp) >>>> >>>> (gdb) >>>> >>>> 45 in ../sysdeps/unix/sysv/linux/ptrace.c >>>> >>>> => 0x00007fcac11ee22a <ptrace+58>: 4c 0f 47 50 18 cmova >>>> 0x18(%rax),%r10 >>>> >>>> 0x00007fcac11ee22f <ptrace+63>: b8 65 00 00 00 mov >>>> $0x65,%eax >>>> >>>> (gdb) >>>> >>>> 0x00007fcac11ee22f 45 in ../sysdeps/unix/sysv/linux/ptrace.c >>>> >>>> 0x00007fcac11ee22a <ptrace+58>: 4c 0f 47 50 18 cmova >>>> 0x18(%rax),%r10 >>>> >>>> => 0x00007fcac11ee22f <ptrace+63>: b8 65 00 00 00 mov >>>> $0x65,%eax >>>> >>>> (gdb) >>>> >>>> 38 in ../sysdeps/unix/sysv/linux/ptrace.c >>>> >>>> => 0x00007fcac11ee234 <ptrace+68>: c7 44 24 b8 18 00 00 00 movl >>>> $0x18,-0x48(%rsp) >>>> >>>> (gdb) >>>> >>>> 45 in ../sysdeps/unix/sysv/linux/ptrace.c >>>> >>>> => 0x00007fcac11ee23c <ptrace+76>: 0f 05 syscall >>>> >>>> 0x00007fcac11ee23e <ptrace+78>: 48 3d 00 f0 ff ff cmp >>>> $0xfffffffffffff000,%rax >>>> >>>> 0x00007fcac11ee244 <ptrace+84>: 77 2a ja 0x7fcac11ee270 >>>> <ptrace+128> >>>> >>>> (gdb) >>>> >>>> Also, user vs. kernel times are about 1:1000 ratio according to >>>> proc/pid/stat: >>>> >>>> >>>> # cat /proc/21809/stat >>>> >>>> 21809 (lldb-server) R 21780 21809 21809 0 -1 4194304 1175 1709 0 0 18 >>>> 18398 4 0 20 0 1 0 1602761 48611328 2102 18446744073709551615 4194304 >>>> 11143636 140733729492144 140733729487496 140187105837630 262144 65537 4096 >>>> 65537 0 0 0 17 1 0 0 0 0 0 13244688 13945048 29155328 140733729499314 >>>> 140733729499429 140733729499429 140733729501128 0 >>>> >>>> Eugene >>>> >>>> >>>> Sent from Outlook <http://aka.ms/weboutlook> >>>> >>>> >>>> >>>> ------------------------------ >>>> *From:* Eugene Birukov <eugen...@hotmail.com> >>>> *Sent:* Wednesday, December 7, 2016 1:15 PM >>>> *To:* Pavel Labath; Eugene Birukov >>>> >>>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with >>>> 100% CPU on Linux Ubuntu 16.04 >>>> >>>> >>>> And yes - I suspect kernel bug because we did not see it on 15.10 but >>>> on 16.04 I have several people reporting it. In one case it would repro on >>>> 3 out of 4 kind of "identical" machines. >>>> >>>> >>>> I do not have simple repro, but I can find victim VM that we do not >>>> care much about. Is there a way to cause Linux kernel core dump there? >>>> >>>> >>>> Sent from Outlook <http://aka.ms/weboutlook> >>>> >>>> >>>> ------------------------------ >>>> *From:* lldb-dev <lldb-dev-boun...@lists.llvm.org> on behalf of Eugene >>>> Birukov via lldb-dev <lldb-dev@lists.llvm.org> >>>> *Sent:* Wednesday, December 7, 2016 11:36 AM >>>> *To:* Pavel Labath >>>> *Cc:* LLDB >>>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with >>>> 100% CPU on Linux Ubuntu 16.04 >>>> >>>> >>>> *> could you get a backtrace of lldb-server when it is in the "stuck"* >>>> >>>> *state (just attach with lldb/gdb after it hangs and do "bt")?* >>>> >>>> >>>> You wish [image: ☹] The lldb-server does not react to any signals >>>> including SIGSTOP, so gdb just hangs forever. >>>> >>>> >>>> *> If you can get me reasonably detailed repro steps, I can try to * >>>> *investigate* >>>> >>>> >>>> Unfortunately I do not have repro myself. It happens semi-randomly on >>>> some machines and I need to borrow the machine with the problem. Here are >>>> some details from my records: >>>> >>>> >>>> - It is pretty big piece of RX memory, /proc/<pid>/maps shows this: >>>> 409701000-40b49c000 r-xp 00000000 00:00 0 >>>> - Writing into some locations within that VMA works >>>> - When it repros, it is pretty consistent, but changing in the >>>> target may shift it - i.e. make no repro or fail at different address. >>>> >>>> *> **are you able to still reproduce the bug with logging enabled?* >>>> >>>> >>>> Yes. Here are a few last lines from the log: >>>> >>>> 1481139040.768961000 0x7fff253c9780 Communication::Write (src = >>>> 0x7fff253c8f48, src_len = 7) connection = 0x24a6bd0 >>>> 1481139040.768963000 0x24a6bd0 ConnectionFileDescriptor::Write (src = >>>> 0x7fff253c8f48, src_len = 7) >>>> 1481139040.768973000 0x24a6cc0 Socket::Write() (socket = 6, src = >>>> 0x7fff253c8f48, src_len = 7, flags = 0) => 7 (error = (null)) >>>> 1481139040.768976000 0x24a6bd0 ConnectionFileDescriptor::Write(fd = 6, >>>> src = 0x7fff253c8f48, src_len = 7) => 7 (error = (null)) >>>> 1481139040.768979000 0x7fff253c9780 Communication::Read (dst = >>>> 0x7fff253c7140, dst_len = 8192, timeout = 0 usec) connection = 0x24a6bd0 >>>> 1481139040.768982000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable >>>> (timeout_usec = 0) >>>> 1481139040.768984000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable() >>>> ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80)... >>>> 1481139040.768986000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable() >>>> ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80) => 0, >>>> error = (null) >>>> 1481139090.788317000 0x7fff253c9780 Communication::Read (dst = >>>> 0x7fff253c7140, dst_len = 8192, timeout = 0 usec) connection = 0x24a6bd0 >>>> 1481139090.788356000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable >>>> (timeout_usec = 0) >>>> 1481139090.788364000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable() >>>> ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80)... >>>> 1481139090.788368000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable() >>>> ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80) => 1, >>>> error = (null) >>>> 1481139090.788378000 0x24a6cc0 Socket::Read() (socket = 6, src = >>>> 0x7fff253c7140, src_len = 25, flags = 0) => 25 (error = (null)) >>>> 1481139090.788382000 0x24a6bd0 ConnectionFileDescriptor::Read() fd = >>>> 6, dst = 0x7fff253c7140, dst_len = 8192) => 25, error = (null) >>>> 1481139090.788395000 NativeProcessLinux::WriteMemory(0x409d5a7d0, >>>> 0x25271d0, 4) >>>> 1481139090.788409000 NativeProcessLinux::ReadMemory using >>>> process_vm_readv to read 8 bytes from inferior address 0x409d5a7d0: Success >>>> 1481139090.788414000 PTRACE_POKEDATA [1][0][0][0][57][41][54][41] >>>> >>>> Thanks, >>>> >>>> Eugene >>>> >>>> >>>> Sent from Outlook <http://aka.ms/weboutlook> >>>> >>>> >>>> ------------------------------ >>>> *From:* Pavel Labath <lab...@google.com> >>>> *Sent:* Wednesday, December 7, 2016 2:34 AM >>>> *To:* Eugene Birukov >>>> *Cc:* LLDB >>>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with >>>> 100% CPU on Linux Ubuntu 16.04 >>>> >>>> Hello Eugene, >>>> >>>> this sounds troubling, and I'd like to get to the bottom of this. If >>>> you can get me a bit more information about this, I believe we can >>>> figure it out: >>>> >>>> - could you get a backtrace of lldb-server when it is in the "stuck" >>>> state (just attach with lldb/gdb after it hangs and do "bt")? I want >>>> to see the where is it spinning, as I don't see any obvious infinite >>>> loop there. >>>> >>>> - are you able to still reproduce the bug with logging enabled? If so, >>>> I'd like to see the log file to understand this better. (You can >>>> enable logging by starting lldb-server with: --log-file XXX.log >>>> --log-channels "lldb all:linux all". If you're starting it via lldb >>>> client you can set the LLDB_DEBUGSERVER_LOG_FILE and >>>> LLDB_SERVER_LOG_CHANNELS environment vars to achieve this) >>>> >>>> - If you can get me reasonably detailed repro steps, I can try to >>>> investigate (I am fine with the first step being "install ubuntu 16.04 >>>> in virtualbox") >>>> >>>> On 6 December 2016 at 23:41, Eugene Birukov via lldb-dev >>>> <lldb-dev@lists.llvm.org> wrote: >>>> > Hi, >>>> > 1. I believe that lldb-server spins inside ptrace. I put breakpoint >>>> on the >>>> > highlighted line, and it does not hit. If I put breakpoint on line >>>> before, >>>> > it hits but lldb-server hangs. >>>> >>>> Do you mean actually inside the ptrace(2) syscall? Your description >>>> would certainly fit that, but that sounds scary, as it would mean a >>>> kernel bug. If that's the case, then we have to start looking in the >>>> kernel. I have some experience with that, but If we can boil this down >>>> to a simple use case, we can also ask the kernel ptrace folks for >>>> help. >>>> >>>> >>>> > 2. It seems that hang is caused by the client trying to read response >>>> too >>>> > fast. I mean, if I step through the client code it works - i.e. there >>>> is >>>> > significant delay between client writing into pipe and issuing >>>> ::select to >>>> > wait for response. >>>> >>>> I am not sure how this fits in with the item above. I find it hard to >>>> believe that the presence of select(2) in one process would affect the >>>> outcome of ptrace() in another. Unless we are actually encountering a >>>> kernel scheduler bug, which I find unlikely. Hopefully we can get more >>>> insight here with additional logging information. >>>> >>>> >>>> > Any advice how to deal with the situation except putting random >>>> sleeps in >>>> > random places? >>>> Inserting sleeps in various places is a valid (albeit very slow) >>>> strategy for debugging races. If you can push the sleep down, you will >>>> eventually reach the place where it will be obvious what is racing >>>> (or, at least, which component is to blame). Hopefully we can do >>>> something smarter though. >>>> >>>> If you are suspecting a kernel bug, I'd recommend recreating it in a >>>> simple standalone application (fork, trace the child, write its >>>> memory), as then it is easy to ask for help >>>> >>>> pl >>>> >>> >>> >> >
_______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev