Hi Yaroslav Thanks a lot for your help!
Do I understand correctly that we are in fact missing a call to SmartScheduleStopTimer (); Best regards, // Ola On Fri, May 09, 2008 at 04:24:21PM -0400, Yaroslav Halchenko wrote: > Hi Ola > > It is me again and I am emailing just to have record of my possibly fruitless > findings... actually by the time I finished I resolved it for myself!!!! So it > might indeed be informative! > > So from the beginning: > So -- it hanged again and I am trying to debug it once again. > > backtrace is > (gdb) bt > #0 0x00002ae840181ee2 in __libc_fork () from /usr/lib/debug/libc.so.6 > #1 0x000000000043cd90 in Popen () > #2 0x000000000043e884 in LoadAuthorization () > #3 0x000000000043ea76 in CheckAuthorization () > #4 0x0000000000439a25 in ClientAuthorized () > #5 0x000000000041e396 in ProcEstablishConnection () > #6 0x0000000000424672 in Dispatch () > #7 0x000000000040b145 in main () > > though it is weird since it hanged right in the middle of working and I didn't > try to authenticate (may be someone else???)... > in .log file I have inserted by us > Popen called with command='cat /home/yoh/.Xauthority' type='r' as arguments. > and mtime on log file right around the point when it hanged so I guess it is > the right one > > symbols pointed to ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c > > I have downloaded sources and unpacked them, but fork.c pretty much is include > of ../fork.c (and also I had to ln -s sysv to sysdeps) and gdb is silly (or > me is) to don't look there... > > so now I get > (gdb) l > Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c > has 31 lines > > when I go to that fork.c manually into __libc_fork I see 2 possible > causes for infinite loops: > > > while ((runp = __fork_handlers) != NULL) > { > unsigned int oldval = runp->refcntr; > > if (oldval == 0) > /* This means some other thread removed the list just after > the pointer has been loaded. Try again. Either the list > is empty or we can retry it. */ > continue; > > /* Bump the reference counter. */ > if (atomic_compare_and_exchange_bool_acq (&__fork_handlers->refcntr, > oldval + 1, oldval)) > /* The value changed, try again. */ > continue; > > > 1. so if oldval stays 0, I am doomed > 2. atomic_compare_and_exchange_bool_acq ... not sure > > unfortunately I can't print out any of variables (like oldval) > > nexti goes through SmartScheduleTimer() which I have no clue what is it > about.... ha -- actually it is from vncserver! > then it escapes: > > 0x000000000043c545 in SmartScheduleTimer () > 0x00002ae84011f110 in __restore_rt () from /usr/lib/debug/libc.so.6 > 0x00002ae84011f117 in __restore_rt () from /usr/lib/debug/libc.so.6 > 0x00002ae840181ee0 in __libc_fork () from /usr/lib/debug/libc.so.6 > > unfortunately I am still out of luck in printing anything > (gdb) l > Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c > has 31 lines. > (gdb) p oldval > No symbol "oldval" in current contex > > ok -- if I go in full through the 'loop' with nexti I get > > So I guess I am out of luck on the first condition (although who knows what > tricks optimization did for me)... ok and here is the source of that > SmartScheduleTimer > > void > SmartScheduleTimer (int sig) > { > int olderrno = errno; > > SmartScheduleTime += SmartScheduleInterval; > if (SmartScheduleIdle) > { > SmartScheduleStopTimer (); > } > errno = olderrno; > } > > I wonder how does it interact with that WaitForSomething, and that beast is > filledup with #ifdefs so it is barely comprehendable, bt that is the only > place > which could trigger SmartScheduleIdle (or may be I missed some other) and > I am not sure how scheduling and switching is done so I am not clear how it > could ever be reset. > And my knowledge and brain is somewhat far from comprehending > sysdeps/unix/sysv/linux/x86_64/sigaction.c and __restore_rt > > but ok -- let see a bit more > SmartScheduleTimer > > 0x000000000043c52c <SmartScheduleTimer+44>: test %esi,%esi > 0x000000000043c52e <SmartScheduleTimer+46>: je 0x43c535 > <SmartScheduleTimer+53> > 0x000000000043c530 <SmartScheduleTimer+48>: callq 0x43c3f0 > <SmartScheduleStopTimer> > 0x000000000043c535 <SmartScheduleTimer+53>: mov %ebp,(%rbx) > 0x000000000043c537 <SmartScheduleTimer+55>: mov 0x8(%rsp),%rbx > and we step around 530 > 0x000000000043c52e in SmartScheduleTimer () > 0x000000000043c535 in SmartScheduleTimer () > so for sure we are not calling SmartScheduleStopTimer ;) > lets do manually: > > (gdb) call SmartScheduleStopTimer > +call SmartScheduleStopTimer > $1 = {<text variable, no debug info>} 0x43c3f0 <SmartScheduleStopTimer> > > *(gdb) call SmartScheduleStopTimer() > +call SmartScheduleStopTimer() > Reading in symbols for ../sysdeps/x86_64/elf/start.S...done. > $2 = 0 > (gdb) nexti > +nexti > Detaching after fork from child process 23194. > 0x00002ae840181ee8 in __libc_fork () from /usr/lib/debug/libc.so.6 > > ha -- some effect... lets see > > (gdb) c > +c > Continuing. > > Program received signal SIGPIPE, Broken pipe. > 0x00002ae8401ac3e2 in __write_nocancel () from /usr/lib/debug/libc.so.6 > > (gdb) c > +c > Continuing. > > but we are still on the hook -- 100% CPU and in the same fashion after I > press Ctrl-C > > ok -- doing the same call to SmartScheduleStopTimer and then doing > stepping which might be informative: > > (gdb) call SmartScheduleStopTimer() > +call SmartScheduleStopTimer() > $3 = 0 > (gdb) nexti > +nexti > Detaching after fork from child process 23893. > 0x00002ae840181ee8 in __libc_fork () from /usr/lib/debug/libc.so.6 > (gdb) n > +n > Single stepping until exit from function __libc_fork, > which has no line number information. > Reading in symbols for genops.c...done. > Reading in symbols for malloc.c...done. > 0x000000000043cd90 in Popen () > (gdb) n > +n > Single stepping until exit from function Popen, > which has no line number information. > 0x000000000043e884 in LoadAuthorization () > (gdb) l > +l > Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c > has 31 lines. > (gdb) n > +n > Single stepping until exit from function LoadAuthorization, > which has no line number information. > > 0x000000000043ea76 in CheckAuthorization () > (gdb) > +n > Single stepping until exit from function CheckAuthorization, > which has no line number information. > 0x0000000000439a25 in ClientAuthorized () > (gdb) > +n > Single stepping until exit from function ClientAuthorized, > which has no line number information. > 0x000000000041e396 in ProcEstablishConnection () > (gdb) > +n > Single stepping until exit from function ProcEstablishConnection, > which has no line number information. > 0x000000000041e0d0 in SendConnSetup () > (gdb) > +n > Single stepping until exit from function SendConnSetup, > which has no line number information. > 0x0000000000424672 in Dispatch () > (gdb) > +n > Single stepping until exit from function Dispatch, > which has no line number information. > > > BOY -- now my VNC is reacting!!!! slugish but working... lets try to detach > > > Quit > (gdb) detach > +detach > Detaching from program: /usr/bin/Xvnc4, process 2394 > > > > and I am again in the working VNC!!!! uff ;-)))) > > > > > > On Mon, 28 Apr 2008, Ola Lundqvist wrote: > > > Hi again > > > On Mon, Apr 28, 2008 at 03:28:06PM -0400, Yaroslav Halchenko wrote: > > > > I'm not perfectly sure but some things that I suspect is the problem is > > > > that the > > > > number of open files, open sockets, number of processes os something > > > > similar has > > > > reached its limit. > > > > The reason is that you get ERESTARTNOINTR. > > > thanks for sharing the knowledge ;-) I guess I just need to figure out > > > how to monitor all the resources from a single point... > > > ::) > > > > > Have you seen this on several systems or just one? > > > unfortunatly I use VNC primarily on that only box, thus I didn't see it > > > anywhere else. If only we could figure out the loop where it gets to > > > 100% I guess I could figure out what rejection does it get (ie what > > > resource is the problem) > > > To me it seems more like you have really problematic libc or kernel. Because > > I see from your information that you have provided that you can get this > > problem in quite a few situation. > > > Are you sure that you do not have a broken installation like buggy kernel > > or libc? > > > I mean it should not really hang in fork... > > > Best regards, > > > // Ola > > > > > Best regards, > > > > > // Ola > > > > > > Sorry for being so anal... stalled once again today. From gdb now it > > > > > is at fork and > > > > > never actually exits it :-/ If someone could build it with > > > > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > > > > > 0x00002b68df98cee2 in fork () from /lib/libc.so.6 > > > > > (gdb) bt > > > > > #0 0x00002b68df98cee2 in fork () from /lib/libc.so.6 > > > > > #1 0x000000000043cd90 in Popen () > > > > > #2 0x000000000043e884 in LoadAuthorization () > > > > > #3 0x000000000043ea76 in CheckAuthorization () > > > > > #4 0x0000000000439a25 in ClientAuthorized () > > > > > #5 0x000000000041e396 in ProcEstablishConnection () > > > > > #6 0x0000000000424672 in Dispatch () > > > > > #7 0x000000000040b145 in main () > > > > > (gdb) finish > > > > > Run till exit from #0 0x00002b68df98cee2 in fork () from > > > > > /lib/libc.so.6 > > > > > > Program received signal SIGINT, Interrupt. > > > > > 0x00002b68df98cee2 in fork () from /lib/libc.so.6 > > > > > (gdb) bt > > > > > #0 0x00002b68df98cee2 in fork () from /lib/libc.so.6 > > > > > > strace was busy with > > > > > 14892 rt_sigreturn(0xe) = 56 > > > > > 14892 clone(child_stack=0, > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be > > > > > restarted)ld_stack=0, > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD > > > > > 14892 --- SIGALRM (Alarm clock) @ > > > > > 0 (0) --- > > > > > 14892 rt_sigreturn(0xe) = 56 > > > > > 14892 clone(child_stack=0, > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be > > > > > restarted)ld_stack=0, > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD > > > > > 14892 --- SIGALRM (Alarm clock) @ > > > > > 0 (0) --- > > > > > 14892 rt_sigreturn(0xe) = 56 > > > > > 14892 clone(child_stack=0, > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be > > > > > restarted)nfinished ...> > > > > > 14892 --- SIGALRM (Alarm clock) @ 0 (0) --- > > > > > 14892 rt_sigreturn(0xe) = 56 > > > > > 14892 clone(child_stack=0, > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be restarted) > > > > > 14892 --- SIGALRM (Alarm clock) @ 0 (0) --- > > > > > 14892 rt_sigreturn(0xe) = 56 > > > > > 14892 clone(child_stack=0, > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be restarted) > > > > > > > It would so great if there is a vnc4server-dbg ;-))) > > > > > > BTW -- last line in .log was due to our inserted debug line > > > > > Popen called with command='cat /home/yoh/.Xauthority' type='r' as > > > > > arguments > > > > > > but I am not sure if that wasn't from original login moment earlier > > > > > in the morning > > > > > > > On Mon, 21 Apr 2008, Ola Lundqvist wrote: > > > > > > > stracing was showing lots of getttimeoftheday or whatever that > > > > > > > syscall > > > > > > > is. Today it was different: > > > > > > > 21162 rt_sigreturn(0xe) = 56 > > > > > > > 21162 clone(child_stack=0, > > > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > > > > > > > child_tidptr=0x2ad7a050a160) = ? ERESTARTNOINTR (To be restarted) > > > > > > > 21162 --- SIGALRM (Alarm clock) @ 0 (0) --- > > > > > > > 21162 rt_sigreturn(0xe) = 56 > > > > > > > 21162 clone(child_stack=0, > > > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > > > > > > > child_tidptr=0x2ad7a050a160) = ? ERESTARTNOINTR (To be restarted) > > > > > > > ... > > > > > > Hmm. To me it looks that we are out of resources... > > > > > > -- > > > > > Yaroslav Halchenko > > > > > Research Assistant, Psychology Department, Rutgers-Newark > > > > > Student Ph.D. @ CS Dept. NJIT > > > > > Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171 > > > > > 101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102 > > > > > WWW: http://www.linkedin.com/in/yarik > > > -- > > > Yaroslav Halchenko > > > Research Assistant, Psychology Department, Rutgers-Newark > > > Student Ph.D. @ CS Dept. NJIT > > > Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171 > > > 101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102 > > > WWW: http://www.linkedin.com/in/yarik > -- > Yaroslav Halchenko > Research Assistant, Psychology Department, Rutgers-Newark > Student Ph.D. @ CS Dept. NJIT > Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171 > 101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102 > WWW: http://www.linkedin.com/in/yarik > -- --- Inguza Technology AB --- MSc in Information Technology ---- / [EMAIL PROTECTED] Annebergsslingan 37 \ | [EMAIL PROTECTED] 654 65 KARLSTAD | | http://inguza.com/ Mobile: +46 (0)70-332 1551 | \ gpg/f.p.: 7090 A92B 18FE 7994 0C36 4FE4 18A1 B1CF 0FE5 3DD9 / --------------------------------------------------------------- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]