Hi Yaroslav

Thanks a lot for your help!

Do I understand correctly that we are in fact missing a call to
SmartScheduleStopTimer ();

Best regards,

// Ola

On Fri, May 09, 2008 at 04:24:21PM -0400, Yaroslav Halchenko wrote:
> Hi Ola
> 
> It is me again and I am emailing just to have record of my possibly fruitless
> findings... actually by the time I finished I resolved it for myself!!!! So it
> might indeed be informative!
> 
> So from the beginning:
> So -- it hanged again and I am trying to debug it once again.
> 
> backtrace is
> (gdb) bt
> #0  0x00002ae840181ee2 in __libc_fork () from /usr/lib/debug/libc.so.6
> #1  0x000000000043cd90 in Popen ()
> #2  0x000000000043e884 in LoadAuthorization ()
> #3  0x000000000043ea76 in CheckAuthorization ()
> #4  0x0000000000439a25 in ClientAuthorized ()
> #5  0x000000000041e396 in ProcEstablishConnection ()
> #6  0x0000000000424672 in Dispatch ()
> #7  0x000000000040b145 in main ()
> 
> though it is weird since it hanged right in the middle of working and I didn't
> try to authenticate (may be someone else???)...
> in .log file I have inserted by us
> Popen called with command='cat /home/yoh/.Xauthority' type='r' as arguments.
> and mtime on log file right around the point when it hanged so I guess it is 
> the right one
> 
> symbols  pointed to ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c
> 
> I have downloaded sources and unpacked them, but fork.c pretty much is include
> of ../fork.c (and also I had to ln -s sysv to sysdeps) and gdb is silly (or 
> me is) to don't look there... 
> 
> so now I get
> (gdb) l
> Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c 
> has 31 lines
> 
> when I go to that fork.c manually into __libc_fork I see 2 possible
> causes for infinite loops:
> 
> 
>   while ((runp = __fork_handlers) != NULL)
>     {
>       unsigned int oldval = runp->refcntr;
> 
>       if (oldval == 0)
>       /* This means some other thread removed the list just after
>        the pointer has been loaded.  Try again.  Either the list
>        is empty or we can retry it.  */
>               continue;
> 
>       /* Bump the reference counter.  */
>       if (atomic_compare_and_exchange_bool_acq (&__fork_handlers->refcntr,
>                         oldval + 1, oldval))
>               /* The value changed, try again.  */
>               continue;
> 
> 
> 1. so if oldval stays 0, I am doomed
> 2. atomic_compare_and_exchange_bool_acq ... not sure
> 
> unfortunately I can't print out any of variables (like oldval)
> 
> nexti goes through SmartScheduleTimer() which I have no clue what is it
> about.... ha -- actually it is from vncserver! 
> then it escapes:
> 
> 0x000000000043c545 in SmartScheduleTimer ()
> 0x00002ae84011f110 in __restore_rt () from /usr/lib/debug/libc.so.6
> 0x00002ae84011f117 in __restore_rt () from /usr/lib/debug/libc.so.6
> 0x00002ae840181ee0 in __libc_fork () from /usr/lib/debug/libc.so.6
> 
> unfortunately I am still out of luck in printing anything
> (gdb) l
> Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c 
> has 31 lines.
> (gdb) p oldval
> No symbol "oldval" in current contex
> 
> ok -- if I go in full through the 'loop' with nexti I get
> 
> So I guess I am out of luck on the first condition (although who knows what
> tricks optimization did for me)... ok and here is the source of that 
> SmartScheduleTimer
> 
> void
> SmartScheduleTimer (int sig)
> {
>     int olderrno = errno;
> 
>     SmartScheduleTime += SmartScheduleInterval;
>     if (SmartScheduleIdle)
>     {
>         SmartScheduleStopTimer ();
>     }
>     errno = olderrno;
> }
> 
> I wonder how does it interact with that WaitForSomething, and that beast is
> filledup with #ifdefs so it is barely comprehendable, bt that is the only 
> place
> which could trigger SmartScheduleIdle (or may be I missed some other) and
> I am not sure how scheduling and switching is done so I am not clear how it
> could ever be reset.
> And my knowledge and brain is somewhat far from comprehending
> sysdeps/unix/sysv/linux/x86_64/sigaction.c and __restore_rt 
> 
> but ok -- let see a bit more
> SmartScheduleTimer
> 
> 0x000000000043c52c <SmartScheduleTimer+44>:     test   %esi,%esi
> 0x000000000043c52e <SmartScheduleTimer+46>:     je     0x43c535 
> <SmartScheduleTimer+53>
> 0x000000000043c530 <SmartScheduleTimer+48>:     callq  0x43c3f0 
> <SmartScheduleStopTimer>
> 0x000000000043c535 <SmartScheduleTimer+53>:     mov    %ebp,(%rbx)
> 0x000000000043c537 <SmartScheduleTimer+55>:     mov    0x8(%rsp),%rbx
> and we step around 530
> 0x000000000043c52e in SmartScheduleTimer ()
> 0x000000000043c535 in SmartScheduleTimer ()
> so for sure we are not calling SmartScheduleStopTimer ;)
> lets do manually:
> 
> (gdb) call SmartScheduleStopTimer
> +call SmartScheduleStopTimer
> $1 = {<text variable, no debug info>} 0x43c3f0 <SmartScheduleStopTimer>
> 
> *(gdb) call SmartScheduleStopTimer()
> +call SmartScheduleStopTimer()
> Reading in symbols for ../sysdeps/x86_64/elf/start.S...done.
> $2 = 0
> (gdb) nexti
> +nexti
> Detaching after fork from child process 23194.
> 0x00002ae840181ee8 in __libc_fork () from /usr/lib/debug/libc.so.6
> 
> ha -- some effect... lets see
> 
> (gdb) c
> +c
> Continuing.
> 
> Program received signal SIGPIPE, Broken pipe.
> 0x00002ae8401ac3e2 in __write_nocancel () from /usr/lib/debug/libc.so.6
> 
> (gdb) c
> +c
> Continuing.
> 
> but we are still on the hook -- 100% CPU and in the same fashion after I
> press Ctrl-C
> 
> ok -- doing the same call to SmartScheduleStopTimer and then doing
> stepping which might be informative:
> 
> (gdb)  call SmartScheduleStopTimer()
> +call SmartScheduleStopTimer()
> $3 = 0
> (gdb) nexti
> +nexti
> Detaching after fork from child process 23893.
> 0x00002ae840181ee8 in __libc_fork () from /usr/lib/debug/libc.so.6
> (gdb) n
> +n
> Single stepping until exit from function __libc_fork, 
> which has no line number information.
> Reading in symbols for genops.c...done.
> Reading in symbols for malloc.c...done.
> 0x000000000043cd90 in Popen ()
> (gdb) n
> +n
> Single stepping until exit from function Popen, 
> which has no line number information.
> 0x000000000043e884 in LoadAuthorization ()
> (gdb) l
> +l
> Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c 
> has 31 lines.
> (gdb) n
> +n
> Single stepping until exit from function LoadAuthorization, 
> which has no line number information.
> 
> 0x000000000043ea76 in CheckAuthorization ()
> (gdb) 
> +n
> Single stepping until exit from function CheckAuthorization, 
> which has no line number information.
> 0x0000000000439a25 in ClientAuthorized ()
> (gdb) 
> +n
> Single stepping until exit from function ClientAuthorized, 
> which has no line number information.
> 0x000000000041e396 in ProcEstablishConnection ()
> (gdb) 
> +n
> Single stepping until exit from function ProcEstablishConnection, 
> which has no line number information.
> 0x000000000041e0d0 in SendConnSetup ()
> (gdb) 
> +n
> Single stepping until exit from function SendConnSetup, 
> which has no line number information.
> 0x0000000000424672 in Dispatch ()
> (gdb) 
> +n
> Single stepping until exit from function Dispatch, 
> which has no line number information.
> 
> 
> BOY -- now my VNC is reacting!!!! slugish but working... lets try to detach
> 
> 
> Quit
> (gdb) detach
> +detach
> Detaching from program: /usr/bin/Xvnc4, process 2394
> 
> 
> 
> and I am again in the working VNC!!!! uff ;-))))
> 
> 
> 
> 
> 
> On Mon, 28 Apr 2008, Ola Lundqvist wrote:
> 
> > Hi again
> 
> > On Mon, Apr 28, 2008 at 03:28:06PM -0400, Yaroslav Halchenko wrote:
> > > > I'm not perfectly sure but some things that I suspect is the problem is 
> > > > that the
> > > > number of open files, open sockets, number of processes os something 
> > > > similar has
> > > > reached its limit.
> > > > The reason is that you get ERESTARTNOINTR.
> > > thanks for sharing the knowledge ;-) I guess I just need to figure out
> > > how to monitor all the resources from a single point...
> 
> > ::)
> 
> > > > Have you seen this on several systems or just one?
> > > unfortunatly I use VNC primarily on that only box, thus I didn't see it
> > > anywhere else. If only we could figure out the loop where it gets to
> > > 100% I guess I could figure out what rejection does it get (ie what
> > > resource is the problem)
> 
> > To me it seems more like you have really problematic libc or kernel. Because
> > I see from your information that you have provided that you can get this
> > problem in quite a few situation.
> 
> > Are you sure that you do not have a broken installation like buggy kernel
> > or libc?
> 
> > I mean it should not really hang in fork...
> 
> > Best regards,
> 
> > // Ola
> 
> > > > Best regards,
> 
> > > > // Ola
> 
> > > > > Sorry for being so anal... stalled once again today. From gdb now it 
> > > > > is at fork and
> > > > > never actually exits it :-/ If someone could build it with 
> 
> > > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2
> > > > > 0x00002b68df98cee2 in fork () from /lib/libc.so.6
> > > > > (gdb) bt
> > > > > #0  0x00002b68df98cee2 in fork () from /lib/libc.so.6
> > > > > #1  0x000000000043cd90 in Popen ()
> > > > > #2  0x000000000043e884 in LoadAuthorization ()
> > > > > #3  0x000000000043ea76 in CheckAuthorization ()
> > > > > #4  0x0000000000439a25 in ClientAuthorized ()
> > > > > #5  0x000000000041e396 in ProcEstablishConnection ()
> > > > > #6  0x0000000000424672 in Dispatch ()
> > > > > #7  0x000000000040b145 in main ()
> > > > > (gdb) finish
> > > > > Run till exit from #0  0x00002b68df98cee2 in fork () from 
> > > > > /lib/libc.so.6
> 
> > > > > Program received signal SIGINT, Interrupt.
> > > > > 0x00002b68df98cee2 in fork () from /lib/libc.so.6
> > > > > (gdb) bt
> > > > > #0  0x00002b68df98cee2 in fork () from /lib/libc.so.6
> 
> > > > > strace was busy with
> > > > > 14892 rt_sigreturn(0xe)                 = 56
> > > > > 14892 clone(child_stack=0, 
> > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be 
> > > > > restarted)ld_stack=0, 
> > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD                 
> > > > >                                    14892 --- SIGALRM (Alarm clock) @ 
> > > > > 0 (0) ---
> > > > > 14892 rt_sigreturn(0xe)                 = 56
> > > > > 14892 clone(child_stack=0, 
> > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be 
> > > > > restarted)ld_stack=0, 
> > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD                 
> > > > >                                    14892 --- SIGALRM (Alarm clock) @ 
> > > > > 0 (0) ---
> > > > > 14892 rt_sigreturn(0xe)                 = 56
> > > > > 14892 clone(child_stack=0, 
> > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be 
> > > > > restarted)nfinished ...>
> > > > > 14892 --- SIGALRM (Alarm clock) @ 0 (0) ---
> > > > > 14892 rt_sigreturn(0xe)                 = 56
> > > > > 14892 clone(child_stack=0, 
> > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be restarted)
> > > > > 14892 --- SIGALRM (Alarm clock) @ 0 (0) ---
> > > > > 14892 rt_sigreturn(0xe)                 = 56
> > > > > 14892 clone(child_stack=0, 
> > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> > > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be restarted)
> 
> 
> > > > > It would so great if there is a vnc4server-dbg ;-)))
> 
> > > > > BTW -- last line in .log was due to our inserted debug line
> > > > > Popen called with command='cat /home/yoh/.Xauthority' type='r' as 
> > > > > arguments
> 
> > > > > but I am not sure if that wasn't from original login moment earlier 
> > > > > in the morning
> 
> 
> > > > > On Mon, 21 Apr 2008, Ola Lundqvist wrote:
> > > > > > > stracing was showing lots of getttimeoftheday or whatever that 
> > > > > > > syscall
> > > > > > > is. Today it was different:
> > > > > > > 21162 rt_sigreturn(0xe)                 = 56
> > > > > > > 21162 clone(child_stack=0, 
> > > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> > > > > > > child_tidptr=0x2ad7a050a160) = ? ERESTARTNOINTR (To be restarted)
> > > > > > > 21162 --- SIGALRM (Alarm clock) @ 0 (0) ---
> > > > > > > 21162 rt_sigreturn(0xe)                 = 56
> > > > > > > 21162 clone(child_stack=0, 
> > > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> > > > > > > child_tidptr=0x2ad7a050a160) = ? ERESTARTNOINTR (To be restarted)
> > > > > > > ...
> > > > > > Hmm. To me it looks that we are out of resources...
> 
> > > > > -- 
> > > > > Yaroslav Halchenko
> > > > > Research Assistant, Psychology Department, Rutgers-Newark
> > > > > Student  Ph.D. @ CS Dept. NJIT
> > > > > Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
> > > > >         101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
> > > > > WWW:     http://www.linkedin.com/in/yarik        
> > > -- 
> > > Yaroslav Halchenko
> > > Research Assistant, Psychology Department, Rutgers-Newark
> > > Student  Ph.D. @ CS Dept. NJIT
> > > Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
> > >         101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
> > > WWW:     http://www.linkedin.com/in/yarik        
> -- 
> Yaroslav Halchenko
> Research Assistant, Psychology Department, Rutgers-Newark
> Student  Ph.D. @ CS Dept. NJIT
> Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
>         101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
> WWW:     http://www.linkedin.com/in/yarik        
> 

-- 
 --- Inguza Technology AB --- MSc in Information Technology ----
/  [EMAIL PROTECTED]                    Annebergsslingan 37        \
|  [EMAIL PROTECTED]                   654 65 KARLSTAD            |
|  http://inguza.com/                Mobile: +46 (0)70-332 1551 |
\  gpg/f.p.: 7090 A92B 18FE 7994 0C36 4FE4 18A1 B1CF 0FE5 3DD9  /
 ---------------------------------------------------------------



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to