>>>>> On Tue, 23 Aug 2005 14:44:45 +0200, Kern Sibbald <[EMAIL PROTECTED]> said:
Kern> On Tuesday 23 August 2005 13:35, Martin Simmons wrote:
>> >>>>> On Tue, 23 Aug 2005 12:30:45 +0200, Kern Sibbald <[EMAIL PROTECTED]>
>> >>>>> said:
>>
Kern> Hello Volker,
>>
Kern> I've now found the time to look over your debug output below. My
>> analysis Kern> leads me to believe that what is show is "impossible". That
>> is the code flow Kern> as created in the source code cannot possibly do
>> what is indicated in the Kern> dump. What is shown in the dump is that the
>> subroutine get_next_jcr_ is Kern> recursively called with the same argument
>> (not possible). This will almost Kern> surely lead to a blocked situation.
>>
Kern> How could this happen? Bad compiler code, an interrupt that
>> happens and Kern> restarts the stack at the wrong point, memory error (I
>> doubt), ...
>>
>> I doubt that is really happening -- much more likely is that gdb can't
>> understand the stack. Look at the other threads and you'll see that
>> jobq_server appears to call jobq_server!
>>
>> In all these cases, the extra "call" happens where there is a real call to
>> something like pthread_mutex_lock. The pthread library is probably
>> compiled with too much optimization and/or insufficient debug info for gdb
>> to understand the stack inside there.
Kern> Yes, that is the first thing I thought of, but forgot to put it on the
list.
Kern> However, if that is the case, I cannot explain the hang.
It looks to me like a deadlock caused by get_next_jcr() locking the mutex in
the jcr. I see that the latest code just locks the jcr chain instead, so
hopefully that fixes it.
__Martin
>>
>> __Martin
>>
>> >> From what I see there is very little I can do.
>>
Kern> I've marked the place in the dump below where it is going wrong --
>> Thread 3 Kern> stack levels 8 and 9.
>>
Kern> On Friday 29 July 2005 23:31, Volker Sauer wrote:
>> >> On Fr, 29 Jul 2005, Kern Sibbald <[EMAIL PROTECTED]> wrote:
>> >> > What I see from this is that everything in the Director is normal.
>> >> > It thinks that something like 5 jobs are running. The threads are
>> >> > all waiting on input from one of the other daemons, and there is no
>> >> > mutex dead lock situation. So, if everything is locked up, I suspect
>> >> > the problem is in one of the other daemons.
>> >> >
>> >> > I recommend when it is in this state to do a "status" on all the
>> >> > Clients and on the SD and see if there is anything interesting going
>> >> > on. Perhaps that will tell us the right place to point the debugger.
>> >>
>> >> Again, the director locked. This time it locked up at the first job
>> >> (Client Conc. Jobs = 1) and I was *not* able to connect with bconsole.
>> >> Therefore I couldn't get the status from sd or the clients.
>> >>
>> >> This is what gdb of bacula-dir says:
>> >>
>> >>
>> >> (gdb) run -s -f -c /etc/bacula/bacula-dir.conf
>> >> The program being debugged has been started already.
>> >> Start it from the beginning? (y or n) y
>> >> Starting program: /usr/sbin/bacula-dir -s -f -c
>> >> /etc/bacula/bacula-dir.conf
>> >> [Thread debugging using libthread_db enabled]
>> >> [New Thread 1078020896 (LWP 29834)]
>> >> [New Thread 1086450608 (LWP 29837)]
>> >> [New Thread 1094839216 (LWP 29838)]
>> >> [New Thread 1103227824 (LWP 29857)]
>> >> backup-dir: dird.c:438 Director's configuration file reread.
>> >> [Thread 1103227824 (LWP 29857) exited]
>> >>
>> >> [New Thread 1103227824 (LWP 30275)]
>> >> backup-dir: dird.c:438 Director's configuration file reread.
>> >> [Thread 1103227824 (LWP 30275) exited]
>> >> [New Thread 1103227824 (LWP 30574)]
>> >> [New Thread 1111620528 (LWP 30575)]
>> >> [New Thread 1120074672 (LWP 30577)]
>> >> [New Thread 1128463280 (LWP 30578)]
>> >> [New Thread 1136851888 (LWP 30580)]
>> >> [New Thread 1145240496 (LWP 30581)]
>> >> [New Thread 1153629104 (LWP 30582)]
>> >> [New Thread 1162017712 (LWP 30644)]
>> >>
>> >> Program received signal SIGINT, Interrupt.
>> >> [Switching to Thread 1078020896 (LWP 29834)]
>> >> 0x401a6436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
>> >> (gdb) thread apply all bt
>> >>
>> >> Thread 13 (Thread 1162017712 (LWP 30644)):
>> >> #0 0x401a4295 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> >> /lib/tls/libpthread.so.0
>> >> #1 0x080959fc in rwl_writelock (rwl=0x80c5b80) at rwlock.c:231
>> >> #2 0x0808c8d2 in lock_jcr_chain () at jcr.c:544
>> >> #3 0x0808bd56 in new_jcr (size=1162017184,
>> >> daemon_free_jcr=0xfffffffc) at jcr.c:218
>> >> #4 0x0807458c in new_control_jcr (base_name=0xfffffffc <Address
>> >> 0xfffffffc out of bounds>, job_type=-4)
>> >> at ua_server.c:90
>> >> #5 0x0807468e in handle_UA_client_request (arg=0x80e9d60) at
>> >> ua_server.c:122
>> >> #6 0x0809e4db in workq_server (arg=0x80c5920) at workq.c:347
>> >> #7 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #8 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 12 (Thread 1153629104 (LWP 30582)):
>> >> #0 0x401a6436 in __lll_mutex_lock_wait () from
>> >> /lib/tls/libpthread.so.0 #1 0x401a3893 in _L_mutex_lock_26 () from
>> >> /lib/tls/libpthread.so.0 #2 0x080c5b80 in jobs ()
>> >> #3 0x00000000 in ?? ()
>> >> #4 0x00000001 in ?? ()
>> >> #5 0x00000001 in ?? ()
>> >> #6 0x00000000 in ?? ()
>> >> #7 0x44c2fad8 in ?? ()
>> >> #8 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
>> >> #9 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
>> >> #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #11 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 11 (Thread 1145240496 (LWP 30581)):
>> >> #0 0x401a6436 in __lll_mutex_lock_wait () from
>> >> /lib/tls/libpthread.so.0 #1 0x401a3893 in _L_mutex_lock_26 () from
>> >> /lib/tls/libpthread.so.0 #2 0x080c5b80 in jobs ()
>> >> #3 0x00000000 in ?? ()
>> >> #4 0x00000001 in ?? ()
>> >> #5 0x00000001 in ?? ()
>> >> #6 0x00000000 in ?? ()
>> >> #7 0x4442fad8 in ?? ()
>> >> #8 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
>> >> #9 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
>> >> #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #11 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 10 (Thread 1136851888 (LWP 30580)):
>> >> #0 0x401a6436 in __lll_mutex_lock_wait () from
>> >> /lib/tls/libpthread.so.0 #1 0x401a3893 in _L_mutex_lock_26 () from
>> >> /lib/tls/libpthread.so.0 #2 0x080c5b80 in jobs ()
>> >> #3 0x00000000 in ?? ()
>> >> #4 0x00000001 in ?? ()
>> >> #5 0x00000001 in ?? ()
>> >> #6 0x00000000 in ?? ()
>> >> #7 0x43c2fad8 in ?? ()
>> >> #8 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
>> >> #9 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
>> >> #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #11 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 9 (Thread 1128463280 (LWP 30578)):
>> >> #0 0x401a4295 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> >> /lib/tls/libpthread.so.0
>> >> #1 0x080959fc in rwl_writelock (rwl=0x80c5b80) at rwlock.c:231
>> >> #2 0x0808c8d2 in lock_jcr_chain () at jcr.c:544
>> >> #3 0x0805bea4 in jobq_server (arg=0x80c57a0) at jobq.c:582
>> >> #4 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #5 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 8 (Thread 1120074672 (LWP 30577)):
>> >> #0 0x401a66a1 in __read_nocancel () from /lib/tls/libpthread.so.0
>> >> #1 0x08084d4c in read_nbytes (bsock=0x80e1140, ptr=0x42c2f82c "@",
>> >> nbytes=4) at bnet.c:72
>> >> #2 0x08085067 in bnet_recv (bsock=0x80e1140) at bnet.c:175
>> >> #3 0x08055d88 in bget_dirmsg (bs=0x80e1140) at getmsg.c:79
>> >> #4 0x0805e508 in msg_thread (arg=0x80dcc48) at msgchan.c:235
>> >> #5 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #6 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 7 (Thread 1111620528 (LWP 30575)):
>> >> #0 0x401a66a1 in __read_nocancel () from /lib/tls/libpthread.so.0
>> >> #1 0x08084d4c in read_nbytes (bsock=0x80e5f20,
>> >> ptr=0x4241f08c "9Q\b\bHÌ\r\b _\016\bXòAB\210]\005\b
>> >> [EMAIL PROTECTED]<@[EMAIL PROTECTED]
>> >> ", nbytes=4) at bnet.c:72
>> >> #2 0x08085067 in bnet_recv (bsock=0x80e5f20) at bnet.c:175
>> >> #3 0x08055d88 in bget_dirmsg (bs=0x80e5f20) at getmsg.c:79
>> >> #4 0x0804daf8 in wait_for_job_termination (jcr=0x80dcc48) at
>> >> backup.c:243
>> >> #5 0x0804da23 in do_backup (jcr=0x80dcc48) at backup.c:207
>> >> #6 0x08058946 in job_thread (arg=0x80dcc48) at job.c:215
>> >> #7 0x0805c08a in jobq_server (arg=0x80c57a0) at jobq.c:444
>> >> #8 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #9 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 6 (Thread 1103227824 (LWP 30574)):
>> >> #0 0x401a6436 in __lll_mutex_lock_wait () from
>> >> /lib/tls/libpthread.so.0 #1 0x401a3893 in _L_mutex_lock_26 () from
>> >> /lib/tls/libpthread.so.0 #2 0x080c5b80 in jobs ()
>> >> #3 0x080c70b8 in ?? ()
>> >> #4 0x00000001 in ?? ()
>> >> #5 0x00000001 in ?? ()
>> >> #6 0x00000000 in ?? ()
>> >> #7 0x41c1ead8 in ?? ()
>> >> #8 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
>> >> #9 0x0805b982 in jobq_server (arg=0x80c57a0) at jobq.c:675
>> >> #10 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #11 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 3 (Thread 1094839216 (LWP 29838)):
>> >> #0 0x401a6436 in __lll_mutex_lock_wait () from
>> >> /lib/tls/libpthread.so.0 #1 0x401a3893 in _L_mutex_lock_26 () from
>> >> /lib/tls/libpthread.so.0 #2 0x080c5b80 in jobs ()
>> >> #3 0x00000000 in ?? ()
>> >> #4 0x00000000 in ?? ()
>> >> #5 0x080e8f50 in ?? ()
>> >> #6 0x080e8f60 in ?? ()
>> >> #7 0x4141ea58 in ?? ()
>> >> #8 0x0808c9a8 in get_next_jcr (prev_jcr=0x80c5b80) at jcr.c:581
>> >> #9 0x0808c9a8 in get_next_jcr (prev_jcr=0x80c5b80) at jcr.c:581
>>
Kern> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Kern> Recursive call -- not in source code.
>>
>> >> #10 0x080590c8 in job_monitor_watchdog (self=0x80c5b80) at job.c:386
>> >> #11 0x0809dad6 in watchdog_thread (arg=0x0) at watchdog.c:257
>> >> #12 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #13 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 2 (Thread 1086450608 (LWP 29837)):
>> >> #0 0x4036ca27 in select () from /lib/tls/libc.so.6
>> >> #1 0x080877e0 in bnet_thread_server (addrs=0x40c1eb90,
>> >> max_clients=-514, client_wq=0x80c5920,
>> >> handle_client_request=0xfffffdfe) at bnet_server.c:154
>> >> #2 0x08074569 in connect_thread (arg=0xfffffdfe) at ua_server.c:79
>> >> #3 0x401a1b63 in start_thread () from /lib/tls/libpthread.so.0
>> >> #4 0x4037318a in clone () from /lib/tls/libc.so.6
>> >>
>> >> Thread 1 (Thread 1078020896 (LWP 29834)):
>> >> #0 0x401a6436 in __lll_mutex_lock_wait () from
>> >> /lib/tls/libpthread.so.0 #1 0x401a3893 in _L_mutex_lock_26 () from
>> >> /lib/tls/libpthread.so.0 #2 0x00000006 in ?? ()
>> >> #3 0x00000069 in ?? ()
>> >> #4 0x00000005 in ?? ()
>> >> #5 0x000000d1 in ?? ()
>> >> #6 0xffffffff in ?? ()
>> >> #7 0x080e8f50 in ?? ()
>> >> #8 0xbffff958 in ?? ()
>> >> #9 0x0805afdb in jobq_add (jq=0x80c57a0, jcr=0x0) at jobq.c:240
>> >> #10 0x0805afdb in jobq_add (jq=0x80c57a0, jcr=0xffffffff) at
>> >> jobq.c:240 #11 0x080585fb in run_job (jcr=0x80e8f50) at job.c:140
>> >> #12 0x0804b376 in main (argc=135171920, argv=0x80a0a58) at dird.c:241
>> >>
>> >> I could run bacula-sd and bacula-fd on the client paris (at which
>> >> usually the jobs stop) under the gdb, too (now, that I have the debug
>> >> binaries available).
>> >>
>> >> Regards
>> >> Volker
>>
Kern> --
Kern> Best regards,
>>
Kern> Kern
>>
Kern> (">
Kern> /\
Kern> V_V
Kern> --
Kern> Best regards,
Kern> Kern
Kern> (">
Kern> /\
Kern> V_V
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users