Package: condor Version: 7.8.2~dfsg.1-2~nd60+1 Severity: important
I am seeing a transient failure in condor jobs dependencies. The sentinel jobs created by condor_qsub are releasing dependencies early in some cases. I adding logging to the sentinel job scripts and can see the count of "hold_jids" returned by the following line drop dramatically to zero condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc âÂÂl The sentinel job then releases the held job and exits. As the count of hold_jids is in reality non-zero, those jobs proceed to normal completion, but out of sequence. The current condor configuration is neurodebian defaults with these modifications: DAEMON_LIST = SCHEDD, MASTER CONDOR_ADMIN=<user>@<domain> RESERVED_MEMORY = 2048 FILESYSTEM_DOMAIN = vabirc UID_DOMAIN = <domain> CONDOR_HOST = <host>.<domain> COLLECTOR_NAME="VA BIRC Condor Cluster" ALLOW_WRITE = < list of ipaddresses> CONDOR_DEVELOPERS = condor-ad...@cs.wisc.edu CONDOR_DEVELOPERS_COLLECTOR = condor.cs.wisc.edu BIND_ALL_INTERFACES = FALSE NETWORK_INTERFACE = <ipaddress> TRUST_UID_DOMAIN = TRUE START = TRUE SUSPEND = FALSE CONTINUE = TRUE PREEMPT = FALSE KILL = FALSE MAIL_FROM=please-do-not-reply@<domain> SEC_DAEMON_AUTHENTICATION = required SEC_DAEMON_AUTHENTICATION_METHODS = password SEC_CLIENT_AUTHENTICATION_METHODS = password,fs,gsi SEC_PASSWORD_FILE = /var/lib/condor/cred_dir/condor_credential ALLOW_DAEMON = condor_pool@* JOB_RENICE_INCREMENT=10 DAEMON_LIST = STARTD, SCHEDD, COLLECTOR, NEGOTIATOR, MASTER Repeated tests of condor_q within to condor_qsub sentinel jobs allows detection of the transient and continuation of the holds. This patch provides some relief and logging of the problem: diff --git a/condor_qsub b/condor_qsub index 078bd0c..fd4d2de 100755 --- a/condor_qsub +++ b/condor_qsub @@ -157,9 +157,22 @@ clean_error() { } # as long as there are relevant job in the queue wait and try again -while [ \$(condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc -l) -ge 1 ]; do +counter=3 +while [ \$(condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc -l) -ge 1 -o \$counter -ge 1 ]; do + job_count=\`condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc -l\` + echo "dep_job: \$dep_job, hold_jids: \$hold_jids, hold_job_count: \$job_count" >> /tmp/sentinel_\$dep_job.log + if [ \$job_count -eq 0 ] + then + counter=\$((counter-1)) + condor_q \$hold_jids >> /tmp/sentinel_\$dep_job.log + else + counter=3 + fi sleep 5 done +echo "exited holds loop. " >> /tmp/sentinel_\$dep_job.log +condor_q \$dep_job >> /tmp/sentinel_\$dep_job.log + # now check whether all deps have exited properly (i.e. not with code 100) for jid in \$hold_jids; do A test of 6 simultaneous bedpostx was run on a condor pool composed of 64 slots spread over 8 nodes. These 6 runs generated 514 jobs. Of these, 12 were sentinel jobs. The sentinel jobs tested condor_q every 5 seconds recording 13070 tests. In 18 of these tests condor_q failed to respond correctly. At most, two consecutive failures were found. The patch above was able to correct the problem in each case. -- System Information: Debian Release: 6.0.5 APT prefers stable-updates APT policy: (500, 'stable-updates'), (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32-5-amd64 (SMP w/8 CPU cores) Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1) Shell: /bin/sh linked to /bin/dash Versions of packages condor depends on: ii adduser 3.112+nmu2 add and remove users and groups ii debconf [debconf-2 1.5.36.1 Debian configuration management sy ii libc6 2.11.3-3 Embedded GNU C Library: Shared lib ii libclassad3 7.8.2~dfsg.1-2~nd60+1 Condor classads expression languag ii libcomerr2 1.41.12-4stable1 common error description library ii libcurl3 7.21.0-2.1+squeeze2 Multi-protocol file transfer libra ii libdate-manip-perl 6.11-1 module for manipulating dates ii libexpat1 2.0.1-7+squeeze1 XML parsing C library - runtime li ii libgcc1 1:4.4.5-8 GCC support library ii libglobus-callout0 0.7-6 Globus Toolkit - Globus Callout Li ii libglobus-common0 11.5-2 Globus Toolkit - Common Library ii libglobus-ftp-cont 2.11-2 Globus Toolkit - GridFTP Control L ii libglobus-gass-tra 4.3-2 Globus Toolkit - Globus Gass Trans ii libglobus-gram-cli 10.4-1 Globus Toolkit - GRAM Client Libra ii libglobus-gram-pro 9.7-2 Globus Toolkit - GRAM Protocol Lib ii libglobus-gsi-call 2.7-1 Globus Toolkit - Globus GSI Callba ii libglobus-gsi-cert 6.6-1 Globus Toolkit - Globus GSI Cert U ii libglobus-gsi-cred 3.5-1 Globus Toolkit - Globus GSI Creden ii libglobus-gsi-open 0.14-6 Globus Toolkit - Globus OpenSSL Er ii libglobus-gsi-prox 4.5-1 Globus Toolkit - Globus GSI Proxy ii libglobus-gsi-prox 2.3-1 Globus Toolkit - Globus GSI Proxy ii libglobus-gsi-sysc 3.1-2 Globus Toolkit - Globus GSI System ii libglobus-gss-assi 5.9-1 Globus Toolkit - GSSAPI Assist lib ii libglobus-gssapi-e 2.5-7 Globus Toolkit - GSSAPI Error Libr ii libglobus-gssapi-g 7.5-2 Globus Toolkit - GSSAPI library ii libglobus-io3 6.3-8 Globus Toolkit - uniform I/O inter ii libglobus-openssl- 1.3-1 Globus Toolkit - Globus OpenSSL Mo ii libglobus-rsl2 7.2-2 Globus Toolkit - Resource Specific ii libglobus-xio0 2.8-3 Globus Toolkit - Globus XIO Framew ii libgssapi-krb5-2 1.8.3+dfsg-4squeeze6 MIT Kerberos runtime libraries - k ii libk5crypto3 1.8.3+dfsg-4squeeze6 MIT Kerberos runtime libraries - C ii libkrb5-3 1.8.3+dfsg-4squeeze6 MIT Kerberos runtime libraries ii libkrb5support0 1.8.3+dfsg-4squeeze6 MIT Kerberos runtime libraries - S ii libldap-2.4-2 2.4.23-7.2 OpenLDAP libraries ii libltdl7 2.2.6b-2 A system independent dlopen wrappe ii libpcre3 8.02-1.1 Perl 5 Compatible Regular Expressi ii libssl0.9.8 0.9.8o-4squeeze13 SSL shared libraries ii libstdc++6 4.4.5-8 The GNU Standard C++ Library v3 ii libuuid1 2.17.2-9 Universally Unique ID library ii libvirt0 0.8.3-5+squeeze2 library for interfacing with diffe ii libxml2 2.7.8.dfsg-2+squeeze5 GNOME XML library ii neurodebian-popula 0.28~nd60+1 Helper for NeuroDebian popularity ii perl 5.10.1-17squeeze3 Larry Wall's Practical Extraction ii python 2.6.6-3+squeeze7 interactive high-level object-orie ii zlib1g 1:1.2.3.4.dfsg-3 compression library - runtime Versions of packages condor recommends: ii dmtcp 1.2.5-1~nd60+1 Checkpoint/Restart functionality f Versions of packages condor suggests: pn coop-computing-tools <none> (no description available) -- debconf information excluded -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org