Package: condor
Version: 7.8.2~dfsg.1-2~nd60+1
Severity: important

I am seeing a transient failure in condor jobs dependencies.  The sentinel jobs
created by condor_qsub are releasing dependencies early in some cases.  I
adding logging to the sentinel job scripts and can see the count of "hold_jids"
returned by the following line drop dramatically to zero

  condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc –l

The sentinel job then releases the held job and exits.  As the count of
hold_jids is in reality non-zero, those jobs proceed to normal completion, but
out of sequence.

The current condor configuration is neurodebian defaults with these
modifications:

DAEMON_LIST = SCHEDD, MASTER
CONDOR_ADMIN=<user>@<domain>
RESERVED_MEMORY = 2048
FILESYSTEM_DOMAIN = vabirc
UID_DOMAIN = <domain>
CONDOR_HOST = <host>.<domain>
COLLECTOR_NAME="VA BIRC Condor Cluster"
ALLOW_WRITE = < list of ipaddresses>
CONDOR_DEVELOPERS = condor-ad...@cs.wisc.edu
CONDOR_DEVELOPERS_COLLECTOR = condor.cs.wisc.edu
BIND_ALL_INTERFACES = FALSE
NETWORK_INTERFACE =  <ipaddress>
TRUST_UID_DOMAIN = TRUE
START = TRUE
SUSPEND = FALSE
CONTINUE = TRUE
PREEMPT = FALSE
KILL = FALSE
MAIL_FROM=please-do-not-reply@<domain>
SEC_DAEMON_AUTHENTICATION = required
SEC_DAEMON_AUTHENTICATION_METHODS = password
SEC_CLIENT_AUTHENTICATION_METHODS = password,fs,gsi
SEC_PASSWORD_FILE = /var/lib/condor/cred_dir/condor_credential
ALLOW_DAEMON = condor_pool@*
JOB_RENICE_INCREMENT=10
DAEMON_LIST = STARTD, SCHEDD, COLLECTOR, NEGOTIATOR, MASTER

Repeated tests of condor_q within to condor_qsub sentinel jobs allows detection
of the transient and continuation of the holds.  This patch provides some
relief and logging of the problem:

diff --git a/condor_qsub b/condor_qsub
index 078bd0c..fd4d2de 100755
--- a/condor_qsub
+++ b/condor_qsub
@@ -157,9 +157,22 @@ clean_error() {
 }

 # as long as there are relevant job in the queue wait and try again
-while [ \$(condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc
-l) -ge 1 ]; do
+counter=3
+while [ \$(condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc
-l) -ge 1 -o \$counter -ge 1 ]; do
+        job_count=\`condor_q -long -attributes Owner \$hold_jids | grep
"$USER" | wc -l\`
+        echo "dep_job: \$dep_job, hold_jids: \$hold_jids, hold_job_count:
\$job_count"  >> /tmp/sentinel_\$dep_job.log
+        if [ \$job_count -eq 0 ]
+        then
+          counter=\$((counter-1))
+          condor_q \$hold_jids  >> /tmp/sentinel_\$dep_job.log
+        else
+          counter=3
+        fi
        sleep 5
 done
+echo "exited holds loop.  " >> /tmp/sentinel_\$dep_job.log
+condor_q \$dep_job  >> /tmp/sentinel_\$dep_job.log
+

 # now check whether all deps have exited properly (i.e. not with code 100)
 for jid in \$hold_jids; do


A test of 6 simultaneous bedpostx was run on a condor pool composed of 64 slots
spread over 8 nodes.  These 6 runs generated 514 jobs.  Of these, 12 were
sentinel jobs.  The sentinel jobs tested condor_q every 5 seconds recording
13070 tests.  In 18 of these tests condor_q failed to respond correctly.   At
most, two consecutive failures were found.  The patch above was able to correct
the problem in each case.



-- System Information:
Debian Release: 6.0.5
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32-5-amd64 (SMP w/8 CPU cores)
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/dash

Versions of packages condor depends on:
ii  adduser            3.112+nmu2            add and remove users and groups
ii  debconf [debconf-2 1.5.36.1              Debian configuration management sy
ii  libc6              2.11.3-3              Embedded GNU C Library: Shared lib
ii  libclassad3        7.8.2~dfsg.1-2~nd60+1 Condor classads expression languag
ii  libcomerr2         1.41.12-4stable1      common error description library
ii  libcurl3           7.21.0-2.1+squeeze2   Multi-protocol file transfer libra
ii  libdate-manip-perl 6.11-1                module for manipulating dates
ii  libexpat1          2.0.1-7+squeeze1      XML parsing C library - runtime li
ii  libgcc1            1:4.4.5-8             GCC support library
ii  libglobus-callout0 0.7-6                 Globus Toolkit - Globus Callout Li
ii  libglobus-common0  11.5-2                Globus Toolkit - Common Library
ii  libglobus-ftp-cont 2.11-2                Globus Toolkit - GridFTP Control L
ii  libglobus-gass-tra 4.3-2                 Globus Toolkit - Globus Gass Trans
ii  libglobus-gram-cli 10.4-1                Globus Toolkit - GRAM Client Libra
ii  libglobus-gram-pro 9.7-2                 Globus Toolkit - GRAM Protocol Lib
ii  libglobus-gsi-call 2.7-1                 Globus Toolkit - Globus GSI Callba
ii  libglobus-gsi-cert 6.6-1                 Globus Toolkit - Globus GSI Cert U
ii  libglobus-gsi-cred 3.5-1                 Globus Toolkit - Globus GSI Creden
ii  libglobus-gsi-open 0.14-6                Globus Toolkit - Globus OpenSSL Er
ii  libglobus-gsi-prox 4.5-1                 Globus Toolkit - Globus GSI Proxy 
ii  libglobus-gsi-prox 2.3-1                 Globus Toolkit - Globus GSI Proxy 
ii  libglobus-gsi-sysc 3.1-2                 Globus Toolkit - Globus GSI System
ii  libglobus-gss-assi 5.9-1                 Globus Toolkit - GSSAPI Assist lib
ii  libglobus-gssapi-e 2.5-7                 Globus Toolkit - GSSAPI Error Libr
ii  libglobus-gssapi-g 7.5-2                 Globus Toolkit - GSSAPI library
ii  libglobus-io3      6.3-8                 Globus Toolkit - uniform I/O inter
ii  libglobus-openssl- 1.3-1                 Globus Toolkit - Globus OpenSSL Mo
ii  libglobus-rsl2     7.2-2                 Globus Toolkit - Resource Specific
ii  libglobus-xio0     2.8-3                 Globus Toolkit - Globus XIO Framew
ii  libgssapi-krb5-2   1.8.3+dfsg-4squeeze6  MIT Kerberos runtime libraries - k
ii  libk5crypto3       1.8.3+dfsg-4squeeze6  MIT Kerberos runtime libraries - C
ii  libkrb5-3          1.8.3+dfsg-4squeeze6  MIT Kerberos runtime libraries
ii  libkrb5support0    1.8.3+dfsg-4squeeze6  MIT Kerberos runtime libraries - S
ii  libldap-2.4-2      2.4.23-7.2            OpenLDAP libraries
ii  libltdl7           2.2.6b-2              A system independent dlopen wrappe
ii  libpcre3           8.02-1.1              Perl 5 Compatible Regular Expressi
ii  libssl0.9.8        0.9.8o-4squeeze13     SSL shared libraries
ii  libstdc++6         4.4.5-8               The GNU Standard C++ Library v3
ii  libuuid1           2.17.2-9              Universally Unique ID library
ii  libvirt0           0.8.3-5+squeeze2      library for interfacing with diffe
ii  libxml2            2.7.8.dfsg-2+squeeze5 GNOME XML library
ii  neurodebian-popula 0.28~nd60+1           Helper for NeuroDebian popularity 
ii  perl               5.10.1-17squeeze3     Larry Wall's Practical Extraction 
ii  python             2.6.6-3+squeeze7      interactive high-level object-orie
ii  zlib1g             1:1.2.3.4.dfsg-3      compression library - runtime

Versions of packages condor recommends:
ii  dmtcp                     1.2.5-1~nd60+1 Checkpoint/Restart functionality f

Versions of packages condor suggests:
pn  coop-computing-tools          <none>     (no description available)

-- debconf information excluded


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to