On Sun, 2010-09-12 at 22:30 +0200, Vincent Danjean wrote:
> On 12/09/2010 12:43, Arthur de Jong wrote:
> > First of all, it is recommended to use and IP address for your LDAP
> > server or at least something that can be locally resolved.
> > Otherwise, if your DNS server is unavailable your LDAP server will
> > also be unavailable.
> 
> It is a workaround and perhaps a better configuration. But, as all my
> servers are VM on one host, it is very infrequent that one works but
> not the others...

Another workaround is putting the LDAP server in /etc/hosts. It is
slightly nicer and it also speeds things up if DNS is not available
because the OpenLDAP library does address to hostname lookups in some
cases.

> You miss my point: I got the two kind of answers serveral times, in
> any order. It is not because I got the answer one time that the next
> one will be ok (and even when I got the answer, it is *very* long to
> get (several seconds)). And I suspect (not tested) that if all nslcd
> threads try to answer when resolv.conf is wrong, then I would never
> got any answer once resolv.conf is good.

I think I understand now, you get failures mixed with success results
right after nsswitch.conf is valid. Also, when nsswitch.conf is correct
one in about 5 requests keeps returning failures.

I can reproduce this in my test environment which is good. The bad part
is that I haven't narrowed it down yet. I suspect it is either a bug in
OpenLDAP that is caching some hostname information somewhere but I think
it is more likely a bug in glibc that doesn't correctly
reload /etc/resolv.conf in threaded applications.

From what I gathered with strace it seems that some tests are in place
to reload /etc/resolv.conf if the file has changed (at leasts a stat()
is done). The first time the file is loaded twice and also twice if the
file was modified but I don't think it is reloaded by the thread that
originally failed if it was reloaded again by another thread. My guess
is that the cached contents of /etc/resolv.conf is thread local but the
timestamp that is used for the stat() is global.

Anyway, I will try to make a test application to reproduce this and file
a bugreport in glibc if the above is confirmed.

> The problem is that there is *no* timeout for bad DNS resolution: I got
> "no available LDAP server found, sleeping 1 seconds" during about
> 10 hours (ie between my restart of the host that trigger the problem
> to my wakeup next week when I tried to logging and look at the logs
> sent by logcheck: cron is regularly using nslcd).

The problem is that if a bad thread is triggered the retry-mechanism
will kick in and prevent other threads from even trying a request. If
more than a couple of requests came in with the bad /etc/resolv.conf
this would eventually result in most threads being dysfunctional.

Anyway, thanks for the clarification.

-- 
-- arthur - adej...@debian.org - http://people.debian.org/~adejong --

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to