First, a disclaimer: my test is not by the book and I know that - feel free
to yell at me. The production machine I'd like to run nagios on is still on
3.6-stable. What I did was not the nicest thing in the world but I created
the _nagios user and group using the info from PLIST, ripped the two lines
out of PLIST and then the port built on 3.6 OK.

What I noticed is that sometimes (approx. 15% of attempts) the plugins do
not return any output, causing alerts.

[02-08-2005 13:20:53] HOST ALERT: Lili;DOWN;SOFT;1;(No output!)

[02-08-2005 12:17:58] SERVICE ALERT: Lili;Ping;CRITICAL;SOFT;1;(Return code
of 139 is out of bounds)

Where lili is localhost and obviously not down.

I searched around and found this
(http://nagios.sourceforge.net/docs/2_0/whatsnew.html ): 

"There are a few known issues with the Nagios 2.0 code at the moment.
Hopefully some of these will be fixed before 2.0 is released as stable...

   1. FreeBSD and threads. On FreeBSD there's a native user-level
implementation of threads called 'pthread' and there's also an optional
ports collection 'linuxthreads' that uses kernel hooks. Some folks from
Yahoo! have reported that using the pthread library causes Nagios to pause
under heavy I/O load, causing some service check results to be lost.
Switching to linuxthreads seems to help this problem, but not fix it. The
lock happens in liblthread's __pthread_acquire() - it can't ever acquire the
spinlock. It happens when the main thread forks to execute an active check.
On the second fork to create the grandchild, the grandchild is created by
fork, but never returns from liblthread's fork wrapper, because it's stuck
in __pthread_acquire(). Maybe some FreeBSD users can help out with this
problem."


My question is: is anybody else experiencing similar problems? Is this
caused by running nagios on non -current box? Could this be the same pthread
issue as with FreeBSD?

I'm setting up a -current box at the moment to replicate the test on it. 

Regards, Mitja

Reply via email to