We rolled out the skiplist backend on Saturday and it could hae gone a 
little better. For some reason, it seems that the performance of 
fdatasync() under Solaris 2.7 is terrible under high load conditions or 
maybe we're just assuming the operating system is smarter than it actually 
is (or maybe we aren't as smart as we should be. :)

The result of this is that the write lock is held on too long and when the 
lock is released a ton of processes waiting for a read lock swarm into 
action. According to top, we've had 6000+ processes on the machine and over 
500 reported as runnable.  Of course, this shouldn't be that much different 
than what is happening with the flat file.

There also is bug that we can't reproduce that will result in the skiplist 
getting into a loop. "Luckily" this has only happened with seen state and 
only happens to one user every four to eight hours.

Our plans are to look at making things more efficient -- possibly by 
separating the log from the data and so having two files and not just one. 
We're looking at throwing additional debugging code in to try to find out 
how it gets into the loop, or at worst throwing in a hack that if it 
detects the loop to break it automatically.

So, right now, I wouldn't recommend switching your production system over 
to it.

Walter



Reply via email to