Sure, sounds good. I created https://issues.apache.org/jira/browse/SOLR-1855 for a script to monitor slave replication health, and attached our current implementation. Improvements would be welcome...
Shawn On Mon, Mar 29, 2010 at 12:10 PM, Jason Rutherglen <jason.rutherg...@gmail.com> wrote: > Shawn, > > I was working on something very similar... Lets perhaps also create a > Jira issue for this monitoring? > > Thanks, > > Jason > > On Fri, Mar 26, 2010 at 6:59 AM, Shawn Smith <ssmit...@gmail.com> wrote: >> We're using Solr 1.4 Java replication, which seems to be working >> nicely. While writing production monitors to check that replication >> is healthy, I think we've run into a bug in the status reporting of >> the "../solr/replication?command=details" command. (I know it's >> experimental...) >> >> Our monitor parses the replication?command=details XML and checks that >> replication lag is reasonable by diffing the indexVersion of the >> master and slave indices to make sure it's within a reasonable time >> range. >> >> Our monitor also compares the first elements of >> "indexReplicatedAtList" and "replicationFailedAtList" lists to see if >> the last replication attempt failed. This is where we're having a >> problem with the monitor throwing false errors. It looks like there's >> a bug that causes successful replications to be considered failures. >> The bug is triggered immediately after a slave restarts when the slave >> is already in sync with the master. Each no-op replication attempt >> after restart is considered a failure until something on the master >> changes and replication has to actually do work. >> >> From the code, it looks like "SnapPuller.successfulInstall" starts out >> false on restart. If the slave starts out in sync with the master, >> then each no-op replication poll leaves "successfulInstall" set to >> false which makes SnapPuller.logReplicationTimeAndConfFiles log the >> poll as a failure. SnapPuller.successfulInstall stays false until the >> first time replication actually has to do something, at which point it >> gets set to true, and then everything is OK. >> >> Thanks, >> Shawn >> >