We're using Solr 1.4 Java replication, which seems to be working
nicely.  While writing production monitors to check that replication
is healthy, I think we've run into a bug in the status reporting of
the "../solr/replication?command=details" command.  (I know it's
experimental...)

Our monitor parses the replication?command=details XML and checks that
replication lag is reasonable by diffing the indexVersion of the
master and slave indices to make sure it's within a reasonable time
range.

Our monitor also compares the first elements of
"indexReplicatedAtList" and "replicationFailedAtList" lists to see if
the last replication attempt failed.  This is where we're having a
problem with the monitor throwing false errors.  It looks like there's
a bug that causes successful replications to be considered failures.
The bug is triggered immediately after a slave restarts when the slave
is already in sync with the master.  Each no-op replication attempt
after restart is considered a failure until something on the master
changes and replication has to actually do work.

>From the code, it looks like "SnapPuller.successfulInstall" starts out
false on restart.  If the slave starts out in sync with the master,
then each no-op replication poll leaves "successfulInstall" set to
false which makes SnapPuller.logReplicationTimeAndConfFiles log the
poll as a failure.  SnapPuller.successfulInstall stays false until the
first time replication actually has to do something, at which point it
gets set to true, and then everything is OK.

Thanks,
Shawn

Reply via email to