We're using Solr 1.4 Java replication, which seems to be working nicely. While writing production monitors to check that replication is healthy, I think we've run into a bug in the status reporting of the "../solr/replication?command=details" command. (I know it's experimental...)
Our monitor parses the replication?command=details XML and checks that replication lag is reasonable by diffing the indexVersion of the master and slave indices to make sure it's within a reasonable time range. Our monitor also compares the first elements of "indexReplicatedAtList" and "replicationFailedAtList" lists to see if the last replication attempt failed. This is where we're having a problem with the monitor throwing false errors. It looks like there's a bug that causes successful replications to be considered failures. The bug is triggered immediately after a slave restarts when the slave is already in sync with the master. Each no-op replication attempt after restart is considered a failure until something on the master changes and replication has to actually do work. >From the code, it looks like "SnapPuller.successfulInstall" starts out false on restart. If the slave starts out in sync with the master, then each no-op replication poll leaves "successfulInstall" set to false which makes SnapPuller.logReplicationTimeAndConfFiles log the poll as a failure. SnapPuller.successfulInstall stays false until the first time replication actually has to do something, at which point it gets set to true, and then everything is OK. Thanks, Shawn