Sure, sounds good.

I created https://issues.apache.org/jira/browse/SOLR-1855 for a script
to monitor slave replication health, and attached our current
implementation.  Improvements would be welcome...

Shawn

On Mon, Mar 29, 2010 at 12:10 PM, Jason Rutherglen
<jason.rutherg...@gmail.com> wrote:
> Shawn,
>
> I was working on something very similar... Lets perhaps also create a
> Jira issue for this monitoring?
>
> Thanks,
>
> Jason
>
> On Fri, Mar 26, 2010 at 6:59 AM, Shawn Smith <ssmit...@gmail.com> wrote:
>> We're using Solr 1.4 Java replication, which seems to be working
>> nicely.  While writing production monitors to check that replication
>> is healthy, I think we've run into a bug in the status reporting of
>> the "../solr/replication?command=details" command.  (I know it's
>> experimental...)
>>
>> Our monitor parses the replication?command=details XML and checks that
>> replication lag is reasonable by diffing the indexVersion of the
>> master and slave indices to make sure it's within a reasonable time
>> range.
>>
>> Our monitor also compares the first elements of
>> "indexReplicatedAtList" and "replicationFailedAtList" lists to see if
>> the last replication attempt failed.  This is where we're having a
>> problem with the monitor throwing false errors.  It looks like there's
>> a bug that causes successful replications to be considered failures.
>> The bug is triggered immediately after a slave restarts when the slave
>> is already in sync with the master.  Each no-op replication attempt
>> after restart is considered a failure until something on the master
>> changes and replication has to actually do work.
>>
>> From the code, it looks like "SnapPuller.successfulInstall" starts out
>> false on restart.  If the slave starts out in sync with the master,
>> then each no-op replication poll leaves "successfulInstall" set to
>> false which makes SnapPuller.logReplicationTimeAndConfFiles log the
>> poll as a failure.  SnapPuller.successfulInstall stays false until the
>> first time replication actually has to do something, at which point it
>> gets set to true, and then everything is OK.
>>
>> Thanks,
>> Shawn
>>
>

Reply via email to