FileNotFoundException on slave after replication - script bug?

Jim Murphy Wed, 22 Oct 2008 07:01:37 -0700

We're seeing strange behavior on one of our slave nodes after replication. 
When the new searcher is created we see FileNotFoundExceptions in the log
and the index is strangely invalid/corrupted.


We may have identified the root cause but wanted to run it by the community. 
We figure there is a bug in the snappuller shell script, line 181:

snap_name=`ssh -o StrictHostKeyChecking=no ${master_host} "ls
${master_data_dir}|grep 'snapshot\.'|grep -v wip|sort -r|head -1"` 

This line determines the directory name of the latest snapshot to download
to the slave from the master.  Problem with this line is that it grab the
temporary work directory of a snapshot in progress.  Those temporary
directories are prefixed with  "temp" and as far as I can tell should never
get pulled from the master so its easy to disambiguate.  It seems that this
temp directory, if it exists will be the newest one so if present it will be
the one replicated: FAIL.

We've tweaked the line to exclude any directories starting with "temp":
snap_name=`ssh -o StrictHostKeyChecking=no ${master_host} "ls
${master_data_dir}|grep 'snapshot\.'|grep -v wip|grep -v temp|sort -r|head
-1"` 

This has fixed our local issue, we can submit a patch but wanted a quick
sanity check because I'm surprised its not much more commonly seen.

Jim

-- 
View this message in context: 
http://www.nabble.com/FileNotFoundException-on-slave-after-replication---script-bug--tp20111313p20111313.html
Sent from the Solr - User mailing list archive at Nabble.com.

FileNotFoundException on slave after replication - script bug?

Reply via email to