RE: FW: Replication error and Shard Inconsistencies..

Annette Newton Thu, 06 Dec 2012 02:17:09 -0800

Hi,

The file descriptor count is always quite low..  At the moment after heavy 
usage for a few days file descriptor counts are between 100-150 and I don't 
have any errors in the logs.  My worry at the moment is around all the 
CLOSE_WAIT connections I am seeing.  This is particularly true on the boxes 
marked as leaders, the replicas have a few but nowhere near as many.


Thanks for the response.

-----Original Message-----
From: Andre Bois-Crettez [mailto:andre.b...@kelkoo.com] 
Sent: 05 December 2012 17:57
To: solr-user@lucene.apache.org
Subject: Re: FW: Replication error and Shard Inconsistencies..

Not sure but, maybe you are running out of file descriptors ?
On each solr instance, look at the "dashboard" admin page, there is a bar with 
"File Descriptor Count".

However if this was the case, I would expect to see lots of errors in the solr 
logs...

André


On 12/05/2012 06:41 PM, Annette Newton wrote:
> Sorry to bombard you - final update of the day...
>
> One thing that I have noticed is that we have a lot of connections 
> between the solr boxes with the connection set to CLOSE_WAIT and they 
> hang around for ages.
>
> -----Original Message-----
> From: Annette Newton [mailto:annette.new...@servicetick.com]
> Sent: 05 December 2012 13:55
> To: solr-user@lucene.apache.org
> Subject: FW: Replication error and Shard Inconsistencies..
>
> Update:
>
> I did a full restart of the solr cloud setup, stopped all the 
> instances, cleared down zookeeper and started them up individually.  I 
> then removed the index from one of the replicas, restarted solr and it 
> replicated ok.  So I'm wondering whether this is something that happens over 
> a period of time.
>
> Also just to let you know I changed the schema a couple of times and 
> reloaded the cores on all instances previous to the problem.  Don't 
> know if this could have contributed to the problem.
>
> Thanks.
>
> -----Original Message-----
> From: Annette Newton [mailto:annette.new...@servicetick.com]
> Sent: 05 December 2012 09:04
> To: solr-user@lucene.apache.org
> Subject: RE: Replication error and Shard Inconsistencies..
>
> Hi Mark,
>
> Thanks so much for the reply.
>
> We are using the release version of 4.0..
>
> It's very strange replication appears to be underway, but no files are 
> being copied across.  I have attached both the log from the new node 
> that I tried to bring up and the Schema and config we are using.
>
> I think it's probably something weird with our config, so I'm going to 
> play around with it today.  If I make any progress I'll send an update.
>
> Thanks again.
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: 05 December 2012 00:04
> To: solr-user@lucene.apache.org
> Subject: Re: Replication error and Shard Inconsistencies..
>
> Hey Annette,
>
> Are you using Solr 4.0 final? A version of 4x or 5x?
>
> Do you have the logs for when the replica tried to catch up to the leader?
>
> Stopping and starting the node is actually a fine thing to do. Perhaps 
> you can try it again and capture the logs.
>
> If a node is not listed as live but is in the clusterstate, that is 
> fine. It shouldn't be consulted. To remove it, you either have to 
> unload it with the core admin api or you could manually delete it's 
> registered state under the node states node that the Overseer looks at.
>
> Also, it would be useful to see the logs of the new node coming 
> up.there should be info about what happens when it tries to replicate.
>
> It almost sounds like replication is just not working for your setup 
> at all and that you have to tweak some configuration. You shouldn't 
> see these nodes as active then though - so we should get to the bottom of 
> this.
>
> - Mark
>
> On Dec 4, 2012, at 4:37 AM, Annette 
> Newton<annette.new...@servicetick.com>
> wrote:
>
>> Hi all,
>>
>> I have a quite weird issue with Solr cloud.  I have a 4 shard, 2 
>> replica
> setup, yesterday one of the nodes lost communication with the cloud 
> setup, which resulted in it trying to run replication, this failed, 
> which has left me with a Shard (Shard 4) that has one node with 
> 2,833,940 documents on the leader and 409,837 on the follower - 
> obviously a big discrepancy and this leads to queries returning 
> differing results depending on which of these nodes it gets the data 
> from.  There is no indication of a problem on the admin site other 
> than the big discrepancy in the number of documents.  They are all marked as 
> active etc.
>>
>> So I thought that I would force replication to happen again, by 
>> stopping
> and starting solr (probably the wrong thing to do) but this resulted 
> in no change.  So I turned off that node and replaced it with a new 
> one.  In zookeeper live nodes doesn't list that machine but it is 
> still being shown as active on in the ClusterState.json, I have attached 
> images showing this.
> This means the new node hasn't replaced the old node but is now a 
> replica on Shard 1!  Also that node doesn't appear to have replicated 
> Shard 1's data anyway, it didn't get marked with replicating or anything.
>>
>> How do I clear the zookeeper state without taking down the entire 
>> solr
> cloud setup?  How do I force a node to replicate from the others in 
> the shard?
>>
>> Thanks in advance.
>>
>> Annette Newton
>>
>>
>> <LiveNodes.zip>
>
>
>
>
>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

RE: FW: Replication error and Shard Inconsistencies..

Reply via email to