Re: Solr Cloud in recovering state & down state for long

Ganesh Sethuraman Fri, 05 Oct 2018 04:15:59 -0700

1. Does GC and Solr Logs help to why the Solr replicas server continues to
be in the recovering/ state? Our assumption is that Sept 17 16:00 hrs we
had done ZK transaction log reading, that might have caused the issue. Is
that correct?
2. Does this state can cause slowness to Solr Queries for reads?
3. Is there any way to get notified/email if the servers has any replica
gets into the recovery mode?



On Wed, Oct 3, 2018 at 5:26 PM Ganesh Sethuraman <ganeshmail...@gmail.com>
wrote:

>
>
>
> On Tue, Oct 2, 2018 at 11:46 PM Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 10/2/2018 8:55 PM, Ganesh Sethuraman wrote:
>> > We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK
>> > ensemble in AWS. There are about 60 collections at any point in time. We
>> > have per JVM max heap of 8GB.
>>
>> Let's focus for right now on a single Solr machine, rather than the
>> whole cluster.  How many shard replicas (cores) are on one server?  How
>> much disk space does all the index data take? How many documents
>> (maxDoc, which includes deleted docs) are in all those cores?  What is
>> the total amount of RAM on the server? Is there any other software
>> besides Solr running on each server?
>>
>> We have  471 replicas are available in each server we have about 60
> collections each with 8 shards and 2 replica. Couple of them just 2 shards
> they are small size. Note that only about 30 of them are actively used. Old
> collections are periodically deleted.
> 470 GB of index data per node
> Max Doc per collection is about 300M. However average per collection will
> be about 50M Docs.
> 256GB RAM (24 vCPUs) on each of the two AWS
> No other software running on the box
>
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
>
>>
>> > But as stated above problem, we will have few collection replicas in the
>> > recovering and down state. In the past we have seen it come back to
>> normal
>> > by restarting the solr server, but we want to understand is there any
>> way
>> > to get this back to normal (all synched up with Zookeeper) through
>> command
>> > line/admin? Another question is, being in this state can it cause data
>> > issue? How do we check that (distrib=false on collection count?)?
>>
>> As long as you have at least one replica operational on every shard, you
>> should be OK.  But if you only have one replica operational, then you're
>> in a precarious state, where one additional problem could result in
>> something being unavailable.
>>
>> thanks for info.
>
>> If all is well, SolrCloud should not have replicas stay in down or
>> recovering state for very long, unless they're really large, in which
>> case it can take a while to copy the data from the leader.  If that
>> state persists for a long time, there's probably something going wrong
>> with your Solr install.  Usually restarting Solr is the only way to
>> recover persistently down replicas.  If it happens again after restart,
>> then the root problem has not been dealt with, and you're going to need
>> to figure it out.
>>
>> Ok. Based on the point above it looks restarting the only option, no
> other way to sync with ZK.  Thanks for that
>
> The log snippet you shared only covers a timespan of less than one
>> second, so it's not very helpful in making any kind of determination.
>> The "session expired" message sounds like what happens when the
>> zkClientTimeout value is exceeded.  Internally, this value defaults to
>> 15 seconds, and typical example configs set it to 30 seconds ... so when
>> the session expires, it means there's a SERIOUS problem.  For computer
>> software, 15 or 30 seconds is a relative eternity.  A properly running
>> system should NEVER exceed that timeout.
>>
>> I don't think we have a memory issue (GC Log for busy day is posted
> here), we had Solr going out of sync with ZK because of the manual ZK
> Transaction log parsing/checking on the server (we did that on the Sept 17
> 16:00 UTC as you can see in the log), which resulted in ZK timeout. Since
> then the Solr has not returned to normal.  Is there a possibility of the
> Solr query (real time GET )response time increasing due the solr servers
> being in recovering/Down state?
>
> Here is the full Solr Log file (Note that it is in INFO mode):
> https://raw.githubusercontent.com/ganeshmailbox/har/master/SolrLogFile
> Here is the GC Log:
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMTAvMy8tLTAxX3NvbHJfZ2MubG9nLjUtLTIxLTE5LTU3
>
>
> Can you share your solr log when the problem happens, covering a
>> timespan of at least a few minutes (and ideally much longer), as well as
>> a gc log from a time when Solr was up for a long time?  Hopefully the
>> solr.log and gc log will cover the same timeframe.  You'll need to use a
>> file sharing site for the GC log, since it's likely to be a large file.
>> I would suggest compressing it.  If the solr.log is small enough, you
>> could use a paste website for that, but if it's large, you'll need to
>> use a file sharing site.  Attachments to list email are almost never
>> preserved.
>>
>> Thanks,
>> Shawn
>>
>>

Re: Solr Cloud in recovering state & down state for long

Reply via email to