Re: TLOG replica stucks

Ere Maijala Thu, 01 Nov 2018 07:21:01 -0700

Could it be related to reloading a collection? I need to do sometesting, but it just occurred to me that reload was done at least onceduring the period the cluster had been up.


Regards,
Ere


Ere Maijala kirjoitti 30.10.2018 klo 12.03:

Hi,
We had the same happen with PULL replicas with Solr 7.5. Solr wasshowing that they all had correct index version, but the changes werenot showing. Unfortunately the solr.log size was too small to catch anyissues, so I've now increased and waiting for it to happen again.
Regards,
Ere

Vadim Ivanov kirjoitti 25.10.2018 klo 18.42:
Thanks Erick for you attention!
My comments below, but supposing that the problem resides in zookeeper
I'll collect more information from zk logs and solr logs and be backsoon.
bq. I've noticed that some replicas stop receiving updates from the
leader without any visible signs from the cluster status.

Hmm, yes, this isn't expected at all. What are you seeing that causes
you to say this? You'd have to be monitoring the log for update
messages to the replicas that aren't leaders or the like.  If anyone is
going to have a prayer of reproducing we'll need more info on exactly
what you're seeing and how you're measuring this.
Meanwhile, I have log level WARN... I'l decrease  it to INFO and see. Tnx
Have you changed any configurations in your replicas at all? We'd need
the exact steps you performed if so.
Command to create replicas was like this (implicit sharding and customCoreName ) :
mysolr07:8983/solr/admin/collections?action=ADDREPLICA
    &collection=rpk94
    &shard=rpk94_1_0
    &property.name=rpk94_1_0_07
    &type=tlog
    &node=mysolr07:8983_solr
On a quick test I didn't see this, but if it were that easy to
reproduce I'd expect it to have shown up before.
Yesterday I've tried to reproduce... trying to change leader withREBALANCELEADERS command.It ended up with no leader at all for the shard and I could not setleader at all for a long time.
There was a problem trying to register as theleader:org.apache.solr.common.SolrException: Could not register as theleader because creating the ephemeral registration node in ZooKeeperfailed
...
Deleting duplicate registration:/collections/rpk94/leader_elect/rpk94_1_117/election/2983181187899523085-core_node73-n_0000000022
...
Index fetch failed :org.apache.solr.common.SolrException: Noregistered leader was found after waiting for 4000ms , collection:rpk94 slice: rpk94_1_117
...
Even to delete all replicas for the shard and recreate Replica to thesame node with the same name did not help - no leader for that shard.I had to delete collection, wait till morning and then it recreatedsuccessfully.
Suppose some weird znodes were deleted from  zk by morning.
NOTE: just looking at the cloud graph and having a node be active is
not _necessarily_ sufficient for the node to be up to date. It
_should_ be sufficient if (and only if) the node was shut down
gracefully, but a "kill -9" or similar doesn't give the replicas on
the node the opportunity to change the state. The "live_nodes" znode
in ZooKeeper must also contain the node the replica resides on.
Node was live, cluster was healthy
If you see this state again, you could try pinging the node directly,
does it respond? Your URL should look something like:
http://host:port/solr/colection_shard1_replica_t1/query?q=*:*&distrib=false
Yes, sure I did. Ill replica responded and number of documents differswith the leader
The "distrib=false" is important as it won't forward the query to any
other replica. If what you're reporting is really happening, that node
should respond with a document count different from other nodes.

NOTE: there's a delay between the time the leader indexes a doc and
it's visible on the follower. Are you sure you're waiting for
leader_commit_interval+polling_interval+autowarm_time before
concluding that there's a problem? I'm a bit suspicious that checking
the versions is concluding that your indexes are out of sync when
really they're just catching up normally. If it's at all possible to
turn off indexing for a few minutes when this happens and everything
just gets better then it's not really a problem.
Sure, the problem was on many shards but not on all shards
and for the long time.
If we prove out that this is really happening as you think, then a
JIRA (with steps to reproduce) is _definitely_ in order.

Best,
Erick
On Wed, Oct 24, 2018 at 2:07 AM Vadim Ivanov
<vadim.iva...@spb.ntk-intourist.ru> wrote:
Hi All !

I'm testing Solr 7.5 with TLOG replicas on SolrCloud with 5 nodes.
My collection has shards and every shard has 3 TLOG replicas ondifferent
nodes.

I've noticed that some replicas stop receiving updates from the leader
without any visible signs from the cluster status.
(all replicas active and green in Admin UI CLOUD graph). Butindexversion of
'ill' replica not increasing with the leader.
It seems to be dangerous, because that 'ill' replica could become aleader
after restart of the nodes and I already experienced data loss.
I didn't notice any meaningfull records in solr log, except thatprobably
problem occurs when leader changes.
Meanwhile, I monitor indexversion of all replicas in a cluster bymbeans andrecreate ill replicas when difference with the leader indexversionmore
than one

Any suggestions?

--

Best regards, Vadim


--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

Re: TLOG replica stucks

Reply via email to