Thanks Eric,
I will add that we have set commits to be only called by the loading program.
We have turned off soft and autoCommits in the solrconfig.xml.
This is so when we upload, we move from one list of docs to the new list in one
atomic operation (delete, add and then commit).
I'll also add: This index holds 500,000,000 docs and under heavy uploading we
get the nodes going into recovery. I'm presuming it's down to the commits being
too far apart and causing the replication nodes to falter. This heavy upload is
a small window of time and to get around this issue, I remove the replicas
during this period and then add them back afterwards. The new recovery mode
issue looks like it was down to heavy upload but outside the designated period.
So the most likely scenario is that I've created the issue with my tweaking,
hope you can point me in the right direction.
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
Regards
Russell Taylor
-----Original Message-----
From: Erick Erickson [mailto:[email protected]]
Sent: 22 May 2019 16:45
To: [email protected]
Subject: Re: CloudSolrClient (any version). Find the node your query has
connected to.
WARNING - External email from lucene.apache.org
OK, now we’re cooking with oil.
First, nodes in recovery shouldn’t make any difference to a query. They should
not serve any part of a query so I think/hope that’s a red herring. At worst a
node in recovery should pass the query on to another replica that is _not_
recovering.
When you’re looking at this, be aware that as long as _Solr_ is up and running
on a node, it’ll accept queries. For simplicity let's say Solr1 hosts _only_
collection1_shard1_replica1 (cs1r1).
Now you fire a query at Solr1. It has the topology from ZooKeeper as well as
its own internal knowledge of hosted replicas. For a top-level query it should
send sub-queries out only to healthy replicas, bypassing its own recovering
replica.
Let’s claim you fire the query at Solr2. First if there’s been time to
propagate the down state of cs1r1 to ZooKeeper and Solr2 has the state, it
shouldn’t even send a subrequest to cs1r1.
Now let’s say Solr2 hasn’t gotten the message yet and does send a query to
cs1r1. cs1r1 should know its state is recovering and either return an error the
Solr2 (which will pick a new replica to send that subrequest to) or forward it
on to another healthy replica, I’m not quite sure which. In any case it should
_not_ service the request from cs1r1.
If you do prove that a node serving requests that is really in recovery, that’s
a fairly serious bug and we need to know lots of details.
Second, even if you did have the URL Solr sends the query to it wouldn’t help.
Once a Solr node receives a query, it does its _own_ round robin for a
subrequest to one replica of each shard, get’s the replies back then goes back
out to the same replica for the final documents. So you still wouldn’t know
what replica served the queries.
The fact that you say things come back into sync after commit points to
autocommit times. I’m assuming you have an autocommit setting that opens a new
searcher (<openSearcher>true in the “autocommit” section or any positive time
in the autoSoftCommit section of solrconfig.xml). These commit points will fire
at different wall-clock time, resulting in replicas temporarily having
different searchable documents. BTW, the same thing applies if you send
“commitWithin” in a SolrJ cloudSolrClient.add command…
Anyway, if you just fire a query at a specific replica and add &distrib=false,
the replica will bring back only documents from that replica. We’re talking the
replica, so part of the URL will be the complete replica name like
"…./solr/collection1_shard1_replica_n1/query?q=*:*&distrib=false”
A very quick test would be, when you have a replica in recovery, stop indexing
and wait for your autocommit interval to expire (one that opens a new searcher)
or issue a commit to the collection. My bet/hope is that your counts will be
just fine. You can use the &distrib=false parameter to query each replica of
the relevant shard directly…
Best,
Erick
> On May 22, 2019, at 8:09 AM, Russell Taylor <[email protected]> wrote:
>
> Hi Erick,
> Every time any of the replication nodes goes into recovery mode we start
> seeing queries which don't match the correct count. I'm being told zookeeper
> will give me the correct node (Not one in recovery), but I want to prove it
> as the query issue only comes up when any of the nodes are in recovery mode.
> The application loading the data shows the correct counts and after
> committing we check the results and they look correct.
>
> If I can get the URL I can prove that the problem is due to doing the query
> against a node in recovery mode.
>
> I hope that explains the problem, thanks for your time.
>
> Regards
>
> Russell Taylor
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:[email protected]]
> Sent: 22 May 2019 15:50
> To: [email protected]
> Subject: Re: CloudSolrClient (any version). Find the node your query has
> connected to.
>
> WARNING - External email from lucene.apache.org
>
> Why do you want to know? You’ve asked how do to X without telling us what
> problem Y you’re trying to solve (the XY problem) and frequently that leads
> to a lot of wasted time…..
>
> Under the covers CloudSolrClient uses a pretty simple round-robin load
> balancer to pick a Solr node to send the query to so “it depends”…..
>
>> On May 22, 2019, at 5:51 AM, Jörn Franke <[email protected]> wrote:
>>
>> You have to provide the addresses of the zookeeper ensemble - it will figure
>> it out on its own based on information in Zookeeper.
>>
>>> Am 22.05.2019 um 14:38 schrieb Russell Taylor <[email protected]>:
>>>
>>> Hi,
>>> Using CloudSolrClient, how do I find the node (I have 3 nodes for this
>>> collection on our 6 node cluster) the query has connected to.
>>> I'm hoping to get the full URL if possible.
>>>
>>>
>>> Regards
>>>
>>> Russell Taylor
>>>
>>>
>>>
>>> ________________________________
>>>
>>> This message may contain confidential information and is intended for
>>> specific recipients unless explicitly noted otherwise. If you have reason
>>> to believe you are not an intended recipient of this message, please delete
>>> it and notify the sender. This message may not represent the opinion of
>>> Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and
>>> does not constitute a contract or guarantee. Unencrypted electronic mail is
>>> not secure and the recipient of this message is expected to provide
>>> safeguards from viruses and pursue alternate means of communication where
>>> privacy or a binding message is desired.
>
>
> ________________________________
>
> This message may contain confidential information and is intended for
> specific recipients unless explicitly noted otherwise. If you have reason to
> believe you are not an intended recipient of this message, please delete it
> and notify the sender. This message may not represent the opinion of
> Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and
> does not constitute a contract or guarantee. Unencrypted electronic mail is
> not secure and the recipient of this message is expected to provide
> safeguards from viruses and pursue alternate means of communication where
> privacy or a binding message is desired.
________________________________
This message may contain confidential information and is intended for specific
recipients unless explicitly noted otherwise. If you have reason to believe you
are not an intended recipient of this message, please delete it and notify the
sender. This message may not represent the opinion of Intercontinental
Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a
contract or guarantee. Unencrypted electronic mail is not secure and the
recipient of this message is expected to provide safeguards from viruses and
pursue alternate means of communication where privacy or a binding message is
desired.