Re: Replication and soft commits for NRT searches

Erick Erickson Thu, 15 Oct 2015 09:33:59 -0700

bq: the background for my question is that one of the requirements for
our injection tool is that it should report that a new document has
been successfully enrolled to the cluster only if it is available on
all replicas


Frankly, this is the tail wagging the dog. SolrCloud is designed to
guarantee eventual consistency, and you're trying to force it to
satisfy other criteria, it'll be a difficult fit.

My guess is that either you're only in development at this point or
that your query rate is very low, because setting soft commits to 1
second is going to be a problem in production if indexing is happening
consistently and you have to serve a significant query rate.

Really, I recommend you revisit this with requirement and see if it
can be relaxed. For example, you could keep track of the number of
unique IDs you expect to be in solr and periodically (when not
indexing and after the (longer) soft commit interval has expired)
query each replica (&distrib=false) or get replica stats and check
that each replica for each shard has the same number of live documents
and that the sum across the shards is what you expect.

Best,
Erick

On Wed, Oct 14, 2015 at 11:26 PM, MOIS Martin (MORPHO)
<martin.m...@morpho.com> wrote:
> Hello,
>
> the background for my question is that one of the requirements for our 
> injection tool is that it should report that a new document has been 
> successfully enrolled to the cluster only if it is available on all replicas. 
> The automated integration test for this feature will submit a document to the 
> cluster and afterwards check if it can be found with an appropriate query 
> (that is why I have configured autoSoftCommit/maxDocs=1).
>
> In this context the question appeared, what happens if the update request 
> returns rf=1 and I submit a query to a cluster with replication factor of two 
> directly after the update (maybe to the replica due to load balancing)? Will 
> the automated integration test fail sometimes and sometimes not? Will I have 
> to wait artificially between the update and the query and if yes, how long? 
> And how can I implement the requirement that our injection tool should report 
> successful only if the document has been replicated to all replicas?
>
> Best Regards,
> Martin Mois
>
>>bq: If a timeout between shard leader and replica can
>>lead to a smaller rf value (because replication has
>>timed out), is it possible to increase this timeout in the configuration?
>>
>>Why do you care? If it timed out, then the follower will
>>no longer be active and will not serve queries. The Cloud view
>>should show it in "down", "recovery" or the like. Before it
>>goes back to the "active" state, it will synchronize from
>>the leader automatically without you having to do anything and
>>any docs that were indexed to the leader will be faithfully
>>reflected on the follower  _before_ the recovering
>>follower serves any new queries. So practically it makes no
>>difference whether there was an update timeout or not.
>>
>>This is feeling a lot like an "XY" problem. You're asking detailed
>>questions about "X" (in this case timeouts, what rf means and the like)
>>without telling us what the problem you're concerned about is ("Y").
>>
>>So please back up and tell us what your higher level concern is.
>>Do you have any evidence of Bad Things Happening?
>>
>>And do, please, change your commit intervals to not happen after
>>doc. That's a Really Bad Practice in Solr.
>>
>>Best,
>>Erick
>>
>>On Tue, Oct 13, 2015 at 11:58 PM, MOIS Martin (MORPHO)
>><martin.m...@morpho.com> wrote:
>>> Hello,
>>>
>>> thank you for the detailed answer.
>>>
>>> If a timeout between shard leader and replica can lead to a smaller rf 
>>> value (because
>>replication has timed out), is it possible to increase this timeout in the 
>>configuration?
>>>
>>> Best Regards,
>>> Martin Mois
>>>
>>> Comments inline:
>>>
>>> On Mon, Oct 12, 2015 at 1:31 PM, MOIS Martin (MORPHO)
>>> <martin.m...@morpho.com> wrote:
>>>> Hello,
>>>>
>>>> I am running Solr 5.2.1 in a cluster with 6 nodes. My collections have 
>>>> been created
>>with
>>> replicationFactor=2, i.e. I have one replica for each shard. Beyond that I 
>>> am using autoCommit/maxDocs=10000
>>> and autoSoftCommits/maxDocs=1 in order to achieve near realtime search 
>>> behavior.
>>>>
>>>> As far as I understand from section "Write Side Fault Tolerance" in the 
>>>> documentation
>>> (https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance),
>>I
>>> cannot enforce that an update gets replicated to all replicas, but I can 
>>> only get the
>>achieved
>>> replication factor by requesting the return value rf.
>>>>
>>>> My question is now, what exactly does rf=2 mean? Does it only mean that 
>>>> the replica
>>has
>>> written the update to its transaction log? Or has the replica also 
>>> performed the soft
>>commit
>>> as configured with autoSoftCommits/maxDocs=1? The answer is important for 
>>> me, as if the
>>update
>>> would only get written to the transaction log, I could not search for it 
>>> reliable, as
>>the
>>> replica may not have added it to the searchable index.
>>>
>>> rf=2 means that the update was successfully replicated to and
>>> acknowledged by two replicas (including the leader). The rf only deals
>>> with the durability of the update and has no relation to visibility of
>>> the update to searchers. The auto(soft)commit settings are applied
>>> asynchronously and do not block an update request.
>>>
>>>>
>>>> My second question is, does rf=1 mean that the update was definitely not 
>>>> successful
>>on
>>> the replica or could it also represent a timeout of the replication request 
>>> from the
>>shard
>>> leader? If it could also represent a timeout, then there would be a small 
>>> chance that
>>the
>>> replication was successfully despite of the timeout.
>>>
>>> Well, rf=1 implies that the update was only applied on the leader's
>>> index + tlog and either replicas weren't available or returned an
>>> error or the request timed out. So yes, you are right that it can
>>> represent a timeout and as such there is a chance that the replication
>>> was indeed successful despite of the timeout.
>>>
>>>>
>>>> Is there a way to retrieve the replication factor for a specific document 
>>>> after the
>>update
>>> in order to check if replication was successful in the meantime?
>>>>
>>>
>>> No, there is no way to do that.
>>>
>>>> Thanks in advance.
>>>>
>>>> Best Regards,
>>>> Martin Mois
>>>> #
>>>> " This e-mail and any attached documents may contain confidential or 
>>>> proprietary
>>information.
>>> If you are not the intended recipient, you are notified that any 
>>> dissemination, copying
>>of
>>> this e-mail and any attachments thereto or use of their contents by any 
>>> means whatsoever
>>is
>>> strictly prohibited. If you have received this e-mail in error, please 
>>> advise the sender
>>immediately
>>> and delete this e-mail and all attached documents from your computer 
>>> system."
>>>> #
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>>
>>> #
>>> " This e-mail and any attached documents may contain confidential or 
>>> proprietary information.
>>If you are not the intended recipient, you are notified that any 
>>dissemination, copying of
>>this e-mail and any attachments thereto or use of their contents by any means 
>>whatsoever is
>>strictly prohibited. If you have received this e-mail in error, please advise 
>>the sender immediately
>>and delete this e-mail and all attached documents from your computer system."
>>> #
>>
> #
> " This e-mail and any attached documents may contain confidential or 
> proprietary information. If you are not the intended recipient, you are 
> notified that any dissemination, copying of this e-mail and any attachments 
> thereto or use of their contents by any means whatsoever is strictly 
> prohibited. If you have received this e-mail in error, please advise the 
> sender immediately and delete this e-mail and all attached documents from 
> your computer system."
> #

Re: Replication and soft commits for NRT searches

Reply via email to