Re: Out of sync deletions causing differing IDF

Erick Erickson Thu, 04 Aug 2016 09:28:38 -0700

Upayavira:

bq: I would have expected that, because the data is being indexed
concurrently across replicas, that the pattern of delete/merge would be
similar across replicas.


Except for the pesky timing issue. The timers start for autocommit when a
request is received. So the time the autocommit timer expires won't be
the same wall-clock time on all servers and thus may not have the same docs
in the same segments. It would be _really nice_ if they did, because then
we wouldn't have to fall back to full replication so often for recovery.

I think there's a JIRA out there for trying to coordinate all the commits across
replicas in a shard, but I can't find it on a quick look.

Would distributed IDF help here?
https://issues.apache.org/jira/browse/SOLR-1632 (even though this is
really old, it's in 5.0+)

Best,
Erick

On Thu, Aug 4, 2016 at 5:12 AM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> Hello - your similarity should rely on numDoc instead, it solves the problem. 
> I believe it is already fixed in trunk, but i am not sure.
> Markus
>
> -----Original message-----
>> From:Upayavira <upayav...@odoko.co.uk>
>> Sent: Thursday 4th August 2016 13:59
>> To: solr-user@lucene.apache.org
>> Subject: Out of sync deletions causing differing IDF
>>
>> We have a system that has a reasonable number of changes going on on a
>> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
>> Cloud, the data is split into 10 shards and those shards are replicated.
>>
>> What we are finding is that the number of deletions is causing differing
>> maxDocs across the different replicas, and that is causing significantly
>> different IDF values between replicas of the same shard, giving
>> different scores and thus different orders depending upon which replica
>> we hit.
>>
>> I would have expected that, because the data is being indexed
>> concurrently across replicas, that the pattern of delete/merge would be
>> similar across replicas, but that doesn't seem to be the case in
>> practice.
>>
>> We could, of course, optimise the index to merge down to a single
>> segment. This would clear all deletes out, but would leave us in a worse
>> place for the future, as now most of our deletes would be concentrated
>> into a single large segment.
>>
>> Has anyone seen this sort of thing before, and does anyone have
>> suggested strategies as to how to encourage IDF values into a similar
>> range across replicas?
>>
>> Upayavira
>>

Re: Out of sync deletions causing differing IDF

Reply via email to