Out of sync deletions causing differing IDF

Upayavira Thu, 04 Aug 2016 05:00:07 -0700

We have a system that has a reasonable number of changes going on on a
daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
Cloud, the data is split into 10 shards and those shards are replicated.


What we are finding is that the number of deletions is causing differing
maxDocs across the different replicas, and that is causing significantly
different IDF values between replicas of the same shard, giving
different scores and thus different orders depending upon which replica
we hit.

I would have expected that, because the data is being indexed
concurrently across replicas, that the pattern of delete/merge would be
similar across replicas, but that doesn't seem to be the case in
practice.

We could, of course, optimise the index to merge down to a single
segment. This would clear all deletes out, but would leave us in a worse
place for the future, as now most of our deletes would be concentrated
into a single large segment.

Has anyone seen this sort of thing before, and does anyone have
suggested strategies as to how to encourage IDF values into a similar
range across replicas?

Upayavira

Out of sync deletions causing differing IDF

Reply via email to