Upayavira: bq: I would have expected that, because the data is being indexed concurrently across replicas, that the pattern of delete/merge would be similar across replicas.
Except for the pesky timing issue. The timers start for autocommit when a request is received. So the time the autocommit timer expires won't be the same wall-clock time on all servers and thus may not have the same docs in the same segments. It would be _really nice_ if they did, because then we wouldn't have to fall back to full replication so often for recovery. I think there's a JIRA out there for trying to coordinate all the commits across replicas in a shard, but I can't find it on a quick look. Would distributed IDF help here? https://issues.apache.org/jira/browse/SOLR-1632 (even though this is really old, it's in 5.0+) Best, Erick On Thu, Aug 4, 2016 at 5:12 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello - your similarity should rely on numDoc instead, it solves the problem. > I believe it is already fixed in trunk, but i am not sure. > Markus > > -----Original message----- >> From:Upayavira <upayav...@odoko.co.uk> >> Sent: Thursday 4th August 2016 13:59 >> To: solr-user@lucene.apache.org >> Subject: Out of sync deletions causing differing IDF >> >> We have a system that has a reasonable number of changes going on on a >> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr >> Cloud, the data is split into 10 shards and those shards are replicated. >> >> What we are finding is that the number of deletions is causing differing >> maxDocs across the different replicas, and that is causing significantly >> different IDF values between replicas of the same shard, giving >> different scores and thus different orders depending upon which replica >> we hit. >> >> I would have expected that, because the data is being indexed >> concurrently across replicas, that the pattern of delete/merge would be >> similar across replicas, but that doesn't seem to be the case in >> practice. >> >> We could, of course, optimise the index to merge down to a single >> segment. This would clear all deletes out, but would leave us in a worse >> place for the future, as now most of our deletes would be concentrated >> into a single large segment. >> >> Has anyone seen this sort of thing before, and does anyone have >> suggested strategies as to how to encourage IDF values into a similar >> range across replicas? >> >> Upayavira >>