Hi Michael, The evidence is how Lucene works, and that i add the same docs over and over again in tests. If i index 500k docs to an index that already has the same 500k docs it means i write a delete flag to the old 500k and add the new 500k, leading to a million docs (maxDoc). You're correct, only by merging segments (or optimize/forceMerge) i can reduce (or stabilize) maxDoc on all replica's.
Old school replication has an advantage as identical segments are replicated. In SolrCloud only docs are pushed to replica's. The problem now is that replica's don't merge at the same time, leading to differences in maxDoc, docCount and docFreq. We need, and i think many SolrCloud users are going to need this as well, to make replica's don't deviate too much from eachother, because if they do documents are certainly going to jump positions. Many thanks for sharing your thoughts, Markus -----Original message----- > From:Michael Ryan <mr...@moreover.com> > Sent: Wed 23-Jan-2013 23:50 > To: solr-user@lucene.apache.org > Subject: RE: Issues with docFreq/docCount on SolrCloud > > Are you able to see any evidence that some of the 500k docs are being added > twice? Check the maxDocs on the Solr admin page. I vaguely recall there being > some issue with docs in SolrCloud being added multiple times (which under the > covers is really add, delete, add). I think that could cause the docCount to > be different across "identical" indexes. That would also explain why a > forceMerge fixes it, as the deleted documents are then fully removed. > > -Michael > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Wednesday, January 23, 2013 5:38 PM > To: solr-user@lucene.apache.org > Subject: RE: Issues with docFreq/docCount on SolrCloud > > Hi again, > > I've tried various settings for TieredMergePolicy to make sure the docFreq, > maxDoc and docCount don't deviate too much. We've also did tests after > increasing reclaimDeletesWeight from 2.0 to 8.0 and slightly more frequent > merging. In these tests we reindexed the same 500k docs each time in > different cores with various settings at the same time. > > We still see documents in distributed queries being scored slightly different > leading to documents jumping positions in the resultset, which is obviously > unacceptable. > > To clarify, these documents don't jump positions because of them having the > same score and being sorted by Lucene docID, it's the actual score being > different. Also, the index doesn't change when we fire queries and it's not a > problem of lacking distributed IDF. It is, of course, acceptable for > documents to jump position on a frequently changing index, that's the way it > works. But not for a multiple replica's on a static index. > > Is there anyone around here with suggestions, hints or anything? > > The next thing we might try is to route the same user to the same replica of > a shard by overriding the http shard handler but i'm not sure this is a > proper solution. This, at least, might prevent users from seeing documents > jumping positions in the same result set. > > Thanks, > Markus > > -----Original message----- > > From:Markus Jelsma <markus.jel...@openindex.io> > > Sent: Mon 21-Jan-2013 20:31 > > To: solr-user@lucene.apache.org > > Subject: Issues with docFreq/docCount on SolrCloud > > > > Hi, > > > > We have a few trunk clusters running with two replica's for each shard. We > > sometimes see results jumping positions for identical queries. We've > > tracked it down to differences in docFreq and docCount between the leader > > and replica's. The only way to force all cores in the shard to be > > consistent is to optimize or forceMerge the segments. > > > > Is there anyone here who can give advice on this issue? For obvious reasons > > we don't want to to optimize 50GB of data on some regular basis but we do > > want to make sure the variations in docFreq/docCount does not lead to > > results jumping positions in the resultset for identical queries. > > > > We already have like most of you small issues due to the lack of > > distributed IDF, having this problem as well makes SolrCloud less > > predictable and harder to debug. > > > > Thanks, > > Markus > > >