Thanks, but i am not going to be brave this time :) I have tried reclaimDeletesWeight on an ordinary index some time ago and it was very aggresive with slightly higher values than default. I think setting this weight in this situation would be analogous to a forceMerge every time, which makes sense.
Thanks, Markus -----Original message----- > From:Erick Erickson <erickerick...@gmail.com> > Sent: Thursday 13th April 2017 17:07 > To: solr-user <solr-user@lucene.apache.org> > Subject: Re: maxDoc ten times greater than numDoc > > If you want to be brave.... > > Through a clever bit of reflection, the parameters that > TieredMergePolicy uses to decide what segments to reclaim are settable > in solrconfig.xml (undocumented, so use at your own risk). You could > try bumping > > reclaimDeletesWeight > > in your TieredMergePolicy configuration if you wanted to experiment. > > There's no good reason not to set your segments per tier, it won't hurt. > > But as you say you have a solution so this is just for curiosity's sake. > > Best, > Erick > > On Thu, Apr 13, 2017 at 4:42 AM, Alexandre Rafalovitch > <arafa...@gmail.com> wrote: > > Maybe not every entry got deleted and it was holding up the segment. > > E.g. a child or parent record abandoned. If, for example, the parent > > record has a date field and the child does not, then deleting with a > > date-based query may trigger this. I think there was a bug about > > abandoned child or something. > > > > This is pure speculation of course. > > > > Regards, > > Alex. > > ---- > > http://www.solr-start.com/ - Resources for Solr users, new and experienced > > > > > > On 13 April 2017 at 12:54, Markus Jelsma <markus.jel...@openindex.io> wrote: > >> I have forced a merge yesterday and went back to one segment. > >> > >> One indexer program reindexes (most or all) every 20 minutes orso. There > >> is nothing custom at that particular point. There is no autoCommit, the > >> indexer program is responsible for a hard commit, it is the single source > >> of reindexed data. > >> > >> After one cycle we had two segments, 50 % deleted, as expected. This was > >> stable for many hours and many cycles. For some reason, i now have 2/3 > >> deletes and three segments, now this situation is stable. So the merges do > >> happen, but sometimes they don't. When they don't, the size increases (now > >> three segments, 55 MB). But it appears that number of segments never > >> decreases, and that is what bothers me. > >> > >> I was about to set segmentsPerTier to two but then i realized i can also > >> delete everything prior to indexing as opposed to deleting only items > >> older than the set i am already about to reindex. This strategy works fine > >> with other reindexing programs, they don't suffer this problem. > >> > >> So, it is not solved, but not a problem anymore. Thanks all anyway :) > >> Markus > >> > >> -----Original message----- > >>> From:Erick Erickson <erickerick...@gmail.com> > >>> Sent: Wednesday 12th April 2017 17:51 > >>> To: solr-user <solr-user@lucene.apache.org> > >>> Subject: Re: maxDoc ten times greater than numDoc > >>> > >>> Yes, this is very strange. My bet: you have something > >>> custom, a setting, indexing code, whatever that > >>> is getting in the way. > >>> > >>> Second possibility (really stretching here): your > >>> merge settings are set to 10 segments having to exist > >>> before merging and somehow not all the docs in the > >>> segments are replaced. So until you get to the 10th > >>> re-index (and assuming a single segment is > >>> produced per re-index) the older segments aren't > >>> merged. If that were the case I'd expect to see the > >>> number of deleted docs drop back periodically > >>> then build up again. A real shot in the dark. One way > >>> to test this would be to specify "segmentsPerTier" of, say, > >>> 2 rather than the default 10, see: > >>> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig > >>> If this were the case I'd expect with a setting of 2 that > >>> your index might have 50% deleted docs, that would at > >>> least tell us whether we're on the right track. > >>> > >>> Take a look at your index on disk. If you're seeing gaps > >>> in the numbering, you are getting merging, it may be > >>> that they're not happening very often. > >>> > >>> And I take it you have no custom code here and you are > >>> doing commits? (hard commits are all that matters > >>> for merging, it doesn't matter whether openSearcher > >>> is set to true or false). > >>> > >>> I just tried the "techproducts" example as follows: > >>> 1> indexed all the sample files with the bin/solr -e techproducts example > >>> 2> started re-indexing the sample docs one at a time with post.jar > >>> > >>> It took a while, but eventually the original segments got merged away so > >>> I doubt it's any weirdness with a small index. > >>> > >>> Speaking of small index, why are you sharding with only > >>> 8K docs? Sharding will probably slow things down for such > >>> a small index. This isn't germane to your question though. > >>> > >>> Best, > >>> Erick > >>> > >>> > >>> On Wed, Apr 12, 2017 at 5:56 AM, Shawn Heisey <apa...@elyograg.org> wrote: > >>> > On 4/12/2017 5:11 AM, Markus Jelsma wrote: > >>> >> One of our 2 shard collections is rather small and gets all its > >>> >> entries reindexed every 20 minutes orso. Now i just noticed maxDoc is > >>> >> ten times greater than numDoc, the merger is never scheduled but > >>> >> settings are default. We just overwrite the existing entries, all of > >>> >> them. > >>> >> > >>> >> Here are the stats: > >>> >> > >>> >> Last Modified: 12 minutes ago > >>> >> Num Docs: 8336 > >>> >> Max Doc: 82362 > >>> >> Heap Memory Usage: -1 > >>> >> Deleted Docs: 74026 > >>> >> Version: 3125 > >>> >> Segment Count: 10 > >>> > > >>> > This discrepancy would typically mean that when you reindex, you're > >>> > indexing MOST of the documents, but not ALL of them, so at least one > >>> > document is still not deleted in each older segment. When segments have > >>> > all their documents deleted, they are automatically removed by Lucene, > >>> > but if there's even one document NOT deleted, the segment will remain > >>> > until it is merged. > >>> > > >>> > There's no information here about how large this core is, but unless the > >>> > documents are REALLY enormous, I'm betting that an optimize would happen > >>> > quickly. With a document count this low and an indexing pattern that > >>> > results in such a large maxdoc, this might be a good time to go against > >>> > general advice and perform an optimize at least once a day. > >>> > > >>> > An alternate idea that would not require optimizes: If the intent is to > >>> > completely rebuild the index, you might want to consider issuing a > >>> > "delete all docs by query" before beginning the indexing process. This > >>> > would ensure that none of the previous documents remain. As long as you > >>> > don't do a commit that opens a new searcher before the indexing is > >>> > complete, clients won't ever know that everything was deleted. > >>> > > >>> >> This is the config: > >>> >> > >>> >> <luceneMatchVersion>6.5.0</luceneMatchVersion> > >>> >> <dataDir>${solr.data.dir:}</dataDir> > >>> >> <directoryFactory name="DirectoryFactory" > >>> >> class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/> > >>> >> <codecFactory class="solr.SchemaCodecFactory"/> > >>> >> <schemaFactory class="ClassicIndexSchemaFactory"/> > >>> >> > >>> >> <indexConfig> > >>> >> <lockType>${solr.lock.type:native}</lockType> > >>> >> <infoStream>false</infoStream> > >>> >> </indexConfig> > >>> >> > >>> >> <jmx /> > >>> >> > >>> >> <updateHandler class="solr.DirectUpdateHandler2"> > >>> >> <updateLog> > >>> >> <str name="dir">${solr.ulog.dir:}</str> > >>> >> </updateLog> > >>> >> </updateHandler> > >>> > > >>> > Side issue: This config is missing autoCommit. You really should have > >>> > autoCommit with openSearcher set to false and a maxTime in the > >>> > neighborhood of 60000. It goes inside the updateHandler section. This > >>> > won't change the maxDoc issue, but because of the other problems it can > >>> > prevent, it is strongly recommended. It can be omitted if you are > >>> > confident that your indexing code is correctly managing hard commits. > >>> > > >>> > Thanks, > >>> > Shawn > >>> > > >>> >