RE: maxDoc ten times greater than numDoc

Markus Jelsma Thu, 13 Apr 2017 13:03:48 -0700

Thanks, but i am not going to be brave this time :)

I have tried reclaimDeletesWeight on an ordinary index some time ago and it was 
very aggresive with slightly higher values than default. I think setting this 
weight in this situation would be analogous to a forceMerge every time, which 
makes sense.


Thanks,
Markus
 
-----Original message-----
> From:Erick Erickson <erickerick...@gmail.com>
> Sent: Thursday 13th April 2017 17:07
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: maxDoc ten times greater than numDoc
> 
> If you want to be brave....
> 
> Through a clever bit of reflection, the parameters that
> TieredMergePolicy uses to decide what segments to reclaim are settable
> in solrconfig.xml (undocumented, so use at your own risk). You could
> try bumping
> 
> reclaimDeletesWeight
> 
> in your TieredMergePolicy configuration if you wanted to experiment.
> 
> There's no good reason not to set your segments per tier, it won't hurt.
> 
>  But as you say you have a solution so this is just for curiosity's sake.
> 
> Best,
> Erick
> 
> On Thu, Apr 13, 2017 at 4:42 AM, Alexandre Rafalovitch
> <arafa...@gmail.com> wrote:
> > Maybe not every entry got deleted and it was holding up the segment.
> > E.g. a child or parent record abandoned. If, for example, the parent
> > record has a date field and the child does not, then deleting with a
> > date-based query may trigger this. I think there was a bug about
> > abandoned child or something.
> >
> > This is pure speculation of course.
> >
> > Regards,
> >    Alex.
> > ----
> > http://www.solr-start.com/ - Resources for Solr users, new and experienced
> >
> >
> > On 13 April 2017 at 12:54, Markus Jelsma <markus.jel...@openindex.io> wrote:
> >> I have forced a merge yesterday and went back to one segment.
> >>
> >> One indexer program reindexes (most or all) every 20 minutes orso. There 
> >> is nothing custom at that particular point. There is no autoCommit, the 
> >> indexer program is responsible for a hard commit, it is the single source 
> >> of reindexed data.
> >>
> >> After one cycle we had two segments, 50 % deleted, as expected. This was 
> >> stable for many hours and many cycles. For some reason, i now have 2/3 
> >> deletes and three segments, now this situation is stable. So the merges do 
> >> happen, but sometimes they don't. When they don't, the size increases (now 
> >> three segments, 55 MB). But it appears that number of segments never 
> >> decreases, and that is what bothers me.
> >>
> >> I was about to set segmentsPerTier to two but then i realized i can also 
> >> delete everything prior to indexing as opposed to deleting only items 
> >> older than the set i am already about to reindex. This strategy works fine 
> >> with other reindexing programs, they don't suffer this problem.
> >>
> >> So, it is not solved, but not a problem anymore. Thanks all anyway :)
> >> Markus
> >>
> >> -----Original message-----
> >>> From:Erick Erickson <erickerick...@gmail.com>
> >>> Sent: Wednesday 12th April 2017 17:51
> >>> To: solr-user <solr-user@lucene.apache.org>
> >>> Subject: Re: maxDoc ten times greater than numDoc
> >>>
> >>> Yes, this is very strange. My bet: you have something
> >>> custom, a setting, indexing code, whatever that
> >>> is getting in the way.
> >>>
> >>> Second possibility (really stretching here): your
> >>> merge settings are set to 10 segments having to exist
> >>> before merging and somehow not all the docs in the
> >>> segments are replaced. So until you get to the 10th
> >>> re-index (and assuming a single segment is
> >>> produced per re-index) the older segments aren't
> >>> merged. If that were the case I'd expect to see the
> >>> number of deleted docs drop back periodically
> >>> then build up again. A real shot in the dark. One way
> >>> to test this would be to specify "segmentsPerTier" of, say,
> >>> 2 rather than the default 10, see:
> >>> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
> >>> If this were the case I'd expect with a setting of 2 that
> >>> your index might have 50% deleted docs, that would at
> >>> least tell us whether we're on the right track.
> >>>
> >>> Take a look at your index on disk. If you're seeing gaps
> >>> in the numbering, you are getting merging, it may be
> >>> that they're not happening very often.
> >>>
> >>> And I take it you have no custom code here and you are
> >>> doing commits? (hard commits are all that matters
> >>> for merging, it doesn't matter whether openSearcher
> >>> is set to true or false).
> >>>
> >>> I just tried the "techproducts" example as follows:
> >>> 1> indexed all the sample files with the bin/solr -e techproducts example
> >>> 2> started re-indexing the sample docs one at a time with post.jar
> >>>
> >>> It took a while, but eventually the original segments got merged away so
> >>> I doubt it's any weirdness with a small index.
> >>>
> >>> Speaking of small index, why are you sharding with only
> >>> 8K docs? Sharding will probably slow things down for such
> >>> a small index. This isn't germane to your question though.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>
> >>> On Wed, Apr 12, 2017 at 5:56 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> >>> > On 4/12/2017 5:11 AM, Markus Jelsma wrote:
> >>> >> One of our 2 shard collections is rather small and gets all its 
> >>> >> entries reindexed every 20 minutes orso. Now i just noticed maxDoc is 
> >>> >> ten times greater than numDoc, the merger is never scheduled but 
> >>> >> settings are default. We just overwrite the existing entries, all of 
> >>> >> them.
> >>> >>
> >>> >> Here are the stats:
> >>> >>
> >>> >> Last Modified:    12 minutes ago
> >>> >> Num Docs:     8336
> >>> >> Max Doc:    82362
> >>> >> Heap Memory Usage:     -1
> >>> >> Deleted Docs:     74026
> >>> >> Version:     3125
> >>> >> Segment Count:     10
> >>> >
> >>> > This discrepancy would typically mean that when you reindex, you're
> >>> > indexing MOST of the documents, but not ALL of them, so at least one
> >>> > document is still not deleted in each older segment.  When segments have
> >>> > all their documents deleted, they are automatically removed by Lucene,
> >>> > but if there's even one document NOT deleted, the segment will remain
> >>> > until it is merged.
> >>> >
> >>> > There's no information here about how large this core is, but unless the
> >>> > documents are REALLY enormous, I'm betting that an optimize would happen
> >>> > quickly.  With a document count this low and an indexing pattern that
> >>> > results in such a large maxdoc, this might be a good time to go against
> >>> > general advice and perform an optimize at least once a day.
> >>> >
> >>> > An alternate idea that would not require optimizes:  If the intent is to
> >>> > completely rebuild the index, you might want to consider issuing a
> >>> > "delete all docs by query" before beginning the indexing process.  This
> >>> > would ensure that none of the previous documents remain.  As long as you
> >>> > don't do a commit that opens a new searcher before the indexing is
> >>> > complete, clients won't ever know that everything was deleted.
> >>> >
> >>> >> This is the config:
> >>> >>
> >>> >>   <luceneMatchVersion>6.5.0</luceneMatchVersion>
> >>> >>   <dataDir>${solr.data.dir:}</dataDir>
> >>> >>   <directoryFactory name="DirectoryFactory" 
> >>> >> class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
> >>> >>   <codecFactory class="solr.SchemaCodecFactory"/>
> >>> >>   <schemaFactory class="ClassicIndexSchemaFactory"/>
> >>> >>
> >>> >>   <indexConfig>
> >>> >>     <lockType>${solr.lock.type:native}</lockType>
> >>> >>      <infoStream>false</infoStream>
> >>> >>   </indexConfig>
> >>> >>
> >>> >>   <jmx />
> >>> >>
> >>> >>   <updateHandler class="solr.DirectUpdateHandler2">
> >>> >>     <updateLog>
> >>> >>       <str name="dir">${solr.ulog.dir:}</str>
> >>> >>     </updateLog>
> >>> >>   </updateHandler>
> >>> >
> >>> > Side issue: This config is missing autoCommit.  You really should have
> >>> > autoCommit with openSearcher set to false and a maxTime in the
> >>> > neighborhood of 60000.  It goes inside the updateHandler section.  This
> >>> > won't change the maxDoc issue, but because of the other problems it can
> >>> > prevent, it is strongly recommended.  It can be omitted if you are
> >>> > confident that your indexing code is correctly managing hard commits.
> >>> >
> >>> > Thanks,
> >>> > Shawn
> >>> >
> >>>
>

RE: maxDoc ten times greater than numDoc

Reply via email to