Re: maxDoc ten times greater than numDoc

Alexandre Rafalovitch Thu, 13 Apr 2017 04:43:41 -0700

Maybe not every entry got deleted and it was holding up the segment.
E.g. a child or parent record abandoned. If, for example, the parent
record has a date field and the child does not, then deleting with a
date-based query may trigger this. I think there was a bug about
abandoned child or something.


This is pure speculation of course.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 13 April 2017 at 12:54, Markus Jelsma <markus.jel...@openindex.io> wrote:
> I have forced a merge yesterday and went back to one segment.
>
> One indexer program reindexes (most or all) every 20 minutes orso. There is 
> nothing custom at that particular point. There is no autoCommit, the indexer 
> program is responsible for a hard commit, it is the single source of 
> reindexed data.
>
> After one cycle we had two segments, 50 % deleted, as expected. This was 
> stable for many hours and many cycles. For some reason, i now have 2/3 
> deletes and three segments, now this situation is stable. So the merges do 
> happen, but sometimes they don't. When they don't, the size increases (now 
> three segments, 55 MB). But it appears that number of segments never 
> decreases, and that is what bothers me.
>
> I was about to set segmentsPerTier to two but then i realized i can also 
> delete everything prior to indexing as opposed to deleting only items older 
> than the set i am already about to reindex. This strategy works fine with 
> other reindexing programs, they don't suffer this problem.
>
> So, it is not solved, but not a problem anymore. Thanks all anyway :)
> Markus
>
> -----Original message-----
>> From:Erick Erickson <erickerick...@gmail.com>
>> Sent: Wednesday 12th April 2017 17:51
>> To: solr-user <solr-user@lucene.apache.org>
>> Subject: Re: maxDoc ten times greater than numDoc
>>
>> Yes, this is very strange. My bet: you have something
>> custom, a setting, indexing code, whatever that
>> is getting in the way.
>>
>> Second possibility (really stretching here): your
>> merge settings are set to 10 segments having to exist
>> before merging and somehow not all the docs in the
>> segments are replaced. So until you get to the 10th
>> re-index (and assuming a single segment is
>> produced per re-index) the older segments aren't
>> merged. If that were the case I'd expect to see the
>> number of deleted docs drop back periodically
>> then build up again. A real shot in the dark. One way
>> to test this would be to specify "segmentsPerTier" of, say,
>> 2 rather than the default 10, see:
>> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
>> If this were the case I'd expect with a setting of 2 that
>> your index might have 50% deleted docs, that would at
>> least tell us whether we're on the right track.
>>
>> Take a look at your index on disk. If you're seeing gaps
>> in the numbering, you are getting merging, it may be
>> that they're not happening very often.
>>
>> And I take it you have no custom code here and you are
>> doing commits? (hard commits are all that matters
>> for merging, it doesn't matter whether openSearcher
>> is set to true or false).
>>
>> I just tried the "techproducts" example as follows:
>> 1> indexed all the sample files with the bin/solr -e techproducts example
>> 2> started re-indexing the sample docs one at a time with post.jar
>>
>> It took a while, but eventually the original segments got merged away so
>> I doubt it's any weirdness with a small index.
>>
>> Speaking of small index, why are you sharding with only
>> 8K docs? Sharding will probably slow things down for such
>> a small index. This isn't germane to your question though.
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Apr 12, 2017 at 5:56 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>> > On 4/12/2017 5:11 AM, Markus Jelsma wrote:
>> >> One of our 2 shard collections is rather small and gets all its entries 
>> >> reindexed every 20 minutes orso. Now i just noticed maxDoc is ten times 
>> >> greater than numDoc, the merger is never scheduled but settings are 
>> >> default. We just overwrite the existing entries, all of them.
>> >>
>> >> Here are the stats:
>> >>
>> >> Last Modified:    12 minutes ago
>> >> Num Docs:     8336
>> >> Max Doc:    82362
>> >> Heap Memory Usage:     -1
>> >> Deleted Docs:     74026
>> >> Version:     3125
>> >> Segment Count:     10
>> >
>> > This discrepancy would typically mean that when you reindex, you're
>> > indexing MOST of the documents, but not ALL of them, so at least one
>> > document is still not deleted in each older segment.  When segments have
>> > all their documents deleted, they are automatically removed by Lucene,
>> > but if there's even one document NOT deleted, the segment will remain
>> > until it is merged.
>> >
>> > There's no information here about how large this core is, but unless the
>> > documents are REALLY enormous, I'm betting that an optimize would happen
>> > quickly.  With a document count this low and an indexing pattern that
>> > results in such a large maxdoc, this might be a good time to go against
>> > general advice and perform an optimize at least once a day.
>> >
>> > An alternate idea that would not require optimizes:  If the intent is to
>> > completely rebuild the index, you might want to consider issuing a
>> > "delete all docs by query" before beginning the indexing process.  This
>> > would ensure that none of the previous documents remain.  As long as you
>> > don't do a commit that opens a new searcher before the indexing is
>> > complete, clients won't ever know that everything was deleted.
>> >
>> >> This is the config:
>> >>
>> >>   <luceneMatchVersion>6.5.0</luceneMatchVersion>
>> >>   <dataDir>${solr.data.dir:}</dataDir>
>> >>   <directoryFactory name="DirectoryFactory" 
>> >> class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
>> >>   <codecFactory class="solr.SchemaCodecFactory"/>
>> >>   <schemaFactory class="ClassicIndexSchemaFactory"/>
>> >>
>> >>   <indexConfig>
>> >>     <lockType>${solr.lock.type:native}</lockType>
>> >>      <infoStream>false</infoStream>
>> >>   </indexConfig>
>> >>
>> >>   <jmx />
>> >>
>> >>   <updateHandler class="solr.DirectUpdateHandler2">
>> >>     <updateLog>
>> >>       <str name="dir">${solr.ulog.dir:}</str>
>> >>     </updateLog>
>> >>   </updateHandler>
>> >
>> > Side issue: This config is missing autoCommit.  You really should have
>> > autoCommit with openSearcher set to false and a maxTime in the
>> > neighborhood of 60000.  It goes inside the updateHandler section.  This
>> > won't change the maxDoc issue, but because of the other problems it can
>> > prevent, it is strongly recommended.  It can be omitted if you are
>> > confident that your indexing code is correctly managing hard commits.
>> >
>> > Thanks,
>> > Shawn
>> >
>>

Re: maxDoc ten times greater than numDoc

Reply via email to