Wei:

Right. You should count on the _entire_ index being replicated from the leader, 
but only after the optimize is done. Pre 7.5, this would be a single segment, 
7.5+ it would be a bunch of 5G flies unless you specified that the optimize 
create some number of segments.

But unless you
1> have an unreasonable number of deleted docs in your index
or
2> can demonstrate improved speed after optimize (and are willing to do it 
regularly)

I wouldn’t bother.

Aroop:

Well, optimizing is really never recommended if you can help it ;). By “help 
it” here I mean the number of deleted documents is a “reasonable” percentage of 
your index, where _you_ define what “reasonable” means. Another bit that came 
along with Solr 7.5 is that the percentage of deleted documents should be 
smaller than pre 7.5 in some cases.

It was relatively easy, for instance, to have indexes approaching 50% deleted 
documents pre 7.5. Things had to happen “just right” for that case, but it was 
possible.

When bulk indexing for instance, if what you’re doing is replacing all the docs 
you should have a minuscule number of deleted docs and I wouldn’t bother.

As always, if you can demonstrate that an optimized index returns searches 
enough faster to matter in your particular situation, then the cost may be 
worth it. And the situation where it makes the most sense is situations where 
you can optimize regularly.

Best,
Erick

> On Mar 12, 2019, at 10:51 PM, Aroop Ganguly <aroop_gang...@apple.com.INVALID> 
> wrote:
> 
> Hi Erick
> 
> A related question: 
> 
> Is optimize then ill advised for bulk indexer post solr 7.5 ? 
>>> Especially in a situation where an index is being modified over many days ?
> 
> Thanks
> Aroop
> 
>> On Mar 12, 2019, at 9:30 PM, Wei <weiwan...@gmail.com> wrote:
>> 
>> Thanks Erick, it's very helpful.  So for bulking indexing in a Tlog or
>> Tlog/Pull cloud,  when we optimize at the end of updates, segments on the
>> leader replica will change rapidly and the follower replicas will be
>> continuously pulling from the leader, effectively downloading the whole
>> index.  Is there a more efficient way?
>> 
>> On Mon, Mar 11, 2019 at 9:59 AM Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>>> do _not_ turn of hard commits, even when bulk indexing. Set the
>>> OpenSeacher to false in your config. This is for two reasons:
>>> 1> the only time the transaction log is rolled over is when a hard commit
>>> happens. If you turn off commits it’ll grow to a very large size.
>>> 2> If, for any reason, the node restarts, it’ll replay the transaction log
>>> from the last hard commit point, potentially taking hours if you haven’t
>>> committed.
>>> 
>>> And you should probably open  a new searcher occasionally, even while bulk
>>> indexing. For Real Time Get there are some internal structures that grow in
>>> proportion to the docs indexed since the last searcher was opened.
>>> 
>>> And for your other quesitons:
>>> <1> I believe so, try it and look at your solr log.
>>> 
>>> <2> Yes. Have you looked at Mike’s video (the third one down) here:
>>> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html?
>>> TieredMergePolicy is the third video. The merge policy combines like-sized
>>> segments. It’s wasteful to rewrite, say, a 19G segment just to add a 1G so
>>> having multiple segments < 20G is perfectly normal.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Mar 10, 2019, at 10:36 PM, Wei <weiwan...@gmail.com> wrote:
>>>> 
>>>> A side question, for heavy bulk indexing, what's the recommended setting
>>>> for auto commit? As there is no query needed during the bulking indexing
>>>> process, I have auto soft commit disabled. Is there any side effect if I
>>>> also disable auto commit?
>>>> 
>>>> On Sun, Mar 10, 2019 at 10:22 PM Wei <weiwan...@gmail.com> wrote:
>>>> 
>>>>> Thanks Erick.
>>>>> 
>>>>> 1> TLOG replicas shouldn’t optimize on the follower. They should
>>> optimize
>>>>> on the leader then replicate the entire index to the follower.
>>>>> 
>>>>> Does that mean the follower will ignore the optimize request? Or shall I
>>>>> send the optimize request only to one of the leaders?
>>>>> 
>>>>> 2> As of Solr 7.5, optimize should not optimize to a single segment
>>>>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
>>>>> numSegments on the optimize command.
>>>>> 
>>>>> -- Is the 5G limit controlled by maxMegedSegmentMB setting? In
>>>>> solrconfig.xml I used these settings:
>>>>> 
>>>>> <mergePolicyFactory
>>> class="org.apache.solr.index.TieredMergePolicyFactory">
>>>>>     <int name="maxMergeAtOnceExplicit">100</int>
>>>>>     <int name="maxMergeAtOnce">10</int>
>>>>>     <int name="segmentsPerTier">10</int>
>>>>>     <double name="maxMergedSegmentMB">20480</double>
>>>>> </mergePolicyFactory>
>>>>> 
>>>>> But in the end I see multiple segments much smaller than the 20GB limit.
>>>>> In 7.6 is it required to explicitly set the number of segments to 1? e.g
>>>>> shall I use
>>>>> 
>>>>> /update?optimize=true&waitSearcher=false&maxSegments=1
>>>>> 
>>>>> Best,
>>>>> Wei
>>>>> 
>>>>> 
>>>>> On Fri, Mar 8, 2019 at 12:29 PM Erick Erickson <erickerick...@gmail.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> This is very odd for at least two reasons:
>>>>>> 
>>>>>> 1> TLOG replicas shouldn’t optimize on the follower. They should
>>> optimize
>>>>>> on the leader then replicate the entire index to the follower.
>>>>>> 
>>>>>> 2> As of Solr 7.5, optimize should not optimize to a single segment
>>>>>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
>>>>>> numSegments on the optimize command.
>>>>>> 
>>>>>> So if you can reliably reproduce this, it’s probably worth a JIRA…...
>>>>>> 
>>>>>>> On Mar 8, 2019, at 11:21 AM, Wei <weiwan...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> RecentIy I encountered a strange issue with optimize in Solr 7.6. The
>>>>>> cloud
>>>>>>> is created with 4 shards with 2 Tlog replicas per shard. After batch
>>>>>> index
>>>>>>> update I issue an optimize command to a randomly picked replica in the
>>>>>>> cloud.  After a while when I check,  all the non-leader Tlog replicas
>>>>>>> finished optimization to a single segment, however all the leader
>>>>>> replicas
>>>>>>> still have multiple segments.  Previously inn the all NRT replica
>>>>>> cloud, I
>>>>>>> see optimization is triggered on all nodes.  Is the optimization
>>> process
>>>>>>> different with Tlog/Pull replicas?
>>>>>>> 
>>>>>>> Best,
>>>>>>> Wei
>>>>>> 
>>>>>> 
>>> 
>>> 
> 

Reply via email to