Thanks Erick ! Great details as always :)
> On Mar 13, 2019, at 8:48 AM, Erick Erickson <erickerick...@gmail.com> wrote:
>
> Wei:
>
> Right. You should count on the _entire_ index being replicated from the
> leader, but only after the optimize is done. Pre 7.5, this would be a single
> segment, 7.5+ it would be a bunch of 5G flies unless you specified that the
> optimize create some number of segments.
>
> But unless you
> 1> have an unreasonable number of deleted docs in your index
> or
> 2> can demonstrate improved speed after optimize (and are willing to do it
> regularly)
>
> I wouldn’t bother.
>
> Aroop:
>
> Well, optimizing is really never recommended if you can help it ;). By “help
> it” here I mean the number of deleted documents is a “reasonable” percentage
> of your index, where _you_ define what “reasonable” means. Another bit that
> came along with Solr 7.5 is that the percentage of deleted documents should
> be smaller than pre 7.5 in some cases.
>
> It was relatively easy, for instance, to have indexes approaching 50% deleted
> documents pre 7.5. Things had to happen “just right” for that case, but it
> was possible.
>
> When bulk indexing for instance, if what you’re doing is replacing all the
> docs you should have a minuscule number of deleted docs and I wouldn’t bother.
>
> As always, if you can demonstrate that an optimized index returns searches
> enough faster to matter in your particular situation, then the cost may be
> worth it. And the situation where it makes the most sense is situations where
> you can optimize regularly.
>
> Best,
> Erick
>
>> On Mar 12, 2019, at 10:51 PM, Aroop Ganguly
>> <aroop_gang...@apple.com.INVALID> wrote:
>>
>> Hi Erick
>>
>> A related question:
>>
>> Is optimize then ill advised for bulk indexer post solr 7.5 ?
>>>> Especially in a situation where an index is being modified over many days ?
>>
>> Thanks
>> Aroop
>>
>>> On Mar 12, 2019, at 9:30 PM, Wei <weiwan...@gmail.com> wrote:
>>>
>>> Thanks Erick, it's very helpful. So for bulking indexing in a Tlog or
>>> Tlog/Pull cloud, when we optimize at the end of updates, segments on the
>>> leader replica will change rapidly and the follower replicas will be
>>> continuously pulling from the leader, effectively downloading the whole
>>> index. Is there a more efficient way?
>>>
>>> On Mon, Mar 11, 2019 at 9:59 AM Erick Erickson <erickerick...@gmail.com>
>>> wrote:
>>>
>>>> do _not_ turn of hard commits, even when bulk indexing. Set the
>>>> OpenSeacher to false in your config. This is for two reasons:
>>>> 1> the only time the transaction log is rolled over is when a hard commit
>>>> happens. If you turn off commits it’ll grow to a very large size.
>>>> 2> If, for any reason, the node restarts, it’ll replay the transaction log
>>>> from the last hard commit point, potentially taking hours if you haven’t
>>>> committed.
>>>>
>>>> And you should probably open a new searcher occasionally, even while bulk
>>>> indexing. For Real Time Get there are some internal structures that grow in
>>>> proportion to the docs indexed since the last searcher was opened.
>>>>
>>>> And for your other quesitons:
>>>> <1> I believe so, try it and look at your solr log.
>>>>
>>>> <2> Yes. Have you looked at Mike’s video (the third one down) here:
>>>> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html?
>>>> TieredMergePolicy is the third video. The merge policy combines like-sized
>>>> segments. It’s wasteful to rewrite, say, a 19G segment just to add a 1G so
>>>> having multiple segments < 20G is perfectly normal.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>> On Mar 10, 2019, at 10:36 PM, Wei <weiwan...@gmail.com> wrote:
>>>>>
>>>>> A side question, for heavy bulk indexing, what's the recommended setting
>>>>> for auto commit? As there is no query needed during the bulking indexing
>>>>> process, I have auto soft commit disabled. Is there any side effect if I
>>>>> also disable auto commit?
>>>>>
>>>>> On Sun, Mar 10, 2019 at 10:22 PM Wei <weiwan...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Erick.
>>>>>>
>>>>>> 1> TLOG replicas shouldn’t optimize on the follower. They should
>>>> optimize
>>>>>> on the leader then replicate the entire index to the follower.
>>>>>>
>>>>>> Does that mean the follower will ignore the optimize request? Or shall I
>>>>>> send the optimize request only to one of the leaders?
>>>>>>
>>>>>> 2> As of Solr 7.5, optimize should not optimize to a single segment
>>>>>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
>>>>>> numSegments on the optimize command.
>>>>>>
>>>>>> -- Is the 5G limit controlled by maxMegedSegmentMB setting? In
>>>>>> solrconfig.xml I used these settings:
>>>>>>
>>>>>> <mergePolicyFactory
>>>> class="org.apache.solr.index.TieredMergePolicyFactory">
>>>>>> <int name="maxMergeAtOnceExplicit">100</int>
>>>>>> <int name="maxMergeAtOnce">10</int>
>>>>>> <int name="segmentsPerTier">10</int>
>>>>>> <double name="maxMergedSegmentMB">20480</double>
>>>>>> </mergePolicyFactory>
>>>>>>
>>>>>> But in the end I see multiple segments much smaller than the 20GB limit.
>>>>>> In 7.6 is it required to explicitly set the number of segments to 1? e.g
>>>>>> shall I use
>>>>>>
>>>>>> /update?optimize=true&waitSearcher=false&maxSegments=1
>>>>>>
>>>>>> Best,
>>>>>> Wei
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 8, 2019 at 12:29 PM Erick Erickson <erickerick...@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> This is very odd for at least two reasons:
>>>>>>>
>>>>>>> 1> TLOG replicas shouldn’t optimize on the follower. They should
>>>> optimize
>>>>>>> on the leader then replicate the entire index to the follower.
>>>>>>>
>>>>>>> 2> As of Solr 7.5, optimize should not optimize to a single segment
>>>>>>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
>>>>>>> numSegments on the optimize command.
>>>>>>>
>>>>>>> So if you can reliably reproduce this, it’s probably worth a JIRA…...
>>>>>>>
>>>>>>>> On Mar 8, 2019, at 11:21 AM, Wei <weiwan...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> RecentIy I encountered a strange issue with optimize in Solr 7.6. The
>>>>>>> cloud
>>>>>>>> is created with 4 shards with 2 Tlog replicas per shard. After batch
>>>>>>> index
>>>>>>>> update I issue an optimize command to a randomly picked replica in the
>>>>>>>> cloud. After a while when I check, all the non-leader Tlog replicas
>>>>>>>> finished optimization to a single segment, however all the leader
>>>>>>> replicas
>>>>>>>> still have multiple segments. Previously inn the all NRT replica
>>>>>>> cloud, I
>>>>>>>> see optimization is triggered on all nodes. Is the optimization
>>>> process
>>>>>>>> different with Tlog/Pull replicas?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Wei
>>>>>>>
>>>>>>>
>>>>
>>>>
>>
>