Thanks Erick, it's very helpful. So for bulking indexing in a Tlog or Tlog/Pull cloud, when we optimize at the end of updates, segments on the leader replica will change rapidly and the follower replicas will be continuously pulling from the leader, effectively downloading the whole index. Is there a more efficient way?
On Mon, Mar 11, 2019 at 9:59 AM Erick Erickson <erickerick...@gmail.com> wrote: > do _not_ turn of hard commits, even when bulk indexing. Set the > OpenSeacher to false in your config. This is for two reasons: > 1> the only time the transaction log is rolled over is when a hard commit > happens. If you turn off commits it’ll grow to a very large size. > 2> If, for any reason, the node restarts, it’ll replay the transaction log > from the last hard commit point, potentially taking hours if you haven’t > committed. > > And you should probably open a new searcher occasionally, even while bulk > indexing. For Real Time Get there are some internal structures that grow in > proportion to the docs indexed since the last searcher was opened. > > And for your other quesitons: > <1> I believe so, try it and look at your solr log. > > <2> Yes. Have you looked at Mike’s video (the third one down) here: > http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html? > TieredMergePolicy is the third video. The merge policy combines like-sized > segments. It’s wasteful to rewrite, say, a 19G segment just to add a 1G so > having multiple segments < 20G is perfectly normal. > > Best, > Erick > > > On Mar 10, 2019, at 10:36 PM, Wei <weiwan...@gmail.com> wrote: > > > > A side question, for heavy bulk indexing, what's the recommended setting > > for auto commit? As there is no query needed during the bulking indexing > > process, I have auto soft commit disabled. Is there any side effect if I > > also disable auto commit? > > > > On Sun, Mar 10, 2019 at 10:22 PM Wei <weiwan...@gmail.com> wrote: > > > >> Thanks Erick. > >> > >> 1> TLOG replicas shouldn’t optimize on the follower. They should > optimize > >> on the leader then replicate the entire index to the follower. > >> > >> Does that mean the follower will ignore the optimize request? Or shall I > >> send the optimize request only to one of the leaders? > >> > >> 2> As of Solr 7.5, optimize should not optimize to a single segment > >> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set > >> numSegments on the optimize command. > >> > >> -- Is the 5G limit controlled by maxMegedSegmentMB setting? In > >> solrconfig.xml I used these settings: > >> > >> <mergePolicyFactory > class="org.apache.solr.index.TieredMergePolicyFactory"> > >> <int name="maxMergeAtOnceExplicit">100</int> > >> <int name="maxMergeAtOnce">10</int> > >> <int name="segmentsPerTier">10</int> > >> <double name="maxMergedSegmentMB">20480</double> > >> </mergePolicyFactory> > >> > >> But in the end I see multiple segments much smaller than the 20GB limit. > >> In 7.6 is it required to explicitly set the number of segments to 1? e.g > >> shall I use > >> > >> /update?optimize=true&waitSearcher=false&maxSegments=1 > >> > >> Best, > >> Wei > >> > >> > >> On Fri, Mar 8, 2019 at 12:29 PM Erick Erickson <erickerick...@gmail.com > > > >> wrote: > >> > >>> This is very odd for at least two reasons: > >>> > >>> 1> TLOG replicas shouldn’t optimize on the follower. They should > optimize > >>> on the leader then replicate the entire index to the follower. > >>> > >>> 2> As of Solr 7.5, optimize should not optimize to a single segment > >>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set > >>> numSegments on the optimize command. > >>> > >>> So if you can reliably reproduce this, it’s probably worth a JIRA…... > >>> > >>>> On Mar 8, 2019, at 11:21 AM, Wei <weiwan...@gmail.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> RecentIy I encountered a strange issue with optimize in Solr 7.6. The > >>> cloud > >>>> is created with 4 shards with 2 Tlog replicas per shard. After batch > >>> index > >>>> update I issue an optimize command to a randomly picked replica in the > >>>> cloud. After a while when I check, all the non-leader Tlog replicas > >>>> finished optimization to a single segment, however all the leader > >>> replicas > >>>> still have multiple segments. Previously inn the all NRT replica > >>> cloud, I > >>>> see optimization is triggered on all nodes. Is the optimization > process > >>>> different with Tlog/Pull replicas? > >>>> > >>>> Best, > >>>> Wei > >>> > >>> > >