Yeah, trying to have something that satisfies all use cases is a bear. I know of one installation where the indexing rate was so huge that they couldn't afford to have any merging (80B docs/day) so in that situation any heuristics built into Solr would be wrong.
Here's an alternate approach to having buttons where you have to attend to it each day: http://localhost:8983/solr/admin/cores?action=STATUS returns each core and the number of docs, maxdocs, and deleted docs. One could set up a cron job that runs every night at 3:00 am that then sends the optimize command to any core with greater than X% deleted docs, where X is your locally-determined threshold. That would be less work actually than having to attend to it every day. FWIW On Sat, Apr 21, 2018 at 10:55 AM, Joe Doupnik <j...@netlab1.net> wrote: > A good find Erick, and one which brings into focus the real problem at > hand. That overload case would happen if there were an Optimise button or if > the curl equivalent command were issued, and is not a reason to avoid > either/both. > So, what could be done to avoid such awkward difficulties? > Well, an obvious suggestion, without knowing the details, is might the > system be able to estimate internal conditions sufficiently to issue a > warning and decline an Optimise. Certainly average system managers are not > about to decode and monitor Java VM nuances. > Discussion about automating removals based on sizes of this and that > seem, from this distance, to be musings yet to face the real world. In the > meanwhile we need to control matters, hence the button request. > The resource consumption issue is inherent in such systems, and we in > the field have very little information to help make choices. I know, it's > not a simple affair, and too many buzz words fly about. Thus the engineers > close to the code might have a ponder about the above predictive capability > and about the overall resource consumption process which might permit the > system to adapt to progressively larger loads over time. > In my own situation I feed material into Solr a file at a time, give a > small pause, repeat, get to 100 entries and wait a bit longer, and so on > every file, hundred files, thousand files. This works well to reduce > resource peaks and uncompleted operations, and it lets the system run in the > background all day if necessary without disturbing main activities. My > longest run was over a full day, 660+K documents which worked just fine and > did not upset other activities in the machine. > Thanks, > Joe D. > > > > On 21/04/2018 17:54, Erick Erickson wrote: >> >> Joe: >> >> Serendipity strikes, The thread titled "JVM Heap Memory Increase (SOLR >> CLOUD)" is a perfect example of why the optimize button is so >> "fraught". >> >> Best, >> Erick >> >> On Sat, Apr 21, 2018 at 9:43 AM, Erick Erickson <erickerick...@gmail.com> >> wrote: >>> >>> Joe: >>> >>> Thanks for moving the conversation over here that we were having on >>> the blog post. I think the wider audience will benefit from this going >>> forward. >>> >>> bq: ...apparent inability to remove piles of deleted docs >>> >>> do note that deleted docs are removed during normal indexing when >>> segments are merged, they're not permanently retained in the index. >>> Part of the thinking behind SOLR-7733 is exactly that once you press >>> the very tempting optimize button, you can get into a situation where >>> your one huge segment does _not_ have the deleted docs removed until >>> the "live" document space is < 2.5G. Thus if you have a 100G segment >>> after optimize, it'll look like deleted docs are never removed until >>> at least 97.5% of the docs are deleted. The default max segment size >>> is 5G, and the current algorithm doesn't consider segments eligible >>> for merging until 50% of that maximum number consists of "live" docs. >>> >>> The optimize functionality in the admin UI was removed as part of >>> SOLR-7733 from the screen that comes up when you select a core, but >>> the "core admin" screen still has the optimize button that comes and >>> goes depending on whether there are any deleted documents or not. This >>> page is only visible in standalone mode. >>> >>> Unfortunately SOLR-7733 removed the functionality that actually sent >>> the optimize command from the javascript, so pressing the optimize >>> button does nothing. This is indeed a bug, see: SOLR-12253 which will >>> remove the button from the core admin screen in stand-alone mode. >>> >>> Optimize (aka forceMerge) is pretty actively discouraged because it is: >>> 1> very expensive >>> 2> has significant "gotchas" (we chatted in comments in the blog post >>> about the gotchas). >>> >>> So we made a decision to make it more of an 'expert' option, requiring >>> users to issue a curl/Browser URL command like >>> "....solr/core_or_collection/update?optimize=true" if this >>> functionality is really desirable in their situation. Docs will be >>> updated too, they're lagging a bit. >>> >>> Coming probably in Solr 7.4 is a new parameter (tentatively) for >>> TieredMergePolicy (TMP) that puts a soft ceiling on the percentage of >>> deleted docs in an index. The current version of this patch >>> (LUCENE-7976) sets this threshold at 20% at the expense of about 10% >>> more I/O in my tests from the current TMP implementation. Under >>> discussion is how low to allow this to be, we're thinking 10% as a >>> floor, and what the default should be. The current TMP caps the >>> percentage deleted docs at close to 50%. >>> >>> The thinking behind not allowing the percent deleted documents to be >>> too low is that that would trigger its own massive I/O issues, >>> rewriting "live" documents over and over and over. For NRT indexes, >>> that's almost certainly a horrible tradeoff. For more static indexes, >>> the "expert" API command is still available. >>> >>> Best, >>> Erick >>> >>> On Sat, Apr 21, 2018 at 5:08 AM, Joe Doupnik <j...@netlab1.net> wrote: >>>> >>>> In Solr v7.3.0 the ability to removed "deleted" docs from a core by >>>> use >>>> of what until then was the Optmise button on the admin GUI has been >>>> changed >>>> in an ungood way. That is, in the V7.3.0 Changes list, item SOLR 7733 >>>> (quote >>>> remove "optmize from the UI, end quote). The result of that is an >>>> apparent >>>> inability to remove piles of deleted docs, which amongst other things >>>> means >>>> wasting disk space. That is a marked step backward and is unhelpful for >>>> use >>>> of Solr in the field. As other comments in the now closed 7733 ticket >>>> explain, this is a user item whidh has impact on their site, and it >>>> ought to >>>> be an inherent feature of Solr. Consider a file system where complete >>>> deletes are forbidden, or your kitchen where taking out the rubbish is >>>> denied. Hand waving about obscure auto-sizing notions will not suffice. >>>> Thus >>>> may I urge that the Optimse button and operation be returned to use, as >>>> it >>>> was until Solr v7.3.0. >>>> Thanks, >>>> Joe D. > >