A good find Erick, and one which brings into focus the real problem at hand. That overload case would happen if there were an Optimise button or if the curl equivalent command were issued, and is not a reason to avoid either/both.
    So, what could be done to avoid such awkward difficulties?
    Well, an obvious suggestion, without knowing the details, is might the system be able to estimate internal conditions sufficiently to issue a warning and decline an Optimise. Certainly average system managers are not about to decode and monitor Java VM nuances.     Discussion about automating removals based on sizes of this and that seem, from this distance, to be musings yet to face the real world. In the meanwhile we need to control matters, hence the button request.     The resource consumption issue is inherent in such systems, and we in the field have very little information to help make choices. I know, it's not a simple affair, and too many buzz words fly about. Thus the engineers close to the code might have a ponder about the above predictive capability and about the overall resource consumption process which might permit the system to adapt to progressively larger loads over time.     In my own situation I feed material into Solr a file at a time, give a small pause, repeat, get to 100 entries and wait a bit longer, and so on every file, hundred files, thousand files. This works well to reduce resource peaks and uncompleted operations, and it lets the system run in the background all day if necessary without disturbing main activities. My longest run was over a full day, 660+K documents which worked just fine and did not upset other activities in the machine.
    Thanks,
    Joe D.



On 21/04/2018 17:54, Erick Erickson wrote:
Joe:

Serendipity strikes, The thread titled "JVM Heap Memory Increase (SOLR
CLOUD)" is a perfect example of why the optimize button is so
"fraught".

Best,
Erick

On Sat, Apr 21, 2018 at 9:43 AM, Erick Erickson <erickerick...@gmail.com> wrote:
Joe:

Thanks for moving the conversation over here that we were having on
the blog post. I think the wider audience will benefit from this going
forward.

bq: ...apparent inability to remove piles of deleted docs

do note that deleted docs are removed during normal indexing when
segments are merged, they're not permanently retained in the index.
Part of the thinking behind SOLR-7733 is exactly that once you press
the very tempting optimize button, you can get into a situation where
your one huge segment does _not_ have the deleted docs removed until
the "live" document space is < 2.5G. Thus if you have a 100G segment
after optimize, it'll look like deleted docs are never removed until
at least 97.5% of the docs are deleted. The default max segment size
is 5G, and the current algorithm doesn't consider segments eligible
for merging until 50% of that maximum number consists of "live" docs.

The optimize functionality in the admin UI was removed as part of
SOLR-7733 from the screen that comes up when you select a core, but
the "core admin" screen still has the optimize button that comes and
goes depending on whether there are any deleted documents or not. This
page is only visible in standalone mode.

Unfortunately SOLR-7733 removed the functionality that actually sent
the optimize command from the javascript, so pressing the optimize
button does nothing. This is indeed a bug, see: SOLR-12253 which will
remove the button from the core admin screen in stand-alone mode.

Optimize (aka forceMerge) is pretty actively discouraged because it is:
1> very expensive
2> has significant "gotchas" (we chatted in comments in the blog post
about the gotchas).

So we made a decision to make it more of an 'expert' option, requiring
users to issue a curl/Browser URL command like
"....solr/core_or_collection/update?optimize=true" if this
functionality is really desirable in their situation. Docs will be
updated too, they're lagging a bit.

Coming probably in Solr 7.4 is a new parameter (tentatively) for
TieredMergePolicy (TMP) that puts a soft ceiling on the percentage of
deleted docs in an index. The current version of this patch
(LUCENE-7976) sets this threshold at 20% at the expense of about 10%
more I/O in my tests from the current TMP implementation. Under
discussion is how low to allow this to be, we're thinking 10% as a
floor, and what the default should be. The current TMP caps the
percentage deleted docs at close to 50%.

The thinking behind not allowing the percent deleted documents to be
too low is that that would trigger its own massive I/O issues,
rewriting "live" documents over and over and over. For NRT indexes,
that's almost certainly a horrible tradeoff. For more static indexes,
the "expert" API command is still available.

Best,
Erick

On Sat, Apr 21, 2018 at 5:08 AM, Joe Doupnik <j...@netlab1.net> wrote:
     In Solr v7.3.0 the ability to removed "deleted" docs from a core by use
of what until then was the Optmise button on the admin GUI has been changed
in an ungood way. That is, in the V7.3.0 Changes list, item SOLR 7733 (quote
remove "optmize from the UI, end quote). The result of that is an apparent
inability to remove piles of deleted docs, which amongst other things means
wasting disk space. That is a marked step backward and is unhelpful for use
of Solr in the field. As other comments in the now closed 7733 ticket
explain, this is a user item whidh has impact on their site, and it ought to
be an inherent feature of Solr. Consider a file system where complete
deletes are forbidden, or your kitchen where taking out the rubbish is
denied. Hand waving about obscure auto-sizing notions will not suffice. Thus
may I urge that the Optimse button and operation be returned to use, as it
was until Solr v7.3.0.
     Thanks,
     Joe D.

Reply via email to