On 11/27/2018 7:47 AM, Christopher Schultz wrote:
I've got a single-core Solr instance with something like 1M small
documents in it. It contains user information for fast-lookups, and it
gets updated any time relevant user-info changes.
Here's the basic info from the Core Dashboard:
<snip>
I'm wondering how often it makes sense to "optimize" my index, because
there is plenty of turnover of existing documents. That is, plenty of
existing users update their info and therefore the Lucene index is
being updated as well -- causing a document-delete and document-add
operation to occur. My understanding is that leaves a lot of dead
space over time, and I'm assuming that it might even slow things down
as the ratio of useful data to total data is reduced.
The percentage of deleted documents here is fairly low. About 7.6
percent. Doing an optimize with deleted percentage that low may not be
worthwhile.
On the other hand, it *would* improve performance by a little bit to
optimize. For the index with the stats you mentioned, you'd be going
from 15 segments to one segment. And with an index size of under 300
MB, the optimize operation would complete pretty quickly - likely a few
minutes, maybe even less than one minute.
Presumably, optimizing more often will reduce the time to perform a
single optimization operation, yes?
No, not really. It depends on what documents are in the index, not so
much on whether an optimization was done previously. Subsequent
optimizes will take about as long as the previous optimize did.
Anyhow, I'd like to know a few things:
1. Is manually-triggered optimization even worth doing at all?
Maybe. See how long it takes, how much impact it has on performance
while it's happening, and see if you can get an estimate of how much
extra performance you get from it once it's done. If the impact is low
and/or the benefit is high, then by all means, optimize regularly.
2. If so, how often? Or, maybe not "how often [in hours/days/months]"
but maybe "how often [in deletes, etc.]"?
For an index that size, I would say you should aim for an interval
between once an hour and once every 24 hours. Set up this timing based
on what kind of impact the optimize operation has on performance while
it's occurring. Might be best to do it once a day at a low activity
time, perhaps 03:00. With indexes slightly bigger than that, I was
doing an optimize once an hour. And for the bigger indexes, once a day.
3. During the optimization operation, can clients still issue (read)
queries? If so, will they wait until the optimization operation has
completed?
Yes. And as long as you don't use deleteByQuery, you can even update
the index while it's optimizing. The deleteByQuery operation will cause
problems, especially when the index gets large. With your small index
size, you might not even notice the problems that mixing optimize and
deleteByQuery will cause. Replacing deleteByQuery with a standard query
to retrieve ID values and then doing a deleteById will get rid of the
problems that DBQ causes with optimize.
5. Is it possible to abort an optimization operation if it's taking
too long, and simply discard the new data -- basically, fall-back to
the previously-existing index data?
I am not aware of a way to abort an optimize. I suppose there might be
one ... but in general it doesn't sound like a good idea to me, even if
it's possible.
6. What's a good way to trigger an optimization operation? I didn't
see anything directly in the web UI, but there is an "optimize" method
in the Solr/J client. If I can fire-off a fire-and-forget "optimize"
request via e.g. curl or similar tool rather than writing a Java
client, that would be slightly more convenient for me.
Removal of the optimize button from the admin UI was completely
intentional. It's such a tempting button ... there's a tendency for
people to say to themselves "of COURSE I want to optimize my index, and
make that indicator green!" But optimizing an 50GB index will quite
literally take HOURS ... and will dramatically impact overall
performance for that whole time. So we have removed the temptation. We
haven't removed the ability to optimize, just the button in the UI.
You can use the optimize method in the SolrJ client if your setup is
already using SolrJ. Doing the optimize with something like curl is
typically a little bit easier, and won't present a problem. Either way,
I would arrange for it to happen in the background -- a separate thread
in a SolrJ program, or the & character on the commandline or in a script
when using something like curl. Setting the "wait" options on the
optimize request to false didn't seem to actually lead to an immediate
return on the request and background operation on the server. Been
wondering if I should file a bug on that problem, if I can reproduce it
with latest Solr.
If deleteByQuery is an essential part of your indexing process, then it
would be prudent to avoid indexing while an optimize is underway. If
you do a deleteByQuery during an optimize, then all indexing from that
point on will wait until the optimize is done. On a big index, that
could be hours.
Thanks,
Shawn