On 11/27/2018 7:47 AM, Christopher Schultz wrote:
I've got a single-core Solr instance with something like 1M small
documents in it. It contains user information for fast-lookups, and it
gets updated any time relevant user-info changes.

Here's the basic info from the Core Dashboard:

<snip>

I'm wondering how often it makes sense to "optimize" my index, because
there is plenty of turnover of existing documents. That is, plenty of
existing users update their info and therefore the Lucene index is
being updated as well -- causing a document-delete and document-add
operation to occur. My understanding is that leaves a lot of dead
space over time, and I'm assuming that it might even slow things down
as the ratio of useful data to total data is reduced.

The percentage of deleted documents here is fairly low. About 7.6 percent.  Doing an optimize with deleted percentage that low may not be worthwhile.

On the other hand, it *would* improve performance by a little bit to optimize.  For the index with the stats you mentioned, you'd be going from 15 segments to one segment.  And with an index size of under 300 MB, the optimize operation would complete pretty quickly - likely a few minutes, maybe even less than one minute.

Presumably, optimizing more often will reduce the time to perform a
single optimization operation, yes?

No, not really.  It depends on what documents are in the index, not so much on whether an optimization was done previously.  Subsequent optimizes will take about as long as the previous optimize did.

Anyhow, I'd like to know a few things:

1. Is manually-triggered optimization even worth doing at all?

Maybe.  See how long it takes, how much impact it has on performance while it's happening, and see if you can get an estimate of how much extra performance you get from it once it's done.  If the impact is low and/or the benefit is high, then by all means, optimize regularly.

2. If so, how often? Or, maybe not "how often [in hours/days/months]"
but maybe "how often [in deletes, etc.]"?

For an index that size, I would say you should aim for an interval between once an hour and once every 24 hours.  Set up this timing based on what kind of impact the optimize operation has on performance while it's occurring.  Might be best to do it once a day at a low activity time, perhaps 03:00.  With indexes slightly bigger than that, I was doing an optimize once an hour. And for the bigger indexes, once a day.

3. During the optimization operation, can clients still issue (read)
queries? If so, will they wait until the optimization operation has
completed?

Yes.  And as long as you don't use deleteByQuery, you can even update the index while it's optimizing.  The deleteByQuery operation will cause problems, especially when the index gets large.  With your small index size, you might not even notice the problems that mixing optimize and deleteByQuery will cause. Replacing deleteByQuery with a standard query to retrieve ID values and then doing a deleteById will get rid of the problems that DBQ causes with optimize.

5. Is it possible to abort an optimization operation if it's taking
too long, and simply discard the new data -- basically, fall-back to
the previously-existing index data?

I am not aware of a way to abort an optimize.  I suppose there might be one ... but in general it doesn't sound like a good idea to me, even if it's possible.

6. What's a good way to trigger an optimization operation? I didn't
see anything directly in the web UI, but there is an "optimize" method
in the Solr/J client. If I can fire-off a fire-and-forget "optimize"
request via e.g. curl or similar tool rather than writing a Java
client, that would be slightly more convenient for me.

Removal of the optimize button from the admin UI was completely intentional.  It's such a tempting button ... there's a tendency for people to say to themselves "of COURSE I want to optimize my index, and make that indicator green!"  But optimizing an 50GB index will quite literally take HOURS ... and will dramatically impact overall performance for that whole time.  So we have removed the temptation.  We haven't removed the ability to optimize, just the button in the UI.

You can use the optimize method in the SolrJ client if your setup is already using SolrJ.  Doing the optimize with something like curl is typically a little bit easier, and won't present a problem.  Either way, I would arrange for it to happen in the background -- a separate thread in a SolrJ program, or the & character on the commandline or in a script when using something like curl.  Setting the "wait" options on the optimize request to false didn't seem to actually lead to an immediate return on the request and background operation on the server.  Been wondering if I should file a bug on that problem, if I can reproduce it with latest Solr.

If deleteByQuery is an essential part of your indexing process, then it would be prudent to avoid indexing while an optimize is underway.  If you do a deleteByQuery during an optimize, then all indexing from that point on will wait until the optimize is done.  On a big index, that could be hours.

Thanks,
Shawn

Reply via email to