Re: Period on-line index optimization

Shawn Heisey Tue, 27 Nov 2018 14:12:01 -0800

On 11/27/2018 10:04 AM, Christopher Schultz wrote:

So, it's pretty much like GC promotion: the number of live objects isreally the only things that matters?

That's probably a better analogy than most anything else I could come upwith.

Lucene must completely reconstruct all of the index data from thedocuments that haven't been marked as deleted. The fastest I've everseen an optimize proceed is about 30 megabytes per second, even onRAID10 disk subsystems that are capable of far faster sustained transferrates. The operation strongly impacts CPU and garbage generation, inaddition to the I/O impact.

I was thinking once per day. AFAIK, this index hasn't been optimized since it 
was first built which was a few months ago.

For an index that small, I wouldn't expect a once-per-day optimizationto have much impact on overall operation. Even for big indexes, if youcan do the operation when traffic on your system is very low, usersmight never even notice.

We aren't explicitly deleting anything, ever. The only deletes
occurring should be when we perform an update() on a document, and
Solr/Lucene automatically deletes the existing document with the same id

If you do not use deleteByQuery, then ongoing index updates and segmentmerging (which is what an optimize is) will not interfere with eachother, as long as you're using version 4.0 or later. 3.6 and earlierwere not able to readily mix merging with ongoing indexing operations.

I'd want to schedule this thing with cron, so curl is better for me.
"nohup optimize &" is fine with me, especially if it will give me
stats on how long the optimization actually took.

If you want to know how long it takes, it's probably better to throw thewhole script into the background rather than the curl itself. Butyou're headed in the right general direction. Just a few details tothink about.

I have dev and test environments so I have plenty of places to
play-around. I can even load my production index into dev to see how
long the whole 1M document index will take to optimize, though the
number of segments in the index will be different, unless I just
straight-up copy the index files from the disk. I probably won't do
that because I'd prefer not to take-down the index long enough to take
a copy.

If you're dealing with the small index, I wouldn't expect copying theindex data while the machine is online to be problematic -- the I/O loadwould be small. But if you're running on Windows, I wouldn't be 100%sure that you could copy index data that's in use -- Windows does oddthings with file locking that aren't a problem on most other operatingsystems.

You skipped question 4 which was "can I update my index during an
optimization", but you did mention in your answer to question 3 ("can
I still query during optimize?") that I "should" be able to update the
index (e.g. add/update). Can you clarify why you said "should" instead
of "can"?


I did skip it, because it was answered with question 3, as you noticed.

If that language is there in my reply, I am not really surprised. Saying "should" rather than "can" is just part of a general "cover myass" stance that I adopt whenever I'm answering questions like this. Idon't feel comfortable making absolute declarations that something willwork, unless I'm completely clear on every aspect of the situation. It's fairly rare that my understanding of a user's situation is detailedenough to be 100% certain I'm offering the right answer.

Think of it as saying "as long as everything that I understand aboutyour situation is as stated and the not-stated parts are as I'mexpecting them to be, you're good. But if there's something about yoursituation that I do not know, you might have issues."


Reading ahead to the other replies on the thread...

Optimizing is a tradeoff. You spend a lot of resources squishing thewhole index down to one segment, hoping for a performance boost. Withbig indexes, the cost is probably too high to do it frequently. And ingeneral, once you go down the optimize road once, you must keep doingit, or you can run into severe problems with a very large deleteddocument percentage. Those issues are described in part I of the blogpost that Walter mentioned, which you can find here:


https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Part II of that post, which Walter linked, covers the improvements madein version 7.5.0 to work around those problems.

Because of the issues potentially created by doing optimizes, thegeneral advice you'll get here is that unless you actually KNOW thatyou'll see a significant performance increase, that you shouldn't dooptimizes. My stance was already stated in my last reply ... if theimpact is low and/or the benefit is high, go ahead and do it. I'mbetting that with an index size below 300MB that your impact fromoptimize will be VERY low. I have no idea what the benefit will be ...I can only say that performance WILL increase, even if it's only alittle bit. A lot of work has gone into the last few major Lucenereleases to reduce the performance impact of many segments. In reallyold Lucene releases, merging to one segment would result in a VERYsignificant performance increase ... which I think is how the procedureended up with the dubious name of "optimize".

Some details about the optimizing strategy I used at $LAST_JOB: Most ofthe indexes had six large cold shards and one very small hot shard. Thehot shard was usually less than half a million docs and about 500MB. NoSolrCloud. Optimizing the hot shard was done once an hour, and wouldcomplete within a couple of minutes. One of the large shards wasoptimized each night, usually between 2 and 3 AM, and would take 2-3hours to complete. So it would take six days for all of the large shardsto see an optimize.


Thanks,
Shawn

Re: Period on-line index optimization

Reply via email to