On 11/27/2018 10:04 AM, Christopher Schultz wrote:
So, it's pretty much like GC promotion: the number of live objects is really the only things that matters?

That's probably a better analogy than most anything else I could come up with.

Lucene must completely reconstruct all of the index data from the documents that haven't been marked as deleted.  The fastest I've ever seen an optimize proceed is about 30 megabytes per second, even on RAID10 disk subsystems that are capable of far faster sustained transfer rates.  The operation strongly impacts CPU and garbage generation, in addition to the I/O impact.

I was thinking once per day. AFAIK, this index hasn't been optimized since it 
was first built which was a few months ago.

For an index that small, I wouldn't expect a once-per-day optimization to have much impact on overall operation.  Even for big indexes, if you can do the operation when traffic on your system is very low, users might never even notice.

We aren't explicitly deleting anything, ever. The only deletes
occurring should be when we perform an update() on a document, and
Solr/Lucene automatically deletes the existing document with the same id

If you do not use deleteByQuery, then ongoing index updates and segment merging (which is what an optimize is) will not interfere with each other, as long as you're using version 4.0 or later.  3.6 and earlier were not able to readily mix merging with ongoing indexing operations.

I'd want to schedule this thing with cron, so curl is better for me.
"nohup optimize &" is fine with me, especially if it will give me
stats on how long the optimization actually took.

If you want to know how long it takes, it's probably better to throw the whole script into the background rather than the curl itself.  But you're headed in the right general direction.  Just a few details to think about.

I have dev and test environments so I have plenty of places to
play-around. I can even load my production index into dev to see how
long the whole 1M document index will take to optimize, though the
number of segments in the index will be different, unless I just
straight-up copy the index files from the disk. I probably won't do
that because I'd prefer not to take-down the index long enough to take
a copy.

If you're dealing with the small index, I wouldn't expect copying the index data while the machine is online to be problematic -- the I/O load would be small.  But if you're running on Windows, I wouldn't be 100% sure that you could copy index data that's in use -- Windows does odd things with file locking that aren't a problem on most other operating systems.

You skipped question 4 which was "can I update my index during an
optimization", but you did mention in your answer to question 3 ("can
I still query during optimize?") that I "should" be able to update the
index (e.g. add/update). Can you clarify why you said "should" instead
of "can"?

I did skip it, because it was answered with question 3, as you noticed.

If that language is there in my reply, I am not really surprised.  Saying "should" rather than "can" is just part of a general "cover my ass" stance that I adopt whenever I'm answering questions like this.  I don't feel comfortable making absolute declarations that something will work, unless I'm completely clear on every aspect of the situation.  It's fairly rare that my understanding of a user's situation is detailed enough to be 100% certain I'm offering the right answer.

Think of it as saying "as long as everything that I understand about your situation is as stated and the not-stated parts are as I'm expecting them to be, you're good.  But if there's something about your situation that I do not know, you might have issues."

Reading ahead to the other replies on the thread...

Optimizing is a tradeoff.  You spend a lot of resources squishing the whole index down to one segment, hoping for a performance boost.  With big indexes, the cost is probably too high to do it frequently.  And in general, once you go down the optimize road once, you must keep doing it, or you can run into severe problems with a very large deleted document percentage. Those issues are described in part I of the blog post that Walter mentioned, which you can find here:

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Part II of that post, which Walter linked, covers the improvements made in version 7.5.0 to work around those problems.

Because of the issues potentially created by doing optimizes, the general advice you'll get here is that unless you actually KNOW that you'll see a significant performance increase, that you shouldn't do optimizes.  My stance was already stated in my last reply ... if the impact is low and/or the benefit is high, go ahead and do it.  I'm betting that with an index size below 300MB that your impact from optimize will be VERY low.  I have no idea what the benefit will be ... I can only say that performance WILL increase, even if it's only a little bit.  A lot of work has gone into the last few major Lucene releases to reduce the performance impact of many segments.  In really old Lucene releases, merging to one segment would result in a VERY significant performance increase ... which I think is how the procedure ended up with the dubious name of "optimize".

Some details about the optimizing strategy I used at $LAST_JOB:  Most of the indexes had six large cold shards and one very small hot shard.  The hot shard was usually less than half a million docs and about 500MB.  No SolrCloud.  Optimizing the hot shard was done once an hour, and would complete within a couple of minutes.  One of the large shards was optimized each night, usually between 2 and 3 AM, and would take 2-3 hours to complete. So it would take six days for all of the large shards to see an optimize.

Thanks,
Shawn

Reply via email to