On 11/27/2018 10:04 AM, Christopher Schultz wrote:
So, it's pretty much like GC promotion: the number of live objects is
really the only things that matters?
That's probably a better analogy than most anything else I could come up
with.
Lucene must completely reconstruct all of the index data from the
documents that haven't been marked as deleted. The fastest I've ever
seen an optimize proceed is about 30 megabytes per second, even on
RAID10 disk subsystems that are capable of far faster sustained transfer
rates. The operation strongly impacts CPU and garbage generation, in
addition to the I/O impact.
I was thinking once per day. AFAIK, this index hasn't been optimized since it
was first built which was a few months ago.
For an index that small, I wouldn't expect a once-per-day optimization
to have much impact on overall operation. Even for big indexes, if you
can do the operation when traffic on your system is very low, users
might never even notice.
We aren't explicitly deleting anything, ever. The only deletes
occurring should be when we perform an update() on a document, and
Solr/Lucene automatically deletes the existing document with the same id
If you do not use deleteByQuery, then ongoing index updates and segment
merging (which is what an optimize is) will not interfere with each
other, as long as you're using version 4.0 or later. 3.6 and earlier
were not able to readily mix merging with ongoing indexing operations.
I'd want to schedule this thing with cron, so curl is better for me.
"nohup optimize &" is fine with me, especially if it will give me
stats on how long the optimization actually took.
If you want to know how long it takes, it's probably better to throw the
whole script into the background rather than the curl itself. But
you're headed in the right general direction. Just a few details to
think about.
I have dev and test environments so I have plenty of places to
play-around. I can even load my production index into dev to see how
long the whole 1M document index will take to optimize, though the
number of segments in the index will be different, unless I just
straight-up copy the index files from the disk. I probably won't do
that because I'd prefer not to take-down the index long enough to take
a copy.
If you're dealing with the small index, I wouldn't expect copying the
index data while the machine is online to be problematic -- the I/O load
would be small. But if you're running on Windows, I wouldn't be 100%
sure that you could copy index data that's in use -- Windows does odd
things with file locking that aren't a problem on most other operating
systems.
You skipped question 4 which was "can I update my index during an
optimization", but you did mention in your answer to question 3 ("can
I still query during optimize?") that I "should" be able to update the
index (e.g. add/update). Can you clarify why you said "should" instead
of "can"?
I did skip it, because it was answered with question 3, as you noticed.
If that language is there in my reply, I am not really surprised.
Saying "should" rather than "can" is just part of a general "cover my
ass" stance that I adopt whenever I'm answering questions like this. I
don't feel comfortable making absolute declarations that something will
work, unless I'm completely clear on every aspect of the situation.
It's fairly rare that my understanding of a user's situation is detailed
enough to be 100% certain I'm offering the right answer.
Think of it as saying "as long as everything that I understand about
your situation is as stated and the not-stated parts are as I'm
expecting them to be, you're good. But if there's something about your
situation that I do not know, you might have issues."
Reading ahead to the other replies on the thread...
Optimizing is a tradeoff. You spend a lot of resources squishing the
whole index down to one segment, hoping for a performance boost. With
big indexes, the cost is probably too high to do it frequently. And in
general, once you go down the optimize road once, you must keep doing
it, or you can run into severe problems with a very large deleted
document percentage. Those issues are described in part I of the blog
post that Walter mentioned, which you can find here:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
Part II of that post, which Walter linked, covers the improvements made
in version 7.5.0 to work around those problems.
Because of the issues potentially created by doing optimizes, the
general advice you'll get here is that unless you actually KNOW that
you'll see a significant performance increase, that you shouldn't do
optimizes. My stance was already stated in my last reply ... if the
impact is low and/or the benefit is high, go ahead and do it. I'm
betting that with an index size below 300MB that your impact from
optimize will be VERY low. I have no idea what the benefit will be ...
I can only say that performance WILL increase, even if it's only a
little bit. A lot of work has gone into the last few major Lucene
releases to reduce the performance impact of many segments. In really
old Lucene releases, merging to one segment would result in a VERY
significant performance increase ... which I think is how the procedure
ended up with the dubious name of "optimize".
Some details about the optimizing strategy I used at $LAST_JOB: Most of
the indexes had six large cold shards and one very small hot shard. The
hot shard was usually less than half a million docs and about 500MB. No
SolrCloud. Optimizing the hot shard was done once an hour, and would
complete within a couple of minutes. One of the large shards was
optimized each night, usually between 2 and 3 AM, and would take 2-3
hours to complete. So it would take six days for all of the large shards
to see an optimize.
Thanks,
Shawn