And do note one implication of the link Shawn gave you. Now that
you've optimized, you probably have one huge segment. It _will not_ be
merged unless and until it has < 2.5G "live" documents. So you may see
your percentage of deleted documents get quite a bit larger than
you've seen before merging kicks in. Solr 7.5 will rewrite this
segment (singleton merge) over time as deletes accumulate, or you can
optimize/forceMerge and it'll gradually shrink (assuming you do not
merge down to 1 segment).
Oh, and the admin UI segments view is misleading prior to Solr 7.5.
Hover over each one and you'll see the number of deleted docs. It's
_supposed_ to be proportional to the number of deleted docs, with
light gray being live docs and dark gray being deleted, but the
calculation was off. If you hover over you'll see the raw numbers and
see what I mean.

Best,
Erick

Best,
Erick
On Tue, Nov 27, 2018 at 2:11 PM Shawn Heisey <apa...@elyograg.org> wrote:
>
> On 11/27/2018 10:04 AM, Christopher Schultz wrote:
> > So, it's pretty much like GC promotion: the number of live objects is
> > really the only things that matters?
>
> That's probably a better analogy than most anything else I could come up
> with.
>
> Lucene must completely reconstruct all of the index data from the
> documents that haven't been marked as deleted.  The fastest I've ever
> seen an optimize proceed is about 30 megabytes per second, even on
> RAID10 disk subsystems that are capable of far faster sustained transfer
> rates.  The operation strongly impacts CPU and garbage generation, in
> addition to the I/O impact.
>
> > I was thinking once per day. AFAIK, this index hasn't been optimized since 
> > it was first built which was a few months ago.
>
> For an index that small, I wouldn't expect a once-per-day optimization
> to have much impact on overall operation.  Even for big indexes, if you
> can do the operation when traffic on your system is very low, users
> might never even notice.
>
> > We aren't explicitly deleting anything, ever. The only deletes
> > occurring should be when we perform an update() on a document, and
> > Solr/Lucene automatically deletes the existing document with the same id
>
> If you do not use deleteByQuery, then ongoing index updates and segment
> merging (which is what an optimize is) will not interfere with each
> other, as long as you're using version 4.0 or later.  3.6 and earlier
> were not able to readily mix merging with ongoing indexing operations.
>
> > I'd want to schedule this thing with cron, so curl is better for me.
> > "nohup optimize &" is fine with me, especially if it will give me
> > stats on how long the optimization actually took.
>
> If you want to know how long it takes, it's probably better to throw the
> whole script into the background rather than the curl itself.  But
> you're headed in the right general direction.  Just a few details to
> think about.
>
> > I have dev and test environments so I have plenty of places to
> > play-around. I can even load my production index into dev to see how
> > long the whole 1M document index will take to optimize, though the
> > number of segments in the index will be different, unless I just
> > straight-up copy the index files from the disk. I probably won't do
> > that because I'd prefer not to take-down the index long enough to take
> > a copy.
>
> If you're dealing with the small index, I wouldn't expect copying the
> index data while the machine is online to be problematic -- the I/O load
> would be small.  But if you're running on Windows, I wouldn't be 100%
> sure that you could copy index data that's in use -- Windows does odd
> things with file locking that aren't a problem on most other operating
> systems.
>
> > You skipped question 4 which was "can I update my index during an
> > optimization", but you did mention in your answer to question 3 ("can
> > I still query during optimize?") that I "should" be able to update the
> > index (e.g. add/update). Can you clarify why you said "should" instead
> > of "can"?
>
> I did skip it, because it was answered with question 3, as you noticed.
>
> If that language is there in my reply, I am not really surprised.
> Saying "should" rather than "can" is just part of a general "cover my
> ass" stance that I adopt whenever I'm answering questions like this.  I
> don't feel comfortable making absolute declarations that something will
> work, unless I'm completely clear on every aspect of the situation.
> It's fairly rare that my understanding of a user's situation is detailed
> enough to be 100% certain I'm offering the right answer.
>
> Think of it as saying "as long as everything that I understand about
> your situation is as stated and the not-stated parts are as I'm
> expecting them to be, you're good.  But if there's something about your
> situation that I do not know, you might have issues."
>
> Reading ahead to the other replies on the thread...
>
> Optimizing is a tradeoff.  You spend a lot of resources squishing the
> whole index down to one segment, hoping for a performance boost.  With
> big indexes, the cost is probably too high to do it frequently.  And in
> general, once you go down the optimize road once, you must keep doing
> it, or you can run into severe problems with a very large deleted
> document percentage. Those issues are described in part I of the blog
> post that Walter mentioned, which you can find here:
>
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
>
> Part II of that post, which Walter linked, covers the improvements made
> in version 7.5.0 to work around those problems.
>
> Because of the issues potentially created by doing optimizes, the
> general advice you'll get here is that unless you actually KNOW that
> you'll see a significant performance increase, that you shouldn't do
> optimizes.  My stance was already stated in my last reply ... if the
> impact is low and/or the benefit is high, go ahead and do it.  I'm
> betting that with an index size below 300MB that your impact from
> optimize will be VERY low.  I have no idea what the benefit will be ...
> I can only say that performance WILL increase, even if it's only a
> little bit.  A lot of work has gone into the last few major Lucene
> releases to reduce the performance impact of many segments.  In really
> old Lucene releases, merging to one segment would result in a VERY
> significant performance increase ... which I think is how the procedure
> ended up with the dubious name of "optimize".
>
> Some details about the optimizing strategy I used at $LAST_JOB:  Most of
> the indexes had six large cold shards and one very small hot shard.  The
> hot shard was usually less than half a million docs and about 500MB.  No
> SolrCloud.  Optimizing the hot shard was done once an hour, and would
> complete within a couple of minutes.  One of the large shards was
> optimized each night, usually between 2 and 3 AM, and would take 2-3
> hours to complete. So it would take six days for all of the large shards
> to see an optimize.
>
> Thanks,
> Shawn
>

Reply via email to