Re: Period on-line index optimization

Christopher Schultz Wed, 28 Nov 2018 12:07:08 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Erick,


On 11/27/18 20:47, Erick Erickson wrote:
> And do note one implication of the link Shawn gave you. Now that 
> you've optimized, you probably have one huge segment. It _will not_
> be merged unless and until it has < 2.5G "live" documents. So you
> may see your percentage of deleted documents get quite a bit larger
> than you've seen before merging kicks in. Solr 7.5 will rewrite
> this segment (singleton merge) over time as deletes accumulate, or
> you can optimize/forceMerge and it'll gradually shrink (assuming
> you do not merge down to 1 segment).

Ack. It sounds like I shouldn't worry too much about "optimization" at
all. If I find that I have a performance problem (hah! I'm comparing
the performance to a relational table-scan, which was intolerably
long), I can investigate whether or not optimization will help me.

> Oh, and the admin UI segments view is misleading prior to Solr
> 7.5. Hover over each one and you'll see the number of deleted docs.
> It's _supposed_ to be proportional to the number of deleted docs,
> with light gray being live docs and dark gray being deleted, but
> the calculation was off. If you hover over you'll see the raw
> numbers and see what I mean.

Thanks for this clarification. I'm using 7.4.0, so I think that's what
was confusing me.

I'm fairly certain to upgrade to 7.5 in the next few weeks. For me,
it's basically a untar/stop/ln/start operation as long as testing goes
well.

- -chris

> On Tue, Nov 27, 2018 at 2:11 PM Shawn Heisey <apa...@elyograg.org>
> wrote:
>> 
>> On 11/27/2018 10:04 AM, Christopher Schultz wrote:
>>> So, it's pretty much like GC promotion: the number of live
>>> objects is really the only things that matters?
>> 
>> That's probably a better analogy than most anything else I could
>> come up with.
>> 
>> Lucene must completely reconstruct all of the index data from
>> the documents that haven't been marked as deleted.  The fastest
>> I've ever seen an optimize proceed is about 30 megabytes per
>> second, even on RAID10 disk subsystems that are capable of far
>> faster sustained transfer rates.  The operation strongly impacts
>> CPU and garbage generation, in addition to the I/O impact.
>> 
>>> I was thinking once per day. AFAIK, this index hasn't been
>>> optimized since it was first built which was a few months ago.
>> 
>> For an index that small, I wouldn't expect a once-per-day
>> optimization to have much impact on overall operation.  Even for
>> big indexes, if you can do the operation when traffic on your
>> system is very low, users might never even notice.
>> 
>>> We aren't explicitly deleting anything, ever. The only deletes 
>>> occurring should be when we perform an update() on a document,
>>> and Solr/Lucene automatically deletes the existing document
>>> with the same id
>> 
>> If you do not use deleteByQuery, then ongoing index updates and
>> segment merging (which is what an optimize is) will not interfere
>> with each other, as long as you're using version 4.0 or later.
>> 3.6 and earlier were not able to readily mix merging with ongoing
>> indexing operations.
>> 
>>> I'd want to schedule this thing with cron, so curl is better
>>> for me. "nohup optimize &" is fine with me, especially if it
>>> will give me stats on how long the optimization actually took.
>> 
>> If you want to know how long it takes, it's probably better to
>> throw the whole script into the background rather than the curl
>> itself.  But you're headed in the right general direction.  Just
>> a few details to think about.
>> 
>>> I have dev and test environments so I have plenty of places to 
>>> play-around. I can even load my production index into dev to
>>> see how long the whole 1M document index will take to optimize,
>>> though the number of segments in the index will be different,
>>> unless I just straight-up copy the index files from the disk. I
>>> probably won't do that because I'd prefer not to take-down the
>>> index long enough to take a copy.
>> 
>> If you're dealing with the small index, I wouldn't expect copying
>> the index data while the machine is online to be problematic --
>> the I/O load would be small.  But if you're running on Windows, I
>> wouldn't be 100% sure that you could copy index data that's in
>> use -- Windows does odd things with file locking that aren't a
>> problem on most other operating systems.
>> 
>>> You skipped question 4 which was "can I update my index during
>>> an optimization", but you did mention in your answer to
>>> question 3 ("can I still query during optimize?") that I
>>> "should" be able to update the index (e.g. add/update). Can you
>>> clarify why you said "should" instead of "can"?
>> 
>> I did skip it, because it was answered with question 3, as you
>> noticed.
>> 
>> If that language is there in my reply, I am not really
>> surprised. Saying "should" rather than "can" is just part of a
>> general "cover my ass" stance that I adopt whenever I'm answering
>> questions like this.  I don't feel comfortable making absolute
>> declarations that something will work, unless I'm completely
>> clear on every aspect of the situation. It's fairly rare that my
>> understanding of a user's situation is detailed enough to be 100%
>> certain I'm offering the right answer.
>> 
>> Think of it as saying "as long as everything that I understand
>> about your situation is as stated and the not-stated parts are as
>> I'm expecting them to be, you're good.  But if there's something
>> about your situation that I do not know, you might have issues."
>> 
>> Reading ahead to the other replies on the thread...
>> 
>> Optimizing is a tradeoff.  You spend a lot of resources squishing
>> the whole index down to one segment, hoping for a performance
>> boost.  With big indexes, the cost is probably too high to do it
>> frequently.  And in general, once you go down the optimize road
>> once, you must keep doing it, or you can run into severe problems
>> with a very large deleted document percentage. Those issues are
>> described in part I of the blog post that Walter mentioned, which
>> you can find here:
>> 
>> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-o
ptimize-may-bad/
>>
>>
>> 
Part II of that post, which Walter linked, covers the improvements made
>> in version 7.5.0 to work around those problems.
>> 
>> Because of the issues potentially created by doing optimizes,
>> the general advice you'll get here is that unless you actually
>> KNOW that you'll see a significant performance increase, that you
>> shouldn't do optimizes.  My stance was already stated in my last
>> reply ... if the impact is low and/or the benefit is high, go
>> ahead and do it.  I'm betting that with an index size below 300MB
>> that your impact from optimize will be VERY low.  I have no idea
>> what the benefit will be ... I can only say that performance WILL
>> increase, even if it's only a little bit.  A lot of work has gone
>> into the last few major Lucene releases to reduce the performance
>> impact of many segments.  In really old Lucene releases, merging
>> to one segment would result in a VERY significant performance
>> increase ... which I think is how the procedure ended up with the
>> dubious name of "optimize".
>> 
>> Some details about the optimizing strategy I used at $LAST_JOB:
>> Most of the indexes had six large cold shards and one very small
>> hot shard.  The hot shard was usually less than half a million
>> docs and about 500MB.  No SolrCloud.  Optimizing the hot shard
>> was done once an hour, and would complete within a couple of
>> minutes.  One of the large shards was optimized each night,
>> usually between 2 and 3 AM, and would take 2-3 hours to complete.
>> So it would take six days for all of the large shards to see an
>> optimize.
>> 
>> Thanks, Shawn
>> 
> 
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlv+9T0ACgkQHPApP6U8
pFhmlw/8C2QePovAZI4KhqXEWrooNAdB+mAU6nhFvrrmgrT0wFl7/QzejN6zNtRL
sKVFa2w1CrUl1s1zbW7JhuBzs3tbK9SlEX/F52Uv82hVP9br0GHPUA2Xwe0AbLnt
kErg5J3hQwqGiXHDUVyEv0Hdt0RlVT0UeiNEGMBi7QEEaP9KsETPoRv66guhUsFr
dwK9HZirnUv3AkAVFVfK/kQXUyLl8QIOcEVtWvuuDzocY3BS+e3Uucf3igiO05LN
KSkXcVOd9qQUhbOm2m4WEaxBETNNtWPcQj+8MF/91YyGOC8KQ02P2OkgaO0Btm6D
HzAGmhwt/fqUn/3P9EwyDRa972r4UApv9iJS0xHHDT8QBVgW0qYphPcDURpkYX/w
iH8nS7V5VRHVLQGs1PqvXWEem4vKgins5N7GshlPJK/+r87kgoTeHqvDdpzVPMQT
oRjP5Dlz3sX3S+bSYWmsmLA3oLttSkSXpFg96SyHlgNhssAPtUWiN5gO3Zlex87d
A+PhvJ0HioxuiySo61BNHJfTPcj68hUMIYSAQEwmMLcpYJxclvIb+wiroSyyyLeC
j+Z7zYxVPldSKTWpduf7xY6WypIb42NID+sCDUdDhCHDYd6mbML4aoqVHlKN3tMR
dmnudPjvZcfDeSFTyv6LVD2hlaXZVHcFDyKobJs6WNi7dnf8szI=
=T7c/
-----END PGP SIGNATURE-----

Re: Period on-line index optimization

Reply via email to