-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Erick,
On 11/27/18 20:47, Erick Erickson wrote: > And do note one implication of the link Shawn gave you. Now that > you've optimized, you probably have one huge segment. It _will not_ > be merged unless and until it has < 2.5G "live" documents. So you > may see your percentage of deleted documents get quite a bit larger > than you've seen before merging kicks in. Solr 7.5 will rewrite > this segment (singleton merge) over time as deletes accumulate, or > you can optimize/forceMerge and it'll gradually shrink (assuming > you do not merge down to 1 segment). Ack. It sounds like I shouldn't worry too much about "optimization" at all. If I find that I have a performance problem (hah! I'm comparing the performance to a relational table-scan, which was intolerably long), I can investigate whether or not optimization will help me. > Oh, and the admin UI segments view is misleading prior to Solr > 7.5. Hover over each one and you'll see the number of deleted docs. > It's _supposed_ to be proportional to the number of deleted docs, > with light gray being live docs and dark gray being deleted, but > the calculation was off. If you hover over you'll see the raw > numbers and see what I mean. Thanks for this clarification. I'm using 7.4.0, so I think that's what was confusing me. I'm fairly certain to upgrade to 7.5 in the next few weeks. For me, it's basically a untar/stop/ln/start operation as long as testing goes well. - -chris > On Tue, Nov 27, 2018 at 2:11 PM Shawn Heisey <apa...@elyograg.org> > wrote: >> >> On 11/27/2018 10:04 AM, Christopher Schultz wrote: >>> So, it's pretty much like GC promotion: the number of live >>> objects is really the only things that matters? >> >> That's probably a better analogy than most anything else I could >> come up with. >> >> Lucene must completely reconstruct all of the index data from >> the documents that haven't been marked as deleted. The fastest >> I've ever seen an optimize proceed is about 30 megabytes per >> second, even on RAID10 disk subsystems that are capable of far >> faster sustained transfer rates. The operation strongly impacts >> CPU and garbage generation, in addition to the I/O impact. >> >>> I was thinking once per day. AFAIK, this index hasn't been >>> optimized since it was first built which was a few months ago. >> >> For an index that small, I wouldn't expect a once-per-day >> optimization to have much impact on overall operation. Even for >> big indexes, if you can do the operation when traffic on your >> system is very low, users might never even notice. >> >>> We aren't explicitly deleting anything, ever. The only deletes >>> occurring should be when we perform an update() on a document, >>> and Solr/Lucene automatically deletes the existing document >>> with the same id >> >> If you do not use deleteByQuery, then ongoing index updates and >> segment merging (which is what an optimize is) will not interfere >> with each other, as long as you're using version 4.0 or later. >> 3.6 and earlier were not able to readily mix merging with ongoing >> indexing operations. >> >>> I'd want to schedule this thing with cron, so curl is better >>> for me. "nohup optimize &" is fine with me, especially if it >>> will give me stats on how long the optimization actually took. >> >> If you want to know how long it takes, it's probably better to >> throw the whole script into the background rather than the curl >> itself. But you're headed in the right general direction. Just >> a few details to think about. >> >>> I have dev and test environments so I have plenty of places to >>> play-around. I can even load my production index into dev to >>> see how long the whole 1M document index will take to optimize, >>> though the number of segments in the index will be different, >>> unless I just straight-up copy the index files from the disk. I >>> probably won't do that because I'd prefer not to take-down the >>> index long enough to take a copy. >> >> If you're dealing with the small index, I wouldn't expect copying >> the index data while the machine is online to be problematic -- >> the I/O load would be small. But if you're running on Windows, I >> wouldn't be 100% sure that you could copy index data that's in >> use -- Windows does odd things with file locking that aren't a >> problem on most other operating systems. >> >>> You skipped question 4 which was "can I update my index during >>> an optimization", but you did mention in your answer to >>> question 3 ("can I still query during optimize?") that I >>> "should" be able to update the index (e.g. add/update). Can you >>> clarify why you said "should" instead of "can"? >> >> I did skip it, because it was answered with question 3, as you >> noticed. >> >> If that language is there in my reply, I am not really >> surprised. Saying "should" rather than "can" is just part of a >> general "cover my ass" stance that I adopt whenever I'm answering >> questions like this. I don't feel comfortable making absolute >> declarations that something will work, unless I'm completely >> clear on every aspect of the situation. It's fairly rare that my >> understanding of a user's situation is detailed enough to be 100% >> certain I'm offering the right answer. >> >> Think of it as saying "as long as everything that I understand >> about your situation is as stated and the not-stated parts are as >> I'm expecting them to be, you're good. But if there's something >> about your situation that I do not know, you might have issues." >> >> Reading ahead to the other replies on the thread... >> >> Optimizing is a tradeoff. You spend a lot of resources squishing >> the whole index down to one segment, hoping for a performance >> boost. With big indexes, the cost is probably too high to do it >> frequently. And in general, once you go down the optimize road >> once, you must keep doing it, or you can run into severe problems >> with a very large deleted document percentage. Those issues are >> described in part I of the blog post that Walter mentioned, which >> you can find here: >> >> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-o ptimize-may-bad/ >> >> >> Part II of that post, which Walter linked, covers the improvements made >> in version 7.5.0 to work around those problems. >> >> Because of the issues potentially created by doing optimizes, >> the general advice you'll get here is that unless you actually >> KNOW that you'll see a significant performance increase, that you >> shouldn't do optimizes. My stance was already stated in my last >> reply ... if the impact is low and/or the benefit is high, go >> ahead and do it. I'm betting that with an index size below 300MB >> that your impact from optimize will be VERY low. I have no idea >> what the benefit will be ... I can only say that performance WILL >> increase, even if it's only a little bit. A lot of work has gone >> into the last few major Lucene releases to reduce the performance >> impact of many segments. In really old Lucene releases, merging >> to one segment would result in a VERY significant performance >> increase ... which I think is how the procedure ended up with the >> dubious name of "optimize". >> >> Some details about the optimizing strategy I used at $LAST_JOB: >> Most of the indexes had six large cold shards and one very small >> hot shard. The hot shard was usually less than half a million >> docs and about 500MB. No SolrCloud. Optimizing the hot shard >> was done once an hour, and would complete within a couple of >> minutes. One of the large shards was optimized each night, >> usually between 2 and 3 AM, and would take 2-3 hours to complete. >> So it would take six days for all of the large shards to see an >> optimize. >> >> Thanks, Shawn >> > -----BEGIN PGP SIGNATURE----- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlv+9T0ACgkQHPApP6U8 pFhmlw/8C2QePovAZI4KhqXEWrooNAdB+mAU6nhFvrrmgrT0wFl7/QzejN6zNtRL sKVFa2w1CrUl1s1zbW7JhuBzs3tbK9SlEX/F52Uv82hVP9br0GHPUA2Xwe0AbLnt kErg5J3hQwqGiXHDUVyEv0Hdt0RlVT0UeiNEGMBi7QEEaP9KsETPoRv66guhUsFr dwK9HZirnUv3AkAVFVfK/kQXUyLl8QIOcEVtWvuuDzocY3BS+e3Uucf3igiO05LN KSkXcVOd9qQUhbOm2m4WEaxBETNNtWPcQj+8MF/91YyGOC8KQ02P2OkgaO0Btm6D HzAGmhwt/fqUn/3P9EwyDRa972r4UApv9iJS0xHHDT8QBVgW0qYphPcDURpkYX/w iH8nS7V5VRHVLQGs1PqvXWEem4vKgins5N7GshlPJK/+r87kgoTeHqvDdpzVPMQT oRjP5Dlz3sX3S+bSYWmsmLA3oLttSkSXpFg96SyHlgNhssAPtUWiN5gO3Zlex87d A+PhvJ0HioxuiySo61BNHJfTPcj68hUMIYSAQEwmMLcpYJxclvIb+wiroSyyyLeC j+Z7zYxVPldSKTWpduf7xY6WypIb42NID+sCDUdDhCHDYd6mbML4aoqVHlKN3tMR dmnudPjvZcfDeSFTyv6LVD2hlaXZVHcFDyKobJs6WNi7dnf8szI= =T7c/ -----END PGP SIGNATURE-----