-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Walter,
On 11/27/18 12:31, Walter Underwood wrote: > Optimize is just forcing a full merge. Solr does merges > automatically in the background. Understood. > It has been automatically doing merges for the months you’ve been > using it. Let it continue. Don’t bother with optimize. Fair enough. > It was a huge mistake to name that function “optimize”. Ultraseek > had a button labeled “Merge”. I understand that "optimize" makes it sounds like, without performing that operation, that the index is "not optimized" which sounds bad. I'm not hung-up on the terminology. In my live index, I can see total 20 segments. 7 of them are "all gray" and the other 13 are at various levels of "dark grayness". I haven't been able to find a reference for what those colors mean, but they don't seem to be correlated with any data I can see on each segment . When I have run an "optimize" operation on a test index, I can see a single segment which is shown all in "light gray", whatever that means. Other than wasting my time, are there any negative consequences for periodically "optimizing" (or merging) the index? Thanks, - -chris >> On Nov 27, 2018, at 9:04 AM, Christopher Schultz >> <ch...@christopherschultz.net> wrote: >> > Shawn, > > On 11/27/18 11:01, Shawn Heisey wrote: >>>> On 11/27/2018 7:47 AM, Christopher Schultz wrote: >>>>> I've got a single-core Solr instance with something like 1M >>>>> small documents in it. It contains user information for >>>>> fast-lookups, and it gets updated any time relevant >>>>> user-info changes. >>>>> >>>>> Here's the basic info from the Core Dashboard: >>>> >>>> <snip> >>>> >>>>> I'm wondering how often it makes sense to "optimize" my >>>>> index, because there is plenty of turnover of existing >>>>> documents. That is, plenty of existing users update their >>>>> info and therefore the Lucene index is being updated as >>>>> well -- causing a document-delete and document-add >>>>> operation to occur. My understanding is that leaves a lot >>>>> of dead space over time, and I'm assuming that it might >>>>> even slow things down as the ratio of useful data to total >>>>> data is reduced. >>>> >>>> The percentage of deleted documents here is fairly low. About >>>> 7.6 percent. Doing an optimize with deleted percentage that >>>> low may not be worthwhile. >>>> >>>> On the other hand, it *would* improve performance by a little >>>> bit to optimize. For the index with the stats you mentioned, >>>> you'd be going from 15 segments to one segment. And with an >>>> index size of under 300 MB, the optimize operation would >>>> complete pretty quickly - likely a few minutes, maybe even >>>> less than one minute. > Okay. What I really don't want to do is interrupt normal > operation. > >>>>> Presumably, optimizing more often will reduce the time to >>>>> perform a single optimization operation, yes? >>>> >>>> No, not really. It depends on what documents are in the >>>> index, not so much on whether an optimization was done >>>> previously. Subsequent optimizes will take about as long as >>>> the previous optimize did. > > So, it's pretty much like GC promotion: the number of live objects > is really the only things that matters? > >>>>> Anyhow, I'd like to know a few things: >>>>> >>>>> 1. Is manually-triggered optimization even worth doing at >>>>> all? >>>> >>>> Maybe. See how long it takes, how much impact it has on >>>> performance while it's happening, and see if you can get an >>>> estimate of how much extra performance you get from it once >>>> it's done. If the impact is low and/or the benefit is high, >>>> then by all means, optimize regularly. >>>> >>>>> 2. If so, how often? Or, maybe not "how often [in >>>>> hours/days/months]" but maybe "how often [in deletes, >>>>> etc.]"? >>>> >>>> For an index that size, I would say you should aim for an >>>> interval between once an hour and once every 24 hours. Set >>>> up this timing based on what kind of impact the optimize >>>> operation has on performance while it's occurring. Might be >>>> best to do it once a day at a low activity time, perhaps >>>> 03:00. With indexes slightly bigger than that, I was doing >>>> an optimize once an hour. And for the bigger indexes, once a >>>> day. > > I was thinking once per day. AFAIK, this index hasn't been > optimized since it was first built which was a few months ago. > >>>>> 3. During the optimization operation, can clients still >>>>> issue (read) queries? If so, will they wait until the >>>>> optimization operation has completed? >>>> >>>> Yes. And as long as you don't use deleteByQuery, you can >>>> even update the index while it's optimizing. The >>>> deleteByQuery operation will cause problems, especially when >>>> the index gets large. With your small index size, you might >>>> not even notice the problems that mixing optimize and >>>> deleteByQuery will cause. Replacing deleteByQuery with a >>>> standard query to retrieve ID values and then doing a >>>> deleteById will get rid of the problems that DBQ causes with >>>> optimize. > > We aren't explicitly deleting anything, ever. The only deletes > occurring should be when we perform an update() on a document, and > Solr/Lucene automatically deletes the existing document with the > same id . > >>>>> 5. Is it possible to abort an optimization operation if >>>>> it's taking too long, and simply discard the new data -- >>>>> basically, fall-back to the previously-existing index >>>>> data? >>>> >>>> I am not aware of a way to abort an optimize. I suppose >>>> there might be one ... but in general it doesn't sound like a >>>> good idea to me, even if it's possible. >>>> >>>>> 6. What's a good way to trigger an optimization operation? >>>>> I didn't see anything directly in the web UI, but there is >>>>> an "optimize" method in the Solr/J client. If I can >>>>> fire-off a fire-and-forget "optimize" request via e.g. curl >>>>> or similar tool rather than writing a Java client, that >>>>> would be slightly more convenient for me. >>>> >>>> Removal of the optimize button from the admin UI was >>>> completely intentional. It's such a tempting button ... >>>> there's a tendency for people to say to themselves "of COURSE >>>> I want to optimize my index, and make that indicator green!" >>>> But optimizing an 50GB index will quite literally take HOURS >>>> ... and will dramatically impact overall performance for that >>>> whole time. So we have removed the temptation. We haven't >>>> removed the ability to optimize, just the button in the UI. > > Ack. > >>>> You can use the optimize method in the SolrJ client if your >>>> setup is already using SolrJ. Doing the optimize with >>>> something like curl is typically a little bit easier, and >>>> won't present a problem. Either way, I would arrange for it >>>> to happen in the background -- a separate thread in a SolrJ >>>> program, or the & character on the commandline or in a script >>>> when using something like curl. Setting the "wait" options >>>> on the optimize request to false didn't seem to actually lead >>>> to an immediate return on the request and background >>>> operation on the server. Been wondering if I should file a >>>> bug on that problem, if I can reproduce it with latest Solr. > > I'd want to schedule this thing with cron, so curl is better for > me. "nohup optimize &" is fine with me, especially if it will give > me stats on how long the optimization actually took. > > I have dev and test environments so I have plenty of places to > play-around. I can even load my production index into dev to see > how long the whole 1M document index will take to optimize, though > the number of segments in the index will be different, unless I > just straight-up copy the index files from the disk. I probably > won't do that because I'd prefer not to take-down the index long > enough to take a copy. > >>>> If deleteByQuery is an essential part of your indexing >>>> process, then it would be prudent to avoid indexing while an >>>> optimize is underway. If you do a deleteByQuery during an >>>> optimize, then all indexing from that point on will wait >>>> until the optimize is done. On a big index, that could be >>>> hours. > > I'm assuming this isn't going to take very long to optimize, but > we'll see. > > You skipped question 4 which was "can I update my index during an > optimization", but you did mention in your answer to question 3 > ("can I still query during optimize?") that I "should" be able to > update the index (e.g. add/update). Can you clarify why you said > "should" instead of "can"? > > Thanks again, -chris > > -----BEGIN PGP SIGNATURE----- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlv9gkQACgkQHPApP6U8 pFiVyw/9HeeDa4Y9jNxmTMyIT6I9/oIjqQ8/ZO4ZMF9bdFcoTtSczMSNQOqUyEl7 VgFDokD2qChuTcnWdXu8iHxPTe7sp5b3FevLC34rLzIhOVEHnFXSiemj4l5zocTk 9/gIJ2vIJVJZokav4vRuYn1TY0yXzipgLfusTtp/llT3I7VxlbxMzaIspnoQrfca HSpHe7jqvh1DGtpXvEI42/3YJjW1D7fNB4aC1APDWQASjIFMByqcd1jFr5EkgPBI q8/4UH4kk/I++uh9ZbfIKHrFYawTvuA1pgtIKUr2foZXeFNt8gEk1B+nmf0MwLtn Yhlf4Ash3/79H51rfESXyEJKF8AEZHxero3vp1dbFpg91MMbRf+zfXA0AWM3N4Fy elcm9kJO5vZGTkyPSy1UpouPX6FUP6bzHccFnIKvOIQ7I5XRvEjmOUXGAZrEd3V9 T0fAte6DTzS3SIRTdAAkvwiJLT5T9iirj2WsztjLwBdTkxdu6UKjnkWg/qo23dB8 SHNkf2T1o9H6QjFwvRSFe13DqrH2j3028kG79pxFZMZREPxlTv2rYbjTcjK1hRts p826qzryxzT/2QFGkWTmSFVc0z1DF+I754Kk+1H0aKi3yZUfjDRWKh4hEOa9L2FS o1dq1bLappX5aOWQsm06tBqfJkdLANQirL71HdPDNysQ7CxA2lo= =w9Hg -----END PGP SIGNATURE-----