-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Walter,

On 11/27/18 12:31, Walter Underwood wrote:
> Optimize is just forcing a full merge. Solr does merges
> automatically in the background.
Understood.

> It has been automatically doing merges for the months you’ve been 
> using it. Let it continue. Don’t bother with optimize.
Fair enough.

> It was a huge mistake to name that function “optimize”. Ultraseek
> had a button labeled “Merge”.
I understand that "optimize" makes it sounds like, without performing
that operation, that the index is "not optimized" which sounds bad.
I'm not hung-up on the terminology.

In my live index, I can see total 20 segments. 7 of them are "all
gray" and the other 13 are at various levels of "dark grayness". I
haven't been able to find a reference for what those colors mean, but
they don't seem to be correlated with any data I can see on each segment
.

When I have run an "optimize" operation on a test index, I can see a
single segment which is shown all in "light gray", whatever that means.

Other than wasting my time, are there any negative consequences for
periodically "optimizing" (or merging) the index?

Thanks,
- -chris

>> On Nov 27, 2018, at 9:04 AM, Christopher Schultz
>> <ch...@christopherschultz.net> wrote:
>> 
> Shawn,
> 
> On 11/27/18 11:01, Shawn Heisey wrote:
>>>> On 11/27/2018 7:47 AM, Christopher Schultz wrote:
>>>>> I've got a single-core Solr instance with something like 1M
>>>>> small documents in it. It contains user information for
>>>>> fast-lookups, and it gets updated any time relevant
>>>>> user-info changes.
>>>>> 
>>>>> Here's the basic info from the Core Dashboard:
>>>> 
>>>> <snip>
>>>> 
>>>>> I'm wondering how often it makes sense to "optimize" my
>>>>> index, because there is plenty of turnover of existing
>>>>> documents. That is, plenty of existing users update their
>>>>> info and therefore the Lucene index is being updated as
>>>>> well -- causing a document-delete and document-add
>>>>> operation to occur. My understanding is that leaves a lot
>>>>> of dead space over time, and I'm assuming that it might
>>>>> even slow things down as the ratio of useful data to total
>>>>> data is reduced.
>>>> 
>>>> The percentage of deleted documents here is fairly low. About
>>>> 7.6 percent.  Doing an optimize with deleted percentage that
>>>> low may not be worthwhile.
>>>> 
>>>> On the other hand, it *would* improve performance by a little
>>>> bit to optimize.  For the index with the stats you mentioned,
>>>> you'd be going from 15 segments to one segment.  And with an
>>>> index size of under 300 MB, the optimize operation would
>>>> complete pretty quickly - likely a few minutes, maybe even
>>>> less than one minute.
> Okay. What I really don't want to do is interrupt normal
> operation.
> 
>>>>> Presumably, optimizing more often will reduce the time to 
>>>>> perform a single optimization operation, yes?
>>>> 
>>>> No, not really.  It depends on what documents are in the
>>>> index, not so much on whether an optimization was done
>>>> previously. Subsequent optimizes will take about as long as
>>>> the previous optimize did.
> 
> So, it's pretty much like GC promotion: the number of live objects
> is really the only things that matters?
> 
>>>>> Anyhow, I'd like to know a few things:
>>>>> 
>>>>> 1. Is manually-triggered optimization even worth doing at
>>>>> all?
>>>> 
>>>> Maybe.  See how long it takes, how much impact it has on 
>>>> performance while it's happening, and see if you can get an 
>>>> estimate of how much extra performance you get from it once
>>>> it's done.  If the impact is low and/or the benefit is high,
>>>> then by all means, optimize regularly.
>>>> 
>>>>> 2. If so, how often? Or, maybe not "how often [in 
>>>>> hours/days/months]" but maybe "how often [in deletes,
>>>>> etc.]"?
>>>> 
>>>> For an index that size, I would say you should aim for an
>>>> interval between once an hour and once every 24 hours.  Set
>>>> up this timing based on what kind of impact the optimize
>>>> operation has on performance while it's occurring.  Might be
>>>> best to do it once a day at a low activity time, perhaps
>>>> 03:00.  With indexes slightly bigger than that, I was doing
>>>> an optimize once an hour. And for the bigger indexes, once a
>>>> day.
> 
> I was thinking once per day. AFAIK, this index hasn't been
> optimized since it was first built which was a few months ago.
> 
>>>>> 3. During the optimization operation, can clients still
>>>>> issue (read) queries? If so, will they wait until the
>>>>> optimization operation has completed?
>>>> 
>>>> Yes.  And as long as you don't use deleteByQuery, you can
>>>> even update the index while it's optimizing.  The
>>>> deleteByQuery operation will cause problems, especially when
>>>> the index gets large.  With your small index size, you might
>>>> not even notice the problems that mixing optimize and
>>>> deleteByQuery will cause. Replacing deleteByQuery with a
>>>> standard query to retrieve ID values and then doing a
>>>> deleteById will get rid of the problems that DBQ causes with
>>>> optimize.
> 
> We aren't explicitly deleting anything, ever. The only deletes 
> occurring should be when we perform an update() on a document, and 
> Solr/Lucene automatically deletes the existing document with the
> same id .
> 
>>>>> 5. Is it possible to abort an optimization operation if
>>>>> it's taking too long, and simply discard the new data --
>>>>> basically, fall-back to the previously-existing index
>>>>> data?
>>>> 
>>>> I am not aware of a way to abort an optimize.  I suppose
>>>> there might be one ... but in general it doesn't sound like a
>>>> good idea to me, even if it's possible.
>>>> 
>>>>> 6. What's a good way to trigger an optimization operation?
>>>>> I didn't see anything directly in the web UI, but there is
>>>>> an "optimize" method in the Solr/J client. If I can
>>>>> fire-off a fire-and-forget "optimize" request via e.g. curl
>>>>> or similar tool rather than writing a Java client, that
>>>>> would be slightly more convenient for me.
>>>> 
>>>> Removal of the optimize button from the admin UI was
>>>> completely intentional.  It's such a tempting button ...
>>>> there's a tendency for people to say to themselves "of COURSE
>>>> I want to optimize my index, and make that indicator green!"
>>>> But optimizing an 50GB index will quite literally take HOURS
>>>> ... and will dramatically impact overall performance for that
>>>> whole time.  So we have removed the temptation.  We haven't
>>>> removed the ability to optimize, just the button in the UI.
> 
> Ack.
> 
>>>> You can use the optimize method in the SolrJ client if your
>>>> setup is already using SolrJ.  Doing the optimize with
>>>> something like curl is typically a little bit easier, and
>>>> won't present a problem. Either way, I would arrange for it
>>>> to happen in the background -- a separate thread in a SolrJ
>>>> program, or the & character on the commandline or in a script
>>>> when using something like curl.  Setting the "wait" options
>>>> on the optimize request to false didn't seem to actually lead
>>>> to an immediate return on the request and background
>>>> operation on the server.  Been wondering if I should file a
>>>> bug on that problem, if I can reproduce it with latest Solr.
> 
> I'd want to schedule this thing with cron, so curl is better for
> me. "nohup optimize &" is fine with me, especially if it will give
> me stats on how long the optimization actually took.
> 
> I have dev and test environments so I have plenty of places to 
> play-around. I can even load my production index into dev to see
> how long the whole 1M document index will take to optimize, though
> the number of segments in the index will be different, unless I
> just straight-up copy the index files from the disk. I probably
> won't do that because I'd prefer not to take-down the index long
> enough to take a copy.
> 
>>>> If deleteByQuery is an essential part of your indexing
>>>> process, then it would be prudent to avoid indexing while an
>>>> optimize is underway.  If you do a deleteByQuery during an
>>>> optimize, then all indexing from that point on will wait
>>>> until the optimize is done. On a big index, that could be
>>>> hours.
> 
> I'm assuming this isn't going to take very long to optimize, but
> we'll see.
> 
> You skipped question 4 which was "can I update my index during an 
> optimization", but you did mention in your answer to question 3
> ("can I still query during optimize?") that I "should" be able to
> update the index (e.g. add/update). Can you clarify why you said
> "should" instead of "can"?
> 
> Thanks again, -chris
> 
> 
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlv9gkQACgkQHPApP6U8
pFiVyw/9HeeDa4Y9jNxmTMyIT6I9/oIjqQ8/ZO4ZMF9bdFcoTtSczMSNQOqUyEl7
VgFDokD2qChuTcnWdXu8iHxPTe7sp5b3FevLC34rLzIhOVEHnFXSiemj4l5zocTk
9/gIJ2vIJVJZokav4vRuYn1TY0yXzipgLfusTtp/llT3I7VxlbxMzaIspnoQrfca
HSpHe7jqvh1DGtpXvEI42/3YJjW1D7fNB4aC1APDWQASjIFMByqcd1jFr5EkgPBI
q8/4UH4kk/I++uh9ZbfIKHrFYawTvuA1pgtIKUr2foZXeFNt8gEk1B+nmf0MwLtn
Yhlf4Ash3/79H51rfESXyEJKF8AEZHxero3vp1dbFpg91MMbRf+zfXA0AWM3N4Fy
elcm9kJO5vZGTkyPSy1UpouPX6FUP6bzHccFnIKvOIQ7I5XRvEjmOUXGAZrEd3V9
T0fAte6DTzS3SIRTdAAkvwiJLT5T9iirj2WsztjLwBdTkxdu6UKjnkWg/qo23dB8
SHNkf2T1o9H6QjFwvRSFe13DqrH2j3028kG79pxFZMZREPxlTv2rYbjTcjK1hRts
p826qzryxzT/2QFGkWTmSFVc0z1DF+I754Kk+1H0aKi3yZUfjDRWKh4hEOa9L2FS
o1dq1bLappX5aOWQsm06tBqfJkdLANQirL71HdPDNysQ7CxA2lo=
=w9Hg
-----END PGP SIGNATURE-----

Reply via email to