Rahul: bq: we dont want the index sizes to grow too large and auto optimzie to kick in
Not what quite what's going on. There is no "auto optimize". What there is is background merging that will take _some_ segments and merge them together. Very occasionally this will be the same as a full optimize if it just happens that "some" means all the segments. bq: recovery takes a bit more time when it is not optimized I'd be interested in formal measurements here. A recovery that copied the _entire_ index down from the leader shouldn't really have that much be different between an optimized and non-optimized index, but all things are possible. If the recovery is a "peer sync" it shouldn't matter at all. If you're continually adding documents that _replace_ older documents, optimizing will recover any "holes" left by the old updated docs. An update is really a mark-as-deleted for the old version and a re-index of the new. Since segments are write-once, the old data is left there until the segment is merged. Now, one of the bits of information that goes into deciding whether to merge a segment or not is the size. Another is the percentage of deleted docs. When you optimize, you get one huge segment. Now you have to update a lot of docs for that segment to have a large percentage of deleted documents and be merged, thus wasting space and memory. So it's a tradeoff. But if you're getting satisfactory performance from what you have now, there's no reason to change. Here's a wonderful video about the process. you want the third one down (TieredMergePolicy) as that's the default. http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html Best, Erick On Sun, Dec 20, 2015 at 8:26 PM, Rahul Ramesh <rr.ii...@gmail.com> wrote: > Hi Erick, > We index around several million documents/ day and we optimize everyday > when the relative load is low. The reason we optimize is, we dont want the > index sizes to grow too large and auto optimzie to kick in. When auto > optimize kicks in, it results in unpredictable performance as it is CPU and > IO intensive. > > In older solr (4.2), when the segment size grows too large, insertion used > to fail . Have we seen this problem in solr cloud? > > Also, we have observed, recovery takes a bit more time when it is not > optimized. We dont have any quantitative measurement for the same. Its just > an observation. Is this correct observation? > > If we optimize it every day, the indexes will not be skewed right? > > Please let me know if my understanding is correct. > > Regards, > Rahul > > On Mon, Dec 21, 2015 at 9:54 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> You'll probably have to shard before you get to the TB range. At that >> point, all the optimization is done individually on each shard so it >> really doesn't matter how many shards you have. >> >> Just issuing >> http://solr:port/solr/collection/update?optimize=true >> >> is sufficient, that'll forward the optimize command to all the shards >> in the collection. >> >> Best, >> Erick >> >> On Sun, Dec 20, 2015 at 8:19 PM, Zheng Lin Edwin Yeo >> <edwinye...@gmail.com> wrote: >> > Thanks for your information Erick. >> > >> > We have yet to decide how often we will update the index to include new >> > documents that came in. Let's say we update the index once a day, then >> when >> > the indexed is updated, we do the optimization (this will be done at >> night >> > when there are not many users using the system). >> > But my index size will probably grow quite big (potentially can go up to >> > more than 1TB in the future), so does that have to be taken into >> > consideration too? >> > >> > Regards, >> > Edwin >> > >> > >> > On 21 December 2015 at 12:12, Erick Erickson <erickerick...@gmail.com> >> > wrote: >> > >> >> Much depends on how often the index is updated. If your index only >> >> changes, say, once a day then it's probably a good idea. If you're >> >> constantly updating your index, then I'd recommend that you do _not_ >> >> optimize. >> >> >> >> Optimizing will create one large segment. That segment will be >> >> unlikely to be merged since it is so large relative to other segments >> >> for quite a while, resulting in significant wasted space. So if you're >> >> regularly indexing documents that _replace_ existing documents, this >> >> will skew your index. >> >> >> >> Bottom line: >> >> If you have a relatively static index the you can build and then use >> >> for an extended time (as in 12 hours plus) it can be worth the time to >> >> optimize. Otherwise I wouldn't bother. >> >> >> >> Best, >> >> Erick >> >> >> >> On Sun, Dec 20, 2015 at 7:57 PM, Zheng Lin Edwin Yeo >> >> <edwinye...@gmail.com> wrote: >> >> > Hi, >> >> > >> >> > I would like to find out, will it be good to do write a script to do >> an >> >> > auto-opitmization of the indexes at a certain time every day? Is there >> >> any >> >> > advantage to do so? >> >> > >> >> > I found that optimization can reduce the index size by quite a >> >> > signification amount, and allow the searching of the index to run >> faster. >> >> > But will there be advantage if we do the optimization every day? >> >> > >> >> > I'm using Solr 5.3.0 >> >> > >> >> > Regards, >> >> > Edwin >> >> >>