Well, historically during the really old days, optimize made a major difference. As Lucene evolved that difference was smaller, and in recent _anecdotal_ reports it's up to 10% improvement in query processing with the usual caveats that there are a ton of variables here, including and especially how frequently the index is updated so since yours is rarely updated take that number with a huge grain of salt. It's an open question whether a 10% increase (assuming that's what you get) is noticeable. I mean if your query latency is 250 ms, who's going to notice? If it's 10 seconds, that's a different story....
There's a subtext here about whether your query load is enough that a 10% improvement can ripple, but that's only under pretty high query loads. Best, Erick On Fri, Mar 3, 2017 at 9:03 AM, Caruana, Matthew <mcaru...@icij.org> wrote: > We index rarely and in bulk as we’re an organisation that deals in enabling > access to leaked documents for journalists. > > The indexes are mostly static for 99% of the year. We only optimise after > reindexing due to schema changes or when > we have a new leak. > > Our workflow is to index on a staging server, optimise then trigger > replication to a production instance of Solr. We cannot > index straight to production as extracting text from documents is expensive > (lots of EC2 machines running Extract<https://github.com/ICIJ/extract>) and > we need > to really hammer the Solr server with updates (up to 250 concurrent update > request at some times). > > I’ve never done benchmark tests, but it’s an interesting question. I always > worked on the assumption that if the optimise > operation exists then there must be a reason. Also something tells me that > having your index spread over 70 files must be bad. > > The OOM error is certainly due to something else as it happens when we try > indexing text extracted from multi-gigabyte > archives. > > On 3 Mar 2017, at 17:45, Erick Erickson > <erickerick...@gmail.com<mailto:erickerick...@gmail.com>> wrote: > > Matthew: > > What load testing have you done on optimized .vs. unoptimized indexes? > Is there enough of a performance gain to be worth the trouble? Toke's > indexes are pretty static, and in his situation it's worth the effort. > Before spending a lot of cycles on making optimization > work/understanding the ins and outs I'd really recommend you see if > any performance gain is worth it ;)... > > And as I mentioned earlier, optimizing is unlikely to be related to > OOMs during indexing. You never know of course.... > > Best, > Erick > > On Fri, Mar 3, 2017 at 3:40 AM, Caruana, Matthew > <mcaru...@icij.org<mailto:mcaru...@icij.org>> wrote: > Thank you, you’re right - only one of the four cores is hitting 100%. This is > the correct answer. The bottleneck is CPU exacerbated by an absence of > parallelisation. > > On 3 Mar 2017, at 12:32, Toke Eskildsen <t...@kb.dk<mailto:t...@kb.dk>> wrote: > > On Thu, 2017-03-02 at 15:39 +0000, Caruana, Matthew wrote: > Thank you. The question remains however, if this is such a hefty > operation then why is it walking to the destination instead of > running, so to speak? > > We only do optimize on an old Solr 4.10 setup, but for that we have > plenty of experience. At least for single-shard, and at least for most > of the work, optimize is a single-threaded process: It takes us ~8 > hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near > 100% and the other ones not doing anything. > > The machine load number is a bit fuzzy, but if you do a top doing > optimization, my guess is that you will see the same thing as we do: > Only 1 CPU-core working. > -- > Toke Eskildsen, Royal Danish Library > >