Re: What is the bottleneck for an optimise operation?

Erick Erickson Fri, 03 Mar 2017 09:24:56 -0800

Well, historically during the really old days, optimize made a major
difference. As Lucene evolved that difference was smaller, and in
recent _anecdotal_ reports it's up to 10% improvement in query
processing with the usual caveats that there are a ton of variables
here, including and especially how frequently the index is updated
so since yours is rarely updated take that number with a huge grain
of salt. It's an open question whether a 10% increase (assuming that's
what you get) is noticeable. I mean if your query latency is 250 ms,
who's going to notice? If it's 10 seconds, that's a different story....


There's a subtext here about whether your query load is enough that
a 10% improvement can ripple, but that's only under pretty high query
loads.

Best,
Erick

On Fri, Mar 3, 2017 at 9:03 AM, Caruana, Matthew <mcaru...@icij.org> wrote:
> We index rarely and in bulk as we’re an organisation that deals in enabling 
> access to leaked documents for journalists.
>
> The indexes are mostly static for 99% of the year. We only optimise after 
> reindexing due to schema changes or when
> we have a new leak.
>
> Our workflow is to index on a staging server, optimise then trigger 
> replication to a production instance of Solr. We cannot
> index straight to production as extracting text from documents is expensive 
> (lots of EC2 machines running Extract<https://github.com/ICIJ/extract>) and 
> we need
> to really hammer the Solr server with updates (up to 250 concurrent update 
> request at some times).
>
> I’ve never done benchmark tests, but it’s an interesting question. I always 
> worked on the assumption that if the optimise
> operation exists then there must be a reason. Also something tells me that 
> having your index spread over 70 files must be bad.
>
> The OOM error is certainly due to something else as it happens when we try 
> indexing text extracted from multi-gigabyte
> archives.
>
> On 3 Mar 2017, at 17:45, Erick Erickson 
> <erickerick...@gmail.com<mailto:erickerick...@gmail.com>> wrote:
>
> Matthew:
>
> What load testing have you done on optimized .vs. unoptimized indexes?
> Is there enough of a performance gain to be worth the trouble? Toke's
> indexes are pretty static, and in his situation it's worth the effort.
> Before spending a lot of cycles on making optimization
> work/understanding the ins and outs I'd really recommend you see if
> any performance gain is worth it ;)...
>
> And as I mentioned earlier, optimizing is unlikely to be related to
> OOMs during indexing. You never know of course....
>
> Best,
> Erick
>
> On Fri, Mar 3, 2017 at 3:40 AM, Caruana, Matthew 
> <mcaru...@icij.org<mailto:mcaru...@icij.org>> wrote:
> Thank you, you’re right - only one of the four cores is hitting 100%. This is 
> the correct answer. The bottleneck is CPU exacerbated by an absence of 
> parallelisation.
>
> On 3 Mar 2017, at 12:32, Toke Eskildsen <t...@kb.dk<mailto:t...@kb.dk>> wrote:
>
> On Thu, 2017-03-02 at 15:39 +0000, Caruana, Matthew wrote:
> Thank you. The question remains however, if this is such a hefty
> operation then why is it walking to the destination instead of
> running, so to speak?
>
> We only do optimize on an old Solr 4.10 setup, but for that we have
> plenty of experience. At least for single-shard, and at least for most
> of the work, optimize is a single-threaded process: It takes us ~8
> hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near
> 100% and the other ones not doing anything.
>
> The machine load number is a bit fuzzy, but if you do a top doing
> optimization, my guess is that you will see the same thing as we do:
> Only 1 CPU-core working.
> --
> Toke Eskildsen, Royal Danish Library
>
>

Re: What is the bottleneck for an optimise operation?

Reply via email to