Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

Rick Leir Sun, 05 Mar 2017 06:43:04 -0800

Hi Matthew

OCR is something which can be parallelized outside of Solr/Tika. Do oneOCR task per core, and you can have all cores running at 100%. Write theOCR output to a staging area in the filesystem.


cheers -- Rick


On 2017-03-03 03:00 AM, Caruana, Matthew wrote:

This is the current config:

         <indexConfig>
                 <ramBufferSizeMB>100</ramBufferSizeMB>
                 <writeLockTimeout>10000</writeLockTimeout>
                 <mergeScheduler 
class="org.apache.lucene.index.ConcurrentMergeScheduler" />
                 <mergePolicyFactory 
class="org.apache.solr.index.TieredMergePolicyFactory">
                         <int name="maxMergeAtOnce">10</int>
                         <int name="segmentsPerTier">10</int>
                 </mergePolicyFactory>
         </indexConfig>

We index in bulk, so after indexing about 4 million documents over a week (OCR 
takes long) we normally end up with about 60-70 segments with this 
configuration.

On 3 Mar 2017, at 02:42, Alexandre Rafalovitch <arafa...@gmail.com> wrote:

What do you have for merge configuration in solrconfig.xml? You should
be able to tune it to - approximately - whatever you want without
doing the grand optimize:
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 2 March 2017 at 16:37, Caruana, Matthew <mcaru...@icij.org> wrote:

Yes, we already do it outside Solr. See https://github.com/ICIJ/extract which 
we developed for this purpose. My guess is that the documents are very large, 
as you say.

Optimising was always an attempt to bring down the number of segments from 60+. 
Not sure how else to do that.

On 2 Mar 2017, at 7:42 pm, Michael Joyner <mich...@newsrx.com> wrote:

You can solve the disk space and time issues by specifying multiple segments to 
optimize down to instead of a single segment.

When we reindex we have to optimize or we end up with hundreds of segments and 
very horrible performance.

We optimize down to like 16 segments or so and it doesn't do the 3x disk space 
thing and usually runs in a decent amount of time. (we have >50 million 
articles in one of our solr indexes).

On 03/02/2017 10:20 AM, David Hastings wrote:
Agreed, and since it takes three times the space is part of the reason it
takes so long, so that 190gb index ends up writing another 380 gb until it
compresses down and deletes the two left over files.  its a pretty hefty
operation

On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

Optimize operation is no longer recommended for Solr, as the
background merges got a lot smarter.

It is an extremely expensive operation that can require up to 3-times
amount of disk during the processing.

This is not to say yours is a valid question, which I am leaving to
others to respond.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced

On 2 March 2017 at 10:04, Caruana, Matthew <mcaru...@icij.org> wrote:
I’m currently performing an optimise operation on a ~190GB index with

about 4 million documents. The process has been running for hours.

This is surprising, because the machine is an EC2 r4.xlarge with four

cores and 30GB of RAM, 24GB of which is allocated to the JVM.

The load average has been steady at about 1.3. Memory usage is 25% or

less the whole time. iostat reports ~6% util.

What gives?

Running Solr 6.4.1.

Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

Reply via email to