Hi Rick,

We already do this with 30 eight-core machines running seven jobs each, working 
off a shared queue. See https://github.com/ICIJ/extract which has been in 
production for almost two years. Originally developed in order to OCR almost 
ten million PDFs and TIFFs from the Panama Papers.

Matthew

> On 5 Mar 2017, at 3:42 pm, Rick Leir <rl...@leirtech.com> wrote:
> 
> Hi Matthew
> 
> OCR is something which can be parallelized outside of Solr/Tika. Do one OCR 
> task per core, and you can have all cores running at 100%. Write the OCR 
> output to a staging area in the filesystem.
> 
> cheers -- Rick
> 
> 
>> On 2017-03-03 03:00 AM, Caruana, Matthew wrote:
>> This is the current config:
>> 
>>         <indexConfig>
>>                 <ramBufferSizeMB>100</ramBufferSizeMB>
>>                 <writeLockTimeout>10000</writeLockTimeout>
>>                 <mergeScheduler 
>> class="org.apache.lucene.index.ConcurrentMergeScheduler" />
>>                 <mergePolicyFactory 
>> class="org.apache.solr.index.TieredMergePolicyFactory">
>>                         <int name="maxMergeAtOnce">10</int>
>>                         <int name="segmentsPerTier">10</int>
>>                 </mergePolicyFactory>
>>         </indexConfig>
>> 
>> We index in bulk, so after indexing about 4 million documents over a week 
>> (OCR takes long) we normally end up with about 60-70 segments with this 
>> configuration.
>> 
>>> On 3 Mar 2017, at 02:42, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
>>> 
>>> What do you have for merge configuration in solrconfig.xml? You should
>>> be able to tune it to - approximately - whatever you want without
>>> doing the grand optimize:
>>> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments
>>> 
>>> Regards,
>>>   Alex.
>>> ----
>>> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>>> 
>>> 
>>>> On 2 March 2017 at 16:37, Caruana, Matthew <mcaru...@icij.org> wrote:
>>>> Yes, we already do it outside Solr. See https://github.com/ICIJ/extract 
>>>> which we developed for this purpose. My guess is that the documents are 
>>>> very large, as you say.
>>>> 
>>>> Optimising was always an attempt to bring down the number of segments from 
>>>> 60+. Not sure how else to do that.
>>>> 
>>>>> On 2 Mar 2017, at 7:42 pm, Michael Joyner <mich...@newsrx.com> wrote:
>>>>> 
>>>>> You can solve the disk space and time issues by specifying multiple 
>>>>> segments to optimize down to instead of a single segment.
>>>>> 
>>>>> When we reindex we have to optimize or we end up with hundreds of 
>>>>> segments and very horrible performance.
>>>>> 
>>>>> We optimize down to like 16 segments or so and it doesn't do the 3x disk 
>>>>> space thing and usually runs in a decent amount of time. (we have >50 
>>>>> million articles in one of our solr indexes).
>>>>> 
>>>>> 
>>>>>> On 03/02/2017 10:20 AM, David Hastings wrote:
>>>>>> Agreed, and since it takes three times the space is part of the reason it
>>>>>> takes so long, so that 190gb index ends up writing another 380 gb until 
>>>>>> it
>>>>>> compresses down and deletes the two left over files.  its a pretty hefty
>>>>>> operation
>>>>>> 
>>>>>> On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch 
>>>>>> <arafa...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Optimize operation is no longer recommended for Solr, as the
>>>>>>> background merges got a lot smarter.
>>>>>>> 
>>>>>>> It is an extremely expensive operation that can require up to 3-times
>>>>>>> amount of disk during the processing.
>>>>>>> 
>>>>>>> This is not to say yours is a valid question, which I am leaving to
>>>>>>> others to respond.
>>>>>>> 
>>>>>>> Regards,
>>>>>>>   Alex.
>>>>>>> ----
>>>>>>> http://www.solr-start.com/ - Resources for Solr users, new and 
>>>>>>> experienced
>>>>>>> 
>>>>>>> 
>>>>>>>> On 2 March 2017 at 10:04, Caruana, Matthew <mcaru...@icij.org> wrote:
>>>>>>>> I’m currently performing an optimise operation on a ~190GB index with
>>>>>>> about 4 million documents. The process has been running for hours.
>>>>>>>> This is surprising, because the machine is an EC2 r4.xlarge with four
>>>>>>> cores and 30GB of RAM, 24GB of which is allocated to the JVM.
>>>>>>>> The load average has been steady at about 1.3. Memory usage is 25% or
>>>>>>> less the whole time. iostat reports ~6% util.
>>>>>>>> What gives?
>>>>>>>> 
>>>>>>>> Running Solr 6.4.1.
> 

Reply via email to