Hi Rick, We already do this with 30 eight-core machines running seven jobs each, working off a shared queue. See https://github.com/ICIJ/extract which has been in production for almost two years. Originally developed in order to OCR almost ten million PDFs and TIFFs from the Panama Papers.
Matthew > On 5 Mar 2017, at 3:42 pm, Rick Leir <rl...@leirtech.com> wrote: > > Hi Matthew > > OCR is something which can be parallelized outside of Solr/Tika. Do one OCR > task per core, and you can have all cores running at 100%. Write the OCR > output to a staging area in the filesystem. > > cheers -- Rick > > >> On 2017-03-03 03:00 AM, Caruana, Matthew wrote: >> This is the current config: >> >> <indexConfig> >> <ramBufferSizeMB>100</ramBufferSizeMB> >> <writeLockTimeout>10000</writeLockTimeout> >> <mergeScheduler >> class="org.apache.lucene.index.ConcurrentMergeScheduler" /> >> <mergePolicyFactory >> class="org.apache.solr.index.TieredMergePolicyFactory"> >> <int name="maxMergeAtOnce">10</int> >> <int name="segmentsPerTier">10</int> >> </mergePolicyFactory> >> </indexConfig> >> >> We index in bulk, so after indexing about 4 million documents over a week >> (OCR takes long) we normally end up with about 60-70 segments with this >> configuration. >> >>> On 3 Mar 2017, at 02:42, Alexandre Rafalovitch <arafa...@gmail.com> wrote: >>> >>> What do you have for merge configuration in solrconfig.xml? You should >>> be able to tune it to - approximately - whatever you want without >>> doing the grand optimize: >>> https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments >>> >>> Regards, >>> Alex. >>> ---- >>> http://www.solr-start.com/ - Resources for Solr users, new and experienced >>> >>> >>>> On 2 March 2017 at 16:37, Caruana, Matthew <mcaru...@icij.org> wrote: >>>> Yes, we already do it outside Solr. See https://github.com/ICIJ/extract >>>> which we developed for this purpose. My guess is that the documents are >>>> very large, as you say. >>>> >>>> Optimising was always an attempt to bring down the number of segments from >>>> 60+. Not sure how else to do that. >>>> >>>>> On 2 Mar 2017, at 7:42 pm, Michael Joyner <mich...@newsrx.com> wrote: >>>>> >>>>> You can solve the disk space and time issues by specifying multiple >>>>> segments to optimize down to instead of a single segment. >>>>> >>>>> When we reindex we have to optimize or we end up with hundreds of >>>>> segments and very horrible performance. >>>>> >>>>> We optimize down to like 16 segments or so and it doesn't do the 3x disk >>>>> space thing and usually runs in a decent amount of time. (we have >50 >>>>> million articles in one of our solr indexes). >>>>> >>>>> >>>>>> On 03/02/2017 10:20 AM, David Hastings wrote: >>>>>> Agreed, and since it takes three times the space is part of the reason it >>>>>> takes so long, so that 190gb index ends up writing another 380 gb until >>>>>> it >>>>>> compresses down and deletes the two left over files. its a pretty hefty >>>>>> operation >>>>>> >>>>>> On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch >>>>>> <arafa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Optimize operation is no longer recommended for Solr, as the >>>>>>> background merges got a lot smarter. >>>>>>> >>>>>>> It is an extremely expensive operation that can require up to 3-times >>>>>>> amount of disk during the processing. >>>>>>> >>>>>>> This is not to say yours is a valid question, which I am leaving to >>>>>>> others to respond. >>>>>>> >>>>>>> Regards, >>>>>>> Alex. >>>>>>> ---- >>>>>>> http://www.solr-start.com/ - Resources for Solr users, new and >>>>>>> experienced >>>>>>> >>>>>>> >>>>>>>> On 2 March 2017 at 10:04, Caruana, Matthew <mcaru...@icij.org> wrote: >>>>>>>> I’m currently performing an optimise operation on a ~190GB index with >>>>>>> about 4 million documents. The process has been running for hours. >>>>>>>> This is surprising, because the machine is an EC2 r4.xlarge with four >>>>>>> cores and 30GB of RAM, 24GB of which is allocated to the JVM. >>>>>>>> The load average has been steady at about 1.3. Memory usage is 25% or >>>>>>> less the whole time. iostat reports ~6% util. >>>>>>>> What gives? >>>>>>>> >>>>>>>> Running Solr 6.4.1. >