Converting from PDF to text is embarrassingly parallel. You can throw as many 
machines at it as you want. This is a great time to use a cloud computing 
service. Need 1000 machines? No problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 28, 2017, at 2:52 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:
> 
> Hi,
> 
> Do you have suggestions that we can do to cope with the expensive process
> of indexing documents which requires OCR.
> 
> For my current situation, the indexing takes about 2 weeks to complete. If
> the average indexing speed is say to be 50 times slower, it means it will
> require 100 weeks to index the same amount of documents, which is not
> viable. I have several terabytes of PDF documents to index for the actual
> data, and many of them are scanned image, which requires OCR.
> 
> Regards,
> Edwin
> 
> 
> On 28 March 2017 at 13:20, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:
> 
>> Yes, the sample document sizes are not very big. And also, the sample
>> documents have a mixture of documents that consists of inline images, and
>> also documents which are searchable (text extractable without OCR)
>> 
>> I suppose only those documents which requires OCR will slow down the
>> indexing? Which is why the total average is only slowing down by 10 times.
>> 
>> Regards,
>> Edwin
>> 
>> 
>> On 28 March 2017 at 12:06, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>> 
>>> Only by 10? You must have quite small documents. OCR is extremely
>>> expensive process. Indexing is trivial by comparison. For quite large
>>> documents I am working with OCR can be 100 times slower than indexing a PDF
>>> that is searchable (text extractable without OCR).
>>> 
>>> -----Original Message-----
>>> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
>>> Sent: Tuesday, 28 March 2017 4:13 p.m.
>>> To: solr-user@lucene.apache.org
>>> Subject: Indexing speed reduced significantly with OCR
>>> 
>>> Hi,
>>> 
>>> Does the indexing speed of Solr reduced significantly when we are using
>>> Tesseract OCR to extract scanned inline images from PDF?
>>> 
>>> I found that after I implement the solution to extract those scanned
>>> images from PDF, the indexing speed is now slower by almost more than 10
>>> times.
>>> 
>>> I'm using Solr 6.4.2, and Tika App 1.1.4.
>>> 
>>> Regards,
>>> Edwin
>>> Notice: This email and any attachments are confidential and may not be
>>> used, published or redistributed without the prior written consent of the
>>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
>>> received in error please destroy and immediately notify GNS Science. Do not
>>> copy or disclose the contents.
>>> 
>> 
>> 

Reply via email to