Converting from PDF to text is embarrassingly parallel. You can throw as many machines at it as you want. This is a great time to use a cloud computing service. Need 1000 machines? No problem.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 28, 2017, at 2:52 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > > Hi, > > Do you have suggestions that we can do to cope with the expensive process > of indexing documents which requires OCR. > > For my current situation, the indexing takes about 2 weeks to complete. If > the average indexing speed is say to be 50 times slower, it means it will > require 100 weeks to index the same amount of documents, which is not > viable. I have several terabytes of PDF documents to index for the actual > data, and many of them are scanned image, which requires OCR. > > Regards, > Edwin > > > On 28 March 2017 at 13:20, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > >> Yes, the sample document sizes are not very big. And also, the sample >> documents have a mixture of documents that consists of inline images, and >> also documents which are searchable (text extractable without OCR) >> >> I suppose only those documents which requires OCR will slow down the >> indexing? Which is why the total average is only slowing down by 10 times. >> >> Regards, >> Edwin >> >> >> On 28 March 2017 at 12:06, Phil Scadden <p.scad...@gns.cri.nz> wrote: >> >>> Only by 10? You must have quite small documents. OCR is extremely >>> expensive process. Indexing is trivial by comparison. For quite large >>> documents I am working with OCR can be 100 times slower than indexing a PDF >>> that is searchable (text extractable without OCR). >>> >>> -----Original Message----- >>> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] >>> Sent: Tuesday, 28 March 2017 4:13 p.m. >>> To: solr-user@lucene.apache.org >>> Subject: Indexing speed reduced significantly with OCR >>> >>> Hi, >>> >>> Does the indexing speed of Solr reduced significantly when we are using >>> Tesseract OCR to extract scanned inline images from PDF? >>> >>> I found that after I implement the solution to extract those scanned >>> images from PDF, the indexing speed is now slower by almost more than 10 >>> times. >>> >>> I'm using Solr 6.4.2, and Tika App 1.1.4. >>> >>> Regards, >>> Edwin >>> Notice: This email and any attachments are confidential and may not be >>> used, published or redistributed without the prior written consent of the >>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If >>> received in error please destroy and immediately notify GNS Science. Do not >>> copy or disclose the contents. >>> >> >>