Re: Indexing speed reduced significantly with OCR

Zheng Lin Edwin Yeo Wed, 29 Mar 2017 20:54:12 -0700

Thanks for your reply.

>From what I see, getting more hardware to do the OCR is inevitable?


Even if we run the OCR outside of Solr indexing stream, it will still take
a long time to process it if it is on just one machine. And we still need
to wait for the OCR to finish converting before we can run the indexing to
Solr.

Regards,
Edwin


On 29 March 2017 at 04:40, Phil Scadden <p.scad...@gns.cri.nz> wrote:

> Well I haven’t had to deal with a problem that size, but it seems to me
> that you have little alternative except through more computer hardware at
> it. For the job I did, I OCRed to convert PDF to searchable PDF outside the
> indexing workflow. I used pdftotext utility to extract text from pdf. If
> text extracted was <1% document size, then I assumed it needed to be OCRed
> otherwise didn’t bother. You could look at a more sophisticated method to
> determine whether OCR was necessary. Doing it outside indexing stream means
> you can use different hardware for OCR. Converting to searchable PDF means
> you do it only once - a reindex doesn’t need to do OCR.
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>

Re: Indexing speed reduced significantly with OCR

Reply via email to