Well I haven’t had to deal with a problem that size, but it seems to me that
you have little alternative except through more computer hardware at it. For
the job I did, I OCRed to convert PDF to searchable PDF outside the indexing
workflow. I used pdftotext utility to extract text from pdf. If text extracted
was <1% document size, then I assumed it needed to be OCRed otherwise didn’t
bother. You could look at a more sophisticated method to determine whether OCR
was necessary. Doing it outside indexing stream means you can use different
hardware for OCR. Converting to searchable PDF means you do it only once - a
reindex doesn’t need to do OCR.
Notice: This email and any attachments are confidential and may not be used,
published or redistributed without the prior written consent of the Institute
of Geological and Nuclear Sciences Limited (GNS Science). If received in error
please destroy and immediately notify GNS Science. Do not copy or disclose the
contents.