Well I haven’t had to deal with a problem that size, but it seems to me that 
you have little alternative except through more computer hardware at it. For 
the job I did, I OCRed to convert PDF to searchable PDF outside the indexing 
workflow. I used pdftotext utility to extract text from pdf. If text extracted 
was <1% document size, then I assumed it needed to be OCRed otherwise didn’t 
bother. You could look at a more sophisticated method to determine whether OCR 
was necessary. Doing it outside indexing stream means you can use different 
hardware for OCR. Converting to searchable PDF means you do it only once - a 
reindex doesn’t need to do OCR.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Reply via email to