RE: Indexing speed reduced significantly with OCR

Phil Scadden Mon, 27 Mar 2017 21:06:27 -0700

Only by 10? You must have quite small documents. OCR is extremely expensive 
process. Indexing is trivial by comparison. For quite large documents I am 
working with OCR can be 100 times slower than indexing a PDF that is searchable 
(text extractable without OCR).

-----Original Message-----
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Tuesday, 28 March 2017 4:13 p.m.
To: solr-user@lucene.apache.org
Subject: Indexing speed reduced significantly with OCR

Hi,

Does the indexing speed of Solr reduced significantly when we are using 
Tesseract OCR to extract scanned inline images from PDF?

I found that after I implement the solution to extract those scanned images 
from PDF, the indexing speed is now slower by almost more than 10 times.

I'm using Solr 6.4.2, and Tika App 1.1.4.

Regards,
Edwin
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Indexing speed reduced significantly with OCR

Reply via email to