..@gmail.com]
> Sent: Thursday, 30 March 2017 4:53 p.m.
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing speed reduced significantly with OCR
>
> Thanks for your reply.
>
> From what I see, getting more hardware to do the OCR is inevitable?
>
> Even if we run the OCR o
Yes, that would seem an accurate assessment of the problem.
-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Thursday, 30 March 2017 4:53 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significantly with OCR
Thanks for your reply
om]
> Sent: Thursday, March 30, 2017 7:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing speed reduced significantly with OCR
>
> The workflow is
> -/ OCR new documents
> -/ check quality and tune until you get good output text -/ keep the output
> text in the
> Note that the OCRing is a separate task from Solr indexing, and is best done
> on separate machines.
+1
-Original Message-
From: Rick Leir [mailto:rl...@leirtech.com]
Sent: Thursday, March 30, 2017 7:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed r
The workflow is
-/ OCR new documents
-/ check quality and tune until you get good output text
-/ keep the output text in the file system
-/ index and re-index to Solr as necessary from the file system
Note that the OCRing is a separate task from Solr indexing, and is best done on
separate mach
Thanks for your reply.
>From what I see, getting more hardware to do the OCR is inevitable?
Even if we run the OCR outside of Solr indexing stream, it will still take
a long time to process it if it is on just one machine. And we still need
to wait for the OCR to finish converting before we can r
Well I haven’t had to deal with a problem that size, but it seems to me that
you have little alternative except through more computer hardware at it. For
the job I did, I OCRed to convert PDF to searchable PDF outside the indexing
workflow. I used pdftotext utility to extract text from pdf. If t
Converting from PDF to text is embarrassingly parallel. You can throw as many
machines at it as you want. This is a great time to use a cloud computing
service. Need 1000 machines? No problem.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Mar 28,
Hi,
Do you have suggestions that we can do to cope with the expensive process
of indexing documents which requires OCR.
For my current situation, the indexing takes about 2 weeks to complete. If
the average indexing speed is say to be 50 times slower, it means it will
require 100 weeks to index t
Yes, the sample document sizes are not very big. And also, the sample
documents have a mixture of documents that consists of inline images, and
also documents which are searchable (text extractable without OCR)
I suppose only those documents which requires OCR will slow down the
indexing? Which is
Only by 10? You must have quite small documents. OCR is extremely expensive
process. Indexing is trivial by comparison. For quite large documents I am
working with OCR can be 100 times slower than indexing a PDF that is searchable
(text extractable without OCR).
-Original Message-
From:
11 matches
Mail list logo