Are you intending to use the solution in production? If so, combining Tika
and Tesseract on the same server could not be a good choice.
Tika and Tesseract are heavy processing consumers, harming the main service
on the solution, in your case, Solr service.
I had the same situation here, and the combination Tika/Tesseract in the
production server does not scale, once I have many text documents and
images.
An alternative is to use a microservice to text preprocessing and another
one to OCR. You can take some ideas from https://github.com/tleyden/open-ocr
.
I have a separated Kubernetes cluster just for this, to extract and OCR
text from binary documents. Now, I can scale to a world-class solution.
Marco Reis
Software Engineer
http://marcoreis.net
+55 61 981194620
On Fri, 17 Jan 2020 at 07:17, Jörn Franke wrote:
> Have you checked this?
>
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
>
> > Am 17.01.2020 um 10:54 schrieb Retro :
> >
> > Hello, can you please advise me, how to configure Solr so that embedded
> Tika
> > is able to use Tesseract to do the ocr of images? I have installed the
> > following software -
> > SOLR - 7.4.0
> > Tesseract - 4.1.1-rc2-20-g01fb
> > TIKA - TIKA 1.18
> > Tesseract is installed in to the following directory:
> > /usr/share/tesseract/4/tessdata/
> > echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
> > tesseract -v
> > tesseract 4.1.1-rc2-20-g01fb
> > leptonica-1.76.0
> >
> > Command “tesseract test.jpg test.txt” produces accurate txt file with
> > OCRed content from test.jpg
> > Current setup allows us to index attachments such like structured text
> files
> > (txt, word, pdf, etc), but does not react in any way for attachments like
> > png, jpg. Nor it works if uploaded directly to SOLR using its web
> interface.
> >
> > Necessary modifications were made to the following files:
> > solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
> > PDFparser.properties.
> >
> > Would appreciate if someone helped me with this configuration.
> >
> >
> >
> > --
> > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>