Re: regarding Extracting text from Images

Jörn Franke Fri, 17 Jan 2020 02:17:47 -0800

Have you checked this?

https://cwiki.apache.org/confluence/display/TIKA/TikaOCR


> Am 17.01.2020 um 10:54 schrieb Retro <holste...@mail.ru.invalid>:
> 
> Hello, can you please advise me, how to configure Solr so that embedded Tika
> is able to use Tesseract to do the  ocr of images? I have installed the
> following software -
> SOLR      - 7.4.0
> Tesseract - 4.1.1-rc2-20-g01fb
> TIKA       - TIKA 1.18 
> Tesseract is installed in to the following directory:
> /usr/share/tesseract/4/tessdata/
> echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
> tesseract -v
> tesseract 4.1.1-rc2-20-g01fb
> leptonica-1.76.0
> 
> Command “tesseract test.jpg  test.txt”  produces accurate txt file with
> OCRed content from test.jpg
> Current setup allows us to index attachments such like structured text files
> (txt, word, pdf, etc), but does not react in any way for attachments like
> png, jpg. Nor it works if uploaded directly to SOLR using its web interface.
> 
> Necessary modifications were made to the following files:
> solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
> PDFparser.properties.
> 
> Would appreciate if someone helped me with this configuration. 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: regarding Extracting text from Images

Reply via email to