Re: regarding Extracting text from Images

Retro Fri, 17 Jan 2020 01:54:35 -0800

Hello, can you please advise me, how to configure Solr so that embedded Tika
is able to use Tesseract to do the  ocr of images? I have installed the
following software -
SOLR      - 7.4.0
Tesseract - 4.1.1-rc2-20-g01fb
TIKA       - TIKA 1.18 
Tesseract is installed in to the following directory:
/usr/share/tesseract/4/tessdata/
echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
tesseract -v
tesseract 4.1.1-rc2-20-g01fb
leptonica-1.76.0


Command “tesseract test.jpg  test.txt”  produces accurate txt file with
OCRed content from test.jpg
Current setup allows us to index attachments such like structured text files
(txt, word, pdf, etc), but does not react in any way for attachments like
png, jpg. Nor it works if uploaded directly to SOLR using its web interface.

Necessary modifications were made to the following files:
solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
PDFparser.properties.

Would appreciate if someone helped me with this configuration. 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: regarding Extracting text from Images

Reply via email to