Hello, can you please advise me, how to configure Solr so that embedded Tika is able to use Tesseract to do the ocr of images? I have installed the following software - SOLR - 7.4.0 Tesseract - 4.1.1-rc2-20-g01fb TIKA - TIKA 1.18 Tesseract is installed in to the following directory: /usr/share/tesseract/4/tessdata/ echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/ tesseract -v tesseract 4.1.1-rc2-20-g01fb leptonica-1.76.0
Command “tesseract test.jpg test.txt” produces accurate txt file with OCRed content from test.jpg Current setup allows us to index attachments such like structured text files (txt, word, pdf, etc), but does not react in any way for attachments like png, jpg. Nor it works if uploaded directly to SOLR using its web interface. Necessary modifications were made to the following files: solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml; PDFparser.properties. Would appreciate if someone helped me with this configuration. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html