Have you checked this? https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
> Am 17.01.2020 um 10:54 schrieb Retro <holste...@mail.ru.invalid>: > > Hello, can you please advise me, how to configure Solr so that embedded Tika > is able to use Tesseract to do the ocr of images? I have installed the > following software - > SOLR - 7.4.0 > Tesseract - 4.1.1-rc2-20-g01fb > TIKA - TIKA 1.18 > Tesseract is installed in to the following directory: > /usr/share/tesseract/4/tessdata/ > echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/ > tesseract -v > tesseract 4.1.1-rc2-20-g01fb > leptonica-1.76.0 > > Command “tesseract test.jpg test.txt” produces accurate txt file with > OCRed content from test.jpg > Current setup allows us to index attachments such like structured text files > (txt, word, pdf, etc), but does not react in any way for attachments like > png, jpg. Nor it works if uploaded directly to SOLR using its web interface. > > Necessary modifications were made to the following files: > solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml; > PDFparser.properties. > > Would appreciate if someone helped me with this configuration. > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html