Are you intending to use the solution in production? If so, combining Tika and Tesseract on the same server could not be a good choice. Tika and Tesseract are heavy processing consumers, harming the main service on the solution, in your case, Solr service. I had the same situation here, and the combination Tika/Tesseract in the production server does not scale, once I have many text documents and images. An alternative is to use a microservice to text preprocessing and another one to OCR. You can take some ideas from https://github.com/tleyden/open-ocr . I have a separated Kubernetes cluster just for this, to extract and OCR text from binary documents. Now, I can scale to a world-class solution.
Marco Reis Software Engineer http://marcoreis.net +55 61 981194620 On Fri, 17 Jan 2020 at 07:17, Jörn Franke <jornfra...@gmail.com> wrote: > Have you checked this? > > https://cwiki.apache.org/confluence/display/TIKA/TikaOCR > > > Am 17.01.2020 um 10:54 schrieb Retro <holste...@mail.ru.invalid>: > > > > Hello, can you please advise me, how to configure Solr so that embedded > Tika > > is able to use Tesseract to do the ocr of images? I have installed the > > following software - > > SOLR - 7.4.0 > > Tesseract - 4.1.1-rc2-20-g01fb > > TIKA - TIKA 1.18 > > Tesseract is installed in to the following directory: > > /usr/share/tesseract/4/tessdata/ > > echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/ > > tesseract -v > > tesseract 4.1.1-rc2-20-g01fb > > leptonica-1.76.0 > > > > Command “tesseract test.jpg test.txt” produces accurate txt file with > > OCRed content from test.jpg > > Current setup allows us to index attachments such like structured text > files > > (txt, word, pdf, etc), but does not react in any way for attachments like > > png, jpg. Nor it works if uploaded directly to SOLR using its web > interface. > > > > Necessary modifications were made to the following files: > > solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml; > > PDFparser.properties. > > > > Would appreciate if someone helped me with this configuration. > > > > > > > > -- > > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html >