I second Jorn: don't deploy Tesseract + Tika on the same server as Solr. Tesseract, specially with OCR enabled, will drain your machine resources that could be used to indexing/searching. In addition to that, any malformed PDF could potentially shutdown the Solr server. Best bet would be to use tika-server + tesseract on a dedicated server/container and then use it to extract the text/ocr from the documents and then send it to Solr.
But answering your question: Solr embeds Tika that can be configured to use Tesseract. It's Tika that knows about Tesseract. See here: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR for more information. Best regards, Edward On Tue, Feb 11, 2020 at 3:26 PM Jörn Franke <jornfra...@gmail.com> wrote: > Honestly i would not run tesseract on the same server as Solr. It takes a > lot of resources and may negatively impact Solr. Just write a small program > using Tika+Tesseract that runs on a different server / container and posts > the results to Solr. > > About your question: Probably Tika (a dependency of Solr) figured it out > or depending on your format Pdfbox (used by Tika). > > > Am 11.02.2020 um 19:15 schrieb Karan Jain <sachu8...@gmail.com>: > > > > Hi All, > > > > The Solr version 7.6.0 is running on my local machine. I have installed > > Tesseract through following steps:- > > yum install tesseract echo export PATH=$PATH:/usr/share/tesseract > >>> ~/.bash_profile > > echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile > > > > Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX > > in https://github.com/apache/lucene-solr and found no reference there. I > > could not understand How Solr came to know about the deployed tesseract. > > Please tell the specific java class in Solr if possible. > > > > Thanks for your time, > > Best, > > Karan >