I would do neither. I’d put it all on an external server and use _that_, then send the finished docs to Solr.
The problem with putting this all on Solr is at least three-fold: 1> you’re talking heavy-duty work here to do the OCR, which takes away from the available resources for searching and indexing 2> any problems with either one will potentially blow up Solr 3> If you’re processing very many docs, you’ll have to parallelize somehow Here’s the long form: https://lucidworks.com/post/indexing-with-solrj/ Best, Erick > On Oct 26, 2019, at 12:37 PM, Edward Ribeiro <edward.ribe...@gmail.com> wrote: > > No. You should install tesseract-ocr on the same box your Solr instance is, > and configure Solr so that embedded Tika is able to use Tesseract to do the > ocr of images. > > Best, > Edward > > Em qua, 23 de out de 2019 20:08, suresh pendap <sureshpen...@gmail.com> > escreveu: > >> Hi Alex, >> Thanks for your reply. How do we integrate tesseract with Solr? Do we have >> to implement Custom update processor or extend the >> ExtractingRequestProcessor? >> >> Regards >> Suresh >> >> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <arafa...@gmail.com >>> >> wrote: >> >>> I believe Tika that powers this can do so with extra libraries >> (tesseract?) >>> But Solr does not bundle those extras. >>> >>> In any case, you may want to run Tika externally to avoid the >>> conversion/extraction process be a burden to Solr itself. >>> >>> Regards, >>> Alex >>> >>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <sureshpen...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> I am reading the Solr documentation about integration with Tika and >> Solr >>>> Cell framework over here >>>> >>>> >>> >> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html >>>> >>>> I would like to know if the can Solr Cell framework also be used to >>> extract >>>> text from the image files? >>>> >>>> Regards >>>> Suresh >>>> >>> >>