I would do neither. I’d put it all on an external server and use _that_, then 
send
the finished docs to Solr.

The problem with putting this all on Solr is at least three-fold:
1> you’re talking heavy-duty work here to do the OCR, which takes away from the 
available resources for searching and indexing
2> any problems with either one will potentially blow up Solr
3> If you’re processing very many docs, you’ll have to parallelize somehow

Here’s the long form: 
https://lucidworks.com/post/indexing-with-solrj/

Best,
Erick

> On Oct 26, 2019, at 12:37 PM, Edward Ribeiro <edward.ribe...@gmail.com> wrote:
> 
> No. You should install tesseract-ocr on the same box your Solr instance is,
> and configure Solr so that embedded Tika is able to use Tesseract to do the
> ocr of images.
> 
> Best,
> Edward
> 
> Em qua, 23 de out de 2019 20:08, suresh pendap <sureshpen...@gmail.com>
> escreveu:
> 
>> Hi Alex,
>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>> to implement Custom update processor or extend the
>> ExtractingRequestProcessor?
>> 
>> Regards
>> Suresh
>> 
>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <arafa...@gmail.com
>>> 
>> wrote:
>> 
>>> I believe Tika that powers this can do so with extra libraries
>> (tesseract?)
>>> But Solr does not bundle those extras.
>>> 
>>> In any case, you may want to run Tika externally to avoid the
>>> conversion/extraction process be a burden to Solr itself.
>>> 
>>> Regards,
>>>     Alex
>>> 
>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <sureshpen...@gmail.com>
>>> wrote:
>>> 
>>>> Hello,
>>>> I am reading the Solr documentation about integration with Tika and
>> Solr
>>>> Cell framework over here
>>>> 
>>>> 
>>> 
>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>>>> 
>>>> I would like to know if the can Solr Cell framework also be used to
>>> extract
>>>> text from the image files?
>>>> 
>>>> Regards
>>>> Suresh
>>>> 
>>> 
>> 

Reply via email to