Just to stir the pot on this topic, here is an article about why and how to use Tika inside of Solr:
https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/ > On Oct 23, 2019, at 7:21 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > Here’s a blog about why and how to use Tika outside Solr (and an RDBMS too, > but you can pull that part out pretty easily): > https://lucidworks.com/post/indexing-with-solrj/ > > > >> On Oct 23, 2019, at 7:16 PM, Alexandre Rafalovitch <arafa...@gmail.com> >> wrote: >> >> Again, I think you are best to do it out of Solr. >> >> But even of you want to get it to work in Solr, I think you start by >> getting it to work directly in Tika. Then, get the missing libraries and >> configuration into Solr. >> >> Regards, >> Alex >> >> On Wed, Oct 23, 2019, 7:08 PM suresh pendap, <sureshpen...@gmail.com> wrote: >> >>> Hi Alex, >>> Thanks for your reply. How do we integrate tesseract with Solr? Do we have >>> to implement Custom update processor or extend the >>> ExtractingRequestProcessor? >>> >>> Regards >>> Suresh >>> >>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <arafa...@gmail.com >>>> >>> wrote: >>> >>>> I believe Tika that powers this can do so with extra libraries >>> (tesseract?) >>>> But Solr does not bundle those extras. >>>> >>>> In any case, you may want to run Tika externally to avoid the >>>> conversion/extraction process be a burden to Solr itself. >>>> >>>> Regards, >>>> Alex >>>> >>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <sureshpen...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> I am reading the Solr documentation about integration with Tika and >>> Solr >>>>> Cell framework over here >>>>> >>>>> >>>> >>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html >>>>> >>>>> I would like to know if the can Solr Cell framework also be used to >>>> extract >>>>> text from the image files? >>>>> >>>>> Regards >>>>> Suresh >>>>> >>>> >>> > _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.