While I totally think that for any heavy-duty use case or any use case
where the document's are not constrained to a known set with polite
characteristics (i.e. known not to be password protected, reasonable
length, etc), Tika should not run inside solr. That said, as I see it the
key downside of not having solr-cell as part of solr would be that we would
likely  remove the docs for it too, and the entire concept of how to get a
"normal" document into solr evaporates from our ref guide. So I like the
sound of it being an official package as Eric suggests, and perhaps even
the canonical example of how to install a package... Along with heavy
documentation caveats of why Tika should run outside of solr for most
production purposes of course.

-Gus


On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com>
wrote:

> I did a series of blog posts about Tika, and while conventional wisdom is
> that running Tika in Solr is bad, I’ve had GREAT luck with it over the
> years.
> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> <
> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> >
>
> Having said that, my bigger beef with Tika in Solr is about all the
> dependencies that it drags along.   I am constantly looking up a package
> wondering how we use it in Solr just to find it’s a Tika package….  So….
> For that reason I think we need to do something better.
>
> I like SolrCell to a package (
> https://issues.apache.org/jira/browse/SOLR-15951 <
> https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
> powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
> love to see us separate out SolrCell and make it easy to do `bin/solr
> package install solrcell` and have it work!  It would both validate the
> whole Package concept, and minimize the dependencies in Solr’s tarball.
>
> Secondly, for folks who really do want to run a separate Tika server, I’d
> love to make it easier to use.    Tika has introduced a new “pipes” concept
> to reduce the amount of back and forth when working with Tika Server that
> might tie nicely into the Solr update pipeline.  I don’t think any real
> work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)
>
> Eric
>
>
> > On Mar 8, 2023, at 9:50 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> >
> > On 3/7/2023 3:48 PM, Jan Høydahl wrote:
> >> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
> https://issues.apache.org/jira/browse/SOLR-15951>
> >> * Deprecate SolrCell SOLR-13973 <
> https://issues.apache.org/jira/browse/SOLR-13973>
> >> * Keep in Solr but use Tika-Server <
> https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
> https://issues.apache.org/jira/browse/SOLR-7632>
> >> * Integrate Tika client-side SOLR-1526 <
> https://issues.apache.org/jira/browse/SOLR-1526>
> >
> > As you likely know, the big problem is that Tika has a habit of crashing
> or misbehaving, particularly with PDFs, and if it's running inside Solr,
> then Solr itself is going to suffer whatever bad effects Tika causes.
> >
> >> My current thinking / proposal is to:
> >> * Build a new, thin Solr module that exposes a compatible
> /update/extract handler, delegating to Tika-Server (user-hosted)
> >> * Deprecate SolrCell in current form
> >> * From 10.0, Solr will not ship with embedded Tika, only the new
> handler delegating to Tika-Server
> >
> > I was thinking something along these lines too.  A separate JVM running
> Tika Server that can crash without taking Solr down, and communication so
> ERH can send commands to it, receive extracted data, and hopefully know
> when the other JVM crashes.  If we design it well, then the framework could
> be used to integrate with other extraction mechanisms besides Tika.  I
> think that would be quite a bit of work.
> >
> > It might be a good idea to make that a separate project as was done for
> DIH, but I have no way of guessing whether there is enough interest in the
> community to keep it maintained.  If it's a separate project, then I think
> it would just incorporate SolrJ and Tika, rather than using a special
> handler.  I have never used ERH in a production setting, and barely have
> experience with it in non-production.
> >
> > Thanks,
> > Shawn
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > For additional commands, e-mail: dev-h...@solr.apache.org
> >
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Reply via email to