While I totally think that for any heavy-duty use case or any use case where the document's are not constrained to a known set with polite characteristics (i.e. known not to be password protected, reasonable length, etc), Tika should not run inside solr. That said, as I see it the key downside of not having solr-cell as part of solr would be that we would likely remove the docs for it too, and the entire concept of how to get a "normal" document into solr evaporates from our ref guide. So I like the sound of it being an official package as Eric suggests, and perhaps even the canonical example of how to install a package... Along with heavy documentation caveats of why Tika should run outside of solr for most production purposes of course.
-Gus On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com> wrote: > I did a series of blog posts about Tika, and while conventional wisdom is > that running Tika in Solr is bad, I’ve had GREAT luck with it over the > years. > https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/ > < > https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/ > > > > Having said that, my bigger beef with Tika in Solr is about all the > dependencies that it drags along. I am constantly looking up a package > wondering how we use it in Solr just to find it’s a Tika package…. So…. > For that reason I think we need to do something better. > > I like SolrCell to a package ( > https://issues.apache.org/jira/browse/SOLR-15951 < > https://issues.apache.org/jira/browse/SOLR-15951>). We have this > powerful packaging feature, and yet we hardly dog food it ourselves…. I’d > love to see us separate out SolrCell and make it easy to do `bin/solr > package install solrcell` and have it work! It would both validate the > whole Package concept, and minimize the dependencies in Solr’s tarball. > > Secondly, for folks who really do want to run a separate Tika server, I’d > love to make it easier to use. Tika has introduced a new “pipes” concept > to reduce the amount of back and forth when working with Tika Server that > might tie nicely into the Solr update pipeline. I don’t think any real > work has been done on this…. Hoping Tim Allison weighs in on this topic ;-) > > Eric > > > > On Mar 8, 2023, at 9:50 PM, Shawn Heisey <apa...@elyograg.org> wrote: > > > > On 3/7/2023 3:48 PM, Jan Høydahl wrote: > >> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 < > https://issues.apache.org/jira/browse/SOLR-15951> > >> * Deprecate SolrCell SOLR-13973 < > https://issues.apache.org/jira/browse/SOLR-13973> > >> * Keep in Solr but use Tika-Server < > https://cwiki.apache.org/confluence/display/TIKA/TikaServer>, SOLR-7632 < > https://issues.apache.org/jira/browse/SOLR-7632> > >> * Integrate Tika client-side SOLR-1526 < > https://issues.apache.org/jira/browse/SOLR-1526> > > > > As you likely know, the big problem is that Tika has a habit of crashing > or misbehaving, particularly with PDFs, and if it's running inside Solr, > then Solr itself is going to suffer whatever bad effects Tika causes. > > > >> My current thinking / proposal is to: > >> * Build a new, thin Solr module that exposes a compatible > /update/extract handler, delegating to Tika-Server (user-hosted) > >> * Deprecate SolrCell in current form > >> * From 10.0, Solr will not ship with embedded Tika, only the new > handler delegating to Tika-Server > > > > I was thinking something along these lines too. A separate JVM running > Tika Server that can crash without taking Solr down, and communication so > ERH can send commands to it, receive extracted data, and hopefully know > when the other JVM crashes. If we design it well, then the framework could > be used to integrate with other extraction mechanisms besides Tika. I > think that would be quite a bit of work. > > > > It might be a good idea to make that a separate project as was done for > DIH, but I have no way of guessing whether there is enough interest in the > community to keep it maintained. If it's a separate project, then I think > it would just incorporate SolrJ and Tika, rather than using a special > handler. I have never used ERH in a production setting, and barely have > experience with it in non-production. > > > > Thanks, > > Shawn > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > For additional commands, e-mail: dev-h...@solr.apache.org > > > > _______________________ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com < > http://www.opensourceconnections.com/> | My Free/Busy < > http://tinyurl.com/eric-cal> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. > > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)