Sounds good, Jan. If you're heading in this direction, I'd recommend the /tika endpoint with an Accept header set to "application/json".
Please let me know if I can help. Best, Tim On Thu, Mar 23, 2023 at 2:43 PM Jan Høydahl <jan....@cominvent.com> wrote: > > Documentation wise we can re-write the chapter we have on rich text indexing > to mention several options, including tika-server, tika-pipes with solr > emitter. > > Wrt SolrCell successor, I still think a super-thin module forwarding to > TikaServer is the best. Users would get same features and API as today, so > users who rely on SolrCell have a simple migration path. It may also be a > benefit that they get better control over their Tika Server wrt version, > scaling, what parsers are included etc. I want to do a quick POC on this to > see how it flies. > > Jan > > > 23. mar. 2023 kl. 17:14 skrev Tim Allison <talli...@apache.org>: > > > > Apologies for being late to the show, and thank you Eric for pinging me on > > this. > > > > I'm 100% for factoring out Tika from the same jvm as Solr. I see three > > options for removing Tika from Solr's jvm, making it easier for users and > > keeping Tika's jar hell all to itself. > > > > 1) As already proposed, use Tika server and somehow figure out how to > > integrate that seamlessly. > > > > 2) Use Tika pipes within Solr directly or within a package (as Eric > > suggest). This forks a process for parsing, and all the heavy dependencies > > go into the forked process. Solr would need tika-core, but could specify a > > directory with tika-app.jar in it. The dependency nightmare in > > tika-app.jar would not get loaded into Solr's jvm. We'd probably have to > > make some mods to tika-pipes for this to work roughly as Tika is being used > > now, but I think something like this is doable... > > > > 3) Direct users to tika-pipes directly. We have a Solr emitter. Users can > > aim tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and > > Tika will safely parse the files in a forked process and forward the > > results to Solr. This is not as easy as curling bytes to Solr and having > > those bytes parsed, but it is possible. > > > > Please let me know how I can help. > > > > Best, > > > > Tim > > > > On 2023/03/10 03:57:45 Gus Heck wrote: > >> While I totally think that for any heavy-duty use case or any use case > >> where the document's are not constrained to a known set with polite > >> characteristics (i.e. known not to be password protected, reasonable > >> length, etc), Tika should not run inside solr. That said, as I see it the > >> key downside of not having solr-cell as part of solr would be that we would > >> likely remove the docs for it too, and the entire concept of how to get a > >> "normal" document into solr evaporates from our ref guide. So I like the > >> sound of it being an official package as Eric suggests, and perhaps even > >> the canonical example of how to install a package... Along with heavy > >> documentation caveats of why Tika should run outside of solr for most > >> production purposes of course. > >> > >> -Gus > >> > >> > >> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com> > >> wrote: > >> > >>> I did a series of blog posts about Tika, and while conventional wisdom is > >>> that running Tika in Solr is bad, I’ve had GREAT luck with it over the > >>> years. > >>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/ > >>> < > >>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/ > >>>> > >>> > >>> Having said that, my bigger beef with Tika in Solr is about all the > >>> dependencies that it drags along. I am constantly looking up a package > >>> wondering how we use it in Solr just to find it’s a Tika package…. So…. > >>> For that reason I think we need to do something better. > >>> > >>> I like SolrCell to a package ( > >>> https://issues.apache.org/jira/browse/SOLR-15951 < > >>> https://issues.apache.org/jira/browse/SOLR-15951>). We have this > >>> powerful packaging feature, and yet we hardly dog food it ourselves…. I’d > >>> love to see us separate out SolrCell and make it easy to do `bin/solr > >>> package install solrcell` and have it work! It would both validate the > >>> whole Package concept, and minimize the dependencies in Solr’s tarball. > >>> > >>> Secondly, for folks who really do want to run a separate Tika server, I’d > >>> love to make it easier to use. Tika has introduced a new “pipes” > >>> concept > >>> to reduce the amount of back and forth when working with Tika Server that > >>> might tie nicely into the Solr update pipeline. I don’t think any real > >>> work has been done on this…. Hoping Tim Allison weighs in on this topic > >>> ;-) > >>> > >>> Eric > >>> > >>> > >>>> On Mar 8, 2023, at 9:50 PM, Shawn Heisey <apa...@elyograg.org> wrote: > >>>> > >>>> On 3/7/2023 3:48 PM, Jan Høydahl wrote: > >>>>> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 < > >>> https://issues.apache.org/jira/browse/SOLR-15951> > >>>>> * Deprecate SolrCell SOLR-13973 < > >>> https://issues.apache.org/jira/browse/SOLR-13973> > >>>>> * Keep in Solr but use Tika-Server < > >>> https://cwiki.apache.org/confluence/display/TIKA/TikaServer>, SOLR-7632 < > >>> https://issues.apache.org/jira/browse/SOLR-7632> > >>>>> * Integrate Tika client-side SOLR-1526 < > >>> https://issues.apache.org/jira/browse/SOLR-1526> > >>>> > >>>> As you likely know, the big problem is that Tika has a habit of crashing > >>> or misbehaving, particularly with PDFs, and if it's running inside Solr, > >>> then Solr itself is going to suffer whatever bad effects Tika causes. > >>>> > >>>>> My current thinking / proposal is to: > >>>>> * Build a new, thin Solr module that exposes a compatible > >>> /update/extract handler, delegating to Tika-Server (user-hosted) > >>>>> * Deprecate SolrCell in current form > >>>>> * From 10.0, Solr will not ship with embedded Tika, only the new > >>> handler delegating to Tika-Server > >>>> > >>>> I was thinking something along these lines too. A separate JVM running > >>> Tika Server that can crash without taking Solr down, and communication so > >>> ERH can send commands to it, receive extracted data, and hopefully know > >>> when the other JVM crashes. If we design it well, then the framework > >>> could > >>> be used to integrate with other extraction mechanisms besides Tika. I > >>> think that would be quite a bit of work. > >>>> > >>>> It might be a good idea to make that a separate project as was done for > >>> DIH, but I have no way of guessing whether there is enough interest in the > >>> community to keep it maintained. If it's a separate project, then I think > >>> it would just incorporate SolrJ and Tika, rather than using a special > >>> handler. I have never used ERH in a production setting, and barely have > >>> experience with it in non-production. > >>>> > >>>> Thanks, > >>>> Shawn > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > >>>> For additional commands, e-mail: dev-h...@solr.apache.org > >>>> > >>> > >>> _______________________ > >>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > >>> http://www.opensourceconnections.com < > >>> http://www.opensourceconnections.com/> | My Free/Busy < > >>> http://tinyurl.com/eric-cal> > >>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > >>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > >>> > >>> This e-mail and all contents, including attachments, is considered to be > >>> Company Confidential unless explicitly stated otherwise, regardless of > >>> whether attachments are marked as such. > >>> > >>> > >> > >> -- > >> http://www.needhamsoftware.com (work) > >> http://www.the111shift.com (play) > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > For additional commands, e-mail: dev-h...@solr.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org