On 3/7/2023 3:48 PM, Jan Høydahl wrote:
* Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <https://issues.apache.org/jira/browse/SOLR-15951> * Deprecate SolrCell SOLR-13973 <https://issues.apache.org/jira/browse/SOLR-13973> * Keep in Solr but use Tika-Server <https://cwiki.apache.org/confluence/display/TIKA/TikaServer>, SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632> * Integrate Tika client-side SOLR-1526 <https://issues.apache.org/jira/browse/SOLR-1526>
As you likely know, the big problem is that Tika has a habit of crashing or misbehaving, particularly with PDFs, and if it's running inside Solr, then Solr itself is going to suffer whatever bad effects Tika causes.
My current thinking / proposal is to: * Build a new, thin Solr module that exposes a compatible /update/extract handler, delegating to Tika-Server (user-hosted) * Deprecate SolrCell in current form * From 10.0, Solr will not ship with embedded Tika, only the new handler delegating to Tika-Server
I was thinking something along these lines too. A separate JVM running Tika Server that can crash without taking Solr down, and communication so ERH can send commands to it, receive extracted data, and hopefully know when the other JVM crashes. If we design it well, then the framework could be used to integrate with other extraction mechanisms besides Tika. I think that would be quite a bit of work.
It might be a good idea to make that a separate project as was done for DIH, but I have no way of guessing whether there is enough interest in the community to keep it maintained. If it's a separate project, then I think it would just incorporate SolrJ and Tika, rather than using a special handler. I have never used ERH in a production setting, and barely have experience with it in non-production.
Thanks, Shawn --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org