Re: [DISCUSS] Future of SolrCell in Solr

Shawn Heisey Wed, 08 Mar 2023 18:49:49 -0800

On 3/7/2023 3:48 PM, Jan Høydahl wrote:

* Move SolrCell to a package, outside of Solr's tarball SOLR-15951 
<https://issues.apache.org/jira/browse/SOLR-15951>
* Deprecate SolrCell SOLR-13973 
<https://issues.apache.org/jira/browse/SOLR-13973>
* Keep in Solr but use Tika-Server 
<https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 
<https://issues.apache.org/jira/browse/SOLR-7632>
* Integrate Tika client-side SOLR-1526 
<https://issues.apache.org/jira/browse/SOLR-1526>

As you likely know, the big problem is that Tika has a habit of crashingor misbehaving, particularly with PDFs, and if it's running inside Solr,then Solr itself is going to suffer whatever bad effects Tika causes.

My current thinking / proposal is to:
* Build a new, thin Solr module that exposes a compatible /update/extract 
handler, delegating to Tika-Server (user-hosted)
* Deprecate SolrCell in current form
* From 10.0, Solr will not ship with embedded Tika, only the new handler 
delegating to Tika-Server

I was thinking something along these lines too. A separate JVM runningTika Server that can crash without taking Solr down, and communicationso ERH can send commands to it, receive extracted data, and hopefullyknow when the other JVM crashes. If we design it well, then theframework could be used to integrate with other extraction mechanismsbesides Tika. I think that would be quite a bit of work.

It might be a good idea to make that a separate project as was done forDIH, but I have no way of guessing whether there is enough interest inthe community to keep it maintained. If it's a separate project, then Ithink it would just incorporate SolrJ and Tika, rather than using aspecial handler. I have never used ERH in a production setting, andbarely have experience with it in non-production.


Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Re: [DISCUSS] Future of SolrCell in Solr

Reply via email to