Hi,

This has been a recurring topic, and there have been many suggestions for what 
to do with "Solr Cell" aka Extracting Request Handler aka Tika.

Most agree it's a bad idea to parse huge PDFs in Solr's JVM process like we do.

Proposals over the years have been

* Move SolrCell to a package, outside of Solr's tarball SOLR-15951 
<https://issues.apache.org/jira/browse/SOLR-15951>
* Deprecate SolrCell SOLR-13973 
<https://issues.apache.org/jira/browse/SOLR-13973>
* Keep in Solr but use Tika-Server 
<https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 
<https://issues.apache.org/jira/browse/SOLR-7632>
* Integrate Tika client-side SOLR-1526 
<https://issues.apache.org/jira/browse/SOLR-1526>

We should make a plan now for what the Tika story will be for Solr 10.0. We 
should not under-estimate the number of Solr users who rely on SolrCell, and 
should therefore not take this decision lightly. A well communicated story and 
a well executed migration path will give user satisfaction. A bad experience 
will repell users.

Personally I prefer to run Tika on client side and index the already-extracted 
text to Solr. We already document that Solr Cell is not recommended for 
production use 
<https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications>.

My current thinking / proposal is to:
* Build a new, thin Solr module that exposes a compatible /update/extract 
handler, delegating to Tika-Server (user-hosted)
* Deprecate SolrCell in current form
* From 10.0, Solr will not ship with embedded Tika, only the new handler 
delegating to Tika-Server

WDYT?

Jan

Reply via email to