Hi, This has been a recurring topic, and there have been many suggestions for what to do with "Solr Cell" aka Extracting Request Handler aka Tika.
Most agree it's a bad idea to parse huge PDFs in Solr's JVM process like we do. Proposals over the years have been * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <https://issues.apache.org/jira/browse/SOLR-15951> * Deprecate SolrCell SOLR-13973 <https://issues.apache.org/jira/browse/SOLR-13973> * Keep in Solr but use Tika-Server <https://cwiki.apache.org/confluence/display/TIKA/TikaServer>, SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632> * Integrate Tika client-side SOLR-1526 <https://issues.apache.org/jira/browse/SOLR-1526> We should make a plan now for what the Tika story will be for Solr 10.0. We should not under-estimate the number of Solr users who rely on SolrCell, and should therefore not take this decision lightly. A well communicated story and a well executed migration path will give user satisfaction. A bad experience will repell users. Personally I prefer to run Tika on client side and index the already-extracted text to Solr. We already document that Solr Cell is not recommended for production use <https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications>. My current thinking / proposal is to: * Build a new, thin Solr module that exposes a compatible /update/extract handler, delegating to Tika-Server (user-hosted) * Deprecate SolrCell in current form * From 10.0, Solr will not ship with embedded Tika, only the new handler delegating to Tika-Server WDYT? Jan