Re: [DISCUSS] Future of SolrCell in Solr

2023-03-23 Thread Tim Allison
Sounds good, Jan. If you're heading in this direction, I'd recommend the /tika endpoint with an Accept header set to "application/json". Please let me know if I can help. Best, Tim On Thu, Mar 23, 2023 at 2:43 PM Jan Høydahl wrote: > > Documentation wise we can re-write the chapter we

Re: [DISCUSS] Future of SolrCell in Solr

2023-03-23 Thread Jan Høydahl
Documentation wise we can re-write the chapter we have on rich text indexing to mention several options, including tika-server, tika-pipes with solr emitter. Wrt SolrCell successor, I still think a super-thin module forwarding to TikaServer is the best. Users would get same features and API as t

Re: [DISCUSS] Future of SolrCell in Solr

2023-03-23 Thread Tim Allison
Apologies for being late to the show, and thank you Eric for pinging me on this. I'm 100% for factoring out Tika from the same jvm as Solr. I see three options for removing Tika from Solr's jvm, making it easier for users and keeping Tika's jar hell all to itself. 1) As already proposed, use T

Re: [DISCUSS] Future of SolrCell in Solr

2023-03-09 Thread Gus Heck
While I totally think that for any heavy-duty use case or any use case where the document's are not constrained to a known set with polite characteristics (i.e. known not to be password protected, reasonable length, etc), Tika should not run inside solr. That said, as I see it the key downside of n

Re: [DISCUSS] Future of SolrCell in Solr

2023-03-09 Thread Eric Pugh
I did a series of blog posts about Tika, and while conventional wisdom is that running Tika in Solr is bad, I’ve had GREAT luck with it over the years. https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/

Re: [DISCUSS] Future of SolrCell in Solr

2023-03-08 Thread Shawn Heisey
On 3/7/2023 3:48 PM, Jan Høydahl wrote: * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 * Deprecate SolrCell SOLR-13973 * Keep in Solr but use Tika-Server

[DISCUSS] Future of SolrCell in Solr

2023-03-07 Thread Jan Høydahl
Hi, This has been a recurring topic, and there have been many suggestions for what to do with "Solr Cell" aka Extracting Request Handler aka Tika. Most agree it's a bad idea to parse huge PDFs in Solr's JVM process like we do. Proposals over the years have been * Move SolrCell to a package, ou