Hi

Wrt Tika, I had been hoping that we could replace extracting handler with a 
processor that delegates to Tika Server, but is otherwise feature parity. It 
would remove tons of dependencies and attack surface from Solr.

I tried a POC once but could not find a suitable Java client for Tika Server 
REST API. Perhaps that exists now?

Jan Høydahl

> 12. aug. 2024 kl. 16:20 skrev Christos Malliaridis <c.malliari...@gmail.com>:
> 
> Hello everyone,
> 
> I've been looking into the dependencies of the project and thought that we
> could update a couple of them, together with their license files (wherever
> necessary).
> 
> I tried to start with Apache Tika and upgrade it from 1.28.5 to 2.9.2,
> which is a huge step due to some restructuring of Apache Tika. The affected
> modules are extraction and langid.
> 
> There is a PR from solrbot <https://github.com/apache/solr/pull/2583> that
> requires some manual work that I have already picked up for learning
> purposes. I'd like to create a ticket for the upgrade, but also saw that
> there is also SOLR-13973
> <https://issues.apache.org/jira/browse/SOLR-13973> that
> is titled "Deprecate Tika". From the age and conversation on the ticket, it
> sounds like Tika will not be deprecated and the ticket can be closed. But I
> am not sure and would like to ask for your input on this.
> 
> In the migration to 2.9.2 it seems that there are some conflicts with the
> way the title from documents is extracted. Some metadata tags have also
> been removed / replaced, which needs more attention. See Migrating to Tika
> 2.0.0
> <https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0> for
> more details.
> 
> I'd be happy to create a PR for the upgrade and look into the fixes with
> someone that has already worked with Apache Tika 2.X or the affected
> modules (extraction/langid).
> 
> Best,
> Christos

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Reply via email to