Hi Wrt Tika, I had been hoping that we could replace extracting handler with a processor that delegates to Tika Server, but is otherwise feature parity. It would remove tons of dependencies and attack surface from Solr.
I tried a POC once but could not find a suitable Java client for Tika Server REST API. Perhaps that exists now? Jan Høydahl > 12. aug. 2024 kl. 16:20 skrev Christos Malliaridis <c.malliari...@gmail.com>: > > Hello everyone, > > I've been looking into the dependencies of the project and thought that we > could update a couple of them, together with their license files (wherever > necessary). > > I tried to start with Apache Tika and upgrade it from 1.28.5 to 2.9.2, > which is a huge step due to some restructuring of Apache Tika. The affected > modules are extraction and langid. > > There is a PR from solrbot <https://github.com/apache/solr/pull/2583> that > requires some manual work that I have already picked up for learning > purposes. I'd like to create a ticket for the upgrade, but also saw that > there is also SOLR-13973 > <https://issues.apache.org/jira/browse/SOLR-13973> that > is titled "Deprecate Tika". From the age and conversation on the ticket, it > sounds like Tika will not be deprecated and the ticket can be closed. But I > am not sure and would like to ask for your input on this. > > In the migration to 2.9.2 it seems that there are some conflicts with the > way the title from documents is extracted. Some metadata tags have also > been removed / replaced, which needs more attention. See Migrating to Tika > 2.0.0 > <https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0> for > more details. > > I'd be happy to create a PR for the upgrade and look into the fixes with > someone that has already worked with Apache Tika 2.X or the affected > modules (extraction/langid). > > Best, > Christos --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org