Hello everyone,

I've been looking into the dependencies of the project and thought that we
could update a couple of them, together with their license files (wherever
necessary).

I tried to start with Apache Tika and upgrade it from 1.28.5 to 2.9.2,
which is a huge step due to some restructuring of Apache Tika. The affected
modules are extraction and langid.

There is a PR from solrbot <https://github.com/apache/solr/pull/2583> that
requires some manual work that I have already picked up for learning
purposes. I'd like to create a ticket for the upgrade, but also saw that
there is also SOLR-13973
<https://issues.apache.org/jira/browse/SOLR-13973> that
is titled "Deprecate Tika". From the age and conversation on the ticket, it
sounds like Tika will not be deprecated and the ticket can be closed. But I
am not sure and would like to ask for your input on this.

In the migration to 2.9.2 it seems that there are some conflicts with the
way the title from documents is extracted. Some metadata tags have also
been removed / replaced, which needs more attention. See Migrating to Tika
2.0.0
<https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0> for
more details.

I'd be happy to create a PR for the upgrade and look into the fixes with
someone that has already worked with Apache Tika 2.X or the affected
modules (extraction/langid).

Best,
Christos

Reply via email to