Hello everyone, I've been looking into the dependencies of the project and thought that we could update a couple of them, together with their license files (wherever necessary).
I tried to start with Apache Tika and upgrade it from 1.28.5 to 2.9.2, which is a huge step due to some restructuring of Apache Tika. The affected modules are extraction and langid. There is a PR from solrbot <https://github.com/apache/solr/pull/2583> that requires some manual work that I have already picked up for learning purposes. I'd like to create a ticket for the upgrade, but also saw that there is also SOLR-13973 <https://issues.apache.org/jira/browse/SOLR-13973> that is titled "Deprecate Tika". From the age and conversation on the ticket, it sounds like Tika will not be deprecated and the ticket can be closed. But I am not sure and would like to ask for your input on this. In the migration to 2.9.2 it seems that there are some conflicts with the way the title from documents is extracted. Some metadata tags have also been removed / replaced, which needs more attention. See Migrating to Tika 2.0.0 <https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0> for more details. I'd be happy to create a PR for the upgrade and look into the fixes with someone that has already worked with Apache Tika 2.X or the affected modules (extraction/langid). Best, Christos