[ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190751#comment-17190751 ]
Alexandre Rafalovitch commented on SOLR-7632: --------------------------------------------- I agree on the critical path. I was just wondering whether, given the number of internal changes and explanations required on release, it makes sense to also make it into a more flexible architecture on the Solr side. Making it URP, I think would allow to compose it with other pipeline elements in different order (e.g. preprocess file name, feed to Tika, apply DateParser), or possibly even distribute the load by running it on each node, instead of as first step. But that's just an idea. If others do not see the benefits, it is not worth chasing. > Change the ExtractingRequestHandler to use Tika-Server > ------------------------------------------------------ > > Key: SOLR-7632 > URL: https://issues.apache.org/jira/browse/SOLR-7632 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) > Reporter: Chris A. Mattmann > Priority: Major > Labels: gsoc2017, memex > > It's a pain to upgrade Tika's jars all the times when we release, and if Tika > fails it messes up the ExtractingRequestHandler (e.g., the document type > caused Tika to fail, etc). A more reliable way and also separated, and easier > to deploy version of the ExtractingRequestHandler would make a network call > to the Tika JAXRS server, and then call Tika on the Solr server side, get the > results and then index the information that way. I have a patch in the works > from the DARPA Memex project and I hope to post it soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org