Hello! Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have in our Sharepoint system. I have used the tika-app.jar directly to extract the document in question and it does _not_ throw an exception and extract the contents just fine. So it would seem Solr is doing something different than a Tika standalone installation.
After some Googling, I found out that Solr uses its custom HtmlMapper (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML document to Tika. As Tika limits nested elements to 100, this causes Tika to throw an exception: Suspected zip bomb: 100 levels of XML element nesting. This is metioned in TIKA-2091 (https://issues.apache.org/jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The "solution" is to use Tika's default parsing/mapping mechanism but no details have been provided on how to configure this at Solr. I'm hoping some folks here have the knowledge on how to configure Solr to effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's implementation. Thank you! Harinder ________________________________ NOTICE - This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.