Hello!

Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have 
in our Sharepoint system. I have used the tika-app.jar directly to extract the 
document in question and it does _not_ throw an exception and extract the 
contents just fine. So it would seem Solr is doing something different than a 
Tika standalone installation.

After some Googling, I found out that Solr uses its custom HtmlMapper 
(MostlyPassthroughHtmlMapper) which passes through all elements in the HTML 
document to Tika. As Tika limits nested elements to 100, this causes Tika to 
throw an exception: Suspected zip bomb: 100 levels of XML element nesting. This 
is metioned in TIKA-2091 
(https://issues.apache.org/jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131).
 The "solution" is to use Tika's default parsing/mapping mechanism but no 
details have been provided on how to configure this at Solr.

I'm hoping some folks here have the knowledge on how to configure Solr to 
effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's 
implementation.

Thank you!
Harinder


________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.

Reply via email to