I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation.
Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca> wrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we > have in our Sharepoint system. I have used the tika-app.jar directly to > extract the document in question and it does _not_ throw an exception and > extract the contents just fine. So it would seem Solr is doing something > different than a Tika standalone installation. > > After some Googling, I found out that Solr uses its custom HtmlMapper > (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML > document to Tika. As Tika limits nested elements to 100, this causes Tika > to throw an exception: Suspected zip bomb: 100 levels of XML element > nesting. This is metioned in TIKA-2091 (https://issues.apache.org/ > jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira. > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > "solution" is to use Tika's default parsing/mapping mechanism but no > details have been provided on how to configure this at Solr. > > I'm hoping some folks here have the knowledge on how to configure Solr to > effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's > implementation. > > Thank you! > Harinder > > > ________________________________ > NOTICE - > This communication is intended ONLY for the use of the person or entity > named above and may contain information that is confidential or legally > privileged. If you are not the intended recipient named above or a person > responsible for delivering messages or communications to the intended > recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying > of this communication or any of the information contained in it is strictly > prohibited. If you have received this communication in error, please notify > us immediately by telephone and then destroy or delete this communication, > or return it to us by mail if requested by us. The City of Calgary thanks > you for your attention and co-operation. >