How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Hanjan, Harinder Mon, 09 Apr 2018 09:01:01 -0700

Hello!

Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have 
in our Sharepoint system. I have used the tika-app.jar directly to extract the 
document in question and it does _not_ throw an exception and extract the 
contents just fine. So it would seem Solr is doing something different than a 
Tika standalone installation.

After some Googling, I found out that Solr uses its custom HtmlMapper
(MostlyPassthroughHtmlMapper) which passes through all elements in the HTML
document to Tika. As Tika limits nested elements to 100, this causes Tika to
throw an exception: Suspected zip bomb: 100 levels of XML element nesting. This
is metioned in TIKA-2091
(https://issues.apache.org/jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131).
The "solution" is to use Tika's default parsing/mapping mechanism but no
details have been provided on how to configure this at Solr.

I'm hoping some folks here have the knowledge on how to configure Solr to
effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's
implementation.

Thank you!
Harinder

________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named
above and may contain information that is confidential or legally privileged.
If you are not the intended recipient named above or a person responsible for
delivering messages or communications to the intended recipient, YOU ARE HEREBY
NOTIFIED that any use, distribution, or copying of this communication or any of
the information contained in it is strictly prohibited. If you have received
this communication in error, please notify us immediately by telephone and then
destroy or delete this communication, or return it to us by mail if requested
by us. The City of Calgary thanks you for your attention and co-operation.

How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Reply via email to