RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Hanjan, Harinder Mon, 09 Apr 2018 11:26:49 -0700

Thank you Charlie, Tim.
I will integrate Tika in my Java app and use SolrJ to send data to Solr.



-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, April 09, 2018 11:24 AM
To: solr-user@lucene.apache.org
Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?

+1



https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_&d=DwIGaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w&s=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0&e=



We should add a chatbot to the list that includes Charlie's advice and the link 
to Erick's blog post whenever Tika is used. 😊





-----Original Message-----

From: Charlie Hull [mailto:char...@flax.co.uk] 

Sent: Monday, April 9, 2018 12:44 PM

To: solr-user@lucene.apache.org

Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?



I'd recommend you run Tika externally to Solr, which will allow you to catch 
this kind of problem and prevent it bringing down your Solr installation.



Cheers



Charlie



On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca>

wrote:



> Hello!

>

> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 

> we have in our Sharepoint system. I have used the tika-app.jar 

> directly to extract the document in question and it does _not_ throw 

> an exception and extract the contents just fine. So it would seem Solr 

> is doing something different than a Tika standalone installation.

>

> After some Googling, I found out that Solr uses its custom HtmlMapper

> (MostlyPassthroughHtmlMapper) which passes through all elements in the 

> HTML document to Tika. As Tika limits nested elements to 100, this 

> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 

> XML element nesting. This is metioned in TIKA-2091 

> (https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_&d=DwIGaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w&s=Il6-in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0&e=
>  jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.

> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 

> "solution" is to use Tika's default parsing/mapping mechanism but no 

> details have been provided on how to configure this at Solr.

>

> I'm hoping some folks here have the knowledge on how to configure Solr 

> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 

> use Tika's implementation.

>

> Thank you!

> Harinder

>

>

> ________________________________

> NOTICE -

> This communication is intended ONLY for the use of the person or 

> entity named above and may contain information that is confidential or 

> legally privileged. If you are not the intended recipient named above 

> or a person responsible for delivering messages or communications to 

> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 

> distribution, or copying of this communication or any of the 

> information contained in it is strictly prohibited. If you have 

> received this communication in error, please notify us immediately by 

> telephone and then destroy or delete this communication, or return it 

> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.

>

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Reply via email to