As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web
service https://github.com/mattflax/dropwizard-tika-server written by a
colleague of mine at Flax. Hope this is useful.

Cheers

Charlie

On 9 April 2018 at 19:26, Hanjan, Harinder <harinder.han...@calgary.ca>
wrote:

> Thank you Charlie, Tim.
> I will integrate Tika in my Java app and use SolrJ to send data to Solr.
>
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 09, 2018 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from
> HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> +1
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__
> lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_&d=DwIGaQ&c=jdm1Hby_
> BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-
> HO9gO9CysWnvGGoKrSNEuM3U&m=7XZTNWKY6A53HuY_2qeWA_
> 3ndvYmpHBHjZXJ5pTMP2w&s=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0&e=
>
>
>
> We should add a chatbot to the list that includes Charlie's advice and the
> link to Erick's blog post whenever Tika is used. 😊
>
>
>
>
>
> -----Original Message-----
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
>
>
> I'd recommend you run Tika externally to Solr, which will allow you to
> catch this kind of problem and prevent it bringing down your Solr
> installation.
>
>
>
> Cheers
>
>
>
> Charlie
>
>
>
> On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca>
>
> wrote:
>
>
>
> > Hello!
>
> >
>
> > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
>
> > we have in our Sharepoint system. I have used the tika-app.jar
>
> > directly to extract the document in question and it does _not_ throw
>
> > an exception and extract the contents just fine. So it would seem Solr
>
> > is doing something different than a Tika standalone installation.
>
> >
>
> > After some Googling, I found out that Solr uses its custom HtmlMapper
>
> > (MostlyPassthroughHtmlMapper) which passes through all elements in the
>
> > HTML document to Tika. As Tika limits nested elements to 100, this
>
> > causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>
> > XML element nesting. This is metioned in TIKA-2091
>
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_&d=DwIGaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDu
> vdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=
> 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w&s=Il6-
> in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0&e= jira/browse/TIKA-2091?
> focusedCommentId=15514131&page=com.atlassian.jira.
>
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
>
> > "solution" is to use Tika's default parsing/mapping mechanism but no
>
> > details have been provided on how to configure this at Solr.
>
> >
>
> > I'm hoping some folks here have the knowledge on how to configure Solr
>
> > to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>
> > use Tika's implementation.
>
> >
>
> > Thank you!
>
> > Harinder
>
> >
>
> >
>
> > ________________________________
>
> > NOTICE -
>
> > This communication is intended ONLY for the use of the person or
>
> > entity named above and may contain information that is confidential or
>
> > legally privileged. If you are not the intended recipient named above
>
> > or a person responsible for delivering messages or communications to
>
> > the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>
> > distribution, or copying of this communication or any of the
>
> > information contained in it is strictly prohibited. If you have
>
> > received this communication in error, please notify us immediately by
>
> > telephone and then destroy or delete this communication, or return it
>
> > to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
>
> >
>
>

Reply via email to