As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is useful.
Cheers Charlie On 9 April 2018 at 19:26, Hanjan, Harinder <harinder.han...@calgary.ca> wrote: > Thank you Charlie, Tim. > I will integrate Tika in my Java app and use SolrJ to send data to Solr. > > > -----Original Message----- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, April 09, 2018 11:24 AM > To: solr-user@lucene.apache.org > Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from > HTML document instead of Solr's MostlyPassthroughHtmlMapper ? > > +1 > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__ > lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_&d=DwIGaQ&c=jdm1Hby_ > BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d- > HO9gO9CysWnvGGoKrSNEuM3U&m=7XZTNWKY6A53HuY_2qeWA_ > 3ndvYmpHBHjZXJ5pTMP2w&s=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0&e= > > > > We should add a chatbot to the list that includes Charlie's advice and the > link to Erick's blog post whenever Tika is used. 😊 > > > > > > -----Original Message----- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca> > > wrote: > > > > > Hello! > > > > > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > > > we have in our Sharepoint system. I have used the tika-app.jar > > > directly to extract the document in question and it does _not_ throw > > > an exception and extract the contents just fine. So it would seem Solr > > > is doing something different than a Tika standalone installation. > > > > > > After some Googling, I found out that Solr uses its custom HtmlMapper > > > (MostlyPassthroughHtmlMapper) which passes through all elements in the > > > HTML document to Tika. As Tika limits nested elements to 100, this > > > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > > > XML element nesting. This is metioned in TIKA-2091 > > > (https://urldefense.proofpoint.com/v2/url?u=https- > 3A__issues.apache.org_&d=DwIGaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDu > vdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m= > 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w&s=Il6- > in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0&e= jira/browse/TIKA-2091? > focusedCommentId=15514131&page=com.atlassian.jira. > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > > > "solution" is to use Tika's default parsing/mapping mechanism but no > > > details have been provided on how to configure this at Solr. > > > > > > I'm hoping some folks here have the knowledge on how to configure Solr > > > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > > > use Tika's implementation. > > > > > > Thank you! > > > Harinder > > > > > > > > > ________________________________ > > > NOTICE - > > > This communication is intended ONLY for the use of the person or > > > entity named above and may contain information that is confidential or > > > legally privileged. If you are not the intended recipient named above > > > or a person responsible for delivering messages or communications to > > > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > > > distribution, or copying of this communication or any of the > > > information contained in it is strictly prohibited. If you have > > > received this communication in error, please notify us immediately by > > > telephone and then destroy or delete this communication, or return it > > > to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > > > > >