Re: detailed Error reporting in Solr

Jack Krupansky Thu, 04 Apr 2013 10:40:44 -0700

I'm trying to understand the context is here... are you trying to crawl webpages that have bad HTML? Or, ... what?


-- Jack Krupansky

-----Original Message-----From: eShard

Sent: Thursday, April 04, 2013 10:23 AM
To: [email protected]
Subject: detailed Error reporting in Solr

Good morning,
I'm currently running Solr 4.0 final with tika v1.2 and Manifoldcf v1.2 dev.
And I'm battling Tika XML parse errors again.
Solr reports this error:  org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error which is too vague.
I had to manually run the link against the tika app and I got a much more
detailed error.
Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105;
The entity "nbsp" was referenced, but not declared.
so there are old school non break space in the html that tika can't handle.

for example: <li> Cyber Systems and Technology&nbsp;&rsaquo;
</mission/CST/CST.html>   </li>

My question is two fold:
1) how do I get solr to report more detailed errors and
2) how do I get tika to accept (or ignore) nbsp?

thanks,

--

View this message in context:http://lucene.472066.n3.nabble.com/detailed-Error-reporting-in-Solr-tp4053821.htmlSent from the Solr - User mailing list archive at Nabble.com.

Re: detailed Error reporting in Solr

Reply via email to