I'm trying to understand the context is here... are you trying to crawl web
pages that have bad HTML? Or, ... what?
-- Jack Krupansky
-----Original Message-----
From: eShard
Sent: Thursday, April 04, 2013 10:23 AM
To: [email protected]
Subject: detailed Error reporting in Solr
Good morning,
I'm currently running Solr 4.0 final with tika v1.2 and Manifoldcf v1.2 dev.
And I'm battling Tika XML parse errors again.
Solr reports this error: org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: XML parse error which is too vague.
I had to manually run the link against the tika app and I got a much more
detailed error.
Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105;
The entity "nbsp" was referenced, but not declared.
so there are old school non break space in the html that tika can't handle.
for example: <li> Cyber Systems and Technology ›
</mission/CST/CST.html> </li>
My question is two fold:
1) how do I get solr to report more detailed errors and
2) how do I get tika to accept (or ignore) nbsp?
thanks,
--
View this message in context:
http://lucene.472066.n3.nabble.com/detailed-Error-reporting-in-Solr-tp4053821.html
Sent from the Solr - User mailing list archive at Nabble.com.