It is not a bug. XML parsers are required to reject documents with undefined
character entities.
Try parsing it as HTML or XHTML.
wunder
On Apr 4, 2013, at 11:14 AM, eShard wrote:
> Yes, that's it exactly.
> I crawled a link with these ( ›) in each list item and solr
> couldn't handle it threw
ality, in most cases, can
simply be ignored.
Yes, by all means ask on the Tika list. Solr is just wrapping the error Tika
reports.
-- Jack Krupansky
-Original Message-
From: eShard
Sent: Thursday, April 04, 2013 2:14 PM
To: solr-user@lucene.apache.org
Subject: Re: detailed Error re
ene.472066.n3.nabble.com/detailed-Error-reporting-in-Solr-tp4053821p4053882.html
Sent from the Solr - User mailing list archive at Nabble.com.
I'm trying to understand the context is here... are you trying to crawl web
pages that have bad HTML? Or, ... what?
-- Jack Krupansky
-Original Message-
From: eShard
Sent: Thursday, April 04, 2013 10:23 AM
To: solr-user@lucene.apache.org
Subject: detailed Error reporting in
ok, one possible fix is to add the xml equivalent to nbsp with is:
]>
but how do I add this into the tika configuration?
--
View this message in context:
http://lucene.472066.n3.nabble.com/detailed-Error-reporting-in-Solr-tp4053821p4053823.html
Sent from the Solr - User mailing l
27;t handle.
for example: Cyber Systems and Technology ›
My question is two fold:
1) how do I get solr to report more detailed errors and
2) how do I get tika to accept (or ignore) nbsp?
thanks,
--
View this message in context:
http://lucene.472066.n3.nabble.com/detailed-Error-reporting-in