It is not a bug. XML parsers are required to reject documents with undefined
character entities.
Try parsing it as HTML or XHTML.
wunder
On Apr 4, 2013, at 11:14 AM, eShard wrote:
> Yes, that's it exactly.
> I crawled a link with these ( ›) in each list item and solr
> couldn't handle it threw
ality, in most cases, can
simply be ignored.
Yes, by all means ask on the Tika list. Solr is just wrapping the error Tika
reports.
-- Jack Krupansky
-Original Message-
From: eShard
Sent: Thursday, April 04, 2013 2:14 PM
To: solr-user@lucene.apache.org
Subject: Re: detailed Error re
Yes, that's it exactly.
I crawled a link with these ( ›) in each list item and solr
couldn't handle it threw the xml parse error and the crawler terminated the
job.
Is this fixable? Or do I have to submit a bug to the tika folks?
Thanks,
--
View this message in context:
http://lucene.472066.
I'm trying to understand the context is here... are you trying to crawl web
pages that have bad HTML? Or, ... what?
-- Jack Krupansky
-Original Message-
From: eShard
Sent: Thursday, April 04, 2013 10:23 AM
To: solr-user@lucene.apache.org
Subject: detailed Error reporting in Solr
Good
ok, one possible fix is to add the xml equivalent to nbsp with is:
]>
but how do I add this into the tika configuration?
--
View this message in context:
http://lucene.472066.n3.nabble.com/detailed-Error-reporting-in-Solr-tp4053821p4053823.html
Sent from the Solr - User mailing list archiv