slightly to call URLDecoder on text.
Thanks and best regards, Lisheng
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Wednesday, January 16, 2013 2:41 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr exception when parsing XML
In Apache Nutch we strip
solr-user@lucene.apache.org
> Subject: RE: Solr exception when parsing XML
>
> Hi Alex,
>
> Thanks very much for helps! I switched to (I am using PHP in client side)
>
> createTextNode(urlencode($value))
>
> so CTRL character problem is avoided, but I noticed that some
@lucene.apache.org
Subject: Re: Solr exception when parsing XML
Interesting point. Looks like CDATA is more limiting than I thought:
http://en.wikipedia.org/wiki/CDATA#Issues_with_encoding . Basically, the
recommendation is to avoid CDATA and automatically encode characters such
as yours, as well as less
Looking at this second time, maybe we have an X/Y problem (sp?). Why was
that symbol in there in the first place?
Was it a field separator instead of using multiple fields? Was it a
character in an encoding other than UTF-8?
My guess is that the character will not make sense to Solr during either
On Tue, Jan 15, 2013 at 3:55 PM, Alexandre Rafalovitch
wrote:
> Basically, the
> recommendation is to avoid CDATA and automatically encode characters such
> as yours, as well as less/more and ampersand.
Unfortunately that doesn't even work. Just as a raw control character
like a 0 byte is invali
Forgot the link : http://en.wikipedia.org/wiki/Valid_characters_in_XML
André
On 01/16/2013 02:24 PM, Andre Bois-Crettez wrote:
Worth to note that some characters are completely forbidden in XML, such
as "chr(0)".
When dealing with external text input, some cleanup might be necessary
to avoid br
Worth to note that some characters are completely forbidden in XML, such
as "chr(0)".
When dealing with external text input, some cleanup might be necessary
to avoid breaking indexation.
For example you could replace each forbidden XML character with " ".
André
On 01/15/2013 09:55 PM, Alexandre
Interesting point. Looks like CDATA is more limiting than I thought:
http://en.wikipedia.org/wiki/CDATA#Issues_with_encoding . Basically, the
recommendation is to avoid CDATA and automatically encode characters such
as yours, as well as less/more and ampersand.
Regards,
Alex.