Of course, i can't print the system bell and stuff like that in XML. I'll
improve the method to get rid of non-printable control characters as well.
On Monday 27 June 2011 18:16:08 Mike Sokolov wrote:
> Markus - if you want to make sure not to offend XML parsers, you should
> strip all characters
Markus - if you want to make sure not to offend XML parsers, you should
strip all characters not in this list:
http://en.wikipedia.org/wiki/XML#Valid_characters
You'll see that article talks about XML 1.1, which accepts a wider range
of characters than XML 1.0, and I believe the Woodstox parse
Of course it doesn't work like this: use AND instead of OR!
On Monday 27 June 2011 17:50:01 Markus Jelsma wrote:
> Hi all, thanks for your comments. I seem to have fixed it by now by simply
> stripping away all non-character codepoints [1] by iterating over the
> individual chars and checking them
Hi all, thanks for your comments. I seem to have fixed it by now by simply
stripping away all non-character codepoints [1] by iterating over the
individual chars and checking them against:
if (ch % 0x1 != 0x || ch % 0x1 != 0xfffe || (ch <= 0xfdd0 && ch >=
0xfdef)) { pass; }
Comment
I don't think this is a BOM - that would be 0xfeff. Anyway the problem
we usually see w/processing XML with BOMs is in UTF8 (which really
doesn't need a BOM since it's a byte stream anyway), in which if you
transform the stream (bytes) into a reader (chars) before the xml parser
can see it, th
On Monday 27 June 2011 16:33:16 lee carroll wrote:
> Hi Markus
>
> I've seen similar issue before (but not with solr) when processing files as
> xml. In our case the problem was due to processing a utf16 file with a
> byte order mark. This presents itself as
> 0x to the xml parser which is n
Hi Markus
I've seen similar issue before (but not with solr) when processing files as xml.
In our case the problem was due to processing a utf16 file with a byte
order mark. This presents itself as
0x to the xml parser which is not used by utf8 (the bom unicode
would be represented as efbfbf i
hı
Its the same error I mentioned here
http://lucene.472066.n3.nabble.com/strange-utf-8-problem-td3094473.html.
Also if you use solr 1.4.1 there is no problem like that.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-3-1-indexing-error-Invalid-UTF-8-character-0x
Hello,
Am 27.06.2011 um 12:40 schrieb Markus Jelsma:
> Hi,
>
> I came across the indexing error below. It happened in a huge batch update
> from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace
> the error back to a specific document. So i try my luck here: anyone seen
OK - re-reading your message it seems maybe that is what you were trying
to say too, Robert. FWIW I agree with you that XML is rigid, sometimes
for purely arbitrary reasons. But nobody has really helped Markus here
- unfortunately, there is no easy way out of this mess. What I do to
handle i
Actually - you are both wrong!
It is true that 0x is a valid UTF8 character, and not a valid UTF8
byte sequence.
But the parser is reporting (or trying to) that 0x is an invalid XML
character.
And Robert - if the wording offends you, you might want to send a note
to Tatu (http://ji
Am 27.06.2011 14:48, schrieb Robert Muir:
On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling
wrote:
correct!!!
but what i said, is totally different than what you said.
you are still wrong.
http://www.unicode.org/faq//utf_bom.html
see Q: What is a UTF?
On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling
wrote:
>
> correct!!!
>
but what i said, is totally different than what you said.
you are still wrong.
Am 27.06.2011 14:35, schrieb Robert Muir:
On Mon, Jun 27, 2011 at 8:30 AM, Bernd Fehling
wrote:
Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is right.
But I was saying that UTF-8 0x (which is byte sequence "ff ff") is
illegal
and that's what the java.io.CharConversionException
On Mon, Jun 27, 2011 at 8:30 AM, Bernd Fehling
wrote:
> Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is right.
>
> But I was saying that UTF-8 0x (which is byte sequence "ff ff") is
> illegal
> and that's what the java.io.CharConversionException is complaining about.
> "Invalid UTF-
Am 27.06.2011 14:02, schrieb Robert Muir:
On Mon, Jun 27, 2011 at 7:11 AM, Bernd Fehling
wrote:
So there is no UTF-8 0x. It is illegal.
you are wrong: it is legally encoded as a three byte sequence: ef bf bf
Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is right.
But I wa
On Mon, Jun 27, 2011 at 7:11 AM, Bernd Fehling
wrote:
>
> So there is no UTF-8 0x. It is illegal.
>
you are wrong: it is legally encoded as a three byte sequence: ef bf bf
I suggest avoid illegal UTF-8 characters by pre-filtering your
contentstream before loading.
Unicode UTF-8(hex)
U+07FFdf bf
U+0800e0 a0 80
So there is no UTF-8 0x. It is illegal.
Regards
Am 27.06.2011 12:40, schrieb Markus Jelsma:
Hi,
I came across the indexing error below. It
18 matches
Mail list logo