OK - re-reading your message it seems maybe that is what you were trying
to say too, Robert. FWIW I agree with you that XML is rigid, sometimes
for purely arbitrary reasons. But nobody has really helped Markus here
- unfortunately, there is no easy way out of this mess. What I do to
handle issues like this is to wrap the stream I'm handing to the parser
in some kind of cleanup stream that handles a few yucky issues. You
could, eg, just strip out invalid XML characters. Maybe Nutch should be
doing this, or at least handling the error better?
-Mike
On 06/27/2011 09:19 AM, Mike Sokolov wrote:
Actually - you are both wrong!
It is true that 0xffff is a valid UTF8 character, and not a valid UTF8
byte sequence.
But the parser is reporting (or trying to) that 0xffff is an invalid
XML character.
And Robert - if the wording offends you, you might want to send a note
to Tatu (http://jira.codehaus.org/) suggesting that he alter the
wording of the error message :)
-Mike
On 06/27/2011 09:01 AM, Bernd Fehling wrote:
Am 27.06.2011 14:48, schrieb Robert Muir:
On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:
correct!!!
but what i said, is totally different than what you said.
you are still wrong.
http://www.unicode.org/faq//utf_bom.html
see Q: What is a UTF?