Hi!
I have an issue parsing XML containing Unicode strings with surrogate
characters (Xerces 2.11.0). The following exception is thrown:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 18;
Character reference "�" is an invalid XML character.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
Simple code to reproduce the issue:
byte[] enc1 = new byte[] {(byte)0xd8, 0x40, (byte)0xdc, 0x2a};
String result = new String(enc1, "UTF-16");
System.out.println(result); // Outputs 𠀪 correctly
String saml="<name>lz1��.cct.cm</name>";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document= builder.parse(new InputSource(new
StringReader(saml))); // Throws exception
Do I parse the XML correctly?
The XML I parse contains the following string:
lz1𠀪.cct.cm