[
https://issues.apache.org/jira/browse/XERCESC-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334409#comment-16334409
]
Andreas Krantz commented on XERCESC-2130:
-----------------------------------------
https://issues.apache.org/jira/browse/XERCESC-1854
describes that xerces could be used to write files that no longer can be read.
[http://svn.apache.org/viewvc/xerces/c/trunk/src/xercesc/dom/impl/DOMLSSerializerImpl.cpp?r1=768978&r2=1226891]
introduced the new method
DOMLSSerializerImpl::ensureValidString
method that fails to validate characters x10000-#x10FFFF.
Those valid characters can not be displayed using one 16bit XMLCh but two 16bit
XMLCh are needed.
To implement those characters the range D800 - DFFF is used.
[https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF]
There is one leading(high) 16bit XMLCh and a trailing(low) 16bit character.
Checking
[http://svn.apache.org/viewvc/xerces/c/trunk/src/xercesc/util/XMLChar.cpp]
will show you that the
{{isXMLChar}}
method used already is aware of this fact and can be used to validate two
character XMLChs.
*An easy fix would be:*
* *reopen XERCESC 1854*
* *clear the content of ensureValidString to do nothing*
* *make sure this redistributed to avoid not beeing able to write
x10000-#x10FFFF*
I use xerces for over a decade and writing invalid files was always there. So
it does no harm to remove this broken feature (introduced in 3.2.0) again.
P.S.: Signing an CLA seems not that easy. I am checking.
> UTF16 Surrgate values 0xD800-0xDFFF can not longer be written with xerces
> 3.2.0 (e.g. emoticons)
> ------------------------------------------------------------------------------------------------
>
> Key: XERCESC-2130
> URL: https://issues.apache.org/jira/browse/XERCESC-2130
> Project: Xerces-C++
> Issue Type: Bug
> Components: DOM
> Affects Versions: 3.2.0
> Reporter: Andreas Krantz
> Priority: Critical
> Attachments: fix.patch, patch_.cpp, reproduce.cpp
>
>
> Solution for XERCESC-1854 introduced method
> {{DOMLSSerializerImpl::ensureValidString}}
> which has an error in validation.
> The method validates XMLCh which represent UTF16.
> [Valid Characters|https://www.w3.org/TR/REC-xml/#NT-Char] #x9 | #xA | #xD |
> [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> are the valid UTF32 characters.
> The UTF16 surrogate range from xD800 - xDFFF is used to represent
> [#x10000-#x10FFFF] and should not be handled as nvalid.
> *The reader threads this correctly and does not complain, which leads to an
> asmetric behavior*
> Reading DOM => OK
> Save back DOM => Exception
> I tried to attach an example to show the behavior.
> The used methods
> {{bool XMLChar1_1::isXMLChar(const XMLCh toCheck, const XMLCh toCheck2)}}
> already have a second optional parameter to check surrogate values.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]