Johannes Willnecker created XERCESC-2158:
--------------------------------------------
Summary: XMLUTF8Transcoder: One multibyte UTF8 character is
swallowed from the srcData when the resulting surrogate pair does not fit in
toFill at the end
Key: XERCESC-2158
URL: https://issues.apache.org/jira/browse/XERCESC-2158
Project: Xerces-C++
Issue Type: Bug
Components: Utilities
Affects Versions: 3.2.2, 3.1.4
Environment: OS independent: Linux (RedHat 7.5)/Windows 10
Compiler independent
Reporter: Johannes Willnecker
Attachments: UTF8.xml, xerces.patch
*Bug found in Xerces-C++ Version 3.1.4* (based on code reviews also newer
versions are affected)
*How to reproduce:* Call SAX2Print for the attached UTF8.xml file "SAX2Print
UTF8.xml".
One chinese character is missing in the name attribute of the last but one
Instance element.
*Fix:* The fix for this bug is included in the xerces.patch file.
In XMLUTF8Transcoder.cpp a check for this issue was already included but the
conclusion
that the bytes read are updated at the end of the loop was wrong.
The bytes read (bytesEaten) calculation is based on the srcPtr which was
already updated when the check is made.
Therefore srcPtr needs to be repositioned in case the Surrogate pair does not
fit into the toFill buffer.
*Contributor related:*
Author Name of the code being contributed: Johannes Willnecker
Employer: Siemens AG
I have the right to grant the copyright licenses for the contribution.
My employer has rights to the code that I have written. My employer gave me
permission to contribute this code on its behalf.
I am not aware of any third-party license or other restrictions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]