Re: [Interest] Splitting a unicode text into "characters" for RFC 2047 encoding

Thiago Macieira Tue, 25 Dec 2012 13:12:10 -0800

On terça-feira, 25 de dezembro de 2012 17.25.07, Jan Kundrát wrote:
> I think that when testing whether a string can be split at a particular
> index, my code shall check whether the next symbol is a "combinig
> character". However, I know nothing about various non-latin scripts, and I
> wasn't able to tell which method of a QChar shall I use in this context. It
> looks like QChar::isHighSurrogate() and QChar::isLowSurrogate() will be
> part of the solution, but they apparently only work for non-BMP characters.
>
> In short, I know nothing about Unicode details, but want to split the string
> at offsets where it is "safe". How do I tell where to split?


Are you sure you need to keep the combining characters together in the same
RFC 2047 chunk?

If you do, you can use QChar::category [1] and check for the category type
QChar::Mark_SpacingCombining. If you run into a surrogate type, then get the
two surrogates, calculate the UCS4 value (see QChar::surrogateToUcs4 [2]) and
try again.

My currently-experimental QStringIterator class [3] would return UCS 4 values
when iterating over a string.

[1] http://qt-project.org/doc/qt-5.0/qtcore/qchar.html#category-2
[2] http://qt-project.org/doc/qt-5.0/qtcore/qchar.html#surrogateToUcs4
[3] https://codereview.qt-project.org/669
--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Interest mailing list
Interest@qt-project.org
http://lists.qt-project.org/mailman/listinfo/interest

Re: [Interest] Splitting a unicode text into "characters" for RFC 2047 encoding

Reply via email to