[Interest] Splitting a unicode text into "characters" for RFC 2047 encoding

Jan Kundrát Tue, 25 Dec 2012 08:25:17 -0800

Hi,
RFC 2047 is a standard which defines the proper way of encoding Unicode data in 
context of MIME message headers (i.e. it describes the way of using non-ASCII 
stuff in e-mail subjects, human-readable names in the From/To headers etc). 
There's no support for that in Qt, so I wrote my own implementation (after 
trying the one from the Qt Messaging Framework which unfortunately did not work 
so well in real world). It's available at [1].


One step in the process involves splitting the Unicode string into a series of 
"chunks" where the size of the each encoded chunk is limited by some constant. 
The Unicode chunk has to be encoded in some Unicode encoding (like the UTF-8, 
for example) and the resulting byte array has to be envoded again either via 
the Base64 or the Quoted-Printable scheme. It is the size of the result of all 
these transformations which counts.

When decoding these chunks, the decoder takes each chunk, reverses the base64 
or Q-P encoding and then uses an appropriate Unicode decoder (like the UTF-8 
one, or the one for Latin1,...) to convert the array of bytes back into the 
Unicode string. This means that each chunk has to be "self contained" (I've 
seen buggy implementations which are happy to split between the two bytes 
required for UTF-8 representation of the "á" character, for example).

I've tried to do my best in this area, iterating over the individual QChars 
which together make the string (see lines 252..268 of [1]). However, I know 
that there are certain "combining characters" in Unicode, i.e. that the "á" 
character I used earlier can actually be created as an ordinary "a" followed by 
a special symbol. I also suspect that certain combinations can only be 
expressed through the combining syntax which means that QString::normalize() 
won't help me.

I think that when testing whether a string can be split at a particular index, 
my code shall check whether the next symbol is a "combinig character". However, 
I know nothing about various non-latin scripts, and I wasn't able to tell which 
method of a QChar shall I use in this context. It looks like 
QChar::isHighSurrogate() and QChar::isLowSurrogate() will be part of the 
solution, but they apparently only work for non-BMP characters.

In short, I know nothing about Unicode details, but want to split the string at 
offsets where it is "safe". How do I tell where to split?

(As a side note, is there any interest in having a RFC2047 encoder/decoder in 
Qt? I'll be happy to make this part of Qt5 if people are interested and I have 
some free time.)

With kind regards,
Jan

[1] http://quickgit.kde.org/?p=trojita.git&a=blob&f=src/Imap/Encoders.cpp

-- 
Trojitá, a fast Qt IMAP e-mail client -- http://trojita.flaska.net/
_______________________________________________
Interest mailing list
Interest@qt-project.org
http://lists.qt-project.org/mailman/listinfo/interest

[Interest] Splitting a unicode text into "characters" for RFC 2047 encoding

Reply via email to