On 23.01.19 23:15, André Pönitz wrote:
On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote:
23.01.2019, 16:55, "Edward Welbourne" <edward.welbou...@qt.io>:
All of this discussion ignores a major elephant: QString's indexing is
by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode
for a couple of decades now.
We *should* have a string type (I don't care what you call it) that acts
on strings indexed by Unicode characters, not in terms of a
representation. Whether that string type internally uses UTF-16 or
UTF-8 should be invisible to its user. Ideally it would be capable of
carrying its data internally in either form (so as to avoid needless
conversion when both producer and consumer use the same form) and of
converting between the two (e.g. so as to append efficiently) as needed.
I think this is excessive. Most common operations with strings in application
code are:
* Pass the string around or compare as an opaque token
* Draw the string on screen e.g. with QPainter (while technically it
falls in the previous category, I think it's important enough to
deserve separate item)
* Find substring or pattern (regex) inside the string
* Split the string by character, pattern, or index boundaries found by means
of previous item
I think the only common cases when dealing with Unicode grapheme clusters
is required are
* Handling of text cursor movement
* Implementation of text shaping, i.e. what Harfbuzz is doing
I think having special iterator would be quite enough for cursor case. Such
iterator could abstract away underlying encoding, instead of forcing everyone
to convert to UTF-16 first.
All of that is scarily close to my opinion on the topic.
Same here. I think Konstantin is spot on.
Another example of good string design, I think, is the Rust's String. Their
string is encoded in valid UTF-8, indexed by bytes, and splitting the string in
the middle of a code point is a programmer error.
As already mentioned before, UTF-16 is quite a bad choice, if it weren't for
legacy.
The argument of that developper wrongly using indexes cause more problem with
utf-8 than with utf-16 ("it would happen for a lot more characters") actually
means that the developper will see and fix their bugs quickly.
I understand changing QString to UTF-8 is a difficult task if we want to do it
in a compatible way. However, I think there is a way:
In Qt5.x:
- Introduce some iterator that iterates over unicode code points.
- Deprecate utf16() and other API that assume that QString is UTF-16
- Replace them by a toUtf16 which returns a QVector<ushort>. I believe that
it is possible to make the cotent implicitly shared with the QString, avoiding
copies. (since it is just a QTypedArrayData internally)
Then in Qt6 one can simply change the representation without breaking
compatibility with non-deprecated functions.
--
Olivier
Woboq - Qt services and support - https://woboq.com - https://code.woboq.org
_______________________________________________
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development