On 23.01.19 23:15, André Pönitz wrote:
On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote:
23.01.2019, 16:55, "Edward Welbourne" <edward.welbou...@qt.io>:
All of this discussion ignores a major elephant: QString's indexing is
by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode
for a couple of decades now.

We *should* have a string type (I don't care what you call it) that acts
on strings indexed by Unicode characters, not in terms of a
representation. Whether that string type internally uses UTF-16 or
UTF-8 should be invisible to its user. Ideally it would be capable of
carrying its data internally in either form (so as to avoid needless
conversion when both producer and consumer use the same form) and of
converting between the two (e.g. so as to append efficiently) as needed.

I think this is excessive. Most common operations with strings in application
code are:

* Pass the string around or compare as an opaque token
* Draw the string on screen e.g. with QPainter (while technically it
   falls in the previous category, I think it's important enough to
   deserve separate item)
* Find substring or pattern (regex) inside the string
* Split the string by character, pattern, or index boundaries found by means
   of previous item

I think the only common cases when dealing with Unicode grapheme clusters
is required are

* Handling of text cursor movement
* Implementation of text shaping, i.e. what Harfbuzz is doing

I think having special iterator would be quite enough for cursor case. Such
iterator could abstract away underlying encoding, instead of forcing everyone
to convert to UTF-16 first.

All of that is scarily close to my opinion on the topic.

Same here. I think Konstantin is spot on.

Another example of good string design, I think, is the Rust's String. Their string is encoded in valid UTF-8, indexed by bytes, and splitting the string in the middle of a code point is a programmer error.

As already mentioned before, UTF-16 is quite a bad choice, if it weren't for legacy.

The argument of that developper wrongly using indexes cause more problem with utf-8 than with utf-16 ("it would happen for a lot more characters") actually means that the developper will see and fix their bugs quickly.

I understand changing QString to UTF-8 is a difficult task if we want to do it in a compatible way. However, I think there is a way:
In Qt5.x:
 - Introduce some iterator that iterates over unicode code points.
 - Deprecate utf16()  and other API that assume that QString is UTF-16
- Replace them by a toUtf16 which returns a QVector<ushort>. I believe that it is possible to make the cotent implicitly shared with the QString, avoiding copies. (since it is just a QTypedArrayData internally)

Then in Qt6 one can simply change the representation without breaking compatibility with non-deprecated functions.

--
Olivier

Woboq - Qt services and support - https://woboq.com - https://code.woboq.org




_______________________________________________
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Reply via email to