24.01.2019, 10:34, "Olivier Goffart" <oliv...@woboq.com>: > On 23.01.19 23:15, André Pönitz wrote: >> On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote: >>> 23.01.2019, 16:55, "Edward Welbourne" <edward.welbou...@qt.io>: >>>> All of this discussion ignores a major elephant: QString's indexing is >>>> by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode >>>> for a couple of decades now. >>>> >>>> We *should* have a string type (I don't care what you call it) that acts >>>> on strings indexed by Unicode characters, not in terms of a >>>> representation. Whether that string type internally uses UTF-16 or >>>> UTF-8 should be invisible to its user. Ideally it would be capable of >>>> carrying its data internally in either form (so as to avoid needless >>>> conversion when both producer and consumer use the same form) and of >>>> converting between the two (e.g. so as to append efficiently) as needed. >>> >>> I think this is excessive. Most common operations with strings in >>> application >>> code are: >>> >>> * Pass the string around or compare as an opaque token >>> * Draw the string on screen e.g. with QPainter (while technically it >>> falls in the previous category, I think it's important enough to >>> deserve separate item) >>> * Find substring or pattern (regex) inside the string >>> * Split the string by character, pattern, or index boundaries found by >>> means >>> of previous item >>> >>> I think the only common cases when dealing with Unicode grapheme clusters >>> is required are >>> >>> * Handling of text cursor movement >>> * Implementation of text shaping, i.e. what Harfbuzz is doing >>> >>> I think having special iterator would be quite enough for cursor case. Such >>> iterator could abstract away underlying encoding, instead of forcing >>> everyone >>> to convert to UTF-16 first. >> >> All of that is scarily close to my opinion on the topic. > > Same here. I think Konstantin is spot on. > > Another example of good string design, I think, is the Rust's String. Their > string is encoded in valid UTF-8, indexed by bytes, and splitting the string > in > the middle of a code point is a programmer error. > > As already mentioned before, UTF-16 is quite a bad choice, if it weren't for > legacy. > > The argument of that developper wrongly using indexes cause more problem with > utf-8 than with utf-16 ("it would happen for a lot more characters") actually > means that the developper will see and fix their bugs quickly. > > I understand changing QString to UTF-8 is a difficult task if we want to do it > in a compatible way. However, I think there is a way: > In Qt5.x: > - Introduce some iterator that iterates over unicode code points. > - Deprecate utf16() and other API that assume that QString is UTF-16 > - Replace them by a toUtf16 which returns a QVector<ushort>. I believe that > it is possible to make the cotent implicitly shared with the QString, avoiding > copies. (since it is just a QTypedArrayData internally)
I will be officially pissed off if possibility to access raw data of QString without extra copy is gone :( It would be better if there is a way to figure out internal storage encoding (e.g. isUtf16()) and access raw data > > Then in Qt6 one can simply change the representation without breaking > compatibility with non-deprecated functions. > > -- > Olivier > > Woboq - Qt services and support - https://woboq.com - https://code.woboq.org > > _______________________________________________ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Regards, Konstantin _______________________________________________ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development