Hi all,

I’ve had a longer chat with Thiago about how to evolve QString for Qt 6 last 
week.

Some work has already happened, so both QString and QByteArray now share the 
data structure with QList/QVector, enabling zero-copy conversion between the 
types. There’s also some pending changes to transition those classes to 
qsizetype and removing the 32bit limitations we currently have.

My high level goal for the string classes in Qt 6 was to complete the Unicode 
related changes that we started for Qt 5.0, where we made utf8 and utf16 the 
main encodings, and simplify things. I believe it’s important to leave the non 
Unicode world behind us, and offer an as consistent cross-platform story here 
as we can.

Qt 5.x still has some left-overs from the pre-unicode world:

* QTextStream encodes in Latin1 by default, so do a couple of classes in some 
places
* While we assume Utf8 as the source encoding for Qt, we still use 
QLatin1String all over the place
* We have extensive support for legacy text encodings in Qt Core, that should 
not be there anymore in 2020
* We offer options to generate HTML or XML in legacy encodings, even though the 
standard clearly says that those are deprecated
* to/fromLocal8Bit() should be equivalent to to/fromUtf8() on all but Windows 
(where we’re still a few years away from fully getting rid of this)
* source code encoding is undefined

Cleaning this up has progressed quite a bit, and a lot of changes in various 
classes have been merged. There’s a large set of changes currently being 
reviewed the remove QTextCodec as a dependency in Qt (it’ll get moved to 
libQt5Compat), and introduce a new QStringConverter class, that can handle 
transcoding between Unicode encodings, Latin1 and the system locale. For all 
systems except Windows, we make the additional assumption that the system 
locale is UTF-8 (see also my other mail about UTF-8 as System locale on 
Windows).


A next step is to change the build system, so that it (by default) assumes that 
source code is encoded in UTF-8. We are lady do set compiler flags to ensure 
this when building Qt itself, but are not doing this yet for user code.

But gcc and clang do already treat all source code as UTF-8 by default (and I 
believe ICC does the same at least on platforms other than Windows). MSVC will 
require a /utf-8 flag to enable this, something that I want to add to the 
default config for both qmake and cmake when compiling a Qt app. Without it, 
MSVC will still assume the source code is encoded in the current ANSI code page 
and u”…” or u8”…” will result in garbage. Worse it’ll lead to non portable 
code, that might compile correctly on one developer machine and create garbage 
on the next one (as it uses a different locale).

Changing this also for our users will make source code written for Qt more 
portable and bring Qt on par with most other programming languages in the world 
that already mandate utf8 as the source encoding (JS, Swift, Java, etc).


Our string handling classes currently consist of the following classes: 
QByteArray, QString, QStringView, QStringRef, QStringLiteral and QLatin1String. 
The set it too large, inconsistent and needs cleaning up:

* With the source code encoding being utf8, QLatin1String makes a lot less 
sense, and I my goal is to deprecate/deprioritize it in Qt 6. Instead, I would 
like to advocate the use of u”…” to directly encode the string as utf-16.
* QStringRef has been superseded by QStringView and should get deprecated. The 
main hurdle here is it’s use in QXmlStream. The plan is to extend QXmlStringRef 
(yes, that one exists as well…) to cover the use case. Both QXmlStringRef and 
QStringRef will get a cast operator to QStringView. With that we can then 
remove all API that takes a QStringRef and replace it with API taking either a 
QString or a QStringView
* QStringLiteral should turn into a small wrapper around u”…”, and probably 
also get deprecated. Maybe we could add a user defined literal for it instead 
that returns a read-only QString (QString s = “…”_q;). So u”…” would lead to a 
QStringView, u”…”_q to a read-only QString.
* We should add a QByteArrayView to keep symmetry between the QString and 
QByteArray APIs. This is somewhat independent from the rest though and lower 
priority.
* QStringView and QByteArrayView need to be completed to implement all const 
methods of QString/QByteArray
* A basic different between QString and and QStringView will be that the view 
class can contain non zero terminated data and are read-only, while QString 
will guarantee a zero termination (I checked whether we can remove that 
enforcement, but it will break too much code). Sidenote: Currently, 
fromRawData() together with utf16() can break this assumption, we should fix 
this
* QByteArray’s methods like toUpper() will only handle ASCII characters (they 
assume Latin1 in Qt5).

This would leave us with 4 string related classes: QByteArray(View) and 
QString(View).

Another step that is already partially implemented is to allow read-only string 
data inside a QString with a null d-pointer. This will serve as a replacement 
for QStringLiteral, and allow us to pass read-only data without copying into 
all of our API. As opposed to QStringView, QString will however require zero 
termination.

One open question is whether we should add a QUtf8String with a char8_t. I am 
not yet convinced that we actually need the class though.

The next question is what we do with our API methods. Currently we have many 
places where we have three to 4 overloads for the same methods (taking a 
QString, a QStringView, a QStringRef and a QLatin1String). We can’t have 4 
overloads for each method in all of Qt, so we need to restrict overloads to the 
places where it is required. IMO this is mainly the string related classes 
themselves. And even there we can probably cut down on the number of overloads.

In most other places we should by default only use QString, unless there are 
very significant performance benefits to be had from using QStringView. This 
helps us keep an API that’s both easy to use and maintain. With the ideas 
above, you can still create a read-only string, so data copies can in many 
cases be avoided if required.

I believe this adds only API clutter with very little benefit to our users. We 
should cut down on that multitude and offer only one version. In most cases the 
API should simply take a QString (esp as we can do read-only strings without 
allocation). QStringView should be used in low level APIs where performance 
matters and we are certain that we will never require a copy of the input 
string. The QStringRef and QL1String overloads should simply disappear.

So this would give the following API guidelines for using QString(View) and 
related classes in Qt:

For String related classes:

* All methods not taking ownership of the passed arguments take a QStringView
* If the method stores a pointer to the passed data it should take a QString to 
not surprise users. Exceptions can be done where it makes sense, but then the 
method naming has to give clear indications that this happens (like e.g. 
fromRawData())
* Return a QString in QString itself and when doing conversions, return 
QStringView from QStringView
* No QStringRef!
* QLatin1String for backwards compatibility, can be disabled with a macro 
(similar to QT_NO_CAST_FROM_ASCII)
* Remove or deprecate overloads taking a (const Char *, length) pair. Replace 
them with QStringView

Most other classes:

* Only take and return QString
* Exceptions can be done where significant performance gains can be 
demonstrated and the API will by design not require a copy of the data (e.g. 
XML writer, stream writers, date time handling)

Implementation wise, here are the steps we need to take:

* Finish the QString/QStringView API symmetry. QStringView should offer the 
complete const API of QString. QString’s const methods are implemented in terms 
of QStringView (or a static method in a private namespace is used for both).
* (Lower priority) Do the same for QByteArray
* Remove all QStringRef uses except for QXmlStream (which is a beast in itself)
* Rework QXmlStreamReader. Most likely simply return a QXmlStringRef instead of 
QStringRef (and extend it’s API slightly + add cast operators to QStringView 
and QString). This is somewhat SIC, but fortunately XML parsing is usually 
limited to a very restricted part of any project
* Deprecate QStringRef
* Enable /utf8 mode in MSVC for qmake and cmake builds (for us and our users) 
by default. Offer a simple way to turn it off for backwards compatibility
* Consider doing some of the things for utf-8 support on Windows outlined in 
https://lists.qt-project.org/pipermail/development/2020-May/039421.html
* Our QLatin1String uses are in most cases about pure ASCII strings. In any 
case, we should consider mass porting them over to u”…” instead.
* I don’t think we can deprecate QL1String and QStringLiteral in 6.0, but we 
should offer a mode to disable them.

Comments are welcome, help to implement the plan even more :)

Cheers,
Lars
_______________________________________________
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Reply via email to