Hello,

On 05/06/2024 15:18, Ivan Solovev via Development wrote:
Hi,

I'm now working on introducing std::format support for some of the Qt types.
I decided to start with the variety of Qt string types, and I have some open
question regarding the implementation that I want to discuss.

First, I'd like to give a very short summary of my understanding of how
std::format works in plain C++ when it comes to string formatting.
Basically, we have two types of of formatters:
* std::formatter<T, char> that handle std::string, const char *, and
  const char (&)[N] overloads.
* std::formatter<T, wchar_t> that handle std::wstring, const wchar_t *,
  and const wchar_t (&)[N] overloads.

The encoding for the wide char strings is usually known - it's either UTF-16
on Windows or UTF-32 on Linux and macOS.
But what is the encoding for the char strings? The answer is that std::format does not care. It just tries to format the characters according to the format string. What you see in the terminal fully depends on your terminal encoding.

I think we should conceptually separate formatting from printing on a terminal. std::format isn't _just_ for printing on terminals (we now have std::print for that). Having said it, I admit that I've fallen quite behind

So, back to the main question. How we should format Qt string types?

Just for the sake of discussion, we can also leave the problem unsolved until std::format works with Unicode strings. As much as that's a pain point for users, we won't paint ourselves in a corner (see below).


The support for wide char formatters is straightforward - we can use
QString::toStdWString() and be sure that we do not get any unreadable
characters in the formatted output.
I already have a WIP patch implementing it [0].

In general, I'm not too fond of the idea that we need to re-encode strings (= allocations) in order to format them, but I don't see an easy way out given the tools at our disposal...


But what to do with the char formatters? Should we aim for the formatted
strings to be always readable, or should we just not care, like the
std::formatter<char> does?

What do you mean by "readable" here?


I see several options here:

1. Treat everything as UTF-8

Traditionally all QString(View) constructors taking char arrays or std::string treat the data as UTF-8. Also, QString::toStdString() provides a UTF-8 encoded
std::string. So this would be sort of an expected behavior for Qt users.

With this approach QLatin1StringView should also be converted to UTF-8 before
being processed by the formatter.

That sounds definitely appealing, in the sense that in any text-based APIs, we expect `char` to be UTF-8. So, formatting into chars means formatting into UTF-8.

2. Treat everything as Local8Bit

Basically similar to the previous approach, but use toLocal8Bit() instead of
toUtf8() when passing the data to the formatter. On Linux and macOS that would
actually be equivalent to the first approach, because toLocal8Bit() simply
assumes UTF-8 as an encoding. On Windows it would use CP_ACP to do the
conversion.

In this case the behavior would be similar to what qDebug() does.

Again, I'm not really sure of entangling consoles with this.
If you go for this approach and std::print a QString on Windows, what kind of output do you get?

The drawback is that the formatted string might be different from the original one. For example, `Ü` might be replaced with `U`, some other symbols might be
replaced with `?`, depending on the currently selected code page.

Similarly to the previous option, QLatin1StringView and QUtf8StringView should
also be converted to Local8Bit before formatting.

3. Try to not guess the encoding for the user

Basically, for QUtf8StringView and QLatin1StringView their encoding is
explicitly mentioned in the names of the classes, so we can just consider that
if the users use these classes with std::format, they expect to have UTF-8
or Latin1 output respectively.

I'm not following this. If I do

 std::format("{} {}", utf8string, latin1string)

what am I supposed to get out? A string which is a mix of two different encodings? I don't think that's ever possibly wanted.


Question here is how to deal with QString(View)?
  3a. Convert it to UTF-8, because that's the pre-existing behavior which
      should be known for the users.
  3b. Do not implement std::formatter<QString(View), char> at all and let
      the users explicitly convert QString to something else first.

Option 3b is inconvenient and defeats the purpose of std::format support
for Qt types, so I'd personally prefer 3a here.

The concern I was quoting before is this: suppose that tomorrow we have a formatter for `const char16_t *` into char. This formatter does some kind of transcoding. Then QString(View) ought to do precisely the same! If we take a different decision now, we risk having compatibility problems down the line.

Now, I don't really know if formatting char16_t is anywhere on SG16's radar in the short term, but that sounds definitely something to investigate and report about, in order to make a more informed decision.

(Not to mention formatting _into_ char16_t, which would unlock something like QString::format to *create* a QString!)


Thanks,

--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - Trusted Software Excellence

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Reply via email to