On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: > Also, given a function like > > setFoo(const QByteArray &); > > what does this actually expect? An UTF-8 string? A local 8-bit string? > An octet stream? A Latin-1 string? QByteArray is the jack of all these, > master of none.
Like that, it's just "array of bytes of an arbitrary encoding (or none)". There's still a reason to have QByteArray and it'll need to exist in networking and file I/O code. That means the string classes, if any, need to be convertible to QByteArray anyway. > So, assuming the premiss that QByteArray should not be string-ish > anymore, what do we want to have as the result type of QString::toUtf8() > and QString::toLatin1()? Do we really want mere bytes? > > I don't think so. Since for Qt, String = UTF-16, then anything in another encoding is "a bag of bytes". QByteArray does serve that purpose. > If Unicode succeeds, most I/O will be in the form of UTF-8. File names > on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 > (as they are on Windows). It makes a _ton_ of sense to have a container > for this, and C++20 tempts us with char8_t to do exactly that. I'd love > to do string processing in UTF-8 without potentially doubling the > storage requirements by first converting it to UTF-16, then doing the > processing, then converting it back. Unless you're processing Cyrillic or Greek text, in which case your memory usage will be about the same. Or if you're processing CJK, in which case UTF-16 is a 33% reduction in memory use. > Qt should have a strong story not just for UTF-16, but also for UTF-8. So long as it's not confusing on which class to use, sure. If that means a proliferation of overloads everywhere, we've gone wrong somewhere. > I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, > provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 > operations are not much slower than L1 <-> utf16 ones (I heard Lars' > team has them down to within 5% of each other, not sure that's > possible). The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The difference in performance is the need to check if the high bit is set. Both codecs are vectorised with both SSE2 and AVX2 implementations. There are also Neon implementations, but I don't know their benchmark numbers (note: the UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit). For non-US-ASCII Latin1 text, the performance is more than 5% worse, depending on how dense the non-ASCII characters are in the string. But given that we want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 should be rare. I also have an implementation of UTF-16 to ASCII codec, which is the same as UTF-16 to Latin1, but without error checking. That requires that the string class store whether it contains only US-ASCII. I've never pushed this to Qt. > Anyway, we'd have two class templates, and they'd just be > instantiated with different Char types to flesh out all of the above, > with the exception of the byte array ones: > > using QUtf8String = QBasicString<char8_t>; > using QString = QBasicString<char16_t>; > using QLatin1String = QBasicString<char>; > (using QByteArray = QVector<std::byte>;) BTW, I've said this before: QVector should over-allocate by one element and memset it to zero, if the element is small enough (4 or 8 bytes). This should be done behind the scenes, so the API would never notice it. But it would allow transferring the ownership of a QByteArray's payload to any of the other classes and still have a null-terminated string. I don't mind having a QUtf8String{,View} but there needs to be a limit into how much we add to its API. Do we have indexOf(char32_t) optimised with vectorisation? Do we have indexOf(QRegularExpression)? The latter would make us link to libpcre2-8 in addition to libpcre2-16 or would require on-the-fly conversions and memory allocations. If your objective is to speed things up, having too many methods may actually make it worse. And then there's the overload set for generic functions. I'm going to insist a single, clear rule that does not depend on implementation details and is reasonably future-proof. It has to be about *what* the function does, not *how* it does that. > If, after getting all of the above runnig, we _then_ want The One String > (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we > need a QAnyString), which can contain any of the 2-4 string (view) > classes above (but not QByteArray(View)), but which doesn't have > string-ish API. Instead, you need to inspect it to extract the actual > string class (QLatin1String, QUtf8String, QString) contained, or simply > ask for the one you want, and it will convert, if necessary. Excluding QLatin1String since I don't think we need that, I'm willing to see this effort through. We need proofs of concept to show it works. And that's after we decide what QUtf8String is in the first place -- and that practically requires C++20. There isn't enough time for that before 6.0. Therefore, we need a solution for the APIs that doesn't include QAnyString. So I can't take this suggestion: > With this, your typical Qt function taking strings would look like this: > > QLineEdit::setText(QAnyStringView text) > Meep parseMeep(QAnyStringView str) > { > return str.visit([](auto str) { > Meep meep; > for (auto me : str.tokenize(u'\n')) > meep += parse(me); > return meep; > }); > } > > iow: instead of a bunch of overloads, you write your code as a template > and let QAnyStringView instantiate your lambda with the actual type of > string view passed. At the cost of code size increase. More likely, our content will instead convert to UTF-16 and operate on that. That's trading code size for runtime memory consumption (sometimes). > bool operator==(QAnyStringView lhs, QAnyStringView rhs) noexcept > { > return lhs.visit([rhs](auto lhs) { > return rhs.visit([lhs](auto rhs) { > return lhs == rhs; > }); > }); > } This MUST be non-inline and vectorised. Latin1-to-UTF-16 comparisons are easy and we have them. UTF-8-to-UTF-16 not so much: as I explained above, our vector code only operates on US-ASCII. That might suffice for our needs. Another problem of UTF-8 and UTF-16 comparisons is that the lengths can't be directly compared, but with Latin1 and UTF-16 they can. That means this part of the comparison between QString-QLatin1String can't hold: bool QString::operator==(QLatin1String other) const noexcept { if (size() != other.size()) return false; See compareElementRecursive() in qcborvalue.cpp for the comparison combinations. // Officially with CBOR, we sort first the string with the shortest // UTF-8 length. The length of an ASCII string is the same as its UTF-8 // and UTF-16 ones, but the UTF-8 length of a string is bigger than the // UTF-16 equivalent. Combinations are: // 1) UTF-16 and UTF-16 // 2) UTF-16 and UTF-8 <=== this is the problem case // 3) UTF-16 and US-ASCII // 4) UTF-8 and UTF-8 // 5) UTF-8 and US-ASCII // 6) US-ASCII and US-ASCII There are a couple of vector implementations in the Internet (see branchless.org) that decode the full UTF-8 into UTF-32 in vector at rates approaching 3 bytes per cycle, which is about the rate of our UTF-8 decoder when run with US-ASCII content or the rate of the Latin1 decoder. If everyone cares to upgrade to Intel Ice Lake processors, we can do that. For everyone stuck with 2019 processors or older, there are slightly worse implementations. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products _______________________________________________ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development