On Wednesday, 19 January 2022 09:28:40 PST Edward Welbourne wrote: > Thiago Macieira (19 January 2022 17:48) replied: > > That's a misconception. AVX and especially AVX2 introduce a lot of > > codegen opportunities for the compilers, which they've been able to > > use for years. > > Is the difference here: > * We have code that overtly conditions on the availability of CPU > features (for example in the places Lars mentioned) vs > * The compiler can achieve some optimizations, on which we currently > miss out, if we pass a relevant command-line option telling it to do > so (or omit one telling it not to) ? > > (Hoping you'll educate me if I'm being dense.)
Hello Eddy It's both. The compilers can generate better code if the relevant flags are passed on the command-line or are built-in. We build QtCore and QtGui (or maybe is it all libraries now in Qt6?) with -O3, which enables the auto-vectoriser in GCC and Clang. Raising the minimum targeted architecture allows them more options to generate faster code. But it's not just vector code. Compare https://gcc.godbolt.org/z/h3WfWsGEz (x86-64 baseline) with https://gcc.godbolt.org/z/vPcKqbT7P (x86-64-v2, except for MSVC) The floor() function in any good libc/libm will have the runtime detection and use the instruction to implement the functionality. glibc does: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/multiarch/s_floor-sse4_1.S.html The problems are: * not all libc/libm are good. In particular, MinGW's support lacks those optimisations (I've just checked). I haven't disassembled MSVC's Runtime to find out what it does. * some compilers may opt to not make a function call, like GCC did in the example above. When running on a pre-SSE4 CPU, the GCC generated code is actually better. However, since the overwhelming majority of CPUs this code will run on do have SSE4, GCC's codegen is actually a pessimisation. * even in the best case scenario (the other compilers), you still have a function call, which on ELF platforms means going through the PLT. So instead of a single instruction, we have a CALL, then an indirect JMP, then that instruction. That's unnecessary overhead for 99.9% of all users. This is not an isolated example. If I disassemble my QtCore, I see a lot of other scalar instructions: $ objdump -d libQt6Core.so | egrep -c '(movbe|sarx|shrx|shlx|[tl]zcnt|popcnt)' 638 Each of those saves a cycle here and there, so it's not worth making a runtime decision to use them. Instead, they must be used opportunistically. This would be especially beneficial for math-heavy libraries like Qt3D. Then there's our optimised code. qstring.cpp has a lot of it and it's not selected at runtime, for the same reason: the overhead of selecting is higher than the benefit. Most of our strings are fairly small: a histogram of all calls to those functions from a Qt Creator start shows they peak around 5-10 characters, then drop sharply with a long tail. This means those operations suffer greatly from overhead and what matters most is latency, not throughput. That's very different from image manipulation, in QtGui's drawhelpers: even a small 16x16 image is 1024 bytes. So any overhead in making a selection is quickly amortised there, but not so for strings. Would it be worth for some of those operations in qstring.cpp? Probably, particularly after my last round of optimisations. I especially think so if I could use GNU IFUNC support, which would mean all callers would jump directly into one of the optimised functions, instead of calling a function that then calls another. But optimising the entire library means we get more, at the cost of some extra time building and some more files in your system. It's also a generic solution, instead of targetting particular functions. So it should be a win-win: better performance at lower maintenance cost. (*) we've long-since shortened it to 4 characters. And AVX512 has a trick that allows us to use it even down to a single byte (see my outgoing changes in Gerrit). -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering _______________________________________________ Development mailing list [email protected] https://lists.qt-project.org/listinfo/development
