On Thursday, 17 November 2022 10:24:35 PST Volker Hilsheimer via Development wrote: > > Though I am postponing the QString vectorisation update to 6.6 because I > > don't have time to collect the benchmarks to prove I'm right before > > feature freeze next Friday. > > Next Friday is the platform & module freeze. Feature freeze is not until > December 9th, i.e. another 3 weeks to go.
Next Friday is also the day after Thanksgiving here in the US. I don't expect I can finish the benchmarking in 3 weeks, not considering I need to finish the IPC work and that includes starting a couple of changes that I haven't started yet (like the ability to clean up after itself). For the benchmarking, I've already collected the data by instrumenting each of the functions in question and running a Qt build, a Qt Creator start and a Qt build inside Qt Creator: qt-build-data.tar.xz: 1197.3 MB qtcreator-nosession.tar.xz: 2690.0 MB qtcreator-session.tar.xz: 35134.6 MB The data retains its intra-cacheline alignment. The way I'm seeing it, is that for each of the algorithm generations, I need to: 1) find the asymptotic limits, given L1, L2 and L3 cache sizes That is, the algorithms should be fast enough that the bottleneck is the transfer of data. There's no way that running qustrchr on 35 GB is going to be bound by anything other than RAM bandwidth or, in my laptop's case, the NVMe. So what are those limits? 2) benchmark at several data set sizes (half to 75% of L1, half to 75% of L2) on several generations Confirm that the algorithm is running close to or better than the ideal run that LLVM-MCA showed when I designed them. I know I can benchmark throughput to see if we're reaching the target bytes/cycle processing, but I don't know if I can benchmark the latency. I also don't know if it matters. 3) benchmark at several input sizes (i.e., strings of 4 characters, 8 characters, etc.) Same as #2, but instead of running over the sample that adds up to a certain data size, select the input such that the strings have always the same size. 4) compare to the previous generation's algorithm to confirm it's actually better Different instructions have different pros and cons; what might work for one at a given data size may not for another The algorithms available are: * baseline SSE2: no comparisons * SSE 4.1: compare to baseline SSE2, current SSE 4.1 * AVX2: compare to new SSE 4.1, current AVX2 * AVX512 with 256-bit vectors ("Avx256"): compare to new AVX2 I plan on collecting data on 3 laptop processors (Haswell, Skylake and Tiger Lake) and 2 desktop processors (Coffee Lake and Skylake Extreme). The Skylake should match the performance of almost all the Skylake and derivatives since 2016; the Coffee Lake NUC has the same processor as my Mac Mini; the Tiger Lake should be the performance of modern processors. The Skylake Extreme and the Tiger Lake can run the AVX512 code too. I don't know if the AVX512 code on Skylake will show a performance gain or a loss, because despite using only 256 bits, it may need to power on the OpMask registers. If it doesn't, I will adjust the feature detection to only apply to Ice Lakes and later. I have a new Alder Lake which would be nice to benchmark, to get the performance on both the Golden Cove P-core and the Gracemont E-core, but the thing runs Windows and the IT-mandated virus scans, so I will not bother. -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering _______________________________________________ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development