[Bug middle-end/51017] New: GCC 4.6 performance regression (vs. 4.4/4.5)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 Bug #: 51017 Summary: GCC 4.6 performance regression (vs. 4.4/4.5) Classification: Unclassified Product: gcc Version: 4.6.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: solar-...@openwall.com GCC 4.6 happens to produce approx. 25% slower code on at least x86_64 than 4.4 and 4.5 did for John the Ripper 1.7.8's bitslice DES implementation. To reproduce, download http://download.openwall.net/pub/projects/john/1.7.8/john-1.7.8.tar.bz2 and build it with "make linux-x86-64" (will use SSE2 intrinsics), "make linux-x86-64-avx" (will use AVX instead), or "make generic" (won't use any intrinsics). Then run "../run/john -te=1". With GCC 4.4 and 4.5, the "Traditional DES" benchmark reports a speed of around 2500K c/s for the "linux-x86-64" (SSE2) build on a 2.33 GHz Core 2 (this is using one core). With 4.6, this drops to about 1850K c/s. Similar slowdown was observed for AVX on Core i7-2600K when going from GCC 4.5.x to 4.6.x. And it is reproducible for the without-intrinsics code as well, although that's of less practical importance (the intrinsics are so much faster). Similar slowdown with GCC 4.6 was reported by a Mac OS X user. It was also spotted by Phoronix in their recently published C compiler benchmarks, but misinterpreted as a GCC vs. clang difference. Adding "-Os" to OPT_INLINE in the Makefile partially corrects the performance (to something like 2000K c/s - still 20% slower than GCC 4.4/4.5's). Applying the OpenMP patch from http://download.openwall.net/pub/projects/john/1.7.8/john-1.7.8-omp-des-4.diff.gz and then running with OMP_NUM_THREADS=1 (for a fair comparison) corrects the performance almost fully. Keeping the patch applied, but removing -fopenmp still keeps the performance at a good level. So it's some change made to the source code by this patch that mitigates the GCC regression. Similar behavior is seen with current CVS version of John the Ripper, even though it has OpenMP support for DES heavily revised and integrated into the tree.
[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #1 from Alexander Peslyak 2011-11-08 00:47:49 UTC --- (In reply to comment #0) > [...] Similar behavior > is seen with current CVS version of John the Ripper, even though it has OpenMP > support for DES heavily revised and integrated into the tree. I forgot to note that in the CVS version, I changed the default for non-OpenMP builds to use the supplied SSE2 assembly code, which hides this GCC issue for SSE2 non-OpenMP builds. The C code may be re-enabled in x86-64.h, or alternatively an -avx or generic build may be used. (Yes, -avx is still fully affected by the GCC regression even in the latest version of JtR code.) But it is probably simpler to use the 1.7.8 release to reproduce this bug anyway.
[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #2 from Alexander Peslyak 2011-11-08 00:56:47 UTC --- The affected code is in DES_bs_b.c: DES_bs_crypt_25(). (Sorry, I should have mentioned that right away.)
[Bug web/51019] New: unclear documentation on -fomit-frame-pointer default for -Os and different platforms
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51019 Bug #: 51019 Summary: unclear documentation on -fomit-frame-pointer default for -Os and different platforms Classification: Unclassified Product: gcc Version: 4.6.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: web AssignedTo: unassig...@gcc.gnu.org ReportedBy: solar-...@openwall.com The texinfo documentation for GCC 4.6.2 says: Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86 targets has been changed to `-fomit-frame-pointer'. The default can be reverted to `-fno-omit-frame-pointer' by configuring GCC with the `--enable-frame-pointer' configure option. Enabled at levels `-O', `-O2', `-O3', `-Os'. The "when not optimizing for size" comment feels contradictory to having "-Os" listed on the "Enabled at levels" line. Also, it is not clear what the default is on other than "32-bit Linux x86 and 32-bit Darwin x86". In practice, I observe the following behavior with GCC 4.6.2: on Linux/x86_64, -fomit-frame-pointer is the default at both -O2 and -Os (I did not test others); on Linux/i386, it is the default at -O2, but not at -Os. This needs to be documented more clearly.
[Bug target/13822] enable -fomit-frame-pointer or at least -momit-frame-pointer by default on x86/dwarf2 platforms
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13822 Alexander Peslyak changed: What|Removed |Added CC||solar-gcc at openwall dot ||com --- Comment #5 from Alexander Peslyak 2011-11-08 01:40:18 UTC --- Shouldn't this bug be closed now, with GCC 4.6's change of default for -fomit-frame-pointer?
[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #4 from Alexander Peslyak 2012-01-03 04:45:43 UTC --- (In reply to comment #3) > It might be interesting to get numbers for the trunk. There have been some > register allocator fixes which might have improved this. I've just tested the gcc-4.7-20111231 snapshot vs. 4.6.2 release. There's no improvement as it relates to this issue: I am getting the same poor performance (a lot worse than for 4.5). This is for generating x86-64 code with SSE2 intrinsics, benchmarking the resulting code on a Core 2'ish CPU (I used Xeon E5420 this time).
[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #5 from Alexander Peslyak 2012-01-04 19:39:26 UTC --- I wrote and ran some scripts to test many versions/snapshots of gcc. It turns out that 4.6-20100703 (oldest 4.6 snapshot available for FTP) was already affected by this regression, whereas 4.5-20111229 and 4.4-20120103 are not affected (as expected). Also, it turns out that there was a smaller regression at this same benchmark between 4.3 and 4.4. That is, 4.3 produces the fastest code of all gcc versions I tested. Here are some numbers: 4.3.5 20100502 - 2950K c/s, 28229 bytes 4.3.6 20110626 - 2950K c/s, 28229 bytes 4.4.5 20100504 - 2697K c/s, 29764 bytes 4.4.7 20120103 - 2691K c/s, 29316 bytes 4.5.1 20100603 - 2729K c/s, 29203 bytes 4.5.4 20111229 - 2710K c/s, 29203 bytes 4.6.0 20100703 - 2133K c/s, 29911 bytes 4.6.0 20100807 - 2119K c/s, 29940 bytes 4.6.0 20100904 - 2142K c/s, 29848 bytes 4.6.0 20101106 - 2124K c/s, 29848 bytes 4.6.0 20101204 - 2114K c/s, 29624 bytes 4.6.3 20111230 - 2116K c/s, 29624 bytes 4.7.0 20111231 - 2147K c/s, 29692 bytes These are for JtR 1.7.9 with DES_BS_ASM set to 0 on line 157 of x86-64.h (to disable this version's workaround for this GCC 4.6 regression), built with "make linux-x86-64" and run on one core in a Xeon E5420 2.5 GHz (the system is otherwise idle). The code sizes given are for .text of DES_bs_b.o (which contains three similar functions, of which one is in use by this benchmark - that is, the code size in the loop is about 10 KB). As you can see, 4.3 generated code that was both significantly faster and a bit smaller than all other versions'. In 4.4, the speed decreased by 8.5% and code size increased by 4.4%. 4.5 corrected this to a very limited extent - still 8% slower and 3.5% larger than 4.3's. 4.6 brought a huge performance drop and a slight code size increase. 4.7.0 20111231's code is still 27% slower than 4.3's.
[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #7 from Alexander Peslyak 2012-01-04 23:00:24 UTC --- (I ran the tests below and wrote this comment before seeing Jakub's. Then I thought I'd post it anyway.) Here are some numbers for gcc releases: 4.0.0 - 383K c/s, 71879 bytes (this old version of gcc generates function calls for SSE2 intrinsics) 4.1.0 - 2959K c/s, 28182 bytes 4.1.2 - 2964K c/s, 28365 bytes 4.2.0 - 2968K c/s, 28363 bytes 4.2.4 - 2971K c/s, 28382 bytes 4.3.0 - 2971K c/s, 28229 bytes 4.3.6 - 2959K c/s, 28229 bytes 4.4.0 - 2625K c/s, 29770 bytes 4.4.6 - 2695K c/s, 29316 bytes 4.5.0 - 2729K c/s, 29203 bytes 4.5.3 - 2716K c/s, 29203 bytes 4.6.0 - 2111K c/s, 29624 bytes 4.6.2 - 2123K c/s, 29624 bytes So thing were really good for versions 4.1.0 through 4.3.6, but started to get worse afterwards and got really bad with 4.6. To be fair, things are very different for some other hash/cipher types supported by JtR - e.g., for Blowfish-based hashing we went from 560 c/s for 4.1.0 to 700 c/s for 4.6.2. JtR 1.7.9 and 1.7.9-jumbo include a benchmark comparison tool called relbench, which calculates geometric mean, median, and some other metrics for multiple individual outputs from a pair of JtR benchmark invocations (e.g., built with different versions of gcc). In 1.7.9-jumbo-5, there are over 160 individual benchmark outputs (for different hashes/ciphers) and it may be built in a variety of ways (with/without explicit assembly code, with/without intrinsics etc.) relbench combines those 160+ outputs into a nice summary showing overall speedup/slowdown and more. It might be useful for testing of future gcc versions for potential performance regressions like this.
[Bug target/54349] _mm_cvtsi128_si64 unnecessary stores value at stack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54349 Alexander Peslyak changed: What|Removed |Added CC||solar-gcc at openwall dot com --- Comment #10 from Alexander Peslyak --- I confirm that this is fixed in 4.9. Since a lot of people are still using pre-4.9 gcc and may stumble upon this bug, here's my experience with the bug and with working around it: The bug manifests itself the worst when only a pre-SSE4.1 instruction set is available (such as when compiling for x86_64 with no -m... options given), and (at least for me) especially on AMD Bulldozer: over 26% speedup from fully working around the bug in plain SSE2 build of yescrypt with Ubuntu 12.04's gcc 4.6.3 on FX-8120. On Intel CPUs, the impact of the bug is typically 5% to 10%. Enabling SSE4.1 (or AVX or better) mostly mitigates the bug, resulting in inbetween or full speeds (varying by CPU), since "(v)pextrq $0," is then generated and it is almost as good as "(v)movq" (but not exactly). The suggested "-mtune=corei7" workaround works, but is only recognized by gcc 4.6 and up (thus, is only for versions 4.6.x to 4.8.x). At source file level, this works: #if defined(__x86_64__) && \ __GNUC__ == 4 && __GNUC_MINOR__ >= 6 && __GNUC_MINOR__ < 9 #pragma GCC target ("tune=corei7") #endif A related bug is that those versions of gcc with that workaround wrongly generate "movd" (as in e.g. "movd %xmm0, %rax") instead of "movq". Luckily, binutils primarily looks at the register names and silently corrects this error (there's "movq" in the disassembly). For a much wider range of gcc versions - 4.0 and up - this works: #if defined(__x86_64__) && __GNUC__ == 4 && __GNUC_MINOR__ < 9 #ifdef __AVX__ #define MAYBE_V "v" #else #define MAYBE_V "" #endif #define _mm_cvtsi128_si64(x) ({ \ uint64_t result; \ __asm__(MAYBE_V "movq %1,%0" : "=r" (result) : "x" (x)); \ result; \ }) #endif A drawback for using inline asm for a single instruction is that it might negatively affect gcc's instruction scheduling (where gcc ends up unaware of the inlined instruction's timings). However, on this specific occasion (with yescrypt) I am not seeing any slowdown of such code compared to the "tune=corei7" approach, nor compared to gcc 4.9+. It just works for me. Still, because of this concern, it might be wise to combine the two approaches, only resorting to inline asm on pre-4.6 gcc: /* gcc before 4.9 would unnecessarily use store/load (without SSE4.1) or * (V)PEXTR (with SSE4.1 or AVX) instead of simply (V)MOV. */ #if defined(__x86_64__) && \ __GNUC__ == 4 && __GNUC_MINOR__ >= 6 && __GNUC_MINOR__ < 9 #pragma GCC target ("tune=corei7") #endif #include #include #if defined(__x86_64__) && __GNUC__ == 4 && __GNUC_MINOR__ < 6 #ifdef __AVX__ #define MAYBE_V "v" #else #define MAYBE_V "" #endif #define _mm_cvtsi128_si64(x) ({ \ uint64_t result; \ __asm__(MAYBE_V "movq %1,%0" : "=r" (result) : "x" (x)); \ result; \ }) #endif Unfortunately, unlike the pure inline asm workaround, this relies on binutils correcting the "movd" for gcc 4.6.x to 4.8.x. Oh well. I've tested the above combined workaround on these gcc versions (and it works): 4.0.0 4.1.0 4.1.2 4.2.0 4.2.4 4.3.0 4.3.6 4.4.0 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 4.4.6 4.5.0 4.5.3 4.6.0 4.6.2 4.7.0 4.7.4 4.8.0 4.8.4 4.9.0 4.9.2
[Bug target/54349] _mm_cvtsi128_si64 unnecessary stores value at stack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54349 --- Comment #11 from Alexander Peslyak --- Turns out that gcc 4.6.x to 4.8.x generating "movd" instead of "movq" is actually a deliberate hack, to support binutils older than 2.17 ("movq" support committed in 2005, released in 2006) and (presumably) non-GNU assemblers: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43215 Also related, on "vmovd": https://sourceware.org/ml/binutils/2008-05/msg00257.html Per H.J. Lu, this is because of an error in AMD's spec for x86-64. More detail on this cursed intrinsic: gcc got the _mm_cvtsi128_si64x() (with 'x') form before it got Intel's _mm_cvtsi128_si64() name (without 'x'). (When using the inline asm workaround above, this does not matter as the macro brings the without 'x' form to older gcc as well.) Older MSVC and Open64 had bugs for the intrinsic (without 'x'): http://www.thesalmons.org/john/random123/releases/1.08/docs/sse_8h_source.html#l00108 This refers to https://bugs.open64.net/show_bug.cgi?id=873 for the Open64 bug, and I had looked at it before, but unfortunately right now their bug tracker refuses connections (for https; and gives 404 for that path with http). I have no detail on what the MSVC bug was. Apparently, these could result in incorrect computation at runtime (the comment at the URL above mentions failed assertions). Using _mm_extract_epi64(x, 0) is a workaround (SSE4.1+, sometimes slower).
[Bug tree-optimization/65427] New: ICE in emit_move_insn with wide vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65427 Bug ID: 65427 Summary: ICE in emit_move_insn with wide vector types Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: solar-gcc at openwall dot com Created attachment 35037 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35037&action=edit testcase GCC 4.7.0 through at least 4.9.2 and 5.0 20150215 snapshot (I haven't tested newer ones) fails with ICE when compiling the attached md5slice.c testcase on and for Linux x86_64: $ gcc md5slice.c -o md5slice -O2 -DVECTOR -Wno-attributes -ftree-loop-vectorize md5slice.c: In function 'GG': md5slice.c:302:27: internal compiler error: in emit_move_insn, at expr.c:3609 static MAYBE_INLINE3 void GG(a, b, c, d, x, s, ac) ^ 0x6974d2 emit_move_insn(rtx_def*, rtx_def*) ../../gcc/expr.c:3608 0x5e5294 expand_gimple_stmt_1 ../../gcc/cfgexpand.c:3288 0x5e5294 expand_gimple_stmt ../../gcc/cfgexpand.c:3322 0x5e589b expand_gimple_basic_block ../../gcc/cfgexpand.c:5162 0x5e7b56 gimple_expand_cfg ../../gcc/cfgexpand.c:5741 0x5e7b56 execute ../../gcc/cfgexpand.c:5961 Without -ftree-loop-vectorize, compilation succeeds. With -O3, it fails slightly differently: $ gcc md5slice.c -o md5slice -O3 -DVECTOR -Wno-attributes md5slice.c: In function 'II.constprop': md5slice.c:328:27: internal compiler error: in emit_move_insn, at expr.c:3609 static MAYBE_INLINE3 void II(a, b, c, d, x, s, ac) ^ 0x6974d2 emit_move_insn(rtx_def*, rtx_def*) ../../gcc/expr.c:3608 0x5e5294 expand_gimple_stmt_1 ../../gcc/cfgexpand.c:3288 0x5e5294 expand_gimple_stmt ../../gcc/cfgexpand.c:3322 0x5e589b expand_gimple_basic_block ../../gcc/cfgexpand.c:5162 0x5e7b56 gimple_expand_cfg ../../gcc/cfgexpand.c:5741 0x5e7b56 execute ../../gcc/cfgexpand.c:5961 With -mavx or -mavx2, it succeeds (despite of -O3). GCC 4.7.0 does not have the -ftree-loop-vectorize option, but a similar problem is seen with -O3: $ gcc md5slice.c -o md5slice -O3 -DVECTOR -Wno-attributes md5slice.c: In function 'GG': md5slice.c:302:27: internal compiler error: in emit_move_insn, at expr.c:3435 So far, all of this is with: typedef element vector __attribute__ ((vector_size (32))); on line 41. Reducing the vector width to 16 makes the plain SSE2 compilation succeed with any optimizations. Conversely, increasing the vector width to 64 makes compilation to fail even with AVX/AVX2 enabled. Ideally, when the vector type width is in excess of the current target architecture's native SIMD vector width, GCC should transparently split it into multiple sub-vectors of the natively supported width. This is useful not only for being able to build/use wider-vector source code for/on older CPUs, but also to hide instruction latencies by having the compiler interleave operations on the sub-vectors due to the extra parallelism the excessive vector width provides. For example, once this is supported 32 could actually work faster than 16 on SSE2, and 64 faster than 32 on AVX2, for some applications (as long as the register pressure does not become too high). Failing that, at least the compiler should report that this is unsupported, rather than fail with an ICE. With GCC 4.6.2 and older, the ICE does not occur, for the rather unfortunate reason that (at least for me) these versions generate scalar code (so ~10x slower) when the type's vector width exceeds what's supported natively.
[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #9 from Alexander Peslyak --- (In reply to Andrew Pinski from comment #8) > Can you try GCC 4.9? Yes. Bad news: things mostly became even worse. Same machine, same JtR version, same test script as in my previous comment: 4.9.2 - 1849K c/s, 28256 bytes The code size is back to 4.1.0 to 4.3.6 levels (good), but the performance decreased by another 13% since 4.6.2 (and by 38% since it peaked with 4.3.0). I ran this benchmark multiple times, and I also re-ran benchmarks with some previous gcc versions to make sure this isn't caused by some change in my environment - no, I am getting consistently poor results for 4.9.2, and the same results as before for other gcc versions. I'll plan to test with some versions in the range 4.7.0 to 4.9.0 next. (I also see some much smaller regressions with 4.9.2 for other hash types.)
[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #10 from Alexander Peslyak --- I decided to take a look at the generated code. Compared to 4.6.2, GCC 4.9.2 started generating lots of xorps, orps, andps, andnps where it previously generated pxor, por, pand, pandn. Changing those with: sed -i 's/xorps/pxor/g; s/orps/por/g; s/andps/pand/g; s/andnps/pandn/g' made no difference for performance on this machine (still 4.9.2's poor performance). The next suspect were the varieties of MOV instructions. In 4.9.2's generated code, there were 1319 movaps, 721 movups. In 4.6.2's, there were 1258 movaps, 465 movups. Simply changing all movups to movaps in 4.9.2's original code with sed (thus, with no other changes except for this one), resulting in a total of 2040 movaps, brought the performance to levels similar to GCC 4.4 and 4.5's (and is better than 4.6's, but worse than 4.3's). So movups appear to be the main culprit. The same hack for 4.6.2's code brought its performance almost to 4.3's level (still 5% worse, though), and significantly above 4.9.2's (so there's still some other, smaller regression with 4.9.2). Here are my new results: 4.1.0o - 2960K c/s, 28182 bytes, 1758 movaps, 0 movups 4.3.6o - 2956K c/s, 28229 bytes, 1755 movaps, 0 movups 4.4.6o - 2694K c/s, 29316 bytes, 1709 movaps, 7 movups 4.4.6h - 2714K c/s, 29316 bytes, 1716 movaps, 0 movups 4.5.3o - 2709K c/s, 29203 bytes, 1669 movaps, 0 movups 4.6.2o - 2121K c/s, 29624 bytes, 1258 movaps, 465 movups 4.6.2h - 2817K c/s, 29624 bytes, 1723 movaps, 0 movups 4.9.2o - 1852K c/s, 28256 bytes, 1319 movaps, 721 movups 4.9.2h - 2688K c/s, 28256 bytes, 2040 movaps, 0 movups "o" means original, "h" means hacked generated assembly code (all movups changed to movaps). (BTW, there were no movdqa/movdqu in any of these code versions.) Now I am wondering to what extent this is a GCC issue and to what extent it might be my source code's, if GCC is somehow unsure it can assume alignment. What are the conditions when GCC should in fact use movups? Is it intentional that newer versions of GCC are being more careful at this, resulting in worse performance?
[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #12 from Alexander Peslyak --- (In reply to Richard Biener from comment #11) > I wonder if you could share the exact CPU type you are using? This is on (dual) Xeon E5420 (using only one core for these benchmarks), but there was similar slowdown with GCC 4.6 on other Core 2'ish CPUs as well (such as desktop Core 2 Duo CPUs). You might not call these "modern". > Note that we have to use movups because [...] Thank you for looking into this. I still have a question, though: does this mean you're treating older GCC's behavior, where it dared to use movaps anyway, a bug? I was under impression that with most SSE*/AVX* intrinsics (except for those explicitly defined to do unaligned loads/stores) natural alignment is assumed and is supposed to be provided by the programmer. Not only with GCC, but with compilers for x86(-64) in general. I thought this was part of the contract: I use intrinsics and I guarantee alignment. (Things would certainly not work for me at least with older GCC if I assumed the compiler would use unaligned loads whenever it was unsure of alignment.) Was I wrong, or has this changed (in GCC? or in some compiler-neutral specification?), or is GCC wrong in not assuming alignment now? Is there a command-line option to ask GCC to assume alignment, like it did before?
[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #13 from Alexander Peslyak --- (In reply to Richard Biener from comment #11) > We are putting quite heavy register-pressure on the thing by means of > partial redundancy elimination, thus disabling PRE using -fno-tree-pre > might help (we still spill a lot). It looks like -fno-tree-pre or equivalent was implied in the options I was using, which were "-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions" - yes, with -Os added after -O2 when compiling this specific source file. IIRC, this was experimentally derived as producing best performance with 4.6.x or older. Adding -fno-tree-pre after all of these options merely changes the label names in the generated assembly code, while resulting in identical object files (and obviously no performance change). Also, I now realize -Os was probably the reason why GCC preferred SSE "floating-point" bitwise ops and MOVs here, instead of SSE2's integer ones (they have longer encodings). Omitting -Os results in usage of the SSE2 instructions (both bitwise and MOVs), with correspondingly larger code. And yes, when I omit -Os, I do need to add -fno-tree-pre to regain roughly the same performance, and then to s/movdqu/movdqa/g to regain almost the full speed (movdqu is just as slow as movups on this CPU). I've just tested all of this with GCC 4.8.4 to possibly match yours (you mentioned you used 4.8). So I think you uncovered yet another performance regression I had already worked around with -Os. FWIW, here are the generated assembly code sizes ("wc" output) with GCC 4.8.4: -O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions 5870 17420 137636 1.s -O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions -fno-tree-pre 5870 17420 137636 2.s -O2 -fomit-frame-pointer -funroll-loops -finline-functions 6814 20193 156837 a.s -O2 -fomit-frame-pointer -funroll-loops -finline-functions -fno-tree-pre 6028 17842 138284 b.s As you can see, -fno-tree-pre reduces the size almost to the -Os level. (But the .text size would be significantly larger because of the SSE2 instruction encodings. This is why I show the assembly code sizes for this comparison.)
[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #14 from Alexander Peslyak --- For completeness, here are the results for 4.7.x, 4.8.x, and 4.9.0: 4.7.0o - 2142K c/s, 29692 bytes, 1267 movaps, 465 movups 4.7.0h - 2823K c/s, 29692 bytes, 1732 movaps, 0 movups 4.7.4o - 2144K c/s, 29692 bytes, 1267 movaps, 465 movups 4.7.4h - 2827K c/s, 29692 bytes, 1732 movaps, 0 movups 4.8.0o - 1825K c/s, 27813 bytes, 1341 movaps, 721 movups 4.8.0h - 2792K c/s, 27813 bytes, 2062 movaps, 0 movups 4.8.4o - 1827K c/s, 27807 bytes, 1341 movaps, 721 movups 4.8.4h - 2786K c/s, 27807 bytes, 2062 movaps, 0 movups 4.9.0o - 1852K c/s, 28262 bytes, 1319 movaps, 721 movups 4.9.0h - 2685K c/s, 28262 bytes, 2040 movaps, 0 movups 4.8 produces the smallest code so far, but even with the aligned loads hack is still 6% slower than 4.3. All of these are with "-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions", like similar results I had posted before. Xeon E5420, x86_64.
[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #17 from Alexander Peslyak --- (In reply to Richard Biener from comment #16) > I'm completely confused now as to what the original regression was reported > against. I'm sorry, I should have re-read my original description of the regression before I wrote comment 13. Together, these are indeed confusing. > I thought it was the default options in the Makefile, -O2 > -fomit-frame-pointer, which showed the regression and you found -Os would > mitigate it somewhat (and I more specifically told you it is -fno-tree-pre > that makes the actual difference). That's one of the regressions I mentioned in the original description. Yes, you identified -fno-tree-pre as the component of -Os that makes the difference - Thank You! However, I also mentioned in the original description that a bigger regression with 4.6+ vs. 4.5 and 4.4 remained despite of -Os, and I had no similar workaround for it at the time (but enabling -fopenmp made it go away, perhaps due to changes to declarations in the source code in #ifdef _OPENMP blocks). I think we can now say that this bigger 4.6+ regression was primarily caused by the unaligned load instructions. So two regressions are figured out, and the remaining slowdown (not investigated yet) vs. 4.1 to 4.3 (which worked best) is only 6% to 10% in recent versions (9% in 4.9.2). > So - what options give good results with old compilers but bad results with > new compilers? On CPUs where movups/movdqu are slower than their aligned counterparts (for addresses that happen to be aligned), any sane optimization options of 4.6+ give bad results as compared to pre-4.6 with same options. As you say, this can be fixed in the source code (and I most likely will fix it there), but I think many other programs may experience similar slowdowns, so maybe GCC should do something about this too. Other than that, either -Os or -fno-tree-pre works around the second worst slowdown seen in 4.6+. To avoid confusion, maybe this bug should focus on one of the three regressions? Should we keep it for PRE only? Should we create a new bug for the unnecessary and non-optional use of unaligned load instructions for source code like this, or is this considered the new intended behavior despite of the major slowdown on such CPUs? (Presumably not only for JtR. I'd expect this to affect many programs.) Should we also create a bug for investigating the remaining slowdown of 9% in 4.9.2 (vs. 4.1 to 4.3), or is it considered too minor to bother? Thank you!
[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #18 from Alexander Peslyak --- (In reply to Richard Biener from comment #11) > Note that we have to use movups because DES_bs_all is not aligned as seen > from DES_bs_b.c (it's defined in DES_bs.c and only there annotated with > CC_CACHE_ALIGN, not at the point of declaration in DES_bs.h). So the > unaligned moves are the sources fault. Annotating that with CC_CACHE_ALIGN > produces the desired movaps instructions Confirmed also with GCC 4.9.2 on JtR 1.8.0's version of the code. > (with no effect on performance for me). ... with the expected performance improvement for me. I'll commit this fix. Thanks again!
[Bug tree-optimization/59124] [4.8/4.9/5 Regression] Wrong warnings "array subscript is above array bounds"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59124 Alexander Peslyak changed: What|Removed |Added CC||solar-gcc at openwall dot com --- Comment #8 from Alexander Peslyak --- Here's another testcase: $ gcc -S -Wall -O2 -funroll-loops testcase.c testcase.c: In function 'DES_std_set_key': testcase.c:14:17: warning: array subscript is above array bounds [-Warray-bounds] while (DES_key[i++]) k += 2; ^ === 8< === static int DES_KS_updates; static char DES_key[16]; void DES_std_set_key(char *key) { int i, j, k, l; j = key[0]; for (k = i = 0; (l = DES_key[i]) && (j = key[i]); i++) ; if (!j) { j = i; while (DES_key[i++]) k += 2; } if (k < j && ++DES_KS_updates) { } DES_key[0] = key[0]; } === >8 === GCC 4.7.4 and below report no warning, 4.8.0 and 4.9.2 report the warning above. Either -O2 -funroll-loops or -O3 result in the warning; simple -O2 does not. While i++ could potentially run beyond the end of DES_key[], depending on what's in DES_key[] and key[], this isn't the case in the program this snippet is taken from (and simplified), whereas the warning definitively claims "is" rather than "might be". For comparison, Dmitry's first testcase (from this bug's description) results in no warning with -O2 -funroll-loops (but does give the warning to me with -O3, as reported by Dmitry), whereas his second testcase (from comment 2) also reports the warning with -O2 -funroll-loops (but not with simple -O2). I tested this with 4.9.2. I hope this is similar enough to add to this bug (same affected versions, one of the two testcases also affected by -funroll-loops).
[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #19 from Alexander Peslyak --- (In reply to Alexander Peslyak from comment #17) > Should we create a new bug for the unnecessary and non-optional use of > unaligned load instructions for source code like this, or is this considered > the new intended behavior despite of the major slowdown on such CPUs? > (Presumably not only for JtR. I'd expect this to affect many programs.) Upon further analysis, I now think that this was my fault, and (presumably) not common in other programs. What I had was differing definition vs. declaration, so a bug. The lack of alignment specification in the declaration of the struct essentially told (newer) GCC not to assume alignment - to an extent greater than e.g. a pointer would. As far as I can tell, GCC does not currently produce unaligned load instructions (so assumes that SSE* vectors are properly aligned) when all it has is a pointer coming from another object file. I think that's the common scenario, whereas mine was uncommon (and incorrect). So let's focus on PRE only.
[Bug tree-optimization/59124] [4.8/4.9/5 Regression] Wrong warnings "array subscript is above array bounds"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59124 --- Comment #9 from Alexander Peslyak --- (In reply to Alexander Peslyak from comment #8) > $ gcc -S -Wall -O2 -funroll-loops testcase.c > testcase.c: In function 'DES_std_set_key': > testcase.c:14:17: warning: array subscript is above array bounds With GCC 5.0.0 20150215, this warning is gone. I also confirm that Dmitry's comment #2 warning is gone. The original one from this bug's description remains.
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #17 from solar-gcc at openwall dot com 2010-08-24 11:07 --- (In reply to comment #16) > I would really like to see this bug tackled. I second that. > Fixing it is easily done by lowering the spin count as proposed. Otherwise, > please show cases where a low spin count hurts performance. Unfortunately, yes, I've since identified real-world test cases where GOMP_SPINCOUNT=1 hurts performance significantly (compared to gcc 4.5.0's default). Specifically, this was the case when I experimented with my John the Ripper patches on a dual-X5550 system (16 logical CPUs). On a few real-world'ish runs, GOMP_SPINCOUNT=1 would halve the speed. On most other tests I ran, it would slow things down by about 10%. That's on an otherwise idle system. I was surprised as I previously only saw GOMP_SPINCOUNT=1 hurt performance on systems with server-like unrelated load (and it would help tremendously with certain other kinds of load). > In general, for a tuning parameter, a good-natured rather value should be > preferred over a value that gives best results in one case, but very bad ones > in another case. In general, I agree. Even the 50% worse-case slowdown I observed with GOMP_SPINCOUNT=1 is not as bad as the 400x worst-case slowdown observed without that option. On the other hand, a 50% slowdown would be fatal as it relates to comparison of libgomp vs. competing implementations. Also, HPC cluster nodes may well be allocated such that there's no other load on each individual node. So having the defaults tuned for a system with no other load makes some sense to me, and I am really unsure whether simply changing the defaults is the proper fix here. I'd be happy to see this problem fixed differently, such that the unacceptable slowdowns are avoided in "both" cases. Maybe the new default could be to auto-tune the setting while the program is running? Meanwhile, if it's going to take a long time until we have a code fix, perhaps the problem and the workaround need to be documented prominently. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #19 from solar-gcc at openwall dot com 2010-08-24 12:18 --- (In reply to comment #18) > Then, at the start of the spinning libgomp could initialize that flag and > check > it from time to time (say every few hundred or thousand iterations) whether it > has lost the CPU. Without a kernel API like that, you can achieve a similar effect by issuing the rdtsc instruction (or its equivalents for non-x86 archs) and seeing if the cycle counter changes unexpectedly (say, by 1000 or more for a single loop iteration), which would indicate that there was a context switch. For an arch-independent implementation, you could also use a syscall such as times(2) or gettimeofday(2), but then you'd need to do it very infrequently (e.g., maybe just to see if there's a context switch between 10k to 20k spins). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #22 from solar-gcc at openwall dot com 2010-09-05 11:37 --- (In reply to comment #20) > Maybe we could agree on a compromise for a start. Alexander, what are the > corresponding results for GOMP_SPINCOUNT=10? Unfortunately, I no longer have access to the dual-X5550 system, and I did not try other values for this parameter when I was benchmarking that system. On systems that I do currently have access to, the slowdown from GOMP_SPINCOUNT=1 was typically no more than 10% (and most of the time there was either no effect or substantial speedup). I can try 10 on those, although it'd be difficult to tell the difference from 1 because of the changing load. I'll plan on doing this next time I run this sort of benchmarks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 --- Comment #23 from Alexander Peslyak 2010-11-09 16:32:53 UTC --- (In reply to comment #20) > Maybe we could agree on a compromise for a start. Alexander, what are the > corresponding results for GOMP_SPINCOUNT=10? I reproduced slowdown of 5% to 35% (on different pieces of code) on an otherwise-idle dual-E5520 system (16 logical CPUs) when going from gcc 4.5.0's defaults to GOMP_SPINCOUNT=1. On all but one test, the original full speed is restored with GOMP_SPINCOUNT=10. On the remaining test, the threshold appears to be between 10 (still 35% slower than full speed) and 20 (original full speed). So if we're not going to have a code fix soon enough maybe the new default should be slightly higher than 20. It won't help as much as 1 would for cases where this is needed, but it would be of some help.
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706 --- Comment #25 from Alexander Peslyak 2010-11-12 11:19:13 UTC --- (In reply to comment #24) > If only one out of 35 tests becomes slower, You might have misread what I wrote. I did not mention "35 tests"; I mentioned that a test became slower by 35%. The total number of different tests was 4 (and each was invoked multiple times per spincount setting, indeed). One out of four stayed 35% slower until I increased GOMP_SPINCOUNT to 20. > I would rather blame it to this one (probably badly parallelized) > application, not the OpenMP runtime system. This makes some sense, but the job of an optimizing compiler and runtime libraries is to deliver the best performance they can even with somewhat non-optimal source code. There are plenty of real-world cases where spending time on application redesign for speed is unreasonable or can only be completed at a later time - yet it is desirable to squeeze a little bit of extra performance out of the existing code. There are also cases where more efficient parallelization - implemented at a higher level to avoid frequent switches between parallel and sequential execution - makes the application harder to use. To me, one of the very reasons to use OpenMP was to avoid/postpone that redesign and the user-visible complication for now. If I went for a more efficient higher-level solution, I would not need OpenMP in the first place. > So I would suggest a threshold of 10 for now. My suggestion is 25. > IMHO, something should really happen to this problem before the 4.6 release. Agreed. It'd be best to have a code fix, though.
[Bug libgomp/43706] scheduling two threads on one core leads to starvation
--- Comment #14 from solar-gcc at openwall dot com 2010-07-02 01:39 --- We're also seeing this problem on OpenMP-using code built with gcc 4.5.0 release on linux-x86_64. Here's a user's report (400x slowdown on an 8-core system when there's a single other process running on a CPU): http://www.openwall.com/lists/john-users/2010/06/30/3 Here's my confirmation of the problem report (I easily reproduced similar slowdowns), and workarounds: http://www.openwall.com/lists/john-users/2010/06/30/6 GOMP_SPINCOUNT=1 (this specific value) turned out to be nearly optimal in cases affected by this problem, as well as on idle systems, although I was also able to identify cases (with server-like unrelated load: short requests to many processes, which quickly go back to sleep) where this setting lowered the measured best-case speed by 15% (over multiple benchmark invocations), even though it might have improved the average speed even in those cases. All of this is reproducible with John the Ripper 1.7.6 release on Blowfish hashes ("john --test --format=bf") and with the -omp-des patch (current revision is 1.7.6-omp-des-4) on DES-based crypt(3) hashes ("john --test --format=des"). The use of OpenMP needs to be enabled by uncommenting the OMPFLAGS line in the Makefile. JtR and the patch can be downloaded from: http://www.openwall.com/john/ http://openwall.info/wiki/john/patches To reproduce the problem, it is sufficient to have one other CPU-using process running when invoking the John benchmark. I was using a non-OpenMP build of John itself as that other process. Overall, besides this specific "bug", OpenMP-using programs are very sensitive to other system load - e.g., unrelated server-like load of 10% often slows an OpenMP program down by 50%. Any improvements in this area would be very welcome. However, this specific "bug" is extreme, with its 400x slowdowns, so perhaps it is to be treated with priority. Jakub - thank you for your work on gcc's OpenMP support. The ease of use is great! -- solar-gcc at openwall dot com changed: What|Removed |Added ------------ CC||solar-gcc at openwall dot ||com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706