[Bug middle-end/51017] New: GCC 4.6 performance regression (vs. 4.4/4.5)

2011-11-07 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

 Bug #: 51017
   Summary: GCC 4.6 performance regression (vs. 4.4/4.5)
Classification: Unclassified
   Product: gcc
   Version: 4.6.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: solar-...@openwall.com


GCC 4.6 happens to produce approx. 25% slower code on at least x86_64 than 4.4
and 4.5 did for John the Ripper 1.7.8's bitslice DES implementation.  To
reproduce, download
http://download.openwall.net/pub/projects/john/1.7.8/john-1.7.8.tar.bz2 and
build it with "make linux-x86-64" (will use SSE2 intrinsics), "make
linux-x86-64-avx" (will use AVX instead), or "make generic" (won't use any
intrinsics).  Then run "../run/john -te=1".  With GCC 4.4 and 4.5, the
"Traditional DES" benchmark reports a speed of around 2500K c/s for the
"linux-x86-64" (SSE2) build on a 2.33 GHz Core 2 (this is using one core). 
With 4.6, this drops to about 1850K c/s.  Similar slowdown was observed for AVX
on Core i7-2600K when going from GCC 4.5.x to 4.6.x.  And it is reproducible
for the without-intrinsics code as well, although that's of less practical
importance (the intrinsics are so much faster).  Similar slowdown with GCC 4.6
was reported by a Mac OS X user.  It was also spotted by Phoronix in their
recently published C compiler benchmarks, but misinterpreted as a GCC vs. clang
difference.

Adding "-Os" to OPT_INLINE in the Makefile partially corrects the performance
(to something like 2000K c/s - still 20% slower than GCC 4.4/4.5's).  Applying
the OpenMP patch from
http://download.openwall.net/pub/projects/john/1.7.8/john-1.7.8-omp-des-4.diff.gz
and then running with OMP_NUM_THREADS=1 (for a fair comparison) corrects the
performance almost fully.  Keeping the patch applied, but removing -fopenmp
still keeps the performance at a good level.  So it's some change made to the
source code by this patch that mitigates the GCC regression.  Similar behavior
is seen with current CVS version of John the Ripper, even though it has OpenMP
support for DES heavily revised and integrated into the tree.


[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)

2011-11-07 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #1 from Alexander Peslyak  
2011-11-08 00:47:49 UTC ---
(In reply to comment #0)
> [...] Similar behavior
> is seen with current CVS version of John the Ripper, even though it has OpenMP
> support for DES heavily revised and integrated into the tree.

I forgot to note that in the CVS version, I changed the default for non-OpenMP
builds to use the supplied SSE2 assembly code, which hides this GCC issue for
SSE2 non-OpenMP builds.  The C code may be re-enabled in x86-64.h, or
alternatively an -avx or generic build may be used.  (Yes, -avx is still fully
affected by the GCC regression even in the latest version of JtR code.)

But it is probably simpler to use the 1.7.8 release to reproduce this bug
anyway.


[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)

2011-11-07 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #2 from Alexander Peslyak  
2011-11-08 00:56:47 UTC ---
The affected code is in DES_bs_b.c: DES_bs_crypt_25().  (Sorry, I should have
mentioned that right away.)


[Bug web/51019] New: unclear documentation on -fomit-frame-pointer default for -Os and different platforms

2011-11-07 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51019

 Bug #: 51019
   Summary: unclear documentation on -fomit-frame-pointer default
for -Os and different platforms
Classification: Unclassified
   Product: gcc
   Version: 4.6.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: web
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: solar-...@openwall.com


The texinfo documentation for GCC 4.6.2 says:

 Starting with GCC version 4.6, the default setting (when not
 optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86
 targets has been changed to `-fomit-frame-pointer'.  The default
 can be reverted to `-fno-omit-frame-pointer' by configuring GCC
 with the `--enable-frame-pointer' configure option.

 Enabled at levels `-O', `-O2', `-O3', `-Os'.

The "when not optimizing for size" comment feels contradictory to having "-Os"
listed on the "Enabled at levels" line.  Also, it is not clear what the default
is on other than "32-bit Linux x86 and 32-bit Darwin x86".  In practice, I
observe the following behavior with GCC 4.6.2: on Linux/x86_64,
-fomit-frame-pointer is the default at both -O2 and -Os (I did not test
others); on Linux/i386, it is the default at -O2, but not at -Os.  This needs
to be documented more clearly.


[Bug target/13822] enable -fomit-frame-pointer or at least -momit-frame-pointer by default on x86/dwarf2 platforms

2011-11-07 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13822

Alexander Peslyak  changed:

   What|Removed |Added

 CC||solar-gcc at openwall dot
   ||com

--- Comment #5 from Alexander Peslyak  
2011-11-08 01:40:18 UTC ---
Shouldn't this bug be closed now, with GCC 4.6's change of default for
-fomit-frame-pointer?


[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)

2012-01-02 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #4 from Alexander Peslyak  
2012-01-03 04:45:43 UTC ---
(In reply to comment #3)
> It might be interesting to get numbers for the trunk.  There have been some
> register allocator fixes which might have improved this.

I've just tested the gcc-4.7-20111231 snapshot vs. 4.6.2 release.  There's no
improvement as it relates to this issue: I am getting the same poor performance
(a lot worse than for 4.5).  This is for generating x86-64 code with SSE2
intrinsics, benchmarking the resulting code on a Core 2'ish CPU (I used Xeon
E5420 this time).


[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)

2012-01-04 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #5 from Alexander Peslyak  
2012-01-04 19:39:26 UTC ---
I wrote and ran some scripts to test many versions/snapshots of gcc.  It turns
out that 4.6-20100703 (oldest 4.6 snapshot available for FTP) was already
affected by this regression, whereas 4.5-20111229 and 4.4-20120103 are not
affected (as expected).  Also, it turns out that there was a smaller regression
at this same benchmark between 4.3 and 4.4.  That is, 4.3 produces the fastest
code of all gcc versions I tested.  Here are some numbers:

4.3.5 20100502 - 2950K c/s, 28229 bytes
4.3.6 20110626 - 2950K c/s, 28229 bytes
4.4.5 20100504 - 2697K c/s, 29764 bytes
4.4.7 20120103 - 2691K c/s, 29316 bytes
4.5.1 20100603 - 2729K c/s, 29203 bytes
4.5.4 20111229 - 2710K c/s, 29203 bytes
4.6.0 20100703 - 2133K c/s, 29911 bytes
4.6.0 20100807 - 2119K c/s, 29940 bytes
4.6.0 20100904 - 2142K c/s, 29848 bytes
4.6.0 20101106 - 2124K c/s, 29848 bytes
4.6.0 20101204 - 2114K c/s, 29624 bytes
4.6.3 20111230 - 2116K c/s, 29624 bytes
4.7.0 20111231 - 2147K c/s, 29692 bytes

These are for JtR 1.7.9 with DES_BS_ASM set to 0 on line 157 of x86-64.h (to
disable this version's workaround for this GCC 4.6 regression), built with
"make linux-x86-64" and run on one core in a Xeon E5420 2.5 GHz (the system is
otherwise idle).  The code sizes given are for .text of DES_bs_b.o (which
contains three similar functions, of which one is in use by this benchmark -
that is, the code size in the loop is about 10 KB).

As you can see, 4.3 generated code that was both significantly faster and a bit
smaller than all other versions'.  In 4.4, the speed decreased by 8.5% and code
size increased by 4.4%.  4.5 corrected this to a very limited extent - still 8%
slower and 3.5% larger than 4.3's.  4.6 brought a huge performance drop and a
slight code size increase.  4.7.0 20111231's code is still 27% slower than
4.3's.


[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)

2012-01-04 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #7 from Alexander Peslyak  
2012-01-04 23:00:24 UTC ---
(I ran the tests below and wrote this comment before seeing Jakub's.  Then I
thought I'd post it anyway.)

Here are some numbers for gcc releases:

4.0.0 - 383K c/s, 71879 bytes (this old version of gcc generates function calls
for SSE2 intrinsics)
4.1.0 - 2959K c/s, 28182 bytes
4.1.2 - 2964K c/s, 28365 bytes
4.2.0 - 2968K c/s, 28363 bytes
4.2.4 - 2971K c/s, 28382 bytes
4.3.0 - 2971K c/s, 28229 bytes
4.3.6 - 2959K c/s, 28229 bytes
4.4.0 - 2625K c/s, 29770 bytes
4.4.6 - 2695K c/s, 29316 bytes
4.5.0 - 2729K c/s, 29203 bytes
4.5.3 - 2716K c/s, 29203 bytes
4.6.0 - 2111K c/s, 29624 bytes
4.6.2 - 2123K c/s, 29624 bytes

So thing were really good for versions 4.1.0 through 4.3.6, but started to get
worse afterwards and got really bad with 4.6.

To be fair, things are very different for some other hash/cipher types
supported by JtR - e.g., for Blowfish-based hashing we went from 560 c/s for
4.1.0 to 700 c/s for 4.6.2.

JtR 1.7.9 and 1.7.9-jumbo include a benchmark comparison tool called
relbench, which calculates geometric mean, median, and some other metrics for
multiple individual outputs from a pair of JtR benchmark invocations (e.g.,
built with different versions of gcc).  In 1.7.9-jumbo-5, there are over 160
individual benchmark outputs (for different hashes/ciphers) and it may be built
in a variety of ways (with/without explicit assembly code, with/without
intrinsics etc.)  relbench combines those 160+ outputs into a nice summary
showing overall speedup/slowdown and more.  It might be useful for testing of
future gcc versions for potential performance regressions like this.


[Bug target/54349] _mm_cvtsi128_si64 unnecessary stores value at stack

2016-02-26 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54349

Alexander Peslyak  changed:

   What|Removed |Added

 CC||solar-gcc at openwall dot com

--- Comment #10 from Alexander Peslyak  ---
I confirm that this is fixed in 4.9.  Since a lot of people are still using
pre-4.9 gcc and may stumble upon this bug, here's my experience with the bug
and with working around it:

The bug manifests itself the worst when only a pre-SSE4.1 instruction set is
available (such as when compiling for x86_64 with no -m... options given), and
(at least for me) especially on AMD Bulldozer: over 26% speedup from fully
working around the bug in plain SSE2 build of yescrypt with Ubuntu 12.04's gcc
4.6.3 on FX-8120.  On Intel CPUs, the impact of the bug is typically 5% to 10%.
 Enabling SSE4.1 (or AVX or better) mostly mitigates the bug, resulting in
inbetween or full speeds (varying by CPU), since "(v)pextrq $0," is then
generated and it is almost as good as "(v)movq" (but not exactly).

The suggested "-mtune=corei7" workaround works, but is only recognized by gcc
4.6 and up (thus, is only for versions 4.6.x to 4.8.x).  At source file level,
this works:

#if defined(__x86_64__) && \
__GNUC__ == 4 && __GNUC_MINOR__ >= 6 && __GNUC_MINOR__ < 9
#pragma GCC target ("tune=corei7")
#endif

A related bug is that those versions of gcc with that workaround wrongly
generate "movd" (as in e.g. "movd %xmm0, %rax") instead of "movq".  Luckily,
binutils primarily looks at the register names and silently corrects this error
(there's "movq" in the disassembly).

For a much wider range of gcc versions - 4.0 and up - this works:

#if defined(__x86_64__) && __GNUC__ == 4 && __GNUC_MINOR__ < 9
#ifdef __AVX__
#define MAYBE_V "v"
#else
#define MAYBE_V ""
#endif
#define _mm_cvtsi128_si64(x) ({ \
uint64_t result; \
__asm__(MAYBE_V "movq %1,%0" : "=r" (result) : "x" (x)); \
result; \
})
#endif

A drawback for using inline asm for a single instruction is that it might
negatively affect gcc's instruction scheduling (where gcc ends up unaware of
the inlined instruction's timings).  However, on this specific occasion (with
yescrypt) I am not seeing any slowdown of such code compared to the
"tune=corei7" approach, nor compared to gcc 4.9+.  It just works for me. 
Still, because of this concern, it might be wise to combine the two approaches,
only resorting to inline asm on pre-4.6 gcc:

/* gcc before 4.9 would unnecessarily use store/load (without SSE4.1) or
 * (V)PEXTR (with SSE4.1 or AVX) instead of simply (V)MOV. */
#if defined(__x86_64__) && \
__GNUC__ == 4 && __GNUC_MINOR__ >= 6 && __GNUC_MINOR__ < 9
#pragma GCC target ("tune=corei7")
#endif

#include 
#include 

#if defined(__x86_64__) && __GNUC__ == 4 && __GNUC_MINOR__ < 6
#ifdef __AVX__
#define MAYBE_V "v"
#else
#define MAYBE_V ""
#endif
#define _mm_cvtsi128_si64(x) ({ \
uint64_t result; \
__asm__(MAYBE_V "movq %1,%0" : "=r" (result) : "x" (x)); \
result; \
})
#endif

Unfortunately, unlike the pure inline asm workaround, this relies on binutils
correcting the "movd" for gcc 4.6.x to 4.8.x.  Oh well.

I've tested the above combined workaround on these gcc versions (and it works):
4.0.0 4.1.0 4.1.2 4.2.0 4.2.4 4.3.0 4.3.6 4.4.0 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5
4.4.6 4.5.0 4.5.3 4.6.0 4.6.2 4.7.0 4.7.4 4.8.0 4.8.4 4.9.0 4.9.2

[Bug target/54349] _mm_cvtsi128_si64 unnecessary stores value at stack

2016-02-26 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54349

--- Comment #11 from Alexander Peslyak  ---
Turns out that gcc 4.6.x to 4.8.x generating "movd" instead of "movq" is
actually a deliberate hack, to support binutils older than 2.17 ("movq" support
committed in 2005, released in 2006) and (presumably) non-GNU assemblers:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43215

Also related, on "vmovd":

https://sourceware.org/ml/binutils/2008-05/msg00257.html

Per H.J. Lu, this is because of an error in AMD's spec for x86-64.

More detail on this cursed intrinsic: gcc got the _mm_cvtsi128_si64x() (with
'x') form before it got Intel's _mm_cvtsi128_si64() name (without 'x').  (When
using the inline asm workaround above, this does not matter as the macro brings
the without 'x' form to older gcc as well.)  Older MSVC and Open64 had bugs for
the intrinsic (without 'x'):

http://www.thesalmons.org/john/random123/releases/1.08/docs/sse_8h_source.html#l00108

This refers to https://bugs.open64.net/show_bug.cgi?id=873 for the Open64 bug,
and I had looked at it before, but unfortunately right now their bug tracker
refuses connections (for https; and gives 404 for that path with http).  I have
no detail on what the MSVC bug was.  Apparently, these could result in
incorrect computation at runtime (the comment at the URL above mentions failed
assertions).  Using _mm_extract_epi64(x, 0) is a workaround (SSE4.1+, sometimes
slower).

[Bug tree-optimization/65427] New: ICE in emit_move_insn with wide vector types

2015-03-14 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65427

Bug ID: 65427
   Summary: ICE in emit_move_insn with wide vector types
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: solar-gcc at openwall dot com

Created attachment 35037
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35037&action=edit
testcase

GCC 4.7.0 through at least 4.9.2 and 5.0 20150215 snapshot (I haven't tested
newer ones) fails with ICE when compiling the attached md5slice.c testcase on
and for Linux x86_64:

$ gcc md5slice.c -o md5slice -O2 -DVECTOR -Wno-attributes -ftree-loop-vectorize
md5slice.c: In function 'GG':
md5slice.c:302:27: internal compiler error: in emit_move_insn, at expr.c:3609
 static MAYBE_INLINE3 void GG(a, b, c, d, x, s, ac)
   ^
0x6974d2 emit_move_insn(rtx_def*, rtx_def*)
../../gcc/expr.c:3608
0x5e5294 expand_gimple_stmt_1
../../gcc/cfgexpand.c:3288
0x5e5294 expand_gimple_stmt
../../gcc/cfgexpand.c:3322
0x5e589b expand_gimple_basic_block
../../gcc/cfgexpand.c:5162
0x5e7b56 gimple_expand_cfg
../../gcc/cfgexpand.c:5741
0x5e7b56 execute
../../gcc/cfgexpand.c:5961

Without -ftree-loop-vectorize, compilation succeeds.  With -O3, it fails
slightly differently:

$ gcc md5slice.c -o md5slice -O3 -DVECTOR -Wno-attributes 
md5slice.c: In function 'II.constprop':
md5slice.c:328:27: internal compiler error: in emit_move_insn, at expr.c:3609
 static MAYBE_INLINE3 void II(a, b, c, d, x, s, ac)
   ^
0x6974d2 emit_move_insn(rtx_def*, rtx_def*)
../../gcc/expr.c:3608
0x5e5294 expand_gimple_stmt_1
../../gcc/cfgexpand.c:3288
0x5e5294 expand_gimple_stmt
../../gcc/cfgexpand.c:3322
0x5e589b expand_gimple_basic_block
../../gcc/cfgexpand.c:5162
0x5e7b56 gimple_expand_cfg
../../gcc/cfgexpand.c:5741
0x5e7b56 execute
../../gcc/cfgexpand.c:5961

With -mavx or -mavx2, it succeeds (despite of -O3).

GCC 4.7.0 does not have the -ftree-loop-vectorize option, but a similar problem
is seen with -O3:

$ gcc md5slice.c -o md5slice -O3 -DVECTOR -Wno-attributes
md5slice.c: In function 'GG':
md5slice.c:302:27: internal compiler error: in emit_move_insn, at expr.c:3435

So far, all of this is with:

typedef element vector __attribute__ ((vector_size (32)));

on line 41.  Reducing the vector width to 16 makes the plain SSE2 compilation
succeed with any optimizations.  Conversely, increasing the vector width to 64
makes compilation to fail even with AVX/AVX2 enabled.

Ideally, when the vector type width is in excess of the current target
architecture's native SIMD vector width, GCC should transparently split it into
multiple sub-vectors of the natively supported width.  This is useful not only
for being able to build/use wider-vector source code for/on older CPUs, but
also to hide instruction latencies by having the compiler interleave operations
on the sub-vectors due to the extra parallelism the excessive vector width
provides.  For example, once this is supported 32 could actually work faster
than 16 on SSE2, and 64 faster than 32 on AVX2, for some applications (as long
as the register pressure does not become too high).

Failing that, at least the compiler should report that this is unsupported,
rather than fail with an ICE.

With GCC 4.6.2 and older, the ICE does not occur, for the rather unfortunate
reason that (at least for me) these versions generate scalar code (so ~10x
slower) when the type's vector width exceeds what's supported natively.


[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)

2015-02-15 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #9 from Alexander Peslyak  ---
(In reply to Andrew Pinski from comment #8)
> Can you try GCC 4.9?

Yes.  Bad news: things mostly became even worse.  Same machine, same JtR
version, same test script as in my previous comment:

4.9.2 - 1849K c/s, 28256 bytes

The code size is back to 4.1.0 to 4.3.6 levels (good), but the performance
decreased by another 13% since 4.6.2 (and by 38% since it peaked with 4.3.0). 
I ran this benchmark multiple times, and I also re-ran benchmarks with some
previous gcc versions to make sure this isn't caused by some change in my
environment - no, I am getting consistently poor results for 4.9.2, and the
same results as before for other gcc versions.  I'll plan to test with some
versions in the range 4.7.0 to 4.9.0 next.

(I also see some much smaller regressions with 4.9.2 for other hash types.)


[Bug middle-end/51017] GCC 4.6 performance regression (vs. 4.4/4.5)

2015-02-15 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #10 from Alexander Peslyak  ---
I decided to take a look at the generated code.  Compared to 4.6.2, GCC 4.9.2
started generating lots of xorps, orps, andps, andnps where it previously
generated pxor, por, pand, pandn.  Changing those with:

sed -i 's/xorps/pxor/g; s/orps/por/g; s/andps/pand/g; s/andnps/pandn/g'

made no difference for performance on this machine (still 4.9.2's poor
performance).

The next suspect were the varieties of MOV instructions.  In 4.9.2's generated
code, there were 1319 movaps, 721 movups.  In 4.6.2's, there were 1258 movaps,
465 movups.  Simply changing all movups to movaps in 4.9.2's original code with
sed (thus, with no other changes except for this one), resulting in a total of
2040 movaps, brought the performance to levels similar to GCC 4.4 and 4.5's
(and is better than 4.6's, but worse than 4.3's).  So movups appear to be the
main culprit.  The same hack for 4.6.2's code brought its performance almost to
4.3's level (still 5% worse, though), and significantly above 4.9.2's (so
there's still some other, smaller regression with 4.9.2).

Here are my new results:

4.1.0o - 2960K c/s, 28182 bytes, 1758 movaps, 0 movups
4.3.6o - 2956K c/s, 28229 bytes, 1755 movaps, 0 movups
4.4.6o - 2694K c/s, 29316 bytes, 1709 movaps, 7 movups
4.4.6h - 2714K c/s, 29316 bytes, 1716 movaps, 0 movups
4.5.3o - 2709K c/s, 29203 bytes, 1669 movaps, 0 movups
4.6.2o - 2121K c/s, 29624 bytes, 1258 movaps, 465 movups
4.6.2h - 2817K c/s, 29624 bytes, 1723 movaps, 0 movups
4.9.2o - 1852K c/s, 28256 bytes, 1319 movaps, 721 movups
4.9.2h - 2688K c/s, 28256 bytes, 2040 movaps, 0 movups

"o" means original, "h" means hacked generated assembly code (all movups
changed to movaps).  (BTW, there were no movdqa/movdqu in any of these code
versions.)

Now I am wondering to what extent this is a GCC issue and to what extent it
might be my source code's, if GCC is somehow unsure it can assume alignment. 
What are the conditions when GCC should in fact use movups?  Is it intentional
that newer versions of GCC are being more careful at this, resulting in worse
performance?


[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure

2015-02-16 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #12 from Alexander Peslyak  ---
(In reply to Richard Biener from comment #11)
> I wonder if you could share the exact CPU type you are using?

This is on (dual) Xeon E5420 (using only one core for these benchmarks), but
there was similar slowdown with GCC 4.6 on other Core 2'ish CPUs as well (such
as desktop Core 2 Duo CPUs). You might not call these "modern".

> Note that we have to use movups because [...]

Thank you for looking into this. I still have a question, though: does this
mean you're treating older GCC's behavior, where it dared to use movaps anyway,
a bug?

I was under impression that with most SSE*/AVX* intrinsics (except for those
explicitly defined to do unaligned loads/stores) natural alignment is assumed
and is supposed to be provided by the programmer. Not only with GCC, but with
compilers for x86(-64) in general. I thought this was part of the contract: I
use intrinsics and I guarantee alignment. (Things would certainly not work for
me at least with older GCC if I assumed the compiler would use unaligned loads
whenever it was unsure of alignment.) Was I wrong, or has this changed (in GCC?
or in some compiler-neutral specification?), or is GCC wrong in not assuming
alignment now?

Is there a command-line option to ask GCC to assume alignment, like it did
before?


[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure

2015-02-16 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #13 from Alexander Peslyak  ---
(In reply to Richard Biener from comment #11)
> We are putting quite heavy register-pressure on the thing by means of
> partial redundancy elimination, thus disabling PRE using -fno-tree-pre
> might help (we still spill a lot).

It looks like -fno-tree-pre or equivalent was implied in the options I was
using, which were "-O2 -fomit-frame-pointer -Os -funroll-loops
-finline-functions" - yes, with -Os added after -O2 when compiling this
specific source file.  IIRC, this was experimentally derived as producing best
performance with 4.6.x or older.  Adding -fno-tree-pre after all of these
options merely changes the label names in the generated assembly code, while
resulting in identical object files (and obviously no performance change). 
Also, I now realize -Os was probably the reason why GCC preferred SSE
"floating-point" bitwise ops and MOVs here, instead of SSE2's integer ones
(they have longer encodings). Omitting -Os results in usage of the SSE2
instructions (both bitwise and MOVs), with correspondingly larger code. And
yes, when I omit -Os, I do need to add -fno-tree-pre to regain roughly the same
performance, and then to s/movdqu/movdqa/g to regain almost the full speed
(movdqu is just as slow as movups on this CPU). I've just tested all of this
with GCC 4.8.4 to possibly match yours (you mentioned you used 4.8). So I think
you uncovered yet another performance regression I had already worked around
with -Os.

FWIW, here are the generated assembly code sizes ("wc" output) with GCC 4.8.4:

-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions
  5870  17420 137636 1.s
-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions -fno-tree-pre
  5870  17420 137636 2.s
-O2 -fomit-frame-pointer -funroll-loops -finline-functions
  6814  20193 156837 a.s
-O2 -fomit-frame-pointer -funroll-loops -finline-functions -fno-tree-pre
  6028  17842 138284 b.s

As you can see, -fno-tree-pre reduces the size almost to the -Os level. (But
the .text size would be significantly larger because of the SSE2 instruction
encodings.  This is why I show the assembly code sizes for this comparison.)


[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure

2015-02-16 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #14 from Alexander Peslyak  ---
For completeness, here are the results for 4.7.x, 4.8.x, and 4.9.0:

4.7.0o - 2142K c/s, 29692 bytes, 1267 movaps, 465 movups
4.7.0h - 2823K c/s, 29692 bytes, 1732 movaps, 0 movups
4.7.4o - 2144K c/s, 29692 bytes, 1267 movaps, 465 movups
4.7.4h - 2827K c/s, 29692 bytes, 1732 movaps, 0 movups
4.8.0o - 1825K c/s, 27813 bytes, 1341 movaps, 721 movups
4.8.0h - 2792K c/s, 27813 bytes, 2062 movaps, 0 movups
4.8.4o - 1827K c/s, 27807 bytes, 1341 movaps, 721 movups
4.8.4h - 2786K c/s, 27807 bytes, 2062 movaps, 0 movups
4.9.0o - 1852K c/s, 28262 bytes, 1319 movaps, 721 movups
4.9.0h - 2685K c/s, 28262 bytes, 2040 movaps, 0 movups

4.8 produces the smallest code so far, but even with the aligned loads hack is
still 6% slower than 4.3.

All of these are with "-O2 -fomit-frame-pointer -Os -funroll-loops
-finline-functions", like similar results I had posted before.  Xeon E5420,
x86_64.


[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure

2015-02-17 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #17 from Alexander Peslyak  ---
(In reply to Richard Biener from comment #16)
> I'm completely confused now as to what the original regression was reported
> against.

I'm sorry, I should have re-read my original description of the regression
before I wrote comment 13.  Together, these are indeed confusing.

> I thought it was the default options in the Makefile, -O2
> -fomit-frame-pointer, which showed the regression and you found -Os would
> mitigate it somewhat (and I more specifically told you it is -fno-tree-pre
> that makes the actual difference).

That's one of the regressions I mentioned in the original description.  Yes,
you identified -fno-tree-pre as the component of -Os that makes the difference
- Thank You!  However, I also mentioned in the original description that a
bigger regression with 4.6+ vs. 4.5 and 4.4 remained despite of -Os, and I had
no similar workaround for it at the time (but enabling -fopenmp made it go
away, perhaps due to changes to declarations in the source code in #ifdef
_OPENMP blocks).  I think we can now say that this bigger 4.6+ regression was
primarily caused by the unaligned load instructions.  So two regressions are
figured out, and the remaining slowdown (not investigated yet) vs. 4.1 to 4.3
(which worked best) is only 6% to 10% in recent versions (9% in 4.9.2).

> So - what options give good results with old compilers but bad results with
> new compilers?

On CPUs where movups/movdqu are slower than their aligned counterparts (for
addresses that happen to be aligned), any sane optimization options of 4.6+
give bad results as compared to pre-4.6 with same options.  As you say, this
can be fixed in the source code (and I most likely will fix it there), but I
think many other programs may experience similar slowdowns, so maybe GCC should
do something about this too.

Other than that, either -Os or -fno-tree-pre works around the second worst
slowdown seen in 4.6+.

To avoid confusion, maybe this bug should focus on one of the three
regressions?  Should we keep it for PRE only?

Should we create a new bug for the unnecessary and non-optional use of
unaligned load instructions for source code like this, or is this considered
the new intended behavior despite of the major slowdown on such CPUs? 
(Presumably not only for JtR.  I'd expect this to affect many programs.)

Should we also create a bug for investigating the remaining slowdown of 9% in
4.9.2 (vs. 4.1 to 4.3), or is it considered too minor to bother?

Thank you!


[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure

2015-02-17 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #18 from Alexander Peslyak  ---
(In reply to Richard Biener from comment #11)
> Note that we have to use movups because DES_bs_all is not aligned as seen
> from DES_bs_b.c (it's defined in DES_bs.c and only there annotated with
> CC_CACHE_ALIGN, not at the point of declaration in DES_bs.h).  So the
> unaligned moves are the sources fault.  Annotating that with CC_CACHE_ALIGN
> produces the desired movaps instructions

Confirmed also with GCC 4.9.2 on JtR 1.8.0's version of the code.

> (with no effect on performance for me).

... with the expected performance improvement for me.  I'll commit this fix. 
Thanks again!


[Bug tree-optimization/59124] [4.8/4.9/5 Regression] Wrong warnings "array subscript is above array bounds"

2015-02-17 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59124

Alexander Peslyak  changed:

   What|Removed |Added

 CC||solar-gcc at openwall dot com

--- Comment #8 from Alexander Peslyak  ---
Here's another testcase:

$ gcc -S -Wall -O2 -funroll-loops testcase.c 
testcase.c: In function 'DES_std_set_key':
testcase.c:14:17: warning: array subscript is above array bounds
[-Warray-bounds]
   while (DES_key[i++]) k += 2;
 ^

=== 8< ===
static int DES_KS_updates;
static char DES_key[16];

void DES_std_set_key(char *key)
{
int i, j, k, l;

j = key[0];
for (k = i = 0; (l = DES_key[i]) && (j = key[i]); i++)
;

if (!j) {
j = i;
while (DES_key[i++]) k += 2;
}

if (k < j && ++DES_KS_updates) {
}

DES_key[0] = key[0];
}
=== >8 ===

GCC 4.7.4 and below report no warning, 4.8.0 and 4.9.2 report the warning
above.  Either -O2 -funroll-loops or -O3 result in the warning; simple -O2 does
not.  While i++ could potentially run beyond the end of DES_key[], depending on
what's in DES_key[] and key[], this isn't the case in the program this snippet
is taken from (and simplified), whereas the warning definitively claims "is"
rather than "might be".

For comparison, Dmitry's first testcase (from this bug's description) results
in no warning with -O2 -funroll-loops (but does give the warning to me with
-O3, as reported by Dmitry), whereas his second testcase (from comment 2) also
reports the warning with -O2 -funroll-loops (but not with simple -O2).  I
tested this with 4.9.2.

I hope this is similar enough to add to this bug (same affected versions, one
of the two testcases also affected by -funroll-loops).


[Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure

2015-02-17 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #19 from Alexander Peslyak  ---
(In reply to Alexander Peslyak from comment #17)
> Should we create a new bug for the unnecessary and non-optional use of
> unaligned load instructions for source code like this, or is this considered
> the new intended behavior despite of the major slowdown on such CPUs? 
> (Presumably not only for JtR.  I'd expect this to affect many programs.)

Upon further analysis, I now think that this was my fault, and (presumably) not
common in other programs.  What I had was differing definition vs. declaration,
so a bug.  The lack of alignment specification in the declaration of the struct
essentially told (newer) GCC not to assume alignment - to an extent greater
than e.g. a pointer would.  As far as I can tell, GCC does not currently
produce unaligned load instructions (so assumes that SSE* vectors are properly
aligned) when all it has is a pointer coming from another object file.  I think
that's the common scenario, whereas mine was uncommon (and incorrect).

So let's focus on PRE only.


[Bug tree-optimization/59124] [4.8/4.9/5 Regression] Wrong warnings "array subscript is above array bounds"

2015-02-17 Thread solar-gcc at openwall dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59124

--- Comment #9 from Alexander Peslyak  ---
(In reply to Alexander Peslyak from comment #8)
> $ gcc -S -Wall -O2 -funroll-loops testcase.c 
> testcase.c: In function 'DES_std_set_key':
> testcase.c:14:17: warning: array subscript is above array bounds

With GCC 5.0.0 20150215, this warning is gone.  I also confirm that Dmitry's
comment #2 warning is gone.  The original one from this bug's description
remains.


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-08-24 Thread solar-gcc at openwall dot com


--- Comment #17 from solar-gcc at openwall dot com  2010-08-24 11:07 ---
(In reply to comment #16)
> I would really like to see this bug tackled.

I second that.

> Fixing it is easily done by lowering the spin count as proposed.  Otherwise,
> please show cases where a low spin count hurts performance.

Unfortunately, yes, I've since identified real-world test cases where
GOMP_SPINCOUNT=1 hurts performance significantly (compared to gcc 4.5.0's
default).  Specifically, this was the case when I experimented with my John the
Ripper patches on a dual-X5550 system (16 logical CPUs).  On a few
real-world'ish runs, GOMP_SPINCOUNT=1 would halve the speed.  On most other
tests I ran, it would slow things down by about 10%.  That's on an otherwise
idle system.  I was surprised as I previously only saw GOMP_SPINCOUNT=1
hurt performance on systems with server-like unrelated load (and it would help
tremendously with certain other kinds of load).

> In general, for a tuning parameter, a good-natured rather value should be
> preferred over a value that gives best results in one case, but very bad ones
> in another case.

In general, I agree.  Even the 50% worse-case slowdown I observed with
GOMP_SPINCOUNT=1 is not as bad as the 400x worst-case slowdown observed
without that option.  On the other hand, a 50% slowdown would be fatal as it
relates to comparison of libgomp vs. competing implementations.  Also, HPC
cluster nodes may well be allocated such that there's no other load on each
individual node.  So having the defaults tuned for a system with no other load
makes some sense to me, and I am really unsure whether simply changing the
defaults is the proper fix here.

I'd be happy to see this problem fixed differently, such that the unacceptable
slowdowns are avoided in "both" cases.  Maybe the new default could be to
auto-tune the setting while the program is running?

Meanwhile, if it's going to take a long time until we have a code fix, perhaps
the problem and the workaround need to be documented prominently.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-08-24 Thread solar-gcc at openwall dot com


--- Comment #19 from solar-gcc at openwall dot com  2010-08-24 12:18 ---
(In reply to comment #18)
> Then, at the start of the spinning libgomp could initialize that flag and 
> check
> it from time to time (say every few hundred or thousand iterations) whether it
> has lost the CPU.

Without a kernel API like that, you can achieve a similar effect by issuing the
rdtsc instruction (or its equivalents for non-x86 archs) and seeing if the
cycle counter changes unexpectedly (say, by 1000 or more for a single loop
iteration), which would indicate that there was a context switch.  For an
arch-independent implementation, you could also use a syscall such as times(2)
or gettimeofday(2), but then you'd need to do it very infrequently (e.g., maybe
just to see if there's a context switch between 10k to 20k spins).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-09-05 Thread solar-gcc at openwall dot com


--- Comment #22 from solar-gcc at openwall dot com  2010-09-05 11:37 ---
(In reply to comment #20)
> Maybe we could agree on a compromise for a start.  Alexander, what are the
> corresponding results for GOMP_SPINCOUNT=10?

Unfortunately, I no longer have access to the dual-X5550 system, and I did not
try other values for this parameter when I was benchmarking that system.  On
systems that I do currently have access to, the slowdown from
GOMP_SPINCOUNT=1 was typically no more than 10% (and most of the time there
was either no effect or substantial speedup).  I can try 10 on those,
although it'd be difficult to tell the difference from 1 because of the
changing load.  I'll plan on doing this next time I run this sort of
benchmarks.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706



[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-11-09 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

--- Comment #23 from Alexander Peslyak  
2010-11-09 16:32:53 UTC ---
(In reply to comment #20)
> Maybe we could agree on a compromise for a start.  Alexander, what are the
> corresponding results for GOMP_SPINCOUNT=10?

I reproduced slowdown of 5% to 35% (on different pieces of code) on an
otherwise-idle dual-E5520 system (16 logical CPUs) when going from gcc 4.5.0's
defaults to GOMP_SPINCOUNT=1.  On all but one test, the original full speed
is restored with GOMP_SPINCOUNT=10.  On the remaining test, the threshold
appears to be between 10 (still 35% slower than full speed) and 20
(original full speed).  So if we're not going to have a code fix soon enough
maybe the new default should be slightly higher than 20.  It won't help as
much as 1 would for cases where this is needed, but it would be of some
help.


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-11-12 Thread solar-gcc at openwall dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706

--- Comment #25 from Alexander Peslyak  
2010-11-12 11:19:13 UTC ---
(In reply to comment #24)
> If only one out of 35 tests becomes slower,

You might have misread what I wrote.  I did not mention "35 tests"; I mentioned
that a test became slower by 35%.  The total number of different tests was 4
(and each was invoked multiple times per spincount setting, indeed).  One out
of four stayed 35% slower until I increased GOMP_SPINCOUNT to 20.

> I would rather blame it to this one (probably badly parallelized) 
> application, not the OpenMP runtime system.

This makes some sense, but the job of an optimizing compiler and runtime
libraries is to deliver the best performance they can even with somewhat
non-optimal source code.  There are plenty of real-world cases where spending
time on application redesign for speed is unreasonable or can only be completed
at a later time - yet it is desirable to squeeze a little bit of extra
performance out of the existing code.  There are also cases where more
efficient parallelization - implemented at a higher level to avoid frequent
switches between parallel and sequential execution - makes the application
harder to use.  To me, one of the very reasons to use OpenMP was to
avoid/postpone that redesign and the user-visible complication for now.  If I
went for a more efficient higher-level solution, I would not need OpenMP in the
first place.

> So I would suggest a threshold of 10 for now.

My suggestion is 25.

> IMHO, something should really happen to this problem before the 4.6 release.

Agreed.  It'd be best to have a code fix, though.


[Bug libgomp/43706] scheduling two threads on one core leads to starvation

2010-07-01 Thread solar-gcc at openwall dot com


--- Comment #14 from solar-gcc at openwall dot com  2010-07-02 01:39 ---
We're also seeing this problem on OpenMP-using code built with gcc 4.5.0
release on linux-x86_64.  Here's a user's report (400x slowdown on an 8-core
system when there's a single other process running on a CPU):

http://www.openwall.com/lists/john-users/2010/06/30/3

Here's my confirmation of the problem report (I easily reproduced similar
slowdowns), and workarounds:

http://www.openwall.com/lists/john-users/2010/06/30/6

GOMP_SPINCOUNT=1 (this specific value) turned out to be nearly optimal in
cases affected by this problem, as well as on idle systems, although I was also
able to identify cases (with server-like unrelated load: short requests to many
processes, which quickly go back to sleep) where this setting lowered the
measured best-case speed by 15% (over multiple benchmark invocations), even
though it might have improved the average speed even in those cases.

All of this is reproducible with John the Ripper 1.7.6 release on Blowfish
hashes ("john --test --format=bf") and with the -omp-des patch (current
revision is 1.7.6-omp-des-4) on DES-based crypt(3) hashes ("john --test
--format=des").  The use of OpenMP needs to be enabled by uncommenting the
OMPFLAGS line in the Makefile.  JtR and the patch can be downloaded from:

http://www.openwall.com/john/
http://openwall.info/wiki/john/patches

To reproduce the problem, it is sufficient to have one other CPU-using process
running when invoking the John benchmark.  I was using a non-OpenMP build of
John itself as that other process.

Overall, besides this specific "bug", OpenMP-using programs are very sensitive
to other system load - e.g., unrelated server-like load of 10% often slows an
OpenMP program down by 50%.  Any improvements in this area would be very
welcome.  However, this specific "bug" is extreme, with its 400x slowdowns, so
perhaps it is to be treated with priority.

Jakub - thank you for your work on gcc's OpenMP support.  The ease of use is
great!


-- 

solar-gcc at openwall dot com changed:

   What|Removed |Added
------------
     CC||solar-gcc at openwall dot
   ||com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706