[4.4] Strange performance regression?

2009-10-13 Thread francesco biscani
Hi list,

I'm experiencing a strange behaviour with GCC 4.4.1. Basically I have
some C++ mathematical code which gets a ~x2 performance drop if I
*remove* the following debug line from the code:

---

std::cout << "Block size: " << block_size << '\n';

---

Where block_size is a std::size_t variable. It took a while to bisect
this issue, but now I can reproduce it consistently by removing just
that line. In order to have the expected performance, both the strings
and the variable must be printed. The architecture is x86_64, 64-bit
Gentoo Linux on Intel Core2 Q6600 CPU. The same problem is not there
on another machine, a 64-bit Ubuntu Linux with GCC 4.3.3 and Intel
Xeon Core2 (8 cores total).

Unfortunately the offending portion of code is buried quite deep into
templated code, so it is a bit difficult for me to reduce the test
case to a minimum. However, some background may be helpful in
isolating the possible causes. That portion of the code is
conceptually quite simple. It is a polynomial multiplication routine,
which deals with two vectors of coefficients (in the specific case,
double-precision coefficients) and two vectors of long ints
representing the exponents (it's a kind of sparse representation of
two univariate polynomials). The coefficients are multiplied
one-by-one and the corresponding exponents are added one-by-one so
that the resulting integers indicate the positions of the results of
coefficient multiplication in a third coefficient vector (which
represents the result of the multiplication). In order to achieve best
performance, cache-blocking is employed in order to promote spatial
and temporal locality.

Since this portion of the code is quite critical, I've been
consistently trying to make sure the performance was always optimal.
In fact, when the code is as fast as expected, the processor is fully
utilizing its computing power, averaging around 4-5 clock cycles per
coefficient multiplication on the three different Core2 processors
I've tested the code with. This performance figure has been maintained
consistently for at least one year throughout various releases of GCC,
until this problem arose.

I've tried playing around a bit with the optimizations but I could not
identify any concrete lead. The only maybe relevant clue is that the
problem can be mitigated a bit by choosing -Os optimization level
instead of -O2. To my non-expert eyes this would seem like a case of
missed optimization (which maybe is triggered back by the print to
screen?), but at this point I am really at loss. I would like to open
a bug report, but first I wanted to understand if there is something
that I'm completely missing.

Any help or comment would be really appreciated!

Thanks,

  Francesco.

PS: if you reply, please CC me, as I'm not subscribed to the list.


Re: [4.4] Strange performance regression?

2009-10-14 Thread francesco biscani
Hi Joern and list(s),

On Wed, Oct 14, 2009 at 12:05 PM, Joern Rennecke
 wrote:
> He also said that it was a different machine, Core 2 Q6600 vs
> some kind of Xeon Core 2 system with a total of eight cores.
> As different memory subsystems are likely to affect the code, it
> is not an established regression till he can reproduce a performance
> drop going from an older to a current compiler on the same or
> sufficiently similar machines, under comparable load conditions -
> which generally means that the machine must be idle apart from the
> benchmark.
>

I decided to bite the bullet and went back to GCC 4.3.4 on the same
very machine where I'm experiencing the issue. With these flags:

-O2 -march=core2 -fomit-frame-pointer

the performance is the same as on 4.4.1 *with* -funroll-loops
(actually, around 5% better, but probably it is not statistically
significant). So, with 4.3.4, I get the expected *good* performance.
Just to give an order of magnitude, the "good" performance measure is
~5.1-5.2 seconds, while the "bad" performance is ~11-12 seconds for
this test. I ran the tests both on an idle setup (no X, just couple of
services in the background) and with a "busy" machine (Firefox, audio
playing in the background,...) but I could hardly notice any
difference in all cases.

I can try to investigate further if anyone is interested.

Thanks again to everybody,

  Francesco

PS: hope I'm not infringing the netiquette by cross-posting to two
mailing lists.