[Bug c/70686] New: -fprofile-generate (not fprofile-use) somehow produces much faster binary

2016-04-15 Thread alekshs at hotmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686

Bug ID: 70686
   Summary: -fprofile-generate (not fprofile-use) somehow produces
much faster binary
   Product: gcc
   Version: 5.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alekshs at hotmail dot com
  Target Milestone: ---

I have this small benchmark that does 100mn loops of 20 divisions by 2.
Periodically it bumps up the values so that it continues to have something to
divide /2. I time this and see the results. 

---



#include  
#include  
#include 

int main() 
{
printf("\n");

const double a = 333.3456743289;  //initial randomly assigned values to
start halving
const double aa = 555.444334244;
const double aaa = 777.;
const double  = 3276.123458;

unsigned int i;
double score;
double g; //the number to be used for making the divisions, so essentially
halving everything each round

double b; 
double bb;
double bbb;
double ;

g = 2;  

b = a;
bb = aa;
bbb = aaa;
 = ;

double total_time;
clock_t start, end;

start = 0;
end = 0;
score = 0;

start = clock();

 for (i = 1; i <10001; i++) 
 {
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   =/g;
   =/g;
   =/g;
   =/g;
   =/g;

   if (b< 1.001)  {b=b+i+12.432432432;}  //just adding more stuff  in
order for the number
   if (bb   < 1.001)  {bb=bb+i+15.4324442;}  //to return back to larger
values after several
   if (bbb  < 1.001)  {bbb=bbb+i+19.42884;}  //rounds of halving
   if ( < 1.001)  {=+i+34.481;} 
}

 end = clock();

 total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;

 score = (1000 / total_time);
 printf("\nFinal number: %0.20f", (b+bb+bbb+));

 printf("\nTime elapsed: %0.0f msecs", total_time);   
 printf("\nScore: %0.0f\n", score);

 return 0;
}
---



This is run on a quad q8200 @ 1.75ghz

Now notice the times:

gcc Maths4asm.c -lm -O0  => 6224ms
gcc Maths4asm.c -lm -O2 and -O3  => 1527ms
gcc Maths4asm.c -lm -Ofast  => 1227ms
gcc Maths4asm.c -lm -Ofast -march=nocona => 1236ms
gcc Maths4asm.c -lm -Ofast -march=core2 => 1197ms  (I have a core quad,
technically it's core2 arch)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-generate => 624ms.
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-generate => 530ms. 
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-use => 1258ms (slower than
without PGO, slower than -fprofile-generate)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-use => 1222ms (slower than
without PGO, slower than -fprofile-generate).

So PGO optimization made it worse, but the most mindblowing thing is the
running of the profiler, getting execution times down to 530ms. The profiler
run (-generate) should normally take this to 4000-5000ms or above, as it
monitors the process to create a log file. 

I have never run into a -fprofile-generate build that wasn't at least 2-3 times
slower than a normal build - let alone 2-3 times faster. This is totally
absurd. 

And then, to top it all, -fprofile-use (using the logfile to create the best
binary) created worse binaries. 

Oh, and "nocona" (pentium4+) suddenly became ...the better architecture instead
of core2.

This stuff is almost unbelievable. I thought initially that the profiler must
be activating multithreading, but no. I scripted simultaneous use of 4 runs,
they all give the same time - that means, there was no extra cpu use in other
threads.

The implication of all these is that if -fprofile-generate can somehow give
code that is executing at 500ms, and the non -fprofile-generate binaries are
running at 1200ms, then serious performance is left on the table on normal
builds.

[Bug tree-optimization/70686] GIMPLE if-conversion slows down code

2016-04-18 Thread alekshs at hotmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686

--- Comment #2 from alekshs at hotmail dot com ---
(In reply to Richard Biener from comment #1)
> It's not so mind-blowing - it's simply that -fprofile-generate makes our
> GIMPLE level if-conversion no longer apply.  Without -fprofile-generate
> we if-convert the loop into
> 
>  for (i = 1; i <10001; i++) 
>  {
>  ...
> 
>b = b + (b < 1.1) ? i + 12.43 : 0.0; 
> ...
> }
> 
> thus we always evaluate the i + 12.43 and one additional addition of zero.
> 
> We do this to eventually enable vectorization but without any check
> on whether it would be profitable when not vectorizing (your testcase
> shows it's not profitable).
> 
> Confirmed.  -fno-tree-loop-if-convert should fix it in this particular case.

Aha, thanks for the swift reply.

Regarding profitability, I should note that the PGO misses entirely the fact
that 20 mulsd could become 10 mulpd:


  400560:   f2 0f 59 e9 mulsd  %xmm1,%xmm5
  400564:   f2 0f 59 e1 mulsd  %xmm1,%xmm4
  400568:   f2 0f 59 d9 mulsd  %xmm1,%xmm3
  40056c:   f2 0f 59 d1 mulsd  %xmm1,%xmm2
  400570:   f2 0f 59 e9 mulsd  %xmm1,%xmm5
  400574:   f2 0f 59 e1 mulsd  %xmm1,%xmm4
  400578:   f2 0f 59 d9 mulsd  %xmm1,%xmm3
  40057c:   f2 0f 59 d1 mulsd  %xmm1,%xmm2
  400580:   f2 0f 59 e9 mulsd  %xmm1,%xmm5
  400584:   f2 0f 59 e1 mulsd  %xmm1,%xmm4
  400588:   f2 0f 59 d9 mulsd  %xmm1,%xmm3
  40058c:   f2 0f 59 d1 mulsd  %xmm1,%xmm2
  400590:   f2 0f 59 e9 mulsd  %xmm1,%xmm5
  400594:   f2 0f 59 e1 mulsd  %xmm1,%xmm4
  400598:   f2 0f 59 d9 mulsd  %xmm1,%xmm3
  40059c:   f2 0f 59 d1 mulsd  %xmm1,%xmm2
  4005a0:   f2 0f 59 e9 mulsd  %xmm1,%xmm5
  4005a4:   f2 0f 59 e1 mulsd  %xmm1,%xmm4
  4005a8:   f2 0f 59 d9 mulsd  %xmm1,%xmm3
  4005ac:   f2 0f 59 d1 mulsd  %xmm1,%xmm2


...So there was job to be done there. That's at -03 -march=native btw (to
preserve accuracy, unlike -Ofast). Ofast too doesn't pack them. It kind of
splits to scalar muls and packed adds.

It's a similar situation with another such small benchmark I made where it was
doing 4 x sqrts all the time (with some stuff added when values got too low, so
as to keep going), but the 2x packed sqrts I did in asm were much faster than
the 4 scalar that gcc was generating (at every level of optimization and
profiling - it didn't do 2x packed... kept doing it 4x scalar). I'm attaching
the bench in the end.

It seems like gcc avoids packing instructions like the plague in non-array code
even when there are obvious and serious measurable benefits. Perhaps the
heuristics need some tune up for both profiled and non-profiled compilation.


-
code of sqrtbench.c
-

#include  
#include  
#include 

int main() 
{
const double a = 911798473;  // assigning some randomly chosen constants to
begin math functions
const double aa = 143314345;
const double aaa = 531432117;
const double  = 343211418;

unsigned int i; //loop counter

double b; //variables that will be used for storing square roots
double bb;
double bbb;
double ;

b = a;  //assign some large values to the variables in order to start finding
square roots
bb = aa;
bbb = aaa;
 = ;

double score; // score
double time1; //how much time the program took

clock_t start, end; //stopwatch timers

start = clock();

 for (i = 1; i <10001; i++) 
 {
   b=sqrt (b);
   bb=sqrt(bb);
   bbb=sqrt(bbb);
   =sqrt();

   if (b<= 1.001)  {b=b+i+12.432432432;} 
   if (bb   <= 1.001)  {bb=bb+i+15.4324442;} 
   if (bbb  <= 1.001)  {bbb=bbb+i+19.42884;}
   if ( <= 1.001)  {=+i+34.481;}
  }

 end = clock();

 time1 = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;

 score = (1000 / time1); // Just a way to give a "score" insead of just
time elapsed.
// Baseline calibration is at 1000 points rewarded
for 1ms delay...
// In other words if you finish 5 times faster, say
2000ms, you get 5000 points

 printf("\nFinal number: %0.16f", (b+bb+bbb+));  // The number that
resulted from all the math functions - useful for checking math accuracy from
unsafe optimizations

 if (b+bb+bbb+ > 4.032938759028) {printf("Result [INCORRECT -
4.032938759027 expected]");} //checking result
 if (b+bb+bbb+ < 4.032938759026) {printf("Result [INCORRECT -
4.032938759027 expected]");} //checking result 

 printf("\nTime elapsed: %0.0f msecs", time1);   // Time elapsed announced to
the user
 printf("\nScore: %0.0f\n", score);  // Score announced to the user

 return 0;
}

-end code 
(the above generates, consistently, 4 sqrtsd instead of 2 sqrtpd, at -O3 and
PGO).

[Bug tree-optimization/70686] GIMPLE if-conversion slows down code

2016-04-18 Thread alekshs at hotmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686

--- Comment #4 from alekshs at hotmail dot com ---
I would be somewhat understanding in the context of -O2/-O3 (compiler guessing)
but not in the context of PGO (compiler understands the flow after a run - so
it should be able to understand that these IFs can't possible be an obstacle
for packed math... or that is my rationale anyway which may be totally
irrelevant with how these things should be done, cost/reward schemes of
implementing such changes, etc etc).

[Bug c/81467] New: AVX-512 support for inline assembly

2017-07-17 Thread alekshs at hotmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81467

Bug ID: 81467
   Summary: AVX-512 support for inline assembly
   Product: gcc
   Version: 6.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: alekshs at hotmail dot com
  Target Milestone: ---

I'm trying some avx-512 code with the inline assembly.

1) Clobbering xmm16-31 and k-type registers won't work. I guess it won't work
in ymm16-31 or zmm16-31 either: 

error: unknown register name ‘%xmm16’ in ‘asm’
error: unknown register name ‘%k1’ in ‘asm’

2) I'm having a problem trying to issue a

"VMOVDQU32 0(%0), %%ZMM0 {k1}{z}\n"

I also tried it with 

 "VMOVDQU32 0(%0), %%ZMM0 {%k1}{z}\n"

and

 "VMOVDQU32 0(%0), %%ZMM0 {%%k1}{z}\n"

and I also added % and %% in front of the z flag to see if it'll work. It's
possible I'm doing something wrong in the syntax although I tried several
permutations. Googling for examples didn't get me any meaningful results.

[Bug target/81467] AVX-512 support for inline assembly

2017-07-18 Thread alekshs at hotmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81467

--- Comment #3 from alekshs at hotmail dot com ---
Aha, ok thanks for the clarification. It was pretty helpful.

Regarding clobbering, I was compiling on a Skylake Xeon which has

avx512f avx512dq avx512cd avx512bw avx512vl 

using -march=native... so I didn't expect avx512 detection to be an issue (the
registers were present on the cpu and the native flag says go on and use the
maximum possible). 

Perhaps it would be somewhat simpler if it was able to allow clobbering
xmm/ymm/zmm16-31 and k-regs based on the native flag (without the conditional
ifs). Maybe an improvement (?) for the future (?).

Thanks again.