[Bug c/70686] New: -fprofile-generate (not fprofile-use) somehow produces much faster binary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686 Bug ID: 70686 Summary: -fprofile-generate (not fprofile-use) somehow produces much faster binary Product: gcc Version: 5.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: alekshs at hotmail dot com Target Milestone: --- I have this small benchmark that does 100mn loops of 20 divisions by 2. Periodically it bumps up the values so that it continues to have something to divide /2. I time this and see the results. --- #include #include #include int main() { printf("\n"); const double a = 333.3456743289; //initial randomly assigned values to start halving const double aa = 555.444334244; const double aaa = 777.; const double = 3276.123458; unsigned int i; double score; double g; //the number to be used for making the divisions, so essentially halving everything each round double b; double bb; double bbb; double ; g = 2; b = a; bb = aa; bbb = aaa; = ; double total_time; clock_t start, end; start = 0; end = 0; score = 0; start = clock(); for (i = 1; i <10001; i++) { b=b/g; b=b/g; b=b/g; b=b/g; b=b/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; =/g; =/g; =/g; =/g; =/g; if (b< 1.001) {b=b+i+12.432432432;} //just adding more stuff in order for the number if (bb < 1.001) {bb=bb+i+15.4324442;} //to return back to larger values after several if (bbb < 1.001) {bbb=bbb+i+19.42884;} //rounds of halving if ( < 1.001) {=+i+34.481;} } end = clock(); total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000; score = (1000 / total_time); printf("\nFinal number: %0.20f", (b+bb+bbb+)); printf("\nTime elapsed: %0.0f msecs", total_time); printf("\nScore: %0.0f\n", score); return 0; } --- This is run on a quad q8200 @ 1.75ghz Now notice the times: gcc Maths4asm.c -lm -O0 => 6224ms gcc Maths4asm.c -lm -O2 and -O3 => 1527ms gcc Maths4asm.c -lm -Ofast => 1227ms gcc Maths4asm.c -lm -Ofast -march=nocona => 1236ms gcc Maths4asm.c -lm -Ofast -march=core2 => 1197ms (I have a core quad, technically it's core2 arch) gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-generate => 624ms. gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-generate => 530ms. gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-use => 1258ms (slower than without PGO, slower than -fprofile-generate) gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-use => 1222ms (slower than without PGO, slower than -fprofile-generate). So PGO optimization made it worse, but the most mindblowing thing is the running of the profiler, getting execution times down to 530ms. The profiler run (-generate) should normally take this to 4000-5000ms or above, as it monitors the process to create a log file. I have never run into a -fprofile-generate build that wasn't at least 2-3 times slower than a normal build - let alone 2-3 times faster. This is totally absurd. And then, to top it all, -fprofile-use (using the logfile to create the best binary) created worse binaries. Oh, and "nocona" (pentium4+) suddenly became ...the better architecture instead of core2. This stuff is almost unbelievable. I thought initially that the profiler must be activating multithreading, but no. I scripted simultaneous use of 4 runs, they all give the same time - that means, there was no extra cpu use in other threads. The implication of all these is that if -fprofile-generate can somehow give code that is executing at 500ms, and the non -fprofile-generate binaries are running at 1200ms, then serious performance is left on the table on normal builds.
[Bug tree-optimization/70686] GIMPLE if-conversion slows down code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686 --- Comment #2 from alekshs at hotmail dot com --- (In reply to Richard Biener from comment #1) > It's not so mind-blowing - it's simply that -fprofile-generate makes our > GIMPLE level if-conversion no longer apply. Without -fprofile-generate > we if-convert the loop into > > for (i = 1; i <10001; i++) > { > ... > >b = b + (b < 1.1) ? i + 12.43 : 0.0; > ... > } > > thus we always evaluate the i + 12.43 and one additional addition of zero. > > We do this to eventually enable vectorization but without any check > on whether it would be profitable when not vectorizing (your testcase > shows it's not profitable). > > Confirmed. -fno-tree-loop-if-convert should fix it in this particular case. Aha, thanks for the swift reply. Regarding profitability, I should note that the PGO misses entirely the fact that 20 mulsd could become 10 mulpd: 400560: f2 0f 59 e9 mulsd %xmm1,%xmm5 400564: f2 0f 59 e1 mulsd %xmm1,%xmm4 400568: f2 0f 59 d9 mulsd %xmm1,%xmm3 40056c: f2 0f 59 d1 mulsd %xmm1,%xmm2 400570: f2 0f 59 e9 mulsd %xmm1,%xmm5 400574: f2 0f 59 e1 mulsd %xmm1,%xmm4 400578: f2 0f 59 d9 mulsd %xmm1,%xmm3 40057c: f2 0f 59 d1 mulsd %xmm1,%xmm2 400580: f2 0f 59 e9 mulsd %xmm1,%xmm5 400584: f2 0f 59 e1 mulsd %xmm1,%xmm4 400588: f2 0f 59 d9 mulsd %xmm1,%xmm3 40058c: f2 0f 59 d1 mulsd %xmm1,%xmm2 400590: f2 0f 59 e9 mulsd %xmm1,%xmm5 400594: f2 0f 59 e1 mulsd %xmm1,%xmm4 400598: f2 0f 59 d9 mulsd %xmm1,%xmm3 40059c: f2 0f 59 d1 mulsd %xmm1,%xmm2 4005a0: f2 0f 59 e9 mulsd %xmm1,%xmm5 4005a4: f2 0f 59 e1 mulsd %xmm1,%xmm4 4005a8: f2 0f 59 d9 mulsd %xmm1,%xmm3 4005ac: f2 0f 59 d1 mulsd %xmm1,%xmm2 ...So there was job to be done there. That's at -03 -march=native btw (to preserve accuracy, unlike -Ofast). Ofast too doesn't pack them. It kind of splits to scalar muls and packed adds. It's a similar situation with another such small benchmark I made where it was doing 4 x sqrts all the time (with some stuff added when values got too low, so as to keep going), but the 2x packed sqrts I did in asm were much faster than the 4 scalar that gcc was generating (at every level of optimization and profiling - it didn't do 2x packed... kept doing it 4x scalar). I'm attaching the bench in the end. It seems like gcc avoids packing instructions like the plague in non-array code even when there are obvious and serious measurable benefits. Perhaps the heuristics need some tune up for both profiled and non-profiled compilation. - code of sqrtbench.c - #include #include #include int main() { const double a = 911798473; // assigning some randomly chosen constants to begin math functions const double aa = 143314345; const double aaa = 531432117; const double = 343211418; unsigned int i; //loop counter double b; //variables that will be used for storing square roots double bb; double bbb; double ; b = a; //assign some large values to the variables in order to start finding square roots bb = aa; bbb = aaa; = ; double score; // score double time1; //how much time the program took clock_t start, end; //stopwatch timers start = clock(); for (i = 1; i <10001; i++) { b=sqrt (b); bb=sqrt(bb); bbb=sqrt(bbb); =sqrt(); if (b<= 1.001) {b=b+i+12.432432432;} if (bb <= 1.001) {bb=bb+i+15.4324442;} if (bbb <= 1.001) {bbb=bbb+i+19.42884;} if ( <= 1.001) {=+i+34.481;} } end = clock(); time1 = ((double) (end - start)) / CLOCKS_PER_SEC * 1000; score = (1000 / time1); // Just a way to give a "score" insead of just time elapsed. // Baseline calibration is at 1000 points rewarded for 1ms delay... // In other words if you finish 5 times faster, say 2000ms, you get 5000 points printf("\nFinal number: %0.16f", (b+bb+bbb+)); // The number that resulted from all the math functions - useful for checking math accuracy from unsafe optimizations if (b+bb+bbb+ > 4.032938759028) {printf("Result [INCORRECT - 4.032938759027 expected]");} //checking result if (b+bb+bbb+ < 4.032938759026) {printf("Result [INCORRECT - 4.032938759027 expected]");} //checking result printf("\nTime elapsed: %0.0f msecs", time1); // Time elapsed announced to the user printf("\nScore: %0.0f\n", score); // Score announced to the user return 0; } -end code (the above generates, consistently, 4 sqrtsd instead of 2 sqrtpd, at -O3 and PGO).
[Bug tree-optimization/70686] GIMPLE if-conversion slows down code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686 --- Comment #4 from alekshs at hotmail dot com --- I would be somewhat understanding in the context of -O2/-O3 (compiler guessing) but not in the context of PGO (compiler understands the flow after a run - so it should be able to understand that these IFs can't possible be an obstacle for packed math... or that is my rationale anyway which may be totally irrelevant with how these things should be done, cost/reward schemes of implementing such changes, etc etc).
[Bug c/81467] New: AVX-512 support for inline assembly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81467 Bug ID: 81467 Summary: AVX-512 support for inline assembly Product: gcc Version: 6.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: alekshs at hotmail dot com Target Milestone: --- I'm trying some avx-512 code with the inline assembly. 1) Clobbering xmm16-31 and k-type registers won't work. I guess it won't work in ymm16-31 or zmm16-31 either: error: unknown register name ‘%xmm16’ in ‘asm’ error: unknown register name ‘%k1’ in ‘asm’ 2) I'm having a problem trying to issue a "VMOVDQU32 0(%0), %%ZMM0 {k1}{z}\n" I also tried it with "VMOVDQU32 0(%0), %%ZMM0 {%k1}{z}\n" and "VMOVDQU32 0(%0), %%ZMM0 {%%k1}{z}\n" and I also added % and %% in front of the z flag to see if it'll work. It's possible I'm doing something wrong in the syntax although I tried several permutations. Googling for examples didn't get me any meaningful results.
[Bug target/81467] AVX-512 support for inline assembly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81467 --- Comment #3 from alekshs at hotmail dot com --- Aha, ok thanks for the clarification. It was pretty helpful. Regarding clobbering, I was compiling on a Skylake Xeon which has avx512f avx512dq avx512cd avx512bw avx512vl using -march=native... so I didn't expect avx512 detection to be an issue (the registers were present on the cpu and the native flag says go on and use the maximum possible). Perhaps it would be somewhat simpler if it was able to allow clobbering xmm/ymm/zmm16-31 and k-regs based on the native flag (without the conditional ifs). Maybe an improvement (?) for the future (?). Thanks again.