[Bug c/94212] New: [AARCH64] [Regression] Incorrect vectorization of loop with FP calculations

2020-03-18 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94212

Bug ID: 94212
   Summary: [AARCH64] [Regression] Incorrect vectorization of loop
with FP calculations
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: dpochepk at gmail dot com
  Target Milestone: ---

Created attachment 48054
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48054&action=edit
example application returning different result for O3 and O2

Example application (FP polynomial loop calculations) gives different result
for O2 and O3 optimizations. Different is 1 ulp, so it might be some kind of
rounding error (unsafe math leaked?).
"-O3 -fno-tree-vectorize" gives correct result.

This issue seems to affect aarch64-only (at least x86_64 is fine).
Tried several gcc versions:

trunk: affected
gcc8.3: affected
gcc7.4: not affected

(I haven't investigated assembly)

Example application is in attachment. Method "foo" has vectorized loop, which
is probably the trigger for this bug.

[Bug tree-optimization/94212] [8/9/10 Regression] Incorrect vectorization of loop with FP calculations

2020-03-18 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94212

--- Comment #6 from Dmitrij Pochepko  ---
Just checked: non-vectorized assembly for aarch64 (O2) is using fmadd and fmsub
intensively.

[Bug tree-optimization/94212] [8/9/10 Regression] Incorrect vectorization of loop with FP calculations

2020-03-18 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94212

--- Comment #8 from Dmitrij Pochepko  ---
(In reply to Richard Biener from comment #7)
> (In reply to Dmitrij Pochepko from comment #6)
> > Just checked: non-vectorized assembly for aarch64 (O2) is using fmadd and
> > fmsub intensively.
> 
> Try with -ffp-contract=off then.  Note due to effective unrolling of
> the loop with vectorization we might end up forming "different" fmadd
> groups.  So you might also want to check whether the vectorized loop still
> sees fmadd use.

-O2 -ffp-contract=off
-O3 -ffp-contract=off
produce same calculation result as -O2


regarging assembly:
vectorized version is using fmla and fmls, which is vectorized version of
multiply-add/sub. It's hard to say the difference in how multiplications and
additions/subtractions are grouped without detailed step-by-step comparison
though.

[Bug target/93720] [10 Regression] vector creation from two parts of two vectors produces TBL rather than ins

2020-03-31 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93720

Dmitrij Pochepko  changed:

   What|Removed |Added

 CC||dpochepk at gmail dot com

--- Comment #11 from Dmitrij Pochepko  ---
Is it still under development/improvement?

[Bug c++/94532] New: ICE while compiling speccpu2017 blender

2020-04-08 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94532

Bug ID: 94532
   Summary: ICE while compiling speccpu2017 blender
   Product: gcc
   Version: tree-ssa
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: dpochepk at gmail dot com
  Target Milestone: ---

Failed with ToT revision. At least on aarch64.
Passed with 27 march build version.

Log:

blender/source/blender/blenkernel/intern/curve.c:1063:6: error: missing 
definition
1063 | void BKE_nurb_makeFaces(Nurb *nu, float *coord_array, int rowstride, int
resolu, int resolv)
|
^~ for SSA_NAME: _907 in statement:
fp_833 = _907;
during GIMPLE pass: vect
blender/source/blender/blenkernel/intern/curve.c:1063:6: internal compiler
error: verify_ssa failed
0xf62d63 verify_ssa(bool, bool)
/mnt/cdn/ayoudkev/gcc-10/src/gcc/gcc/tree-ssa.c:1208
0xc30ca7 execute_function_todo
/mnt/cdn/ayoudkev/gcc-10/src/gcc/gcc/passes.c:1992
0xc31acb do_per_function /mnt/cdn/ayoudkev/gcc-10/src/gcc/gcc/passes.c:1640
0xc31acb execute_todo /mnt/cdn/ayoudkev/gcc-10/src/gcc/gcc/passes.c:2039
Please submit a full bug report, with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
specmake: *** [/opt/cpu2017/benchspec/Makefile.defaults:347:
blender/source/blender/blenkernel/intern/curve.o] Error 1
specmake: *** Waiting for unfinished jobs...


Currently bisecting problem.

[Bug tree-optimization/94532] [10 Regression] ICE while compiling speccpu2017 blender

2020-04-09 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94532

Dmitrij Pochepko  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from Dmitrij Pochepko  ---


*** This bug has been marked as a duplicate of bug 94443 ***

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-09 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

Dmitrij Pochepko  changed:

   What|Removed |Added

 CC||dpochepk at gmail dot com

--- Comment #19 from Dmitrij Pochepko  ---
*** Bug 94532 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/94532] [10 Regression] ICE while compiling speccpu2017 blender

2020-04-09 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94532

--- Comment #4 from Dmitrij Pochepko  ---
Yes. It'a a diplicate of 94443

[Bug tree-optimization/96208] New: non-power-of-2 group size can be vectorized for 2-element vectors case

2020-07-15 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96208

Bug ID: 96208
   Summary: non-power-of-2 group size can be vectorized for
2-element vectors case
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: dpochepk at gmail dot com
  Target Milestone: ---

Created attachment 48879
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48879&action=edit
initial implementation

Current loop vectorizer only vectorize loops with groups size being power-of-2
or 3 due to vector permutation generation algorithm specifics.
However, in case of 2-element vectors, simple permutation schema can be used to
support any group size: insert each vector element into required position,
which leads to reasonable amount of operations in case of 2-element vectors.

Initial version is attached.

[Bug rtl-optimization/92892] New: [AARCH64] TBL-based permutations can be implemented more efficiently for 2-element vectors

2019-12-10 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92892

Bug ID: 92892
   Summary: [AARCH64] TBL-based permutations can be implemented
more efficiently for 2-element vectors
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: dpochepk at gmail dot com
  Target Milestone: ---

Current vector elements permutation implementation generates different
instructions depending on specific permutation form. For permutations like:
"target[0] = src1[0]; target[1] = src2[1];" the TBL instruction is used and
following instructions sequence is generated:

mov tmpReg1, src1;
mov tmpReg2, src2;
tbl target, {tmpReg1, tmpReg2}, ...
// the tmpReg1 and tmpReg2 registers which are numbered consecutively, as
required by tbl instruction

For 2-element vectors this sequence can be reduced to:

mov target[0], src1[0]
mov target[1], src2[1]


And it can be reduced to a single mov in case target = src, which is already
implemented in patch prototype I'm working on.

[Bug target/93390] New: AARCH64: FP move costs needs improvements for ThunderX2

2020-01-22 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93390

Bug ID: 93390
   Summary: AARCH64: FP move costs needs improvements for
ThunderX2
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: dpochepk at gmail dot com
  Target Milestone: ---
Target: aarch64-thunderx2t99

Current cpu_regmove_cost for thunderx2t99 seems to be not optimal. Preliminary
experiments and benchmarks (SPEC) shows that a little bit lower values for
FP-related entries get better results. I'd like to proceed on this one.

[Bug target/93720] [10 Regression] vector creation from two parts of two vectors produces TBL rather than ins

2020-02-14 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93720

--- Comment #2 from Dmitrij Pochepko  ---
I have a patch, which recognize such pattern and adds ins instructions. Example
in this issue description is compiled fine and produce this assembly:

 :
   0:   6e184420mov v0.d[1], v1.d[1]
   4:   d65f03c0ret

However, there are a little bit more complicated examples, where allocated
registers are preventing from optimal assembly to be generated.
(example: part of blender benchmark from speccpu)
#include 

static inline float dot_v3v3(const float a[3], const float b[3])
{
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2];
}

static inline float len_v3(const float a[3])
{
return sqrtf(dot_v3v3(a, a));
}


void window_translate_m4(float winmat[4][4], float perspmat[4][4], const float
x, const float y)
{
if (winmat[2][3] == -1.0f) {
/* in the case of a win-matrix, this means perspective always
*/
float v1[3];
float v2[3];
float len1, len2;

v1[0] = perspmat[0][0];
v1[1] = perspmat[1][0];
v1[2] = perspmat[2][0];

v2[0] = perspmat[0][1];
v2[1] = perspmat[1][1];
v2[2] = perspmat[2][1];

len1 = (1.0f / len_v3(v1));
len2 = (1.0f / len_v3(v2));

winmat[2][0] += len1 * winmat[0][0] * x;
winmat[2][1] += len2 * winmat[1][1] * y;
}
else {
winmat[3][0] += x;
winmat[3][1] += y;
}
}


This will produce:
...
  24:   fd400010ldr d16, [x0]
  28:   fd400807ldr d7, [x0,#16]
...
  34:   6e040611mov v17.s[0], v16.s[0]
  38:   6e0c24f1mov v17.s[1], v7.s[1]
...
# v16/d17 and d7/v7 are not used in any other places

while it can be:

...
  24:   fd400010ldr d16, [x0]
  28:   fd400807ldr d7, [x0,#16]
...
  38:   6e0c24f1mov v16.s[1], v7.s[1]
...
# and v16 is used instead of v17.


It looks like peephole2 can be used to optimize it. I'm currently looking into
in.

[Bug target/93720] [10 Regression] vector creation from two parts of two vectors produces TBL rather than ins

2020-02-16 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93720

--- Comment #6 from Dmitrij Pochepko  ---
Created attachment 47851
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47851&action=edit
current patch version

[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)

2019-10-02 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839

Dmitrij Pochepko  changed:

   What|Removed |Added

 CC||dpochepk at gmail dot com

--- Comment #2 from Dmitrij Pochepko  ---
aarch64 won't be necessarily faster with such fix.
531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a).

[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)

2019-10-07 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839

--- Comment #4 from Dmitrij Pochepko  ---
(In reply to Andrew Pinski from comment #3)
> ...

I haven't tracked deepsjeng data passed for logL function specifically. I only
measured totals. It might be not directly related to logL code execution time.

I also measured separate synthetic benchmarks with loop-based and
non-loop-based implementations (simple logL function calculation in a loop with
adding result into accumulator). For 0 and 1 arguments I see about 2% slower
numbers with synthetic benchmark on T99. Hope this info will help to anyone
who'll decide to work on this patch.