[Bug c/94212] New: [AARCH64] [Regression] Incorrect vectorization of loop with FP calculations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94212 Bug ID: 94212 Summary: [AARCH64] [Regression] Incorrect vectorization of loop with FP calculations Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: dpochepk at gmail dot com Target Milestone: --- Created attachment 48054 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48054&action=edit example application returning different result for O3 and O2 Example application (FP polynomial loop calculations) gives different result for O2 and O3 optimizations. Different is 1 ulp, so it might be some kind of rounding error (unsafe math leaked?). "-O3 -fno-tree-vectorize" gives correct result. This issue seems to affect aarch64-only (at least x86_64 is fine). Tried several gcc versions: trunk: affected gcc8.3: affected gcc7.4: not affected (I haven't investigated assembly) Example application is in attachment. Method "foo" has vectorized loop, which is probably the trigger for this bug.
[Bug tree-optimization/94212] [8/9/10 Regression] Incorrect vectorization of loop with FP calculations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94212 --- Comment #6 from Dmitrij Pochepko --- Just checked: non-vectorized assembly for aarch64 (O2) is using fmadd and fmsub intensively.
[Bug tree-optimization/94212] [8/9/10 Regression] Incorrect vectorization of loop with FP calculations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94212 --- Comment #8 from Dmitrij Pochepko --- (In reply to Richard Biener from comment #7) > (In reply to Dmitrij Pochepko from comment #6) > > Just checked: non-vectorized assembly for aarch64 (O2) is using fmadd and > > fmsub intensively. > > Try with -ffp-contract=off then. Note due to effective unrolling of > the loop with vectorization we might end up forming "different" fmadd > groups. So you might also want to check whether the vectorized loop still > sees fmadd use. -O2 -ffp-contract=off -O3 -ffp-contract=off produce same calculation result as -O2 regarging assembly: vectorized version is using fmla and fmls, which is vectorized version of multiply-add/sub. It's hard to say the difference in how multiplications and additions/subtractions are grouped without detailed step-by-step comparison though.
[Bug target/93720] [10 Regression] vector creation from two parts of two vectors produces TBL rather than ins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93720 Dmitrij Pochepko changed: What|Removed |Added CC||dpochepk at gmail dot com --- Comment #11 from Dmitrij Pochepko --- Is it still under development/improvement?
[Bug c++/94532] New: ICE while compiling speccpu2017 blender
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94532 Bug ID: 94532 Summary: ICE while compiling speccpu2017 blender Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: dpochepk at gmail dot com Target Milestone: --- Failed with ToT revision. At least on aarch64. Passed with 27 march build version. Log: blender/source/blender/blenkernel/intern/curve.c:1063:6: error: missing definition 1063 | void BKE_nurb_makeFaces(Nurb *nu, float *coord_array, int rowstride, int resolu, int resolv) | ^~ for SSA_NAME: _907 in statement: fp_833 = _907; during GIMPLE pass: vect blender/source/blender/blenkernel/intern/curve.c:1063:6: internal compiler error: verify_ssa failed 0xf62d63 verify_ssa(bool, bool) /mnt/cdn/ayoudkev/gcc-10/src/gcc/gcc/tree-ssa.c:1208 0xc30ca7 execute_function_todo /mnt/cdn/ayoudkev/gcc-10/src/gcc/gcc/passes.c:1992 0xc31acb do_per_function /mnt/cdn/ayoudkev/gcc-10/src/gcc/gcc/passes.c:1640 0xc31acb execute_todo /mnt/cdn/ayoudkev/gcc-10/src/gcc/gcc/passes.c:2039 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. specmake: *** [/opt/cpu2017/benchspec/Makefile.defaults:347: blender/source/blender/blenkernel/intern/curve.o] Error 1 specmake: *** Waiting for unfinished jobs... Currently bisecting problem.
[Bug tree-optimization/94532] [10 Regression] ICE while compiling speccpu2017 blender
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94532 Dmitrij Pochepko changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |DUPLICATE --- Comment #3 from Dmitrij Pochepko --- *** This bug has been marked as a duplicate of bug 94443 ***
[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443 Dmitrij Pochepko changed: What|Removed |Added CC||dpochepk at gmail dot com --- Comment #19 from Dmitrij Pochepko --- *** Bug 94532 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/94532] [10 Regression] ICE while compiling speccpu2017 blender
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94532 --- Comment #4 from Dmitrij Pochepko --- Yes. It'a a diplicate of 94443
[Bug tree-optimization/96208] New: non-power-of-2 group size can be vectorized for 2-element vectors case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96208 Bug ID: 96208 Summary: non-power-of-2 group size can be vectorized for 2-element vectors case Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: dpochepk at gmail dot com Target Milestone: --- Created attachment 48879 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48879&action=edit initial implementation Current loop vectorizer only vectorize loops with groups size being power-of-2 or 3 due to vector permutation generation algorithm specifics. However, in case of 2-element vectors, simple permutation schema can be used to support any group size: insert each vector element into required position, which leads to reasonable amount of operations in case of 2-element vectors. Initial version is attached.
[Bug rtl-optimization/92892] New: [AARCH64] TBL-based permutations can be implemented more efficiently for 2-element vectors
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92892 Bug ID: 92892 Summary: [AARCH64] TBL-based permutations can be implemented more efficiently for 2-element vectors Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: dpochepk at gmail dot com Target Milestone: --- Current vector elements permutation implementation generates different instructions depending on specific permutation form. For permutations like: "target[0] = src1[0]; target[1] = src2[1];" the TBL instruction is used and following instructions sequence is generated: mov tmpReg1, src1; mov tmpReg2, src2; tbl target, {tmpReg1, tmpReg2}, ... // the tmpReg1 and tmpReg2 registers which are numbered consecutively, as required by tbl instruction For 2-element vectors this sequence can be reduced to: mov target[0], src1[0] mov target[1], src2[1] And it can be reduced to a single mov in case target = src, which is already implemented in patch prototype I'm working on.
[Bug target/93390] New: AARCH64: FP move costs needs improvements for ThunderX2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93390 Bug ID: 93390 Summary: AARCH64: FP move costs needs improvements for ThunderX2 Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: dpochepk at gmail dot com Target Milestone: --- Target: aarch64-thunderx2t99 Current cpu_regmove_cost for thunderx2t99 seems to be not optimal. Preliminary experiments and benchmarks (SPEC) shows that a little bit lower values for FP-related entries get better results. I'd like to proceed on this one.
[Bug target/93720] [10 Regression] vector creation from two parts of two vectors produces TBL rather than ins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93720 --- Comment #2 from Dmitrij Pochepko --- I have a patch, which recognize such pattern and adds ins instructions. Example in this issue description is compiled fine and produce this assembly: : 0: 6e184420mov v0.d[1], v1.d[1] 4: d65f03c0ret However, there are a little bit more complicated examples, where allocated registers are preventing from optimal assembly to be generated. (example: part of blender benchmark from speccpu) #include static inline float dot_v3v3(const float a[3], const float b[3]) { return a[0] * b[0] + a[1] * b[1] + a[2] * b[2]; } static inline float len_v3(const float a[3]) { return sqrtf(dot_v3v3(a, a)); } void window_translate_m4(float winmat[4][4], float perspmat[4][4], const float x, const float y) { if (winmat[2][3] == -1.0f) { /* in the case of a win-matrix, this means perspective always */ float v1[3]; float v2[3]; float len1, len2; v1[0] = perspmat[0][0]; v1[1] = perspmat[1][0]; v1[2] = perspmat[2][0]; v2[0] = perspmat[0][1]; v2[1] = perspmat[1][1]; v2[2] = perspmat[2][1]; len1 = (1.0f / len_v3(v1)); len2 = (1.0f / len_v3(v2)); winmat[2][0] += len1 * winmat[0][0] * x; winmat[2][1] += len2 * winmat[1][1] * y; } else { winmat[3][0] += x; winmat[3][1] += y; } } This will produce: ... 24: fd400010ldr d16, [x0] 28: fd400807ldr d7, [x0,#16] ... 34: 6e040611mov v17.s[0], v16.s[0] 38: 6e0c24f1mov v17.s[1], v7.s[1] ... # v16/d17 and d7/v7 are not used in any other places while it can be: ... 24: fd400010ldr d16, [x0] 28: fd400807ldr d7, [x0,#16] ... 38: 6e0c24f1mov v16.s[1], v7.s[1] ... # and v16 is used instead of v17. It looks like peephole2 can be used to optimize it. I'm currently looking into in.
[Bug target/93720] [10 Regression] vector creation from two parts of two vectors produces TBL rather than ins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93720 --- Comment #6 from Dmitrij Pochepko --- Created attachment 47851 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47851&action=edit current patch version
[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839 Dmitrij Pochepko changed: What|Removed |Added CC||dpochepk at gmail dot com --- Comment #2 from Dmitrij Pochepko --- aarch64 won't be necessarily faster with such fix. 531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a).
[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839 --- Comment #4 from Dmitrij Pochepko --- (In reply to Andrew Pinski from comment #3) > ... I haven't tracked deepsjeng data passed for logL function specifically. I only measured totals. It might be not directly related to logL code execution time. I also measured separate synthetic benchmarks with loop-based and non-loop-based implementations (simple logL function calculation in a loop with adding result into accumulator). For 0 and 1 arguments I see about 2% slower numbers with synthetic benchmark on T99. Hope this info will help to anyone who'll decide to work on this patch.