[Bug tree-optimization/57223] Auto-vectorization fails for nested multiple loops depending on type of array
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57223 Usishchev Yury changed: What|Removed |Added CC||y.usishchev at samsung dot com --- Comment #3 from Usishchev Yury --- I'm testing it on current trunk, and second loop is not vectorized both with floating point and integer types. For floating point types it is not vectorized due to control flow in loop: : // ... if (t_56 > _61) goto ; else goto ; : : # iftmp.2_7 = PHI <_61(16), t_56(15)> This can be optimized to MIN_EXPR in phiopt pass, but is not because of NaNs: tree-ssa-phiopt.c:876: /* The optimization may be unsafe due to NaNs. */ if (HONOR_NANS (TYPE_MODE (type))) return false; If compiled with -ffinite-math-only second loop still is not vectorised: not_always_good.c:16:7: note: not vectorized: latch block not empty. (same occurs with integer types). Latch block it that case is: : pretmp_176 = *prephitmp_173; goto ; the statement here is generated in pre pass. For vectorization to work we can either not generate it in pre or move it into head of the loop in vectorizer. Right now i'm trying to find how to prevent pre from generating statements in empty latch blocks.
[Bug target/66433] New: Arm NEON postincrement optimization missed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66433 Bug ID: 66433 Summary: Arm NEON postincrement optimization missed Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: y.usishchev at samsung dot com Target Milestone: --- Created attachment 35701 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35701&action=edit test with vld and vst GCC from trunk, configured with --target=armv7l-tizen-linux-gnueabi with options "-O2 -mfpu=neon" on attached testcase does not generate autoincrement for vld/vst instructions. auto-inc-dec pass ignores possibilities of optimization vld/vst instructions: for code for () { //some loop s0_32x4 = vld1q_u32(s); s1_32x4 = vld1q_u32(s+4); s+=8; ... } gcc generates vld1.32 {d6-d7}, [r1] add.w r4, r1, #16 addsr1, #32 vld1.32 {d28-d29}, [r4] instead of vld1.32 {d6-d7}, [r1]! vld1.32 {d28-d29}, [r1]! This is caused by presumably wrong cost estimation: vld1.32 instruction without increment costs 4, but with increment its cost is 16 (gcc/config/arm/arm.c:9415): case MEM: if (REG_P (XEXP (x, 0))) *cost = COSTS_N_INSNS (1); ... else *cost = COSTS_N_INSNS (ARM_NUM_REGS (mode));
[Bug target/66433] Arm NEON postincrement optimization missed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66433 --- Comment #1 from Usishchev Yury --- Created attachment 35702 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35702&action=edit patch with fix Attached patch that, in my opinion, fixes the issue