Commit as an obvious patch.
gcc/testsuite/ChangeLog:
PR target/115299
* gcc.target/i386/pr86722.c: Also scan for blendvpd.
---
gcc/testsuite/gcc.target/i386/pr86722.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/gcc/testsuite/gcc.target/i386/pr86722.c
b/gc
> Are there any assumptions that BB_HEAD must be a note or label?
> Maybe we should move ix86_align_loops into a separate pass and insert
> the pass just before pass_final.
The patch inserts .p2align after endbr pass, it can also fix the issue.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m3
It results in 2 failures for x86_64-pc-linux-gnu{\
-march=cascadelake};
gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
For pr113560.c, now GCC generates mulx instead of mulq with
-march=cascadelake, which should be optimal,
It results in 2 failures for x86_64-pc-linux-gnu{\
-march=cascadelake};
gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
For pr113560.c, now GCC generates mulx instead of mulq with
-march=cascadelake, which should be optimal,
>From [1]
> > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > source, especially since ix86_expand_vector_mode does have code to
> > cope with constant operand[1]s. emit_move_insn_1 doesn't check the
> > predicates anyway, so the predicate will have little effect.
> >
> > A
When none of mprefer-vector-width, avx256_optimal/avx128_optimal,
avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC will
set ix86_{move_max,store_max} as max available vector length except
for AVX part.
if (TARGET_AVX512F_P (opts->x_ix86_isa_flags)
&&
Looks like -mprefer-vector-width=128 doesn't impact store_max/mov_max
for GCC13/GCC12 branch, explicitly use -mmov-max=128, -mstore-max=128
for those testcases.
Committed as an obvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pieces-memcpy-10.c: Use -mmove-max=256 and
-mst
For mode2 bigger than 16-bytes, when it can be allocated to FIRST_SSE_REGS,
then it can only be allocated to ALL_SSE_REGS, and it can be tiebale
to all mode1 with smaller size which is available to FIRST_SSE_REGS.
When modes is equal to 16 bytes, exclude non-vector modes(TI/TFmode).
This is need fo
Also try to handle redundant broadcasts when there's already a
broadcast to a bigger mode with exactly the same component value.
For broadcast, component mode needs to be the same.
For all-zeros/ones, only need to check the bigger mode.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} and
For mode2 bigger than 16-bytes, when it can be allocated to FIRST_SSE_REGS,
then it can only be allocated to ALL_SSE_REGS, and it can be tiebale
to all mode1 with smaller size which is available to FIRST_SSE_REGS.
When modes is equal to 16 bytes, exclude non-vector modes(TI/TFmode).
This is need fo
> You are possibly overwriting src_related_elt - I'd suggest to either break
> here or do the loop below for each found elt?
Changed.
> Do we know that will always succeed?
1) validate_subreg allows subreg for 2 vector modes with same component modes.
2) gen_lowpart in cse.cc is defined as gen_low
For function arguments/return, when it's BLK mode, it's put in a
parallel with an expr_list, and the expr_list contains the real mode
and registers.
Current ix86_check_avx_upper_register only checked for SSE_REG_P, and
failed to handle that. The patch extend the handle to each subrtx.
Bootstrapped
> Can you add a testcase for this? I don't mind if it's x86 specific and
> does a bit of asm scanning.
>
> Also note that the context for this patch has changed, so it won't
> automatically apply. So be extra careful when updating so that it goes
> into the right place (all the more reason to hav
For power10, there're extra 3 REG_EQUIV notes with (fix:SI. to avoid
the failure. Check (fix:SI is from the pattern not NOTE.
gcc/testsuite/ChangeLog:
PR target/115365
* gcc.dg/pr100927.c: Don't scan fix:SI from the note.
---
gcc/testsuite/gcc.dg/pr100927.c | 2 +-
1 file changed
gcc/testsuite/ChangeLog:
* gcc.dg/vect/pr112325.c:Add additional option --param
max-completely-peeled-insns=200 for power64*-*-*.
---
gcc/testsuite/gcc.dg/vect/pr112325.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/gcc/testsuite/gcc.dg/vect/pr112325.c
b/gcc/testsuite/gcc.
In theory, const_wide_int can also be handle with extra check for each
components of the HOST_WIDE_INT array, and the check is need for both
shift and bit_and operands.
I assume the optimization opportnunity is rare, so the patch just add
extra check to make sure GET_MODE_INNER (mode) can fix into
>
> I think if you only handle CONST_INT_P, you should check just for that, and
> in both places where you check for CONST_VECTOR_DUPLICATE_P (there is one
> spot 2 lines above this).
> So add
> && CONST_INT_P (XVECEXP (XEXP (op0, 1), 0, 0))
> and
> && CONST_INT_P (XVECEXP (op1, 0, 0))
> tests righ
r15-1100-gec985bc97a0157 improves handling of ternlog instructions,
now GCC can recognize lots of pternlog_operand with different
variants.
The patch adjust rtx_costs for that, so pass_combine can
reasonably generate more optimal vpternlog instructions.
.i.e
for avx512f-vpternlog-3.c, with the pa
Use reg_or_subregno instead.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Committed as an obvious patch.
gcc/ChangeLog:
PR target/115452
* config/i386/i386-features.cc (scalar_chain::convert_op): Use
reg_or_subregno instead of REGNO to avoid ICE.
gcc/testsui
The tune is added by PR79390 for SciMark2 on Broadwell.
For latest GCC, with and without the -mtune-ctrl=^one_if_conv_insn.
GCC will generate the same binary for SciMark2. And for SPEC2017,
there's no big impact for SKX/CLX/ICX, and small improvements on SPR
and later.
gcc/ChangeLog:
* co
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Move the optimization did in ix86_expand_int_vcond to match.pd
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}, aarch64-linux-gnu.
Ok for trunk?
gcc/ChangeLog:
PR target/114189
> I think the check for TYPE_UNSIGNED should be of TREE_TYPE (@0) rather
> than type here.
Changed
> Or maybe you need `types_match (type, TREE_TYPE (@0))` too.
And use tree_nop_conversion_p (type, TREE_TYPE (@0)) and add view_convert to
rshift.
Bootstrapped and regtested on x86_64-pc-linux-gnu
Here's the patch committed.
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Move the optimization did in ix86_expand_int_vcond to match.pd
gcc/ChangeLog:
PR target/114189
* match.pd: Simplify a < 0 ? -1 : 0 to (signed) >> 31 and a
416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
It is the case in the PR, the patch uses lower cost to enable more
simplication and fix th
ix86_hardreg_mov_ok is added by r11-5066-gbe39636d9f68c4
>The solution proposed here is to have the x86 backend/recog prevent
>early RTL passes composing instructions (that set likely_spilled hard
>registers) that they (combine) can't simplify, until after reload.
>We allow sets fr
For below pattern, RA may still allocate r162 as v/k register, try to
reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi
which result a linker error.
(set (reg:DI 162)
(mem/u/c:DI
(const:DI (unspec:DI
[(symbol_ref:DI ("a") [flags 0x60] )]
(insn 98 94 387 2 (parallel [
(set (reg:TI 337 [ _32 ])
(ashift:TI (reg:TI 329)
(reg:QI 521)))
(clobber (reg:CC 17 flags))
]) "test.c":11:13 953 {ashlti3_doubleword}
is reloaded into
(insn 98 452 387 2 (parallel [
(se
(insn 98 94 387 2 (parallel [
(set (reg:TI 337 [ _32 ])
(ashift:TI (reg:TI 329)
(reg:QI 521)))
(clobber (reg:CC 17 flags))
]) "test.c":11:13 953 {ashlti3_doubleword}
is reloaded into
(insn 98 452 387 2 (parallel [
(se
Ok for trunk?
---
htdocs/gcc-14/changes.html| 7 +++
htdocs/gcc-14/porting_to.html | 9 +
2 files changed, 16 insertions(+)
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index ca4cae0f..b023a4b9 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/ch
So when both source operand and dest operand require avx512 MASK_REGS, RA
can allocate MASK_REGS register instead of GPR to avoid reload it from
GPR to MASK_REGS.
It's similar as what did for logic patterns.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
The Intel Decimal Floating-Point Math Library is available as open-source on
Netlib[1].
[1] https://www.netlib.org/misc/intel/.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
libgcc/config/libbid/ChangeLog:
* bid128_fma.c (add_and_round): Fix bug: the result
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (usdot_prodv*qi): Extend to VI1_AVX512
with vpmaddwd when avxvnni/avx512vnni is not available.
---
gcc/config/i386/sse.md | 55 +++---
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/113079
* config/i386/mmx.md (usdot_prodv8qi): New expander.
(sdot_prodv8qi): Ditto.
(udot_prodv8qi): Ditto.
(usdot_prodv4hi): Ditto.
(udot_prodv4
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ready push to trunk.
gcc/ChangeLog:
PR target/113090
* config/i386/i386-expand.cc
(expand_vec_perm_punpckldq_pshuf): New function.
(ix86_expand_vec_perm_const_1): Try
expand_vec_perm_punpckldq_pshuf f
The Fortran standard does not specify what the result of the MAX
and MIN intrinsics are if one of the arguments is a NaN. So it
should be ok to tranform reduction for IFN_COND_MIN with vectorized
COND_MIN and REDUC_MIN.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and bac
As testcase in the PR, O3 cunrolli may prevent vectorization for the
innermost loop and increase register pressure.
The patch removes the 1/3 reduction of unr_insn for innermost loop for UL_ALL.
ul != UR_ALL is needed since some small loop complete unrolling at O2 relies
the reduction.
Bootstrappe
> Can the above loop be a part of ix86_check_avx_upper_register, so this
> function would scan the full RTX for avx upper register?
Changed, also adjust ix86_check_avx_upper_stores and ix86_avx_u128_mode_needed
to either inline the old ix86_check_avx_upper_register or replace
FOR_EACH_SUBRTX
with
*_eq3_1 supports
nonimm_or_0_operand for op1 and op2, pass_combine would fail to lower
avx512 comparision back to avx2 one when op1/op2 is const0_rtx. It's
because the splitter only support nonimmediate_operand.
Failed to match this instruction:
(set (reg/i:V16QI 20 xmm0)
(vec_merge:V16QI (con
It fix the regression by
a51f2fc0d80869ab079a93cc3858f24a1fd28237 is the first bad commit
commit a51f2fc0d80869ab079a93cc3858f24a1fd28237
Author: liuhongt
Date: Wed Sep 4 15:39:17 2024 +0800
Handle const0_operand for *avx2_pcmp3_1.
caused
FAIL: gcc.target/i386/pr59539-1.c scan-assembler
According to Intel Software Optimization Manual[1], the Redwood cove
microarchitecture supports LD+OP and MOV+OP macro fusions.
The patch enables MOV+OP tune for GNR.
[1]
https://www.intel.com/content/www/us/en/content-details/814198/intel-64-and-ia-32-architectures-optimization-reference-manual
GCC12 enables vectorization for O2 with very cheap cost model which is
restricted
to constant tripcount. The vectorization capacity is very limited w/
consideration
of codesize impact.
The patch extends the very cheap cost model a little bit to support variable
tripcount.
But still disable peel
gcc/testsuite/ChangeLog:
* gcc.dg/fstack-protector-strong.c: Adjust
scan-assembler-times.
* gcc.dg/graphite/scop-6.c: Add
-Wno-aggressive-loop-optimizations.
* gcc.dg/graphite/scop-9.c: Ditto.
* gcc.dg/tree-ssa/ivopts-lt-2.c: Add -fno-tree-vectorize.
>So should we adjust very-cheap to allow niter peeling as proposed or
>should we switch the default at -O2 to cheap?
I prefer the former.
Update in V2:
Adjust testcase after relax O2 vectorization.
Ok for trunk?
gcc/ChangeLog:
* tree-vect-loop.cc (vect_analyze_loop_costing): Enable
r15-1737-gb06a108f0fbffe lower AVX512 kmask comparison to AVX2 ones,
but wrong lowered unsigned comparison to signed ones, for unsigned
comparison, only EQ/NEQ can be lowered.
The commit fix that.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
Update in V3.
>The testcase looks bogus:
>
> b[i+k] = b[i+k-5] + 2;
>
>accesses b[-3], can you instead adjust the inner loop to start with k == 4?
Changed, also adjust b[100] to b[200] to avoid array out of bound.
>Please remove this testcase - even with fully masking we'd need alias
>versi
>We'd also need to update the documentation:
>... The @samp{very-cheap} model only
>allows vectorization if the vector code would entirely replace the
>scalar code that is being vectorized. For example, if each iteration
>of a vectorized loop would only be able to handle exactly four iterations
>
For masked FMA, there're 2 forms of RTL representation
1) (vec_merge (fma: op2 op1 op3) op1) mask)
2) (vec_merge (fma: op1 op2 op3) op1) mask)
It's because op1 op2 are communatative in RTL(the second op1 is
written as (match_dup 1))
we once tried to replace (match_dup 1)
with (match_operand:VFH_AV
For x86 masked fma, there're 2 rtl representations
1) (vec_merge (fma op2 op1 op3) op1 mask)
2) (vec_merge (fma op1 op2 op3) op1 mask).
5894(define_insn "_fmadd__mask"
5895 [(set (match_operand:VFH_AVX512VL 0 "register_operand" "=v,v")
5896(vec_merge:VFH_AVX512VL
5897 (fma:VF
diate_operand" "0")) to enable more flexibility for pattern
match and recog, but it triggered an ICE in reload(reload can handle
at most one perand with "0" constraint).
So we need either add 2 patterns in the backend or just do the
canonicalization in the middle-end.
The
---
htdocs/gcc-15/changes.html | 10 ++
1 file changed, 10 insertions(+)
diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html
index 6dc46a52..8a238256 100644
--- a/htdocs/gcc-15/changes.html
+++ b/htdocs/gcc-15/changes.html
@@ -36,6 +36,16 @@ a work-in-progress.
General
Also add hard_float target to avoid failed on arm-eabi, cortex-m0.
Verified on cross-compiler for powerpc64le-linux-gnu, sparc-sun-solaris2.11
Ready push to trunk.
gcc/testsuite/ChangeLog:
PR testsuite/115365
* gcc.dg/pr100927.c: Adjust testcase to avoid scan FIX in REG_EQUIV.
-
According to Intel SOM[1], For Crestmont, most 256-bit Intel AVX2
instructions can be decomposed into two independent 128-bit
micro-operations, except for a subset of Intel AVX2 instructions,
known as cross-lane operations, can only compute the result for an
element by utilizing one or more source
ped and regtested on x86_64-pc-linux-gnu{-m32,}.
The patch generally improves SPEC2017 allrate geomean by 1% with
-march=sierraforest -Ofast on SRF.
Ready push to trunk.
liuhongt (2):
[x86] Add new microarchitecture tune for SRF/GRR/CWF.
[x86] Add a new tune avx256_avoid_vec_perm for SRF.
For Crestmont, 4-operand vex blendv instructions come from MSROM and
is slower than 3-instructions sequence (op1 & mask) | (op2 & ~mask).
legacy blendv instruction can still be handled by the decoder.
The patch add a new tune which is enabled for all processors except
for SRF/CWF. It will use vpan
The optimization relies on other patterns which are only available at
GCC14 and obove, so restore the xfail for GCC13/12 branch.
Pushed as an obvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512bw-pr103750-2.c: Add xfail for ia32.
---
gcc/testsuite/gcc.target/i386/avx512bw-pr1
r15-974-gbf7745f887c765e06f2e75508f263debb60aeb2e has optimized for
jcc/setcc, but missed movcc.
The patch supports movcc.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/117232
* config/i386/sse.md (*kortest_cmp_movqicc):
r12-6103-g1a7ce8570997eb combines vpcmpuw + zero_extend to vpcmpuw
with the pre_reload splitter, but the splitter transforms the
zero_extend into a subreg which make reload think the upper part is
garbage, it's not correct.
The patch adjusts the zero_extend define_insn_and_split to
define_insn to
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk and backport to release branch.
gcc/ChangeLog:
PR target/117240
* config/i386/i386-builtin.def: Add avx/avx512f to vaes
ymm/zmm builtins.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr11
It's supported by vector permutation with zero vector.
gcc/ChangeLog:
* config/i386/i386-expand.cc
(ix86_expand_vector_bf2sf_with_vec_perm): New function.
* config/i386/i386-protos.h
(ix86_expand_vector_bf2sf_with_vec_perm): New Declare.
* config/i386/mmx.m
Generate native instruction whenever possible, otherwise use vector
permutation with odd indices.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/i386-expand.cc
(ix86_expand_vector_sf2bf_with_vec_perm): New function.
Force_operand issues an ICE when input
is (subreg:DI (us_truncate:V8QI)), it's probably because it's an
invalid rtx, So refine backend patterns for that.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/117318
* config/i386/s
Return constm1_rtx when GET_MODE_CLASS (MODE) == MODE_VECTOR_INT.
Otherwise NULL_RTX.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/i386.h (VECTOR_STORE_FLAG_VALUE): New macro.
gcc/testsuite/ChangeLog:
* gcc.dg/rtl/x8
Disable the tune for Zhaoxin/CLX/SKX since it could hurt performance
for the inner loop.
According to last test, align_loop helps performance for SPEC2017 on EMR and
Znver4.
So I'll still keep the tune for generic part.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Any comment?
gcc/
hw instruction doesn't raise exceptions, turns sNAN into qNAN quietly,
and always round to nearest (even). Output denormals are always
flushed to zero and input denormals are always treated as zero. MXCSR
is not consulted nor updated.
W/o native instructions, flag_unsafe_math_optimizations is neede
When loop requires any kind of versioning which could increase register
pressure too much, and it's in a deeply nest big loop, don't do
vectorization.
I tested the patch with both Ofast and O2 for SPEC2017, besides 548.exchange_r,
other benchmarks are same binary.
Bootstrapped and regtested 0on x
r15-919-gef27b91b62c3aa removed 1 / 3 size reduction for innermost
loop, but it doesn't accurately remember what's "innermost" for 2
testcases in PR117888.
1) For pass_cunroll, the "innermost" loop could be an originally outer
loop with inner loop completely unrolled by cunrolli. The patch moves
l
r14-172-g0368d169492017 replaces GENERAL_REGS with NO_REGS in cost
calculation when the preferred register class are not known yet.
It regressed powerpc PR109610 and PR109858, it looks too aggressive to use
NO_REGS when mode can be allocated with GENERAL_REGS.
The patch takes a step back, still use
2 and r14-1252 to
GCC13 and GCC12 release branch.
Note r14-1252 is a fix to r14-172 which regressed powerpc testcase in PR109610.
I have verified the fix also works on GCC13/GCC12 branch for PR109610.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}, and aarch64-linux-gnu.
Ok for backport
gcc/ChangeLog:
PR rtl-optimization/108707
* ira-costs.cc (scan_one_insn): Use NO_REGS instead of
GENERAL_REGS when preferred reg_class is not known.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr108707.c: New test.
(cherry picked from commit 0368d169492017cfab5622
After optimization for RA, memory op is not propagated into
instructions(>1), and it make testcases not generate vxorps since
the memory is loaded into the dest, and the dest is never unused now.
So rewrite testcases to make the codegen more stable.
gcc/testsuite/ChangeLog:
* gcc.target/
> Please pass 'sbitmap' instead of auto_sbitmap&, it should properly
> decay to that. Applies everywhere I think.
>
Changed.
> In fact I wonder whether we should simply populate the bitmap
> from a
>
> for (auto loop : loops_list (cfun, LI_ONLY_INNERMOST))
> bitmap_set_bit (original_innerm
It could cause weired spill in RA when register pressure is high.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
BTW, It's difficult to get a decent testcase for the issue since the spill
is not exposed in simple testcase.
gcc/ChangeLog:
PR target/117562
Since there's regression to use vpermq, and it's manually disabled by
!TARGET_AVX512BW. I remove the codes related to vpermq and make
ix86_expand_vecop_qihi2 only handle vpmovbw + op + vpmovwb case.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/118489
* config/i386/sse.md (VF1_AVX512BW): Fix typo.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr118489.c: New test.
---
gcc/config/i386/sse.md |
*jcc only supports ix86_fp_comparison_operator for CCFP, when
comparison code is LT, there's an ICE. W/o AVX10.2, it's ok since
do_compare_rtx_and_jump will transform LT to GT, but w/ AVX10.2 it
goes directly into ix86_expand_branch which doesn't handle it.
Use ix86_fp_comparison_operator in cbran
It looks like the testcase is fragile, it's supposed to check the
compiler ability of generating code_6_gottpoff_reloc instruction, but
failed since there's a seg_prefixed memory
usage(r14-6242-gd564198f960a2f).
mov r13, QWORD PTR j@gottpoff[rip]
mov r12, QWORD PTR a@gottpo
Look like those operand modifiers are only for internal usage
in .md files, so for simplicity, I'll just remove them from extend.texi.
Ready push to trunk.
gcc/ChangeLog:
PR documentation/108134
* doc/extend.texi: Remove documents from r11-344-g0fec3f62b9bfc0.
---
gcc/doc/extend
In some benchmark, I notice stv failed due to cost unprofitable, but the igain
is inside the loop, but sse<->integer conversion is outside the loop, current
cost
model doesn't consider the frequency of those gain/cost.
The patch weights those cost with frequency just like LRA does.
Bootstrapped a
From: "hongtao.liu"
When FMA is available, N-R step can be rewritten with
a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
which have 2 fma generated.[1]
[1] https://bugs.llvm.org/show_bug.cgi?id=21385
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog
Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand
or vpandn.
Current register_operand/vector_operand could lose some optimization
opportunity.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
* config/i386/predicates.md (vector
This is originally from [1]
For the command line, or target attribute, the actual operation goes
into ix86_handle_option, and as long as we get it right in this
ix86_handle_option, everything else should be fine.
As for the macros generated by the mask name (TARGET_SSE4_1_P), their
mea
cat test.c
void
foo ()
{
__mmask8 mask1 = _mm_cmpeq_epu32_mask (pi128[0], pi128[1]);
a = mask1 & 15;
}
with -O2 -march=x86-64-v4, gcc generates
foo():
movqpi128(%rip), %rax
vmovdqa (%rax), %xmm0
vpcmpeqd16(%rax), %xmm0, %k0
kmovb %k0, %eax
For the testcase in the PR, we have
br64 = br;
br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull);
n->n: 0x300200.
n->range: 32.
n->type: uint64.
The original code assumes n->range is same as TYPE PRECISION(n->type),
and tries to rotate the mask from 0x30200
Add missing insn patterns for v2si -> v2hi/v2qi and v2hi-> v2qi vector
truncate.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/92658
* config/i386/mmx.md (truncv2hiv2qi2): New define_insn.
(truncv2si2): Ditto.
gcc/testsu
We have already use intermidate type in case WIDEN, but not for NONE,
this patch extended that.
I didn't do that in pattern recog since we need to know whether the
stmt belongs to any slp_node to decide the vectype, the related optabs
are checked according to vectype_in and vectype_out. For non-sl
This patch only support vec_pack/unpacks optabs for vector modes whose lenth >=
128.
For 32/64-bit vector, they're more hanlded by BB vectorizer with
truncmn2/extendmn2/fix{,uns}_truncmn2.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
*
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
TYPE_MIN, but PABSB will store unsigned result into dst. The patch
uses ABSU_EXPR + VCE instead of ABS_EXPR.
Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
vector absm2 is guarded with TARGET_MMX_WITH_SSE.
B
Since mask < 0 will be always false when -funsigned-char, but
vpblendvb needs to check the most significant bit.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and backport to GCC12/GCC13 release branch?
gcc/ChangeLog:
PR target/110108
* config/i386/i386-b
> I think this is a better patch and will always be correct and still
> get folded at the gimple level (correctly):
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ff56ee8dd..02bf5ba93a5 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -18561,8
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
TYPE_MIN, but PABSB will store unsigned result into dst. The patch
uses ABSU_EXPR + VCE instead of ABS_EXPR.
Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
vector absm2 is guarded with TARGET_MMX_WITH_SSE.
g
Since there's no evex version for vpcmpeq ymm, ymm, ymm.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk and backport to GCC13.
gcc/ChangeLog:
PR target/110227
* config/i386/sse.md (mov_internal>): Use x instead of v
for alternative 2 sinc
packuswb/packusdw does unsigned saturation for signed source, but rtl
us_truncate means does unsigned saturation for unsigned source.
So for value -1, packuswb will produce 0, but us_truncate produces
255. The patch reimplement those related patterns and functions with
UNSPEC_US_TRUNCATE instead of
The packing in vpacksswb/vpackssdw is not a simple concat, it's an
interweave from src1 and src2 for every 128 bit(or 64-bit for the
ss_truncate result).
.i.e.
dst[192-255] = ss_truncate (src2[128-255])
dst[128-191] = ss_truncate (src1[128-255])
dst[64-127] = ss_truncate (src2[0-127])
dst[0-63] =
optimize_insn_for_speed () in assemble output is not aligned with
splitter condition, and it cause an ICE when building SPEC2017
blender_r.
Not sure if ctrl is supposed to be reliable in assemble output, the patch just
remove that as a walkaround.
Bootstrapped and regtested on x86_64-pc-linux-gnu
> I see some regressions most likely with this change on i686-linux,
> in particular:
> +FAIL: gcc.dg/pr107547.c (test for excess errors)
> +FAIL: gcc.dg/torture/floatn-convert.c -O0 (test for excess errors)
> +UNRESOLVED: gcc.dg/torture/floatn-convert.c -O0 compilation failed to
> produce execu
For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
Can someone confirm this with AMD folks?
If AMD doesn't like such optimization, I'll put my optimization under
micro-architecture tuning.
Bootstrapped and regte
Prevent rtl optimization of vec_duplicate + zero_extend to
vpbroadcastm since there could be an extra kmov after RA.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ready to push to trunk.
gcc/ChangeLog:
PR target/110788
* config/i386/sse.md (avx512cd_maskb_vec_dup): Add
After
b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb is the first bad commit
commit b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb
Author: Jan Hubicka
Date: Fri Jul 28 09:16:09 2023 +0200
loop-split improvements, part 1
Now we have
vpbroadcastd %ecx, %xmm0
vpaddd .LC3(%rip), %xmm0, %xmm0
v
AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph.
Also remove scalar mode from fmaddsub/fmsubadd pattern since there's
no scalar instruction for that.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
PR target/81904
* config/i3
In [1], I propose a patch to generate vmovdqu for all vlddqu intrinsics
after AVX2, it's rejected as
> The instruction is reachable only as __builtin_ia32_lddqu* (aka
> _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> think that in this case, the compiler should not be too smart
101 - 200 of 591 matches
Mail list logo