vectorizable_nonlinear_induction doesn't always guard
vect_peel_nonlinear_iv_init when it's called by
vect_update_ivs_after_vectorizer which is supposed to be guarded
by vect_can_advance_ivs_p. The patch put part codes from
vectorizable_nonlinear_induction into a new function
vect_can_peel_nonlinea
Adjust lrintmn2 operand preidcates according to real instructions.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok as an obvious fix?
gcc/ChangeLog:
PR target/107185
* config/i386/i386.md (lrint2): Swap
predicate of operands[0] and operands[1].
gcc/testsuite
For genereal_reg_operand, it will be splitted into xor + not.
For mask_reg_operand, it will be splitted with UNSPEC_MASK_OP just
like what we did for other logic operations.
The patch will optimize xor+not to kxnor when possible.
Bootstrapped and regtested on x86_64-pc-linux-gnu.
Ok for trunk?
g
.. in ix86_expand_vector_move and
ix86_convert_const_wide_int_to_broadcast(called by the former).
ix86_expand_vector_move is called by emit_move_insn which is used by
many pre_reload passes, ix86_gen_scratch_sse_rtx will break data flow
when there's explict usage of xmm7/xmm15/xmm31.
Bootstrapped
This is incremental patch based on [1], it enables optimization as below
- vbroadcastss.LC1(%rip), %xmm0
+ movl$-45, %edx
+ vmovd %edx, %xmm0
+ vpshufd $0, %xmm0, %xmm0
According to microbenchmark, it's faster than broadcast from memory.
[1] https://gcc.gnu.org/
For parameter passing through stack, vectorized load from parm_decl
in callee may trigger serious STF issue. This is why GCC12 regresses
50% for cray at -O2 compared to GCC11.
The patch add an extremely large number to stmt_cost to prevent
vectorization for loads from parm_decl under very-cheap co
>What happens if you set preferred_for_speed to false for alternative 1?
It works, and I've removed the newly added splitter in this patch.
Also i tried to do similar things to *vec_dup with mode iterator
AVX2_VEC_DUP_MODE, but it hit ICE during reload since x86 don't have direct
move for QImode
This patch only handle pure-slp for by-value passed parameter which
has nothing to do with IPA but psABI. For by-reference passed
parameter IPA is required.
The patch is aggressive in determining STLF failure, any
unaligned_load for parm_decl passed by stack is thought to have STLF
stall issue. It
__builtin_ia32_blendvpd is defined under sse4.1 and gimple folded
to ((v2di) c) < 0 ? b : a where vec_cmpv2di is under sse4.2 w/o which
it's veclowered to scalar operations and not combined back in rtl.
Bootstrap and regtest on x86_64-pc-linux-gnu{-m32,}.
Ready push to main trunk.
gcc/ChangeLog:
This patch only handle pure-slp for by-value passed parameter which
has nothing to do with IPA but psABI. For by-reference passed
parameter IPA is required.
The patch is aggressive in determining STLF failure, any
unaligned_load for parm_decl passed by stack is thought to have STLF
stall issue. It
This patch only handle pure-slp for by-value passed parameter which
has nothing to do with IPA but psABI. For by-reference passed
parameter IPA is required.
The patch is aggressive in determining STLF failure, any
unaligned_load for parm_decl passed by stack is thought to have STLF
stall issue. It
Set attr from HImode to HFmode which uses vmovsh instead of vmovw for
movment between sse registers.
Bootstrapped and regstested on x86_64-pc-linux-gnu{-m32,}.
Ok for main trunk?
gcc/ChangeLog:
PR target/104974
* config/i386/i386.md (*movhi_internal): Set attr type from HI
Failed to match this instruction:
(set (reg/v:SI 88 [ z ])
(if_then_else:SI (eq (zero_extract:SI (reg:SI 92)
(const_int 1 [0x1])
(zero_extend:SI (subreg:QI (reg:SI 93) 0)))
(const_int 0 [0]))
(reg:SI 95)
(reg:SI 94)))
but it's equal t
In validate_subreg, both (subreg:V2HF (reg:SI) 0)
and (subreg:V8HF (reg:V2HF) 0) are valid, but not
for (subreg:V8HF (reg:SI) 0) which causes ICE.
Ideally it should be handled in validate_subreg to support
subreg for all modes available in TARGET_CAN_CHANGE_MODE_CLASS, but
that would be too risky
pinsrw is available for both reg and mem operand under sse2.
pextrw requires sse4.1 for mem operands.
The patch change attr "isa" for pinsrw mem alternative from sse4_noavx
to noavx, will enable below optimization.
-movzwl (%rdi), %eax
pxor%xmm1, %xmm1
-pinsrw $0, %
Since cfg is freed before machine_reorg, just do a rough calculation
of the window according to the layout.
Also according to an experiment on CLX, set window size to 64.
Currently only handle V2DFmode load since it doesn't need any scratch
registers, and it's sufficient to recover cray performanc
Update in V2:
1. Use get_insns instead of FOR_EACH_BB_CFUN and FOR_BB_INSNS.
2. Return for any_uncondjump_p and ANY_RETURN_P.
3. Add dump info for spliting instruction.
4. Restrict ix86_split_stlf_stall_load under TARGET_SSE2.
Since cfg is freed before machine_reorg, just do a rough calculation
of
Update in V3:
1. Add -param=x86-stlf-window-ninsns= (default 64).
2. Exclude call in the window.
Since cfg is freed before machine_reorg, just do a rough calculation
of the window according to the layout.
Also according to an experiment on CLX, set window size to 64.
Currently only handle V2DFmod
There's no masked vpandw or vpandb, similar for vpxor/vpor/vpandn.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ready to push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (_andnot3_mask):
Removed.
(_andnot3): Disable V*HFmode patterns
for mask_applied
> your description above hints at that the actual modes involved in the
> vec_pack_sbool_trunc are the same so the TYPE_MODE (narrow_vectype)
> and TYPE_MODE (vectype) are not the actual modes participating. I think
> it would be way better to fix that.
>
> I suppose that since we know TYPE_VECTOR
The pattern converts (cond (cmp a b) (convert c) (convert d))
to (convert (cond (cmp a b) c d)) when
1) types_match (c, d)
2) single_use for (convert c) and (convert d)
3) TYPE_PRECISION (TREE_TYPE (c)) == TYPE_PRECISION (TREE_TYPE (a))
4) INTEGERAL_TYPE_P (TREE_TYPE (c))
The pattern can save pack
For i386, it enables optimization like:
vmovd %xmm0, %edx
- vmovd %xmm0, %eax
+ movl%edx, %eax
Bootstrapped and regtested on CLX for both
x86_64-pc-linux-gnu{-m32,} and
x86_64-pc-linux-gnu{-m32\ -march=native,\ -march=native}
Ok for trunk?
gcc/ChangeLog:
PR
ix86_expand_vector_init expects vals to be a parallel containing
values of individual fields which should be either element mode of the
vector mode, or a vector mode with the same element mode and smaller
number of elements.
But in the expander ashlv16qi3, the second operand is SImode which
can't
ix86_expand_vector_init expects vals to be a parallel containing
values of individual fields which should be either element mode of the
vector mode, or a vector mode with the same element mode and smaller
number of elements.
But in the expander ashlv16qi3, the second operand is SImode which
can't
>But in principle @2 or @3 could safely differ in sign, you'd then need to
>ensure
>to insert sign conversions to @2/@3 to the signedness of @4/@5.
Changed.
>you are not testing for this anywhere?
It's tested in vect_recog_cond_expr_convert_pattern, I've move it to match.pd
>Btw, matching up the
>>> Confirmed. When uncond_op is expensive (there's *div amongst them) that's
>>> definitely unwanted. OTOH when it is cheap then combining will reduce
>>> latency.
>>>
>>> GIMPLE wise it's a neutral transform if uncond_op is not single-use unless
>>> we need two v_c_es.
>>
>> We can leave it t
> > +(match (cond_expr_convert_p @0 @2 @3 @6)
> > + (cond (simple_comparison@6 @0 @1) (convert@4 @2) (convert@5 @3))
> > + (if (types_match (TREE_TYPE (@2), TREE_TYPE (@3))
>
> But in principle @2 or @3 could safely differ in sign, you'd then need to
> ensure
> to insert sign conversions to @2/@3
Bootstrap and regrestest on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
* config/i386/cpuid.h (bit_MPX): Removed.
(bit_BNDREGS): Ditto.
(bit_BNDCSR): Ditto.
---
gcc/config/i386/cpuid.h | 5 -
1 file changed, 5 deletions(-)
diff --git a/gcc/config/i386/cp
> I find this quite unreadable, it looks like if @2 and @3 are treated
> differently. I think keeping the old 3 lines and just adding
> && (TYPE_PRECISION (TREE_TYPE (@0)) >= TYPE_PRECISION (type)
> || (TYPE_UNSIGNED (TREE_TYPE (@2))
> == TYPE_UNSIGNED (TREE_TYPE (@3)
For evex encoding vp{xor,or,and}, suffix is needed.
Or there would be an error for
vpxor %ymm0, %ymm31, %ymm1
Error: unsupported instruction `vpxor'
Bootstrapped and regtested x86_64-pc-linux-gnu{-m32,}.
Pushed to trunk.
gcc/ChangeLog:
* config/i386/sse.md (v1ti3): Add suffix and repla
The patch fixes ICE in ix86_gimple_fold_builtin.
gcc/ChangeLog:
PR target/104666
* config/i386/i386-expand.cc
(ix86_check_builtin_isa_match): New func.
(ix86_expand_builtin): Move code to
ix86_check_builtin_isa_match and call it.
* config/i386/i386-
Make them be equal to cost of unaligned ones to avoid odd alignment
peeling.
Impact for SPEC2017 on CLX:
fprate:
503.bwaves_rBuildSame
507.cactuBSSN_r -0.22
508.namd_r -0.02
510.parest_r-0.28
511.povray_r-0.20
519.lbm_r BuildSame
521.wrf_r
As change in assembler, refer to [1], this patch disallow mask/sse/mmx
mov in TLS code sequences which require integer MOV instructions.
[1]
https://sourceware.org/git/?p=binutils-gdb.git;a=patch;h=d7e3e627027fcf37d63e284144fe27ff4eba36b5
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
>Why is the above declared as a special memory constraint? Also the
Change to define_memory_constraint since it's ok for
reload can make them match by converting the operand to the form
‘(mem (reg X))’.where X is a base register (from the register class specified
by BASE_REG_CLASS
>predicate comme
TYPE_PRECISION (type) < TYPE_PRECISION (TREE_TYPE (@2)) supposed to check
integer type but not pointer type, so use second parameter instead.
i.e. first parameter is VPTR, second parameter is I4.
582DEF_SYNC_BUILTIN (BUILT_IN_ATOMIC_FETCH_OR_4,
583 "__atomic_fetch_or_4",
584
There're several failures reported in [1]:
1. unsupported instruction `pextrw` for "pextrw $0, %xmm31, 16(%rax)"
%vpextrw should be used in output templates.
2. ICE in get_attr_memory for movhi_internal since some alternatives
are marked as TYPE_SSELOG.
Explicitly set memory_attr for those alterna
As discussed in PR, this patch do optimizations:
1. No memory is needed to move HI/HFmode between GPR and SSE registers
under TARGET_SSE2 and above, pinsrw/pextrw are used for them w/o
AVX512FP16.
2. Use gen_sse2_pinsrph/gen_vec_setv4sf_0 to replace
ix86_expand_vector_set in extendhfsf2/truncsfhf2
ix86_attr_length_immediate_default assume TYPE ishift only have 1
constant operand,
but *x86_64_shld_1/*x86_shld_1/*x86_64_shrd_1/*x86_shrd_1 has 2, with
condition: INTVAL (operands[3]) == 32 - INTVAL (operands[2]) or
INTVAL (operands[3]) == 64 - INTVAL (operands[2]), and hit
gcc_assert.
Explicitly
The patch helps reload to choose GENENRAL_REGS alternatives for
SSE_FLOAT_MODE and enabled optimization like
- vmovd %xmm0, -4(%rsp)
- movl$1, %eax
- addl-4(%rsp), %eax
+ movd%xmm0, %eax
+ addl$1, %eax
Bootstrapped anf regtested on x86_64-pc-linux
Hi:
> Please also consider TARGET_INTER_UNIT_MOVES_TO_VEC and
> TARGET_INTER_UNIT_MOVES_FROM_VEC.
Here's updated patch.
Also honor TARGET_INTER_UNIT_MOVES_TO/FROM_VEC and in
preferred_{,out_}reload_class.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32\ -march=k8,\ -march=k8}.
Ok?
gcc/Cha
When moves between integer and sse registers are cheap.
2021-12-06 Hongtao Liu
Uroš Bizjak
gcc/ChangeLog:
PR target/95740
* config/i386/i386.c (ix86_preferred_reload_class): Allow
integer regs when moves between register units are cheap.
* config/i
Fix unexpected non-canon form from gimple vector selector.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/107271
* config/i386/i386-expand.cc (ix86_vec_perm_index_canon): New.
(expand_vec_perm_shufps_shufps): Call
Enable V4BFmode and V2BFmode with the same ABI as V4HFmode and
V2HFmode. No real operation is supported for them except for movement.
This should solve PR target/107261.
Also I notice there's redundancy in VALID_AVX512FP16_REG_MODE, and
remove V2BFmode remove it.
Bootstrapped and regtested on x86
Matching constraints are used in these circumstances. More precisely,
the two operands that match must include one input-only operand and
one output-only operand. Moreover, the digit must be a smaller number
than the number of the operand that uses it in the constraint.
In pr107057, the 2 operands
>You have a couple of other patterns where operand 1 is matched to
>produce vmovddup insn. These are *avx512f_unpcklpd512 and
>avx_unpcklpd256. You can also remove expander in both
>cases.
Yes, changed in V2 patch.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
Matching
This patch doens't handle variable count since it require 5 insns to
be combined to get wanted pattern, but current pass_combine only
supports at most 4.
This patch doesn't handle 16-bit shrd/shld either.
Ideally, we can avoid redundancy of
*x86_64_shld_shrd_1_nozext/*x86_shld_shrd_1_nozext
if mid
Since 256-bit vector integer comparison is under TARGET_AVX2,
and gimple folding for vblendvpd/vblendvps/vpblendvb relies on that.
Restrict gimple fold condition to TARGET_AVX2.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/106704
>Looks good overall - a few comments inline. Also can you please add
>SLP support?
>I've tried hard to fill in gaps where SLP support is missing since my
>goal is still to get
>rid of non-SLP.
For slp with different induction type, they need separate iv update and
an vector permutation. And if the
zmm-version vcvtps2ph is special, it encodes {sae} in evex, but put
round control in the imm. For intrinsic _mm512_cvt_roundps_ph (a,
imm), imm contains both {sae} and round control, we need to separate
it in the assembly output since vcvtps2ph will ignore imm[3:7].
Corresponding llvm patch.
Intri
Optimize
_1 = *srcp_3(D);
_4 = VEC_PERM_EXPR <_1, _1, { 4, 5, 6, 7, 4, 5, 6, 7 }>;
_5 = BIT_FIELD_REF <_4, 128, 0>;
to
_1 = *srcp_3(D);
_5 = BIT_FIELD_REF <_1, 128, 128>;
the upper will finally be optimized to
_5 = BIT_FIELD_REF <*srcp_3(D), 128, 128>;
Bootstrapped and regtested on
Enable optimization for TImode only under 32-bit target, for 64-bit
target there could be extra ineteger <-> sse move regarding psABI,
not efficient.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ok for trunk?
gcc/ChangeLog:
PR target/104610
* config/i386/i386-expand.c
This is adjusted patch only for OImode.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/104610
* config/i386/i386-expand.cc (ix86_expand_branch): Use ptest
for QImode when code is EQ or NE.
* config/i386/sse.md (cbr
Similarly optimize movl + vmovq to vmovd.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/104915
* config/i386/sse.md (*vec_set_0_zero_extendhi): New
pre_reload define_insn_and_split.
(*vec_setv2di_0_zero_extendhi_1
Here's adjused patch.
Ok for trunk?
Optimize
_4 = VEC_PERM_EXPR <_1, _1, { 4, 5, 6, 7, 4, 5, 6, 7 }>;
_5 = BIT_FIELD_REF <_4, 128, 0>;
to
_5 = BIT_FIELD_REF <_1, 128, 128>;
gcc/ChangeLog:
PR tree-optimization/102583
* tree-ssa-forwprop.cc (simplify_bitfield_ref): Extende
This patch will enable below optimization:
{
- int bit;
- long long unsigned int _1;
- long long unsigned int _2;
-
[local count: 46707768]:
-
- [local count: 1027034057]:
- # tmp_11 = PHI
- # bit_13 = PHI
- _1 = 1 << bit_13;
- _2 = ~_1;
- tmp_8 = _2 & tmp_11;
- bit_9 = bit_13 +
pand/pandn may be used to clear upper/lower bits of the operands, in
that case there will be 4-5 instructions for permutation, and it's
still better than scalar codes.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/105354
* confi
Clean up of 16-bit uppers is not needed for pmovzxbq/pmovsxbq.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/105072
* config/i386/sse.md (*sse4_1_v2qiv2di2_1):
New define_insn.
(*sse4_1_zero_extendv2qiv2di2_2): Ne
Assembly Optimization like:
- vmovq %xmm0, %xmm2
- vmovdqa .LC0(%rip), %xmm0
vmovq %xmm1, %xmm1
- vpermi2w%xmm1, %xmm2, %xmm0
+ vmovq %xmm0, %xmm0
+ vpunpcklqdq %xmm1, %xmm0, %xmm0
...
-.LC0:
- .value 0
- .value 1
- .valu
Here's updated patch which adds ix86_pre_reload_split () to those 2
define_insn_and_splits.
Assembly Optimization like:
- vmovq %xmm0, %xmm2
- vmovdqa .LC0(%rip), %xmm0
vmovq %xmm1, %xmm1
- vpermi2w%xmm1, %xmm2, %xmm0
+ vmovq %xmm0, %xmm0
+ vpun
When d->perm[i] == d->perm[i-1] + 1 and d->perm[i] == nelt, it's not
continuous. It should fail if there's more than 2 continuous areas.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/105587
* config/i386/i386-expand.cc
(
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ok for trunk?
gcc/ChangeLog:
PR tree-optimization/105591
* tree-ssa-forwprop.cc (simplify_bitfield_ref): Clamp
vec_perm_expr index.
gcc/testsuite/ChangeLog:
* gcc.dg/pr105591.c: New test.
---
gcc/testsuite
backend has
16550(define_insn "*bmi2_bzhi_3_2"
16551 [(set (match_operand:SWI48 0 "register_operand" "=r")
16552(and:SWI48
16553 (plus:SWI48
16554(ashift:SWI48 (const_int 1)
16555 (match_operand:QI 2 "register_operand" "r"))
16556(
kmovd only uses port5 which is often the bottleneck of
performance. Also from latency perspective, spill and reload mostly
could be STLF or even MRN which only take 1 cycle.
So the patch increase move cost between gpr and mask to be the same as
gpr <-> sse register.
Bootstrapped and regtested on
Rigt now, mem_cost for separate mem alternative is 1 * frequency which
is pretty small and caused the unnecessary SSE spill in the PR, I've tried
to rework backend cost model, but RA still not happy with that(regress
somewhere else). I think the root cause of this is cost for separate 'm'
alternati
Hi:
Details discussed in PR.
Bootstrapped and regtested on x86-64_linux-gnu{-m32,}.
Pushed to master and GCC-11.
gcc/ChangeLog:
PR target/102166
* config/i386/amxbf16intrin.h : Remove macro check for __AMX_BF16__.
* config/i386/amxint8intrin.h : Remove macro check fo
For 32-bit libgcc configure w/o sse2, there's would be an error since
GCC only support _Float16 under sse2. Explicitly add -msse2 for those
HF related libgcc functions, so users can still link them w/ the
upper configuration.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Ok for trunk?
Hi:
As discussed in [1], most of (currently unopposed) targets want
auto-vectorization at O2, and IMHO now would be a good time to enable O2
vectorization for GCC trunk, so it would leave enough time to expose
related issues and fix them.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}
Hi:
As discussed in [1], adjust the layout for x86 _Float16 description.
Bootstrappedn and regtested on x86_64-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
* doc/extend.texi: (@node Floating Types): Adjust the wording.
(@node Half-Precision): Ditto.
---
gcc/doc/extend.te
Hi:
For the conversion from _Float16 to int, if the corresponding optab
does not exist, the compiler will try the wider mode (SFmode here),
but when floatsfsi exists but FAIL, FROM will be rewritten, which
leads to a PR runtime error.
Boostrapped and regtested on x86_64-linux-gnu{-m32,}.
Ok
Hi:
The optimization is decribled in PR.
The two instruction sequences are almost as fast, but the optimized
instruction sequences could be one mov instruction less on sse2 and
2 mov instruction less on sse3.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
gcc/ChangeLog:
PR
Hi:
As decribed in PR, valign{d,q} can be used for vector extract one element.
For elements located in the lower 128 bits, only one instruction is needed,
so this patch only optimizes elements located above 128 bits.
The optimization is like:
- vextracti32x8 $0x1, %zmm0, %ymm0
- v
Hi:
As a follow up of [1], the patch removes all scalar mode copysign related
post_reload splitter/define_insn and expand copysign directly into below using
paradoxical subregs.
op3 = op1 & ~mask;
op4 = op2 & mask;
dest = op3 | op4;
It can sometimes generate better code just like avx512dq
Currently for (vec_concat:M (vec_select op0 idx1)(vec_select op0 idx2)),
optimizer wouldn't simplify if op0 has different mode with M, but that's too
restrict which will prevent below optimization, the condition can be relaxed
to op0 must have same inner mode with M.
(set (reg:V2DF 87 [ xx ])
Hi:
In general_operand, paradoxical subregs w/ outermode SCALAR_FLOAT_MODE_P
are not allowed unless lra_in_progress, so this patch add the restriction
to validate_subreg as well.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}
Also the newly added tests are compiled with aarch64-linu
Hi:
Details discussed in
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579170.html.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Ok for trunk?
liuhongt (2):
Revert "Get rid of all float-int special cases in validate_subreg."
validate_subreg b
This reverts commit d2874d905647a1d146dafa60199d440e837adc4d.
PR target/102254
PR target/102154
PR target/102211
---
gcc/emit-rtl.c | 40
1 file changed, 40 insertions(+)
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 77ea8948ee8..ff3b4449b37 100644
-
gcc/ChangeLog:
* expmed.c (extract_bit_field_using_extv): validate_subreg
before call gen_lowpart.
---
gcc/expmed.c | 6 +-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/gcc/expmed.c b/gcc/expmed.c
index 3143f38e057..10d62d857a8 100644
--- a/gcc/expmed.c
+++ b/g
Hi:
UNSPEC_COPYSIGN/XORSIGN are only used by related post_reload splitters
which have been removed by r12-3417 and r12-3435.
Bootstrapped and regtest on x86_64-linux-gnu{-m32,}.
Pushed to trunk.
gcc/ChangeLog:
* config/i386/i386.md: (UNSPEC_COPYSIGN): Remove.
(UNSPEC_XORSI
Hi:
As describled in PR, use vextract instead on valign when
byte_offset % 16 == 0.
Bootstrapped and regtest on x86_64-linux-gnu{-m32,}.
Pushed to trunk.
2020-09-13 Hongtao Liu
Peter Cordes
gcc/ChangeLog:
PR target/91103
* config/i386/sse.md (extract_suf):
Hi:
The optimization is decribled in PR.
Bootstrapped and regtest on x86_64-linux-gnu{-m32,}.
All avx512fp16 runtest cases passed on SPR.
gcc/ChangeLog:
PR target/102327
* config/i386/i386-expand.c
(ix86_expand_vector_init_interleave): Use puncklwd to pack 2
Ping
rebased on latest trunk.
gcc/ChangeLog:
* common.opt (ftree-vectorize): Add Var(flag_tree_vectorize).
* doc/invoke.texi (Options That Control Optimization): Update
documents.
* opts.c (default_options_table): Enable auto-vectorization at
O2 with very-c
Ping.
Bootstrapped and regtest on x86_64-linux-gnu{-m32,},
aarch64-unknown-linux-gnu{-m32,}
Ok for trunk?
gcc/ChangeLog:
PR middle-end/102080
* match.pd: Check mask type when doing cond_op related gimple
simplification.
* tree.c (is_truth_type_for): New funct
Bootstrapped and regtest on x86_64-pc-linux-gnu{-m32,}.
Runtime tests passed under sde{-m32,}.
gcc/ChangeLog:
PR target/87767
* config/i386/i386.c (ix86_print_operand): Handle
V8HF/V16HF/V32HFmode.
* config/i386/i386.h (VALID_BCST_MODE_P): Add HFmode.
*
Besides conversion instructions, pass_rpad also handles scalar
sqrt/rsqrt/rcp/round instructions, while r12-3614 should only want to
handle conversion instructions, so fix it.
Bootstrapped and regtest on x86_64-linux-gnu{-m32,} w/ configure
--enable-checking=yes,rtl,extra, failed tests are fixed
Hi:
fma/fms/fnma/fnmsv2sf4 are defined only under (TARGET_FMA || TARGET_FMA4).
The patch extend the expanders to TARGET_AVX512VL.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
* config/i386/mmx.md (fmav2sf4): Extend to AVX512 fma.
(f
Pushed to trunk.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr92658-avx512f.c: Refine testcase.
* gcc.target/i386/pr92658-avx512vl.c: Adjust scan-assembler,
only v2di->v2qi truncate is not supported, v4di->v4qi should
be supported.
---
gcc/testsuite/gcc.target/i38
---
htdocs/gcc-12/changes.html | 8 ++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html
index 81f62fe3..14149212 100644
--- a/htdocs/gcc-12/changes.html
+++ b/htdocs/gcc-12/changes.html
@@ -165,8 +165,12 @@ a work-in-progre
expander for smin/maxhf3.
AVX512FP16: Add fix(uns)?_truncmn2 for HF scalar and vector modes
AVX512FP16: Add float(uns)?mn2 expander
AVX512FP16: add truncmn2/extendmn2 expanders
AVX512FP16: Enable vec_cmpmn/vcondmn expanders for HF modes.
liuhongt (2):
AVX512FP16: Add expander for rint
gcc/ChangeLog:
* config/i386/i386.md (rinthf2): New expander.
(nearbyinthf2): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512fp16-builtin-round-1.c: Add new testcase.
---
gcc/config/i386/i386.md | 22 +++
.../i386/avx
gcc/ChangeLog:
* config/i386/sse.md (FMAMODEM): extend to handle FP16.
(VFH_SF_AVX512VL): Extend to handle HFmode.
(VF_SF_AVX512VL): Deleted.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512fp16-fma-1.c: New test.
* gcc.target/i386/avx512fp16vl-fma-1.c: N
From: Hongyu Wang
gcc/ChangeLog:
* config/i386/i386.md (hf3): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512fp16-builtin-minmax-1.c: New test.
---
gcc/config/i386/i386.md | 11 ++
.../i386/avx512fp16-builtin-minmax-1.c| 35 +++
From: Hongyu Wang
NB: 64bit/32bit vectorize for HFmode is not supported for now, will
adjust this patch when V2HF/V4HF operations supported.
gcc/ChangeLog:
* config/i386/i386.md (fix_trunchf2): New expander.
(fixuns_trunchfhi2): Likewise.
(*fixuns_trunchfsi2zext): New de
From: Hongyu Wang
gcc/ChangeLog:
* config/i386/sse.md (float2):
New expander.
(avx512fp16_vcvt2ph_):
Rename to ...
(floatv4hf2): ... this, and drop constraints.
(avx512fp16_vcvtqq2ph_v2di): Rename to ...
(floatv2div2hf2): ... this, and like
From: Hongyu Wang
gcc/ChangeLog:
* config/i386/sse.md (extend2):
New expander.
(extendv4hf2): Likewise.
(extendv2hfv2df2): Likewise.
(trunc2): Likewise.
(avx512fp16_vcvt2ph_): Rename to ...
(truncv4hf2): ... this, and drop constraints.
From: Hongyu Wang
gcc/ChangeLog:
* config/i386/i386-expand.c (ix86_use_mask_cmp_p): Enable
HFmode mask_cmp.
* config/i386/sse.md (sseintvecmodelower): Add HF vector modes.
(_store_mask): Extend to support HF vector modes.
(vec_cmp): Likewise.
(vcon
Updated, mention _Float16 support.
---
htdocs/gcc-12/changes.html | 13 -
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html
index 81f62fe3..f19c6718 100644
--- a/htdocs/gcc-12/changes.html
+++ b/htdocs/gcc-12/changes.
Hi:
Related discussion in [1] and PR.
Bootstrapped and regtest on x86_64-linux-gnu{-m32,}.
Ok for trunk?
[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574330.html
gcc/ChangeLog:
PR target/102464
* config/i386/i386.c (ix86_optab_supported_p):
Return true f
[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580207.html
gcc/ChangeLog:
* doc/extend.texi (Half-Precision): Remove storage only
description for _Float16 w/o avx512fp16.
---
gcc/doc/extend.texi | 11 +--
1 file changed, 5 insertions(+), 6 deletions(-)
diff
Hi:
> Please don't add the -fno- option to the warning tests. As I said,
> I would prefer to either suppress the vectorization for the failing
> cases by tweaking the test code or xfail them. That way future
> regressions won't be masked by the option. Once we've moved
> the warning to a more su
Revert due to performace regression.
This reverts commit 8f323c712ea76cc4506b03895e9b991e4e4b2baf.
PR target/102473
PR target/101059
---
gcc/config/i386/sse.md| 39 ++-
gcc/testsuite/gcc.target/i386/sse2-pr101059.c | 32 ---
gcc/tests
301 - 400 of 591 matches
Mail list logo