For the testcase in the PR, we have
br64 = br;
br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull);
n->n: 0x300200.
n->range: 32.
n->type: uint64.
The original code assumes n->range is same as TYPE PRECISION(n->type),
and tries to rotate the mask from 0x30200
Add missing insn patterns for v2si -> v2hi/v2qi and v2hi-> v2qi vector
truncate.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/92658
* config/i386/mmx.md (truncv2hiv2qi2): New define_insn.
(truncv2si2): Ditto.
gcc/testsu
We have already use intermidate type in case WIDEN, but not for NONE,
this patch extended that.
I didn't do that in pattern recog since we need to know whether the
stmt belongs to any slp_node to decide the vectype, the related optabs
are checked according to vectype_in and vectype_out. For non-sl
This patch only support vec_pack/unpacks optabs for vector modes whose lenth >=
128.
For 32/64-bit vector, they're more hanlded by BB vectorizer with
truncmn2/extendmn2/fix{,uns}_truncmn2.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
*
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
TYPE_MIN, but PABSB will store unsigned result into dst. The patch
uses ABSU_EXPR + VCE instead of ABS_EXPR.
Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
vector absm2 is guarded with TARGET_MMX_WITH_SSE.
B
Since mask < 0 will be always false when -funsigned-char, but
vpblendvb needs to check the most significant bit.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and backport to GCC12/GCC13 release branch?
gcc/ChangeLog:
PR target/110108
* config/i386/i386-b
> I think this is a better patch and will always be correct and still
> get folded at the gimple level (correctly):
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ff56ee8dd..02bf5ba93a5 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -18561,8
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
TYPE_MIN, but PABSB will store unsigned result into dst. The patch
uses ABSU_EXPR + VCE instead of ABS_EXPR.
Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
vector absm2 is guarded with TARGET_MMX_WITH_SSE.
g
Since there's no evex version for vpcmpeq ymm, ymm, ymm.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk and backport to GCC13.
gcc/ChangeLog:
PR target/110227
* config/i386/sse.md (mov_internal>): Use x instead of v
for alternative 2 sinc
packuswb/packusdw does unsigned saturation for signed source, but rtl
us_truncate means does unsigned saturation for unsigned source.
So for value -1, packuswb will produce 0, but us_truncate produces
255. The patch reimplement those related patterns and functions with
UNSPEC_US_TRUNCATE instead of
The packing in vpacksswb/vpackssdw is not a simple concat, it's an
interweave from src1 and src2 for every 128 bit(or 64-bit for the
ss_truncate result).
.i.e.
dst[192-255] = ss_truncate (src2[128-255])
dst[128-191] = ss_truncate (src1[128-255])
dst[64-127] = ss_truncate (src2[0-127])
dst[0-63] =
optimize_insn_for_speed () in assemble output is not aligned with
splitter condition, and it cause an ICE when building SPEC2017
blender_r.
Not sure if ctrl is supposed to be reliable in assemble output, the patch just
remove that as a walkaround.
Bootstrapped and regtested on x86_64-pc-linux-gnu
> I see some regressions most likely with this change on i686-linux,
> in particular:
> +FAIL: gcc.dg/pr107547.c (test for excess errors)
> +FAIL: gcc.dg/torture/floatn-convert.c -O0 (test for excess errors)
> +UNRESOLVED: gcc.dg/torture/floatn-convert.c -O0 compilation failed to
> produce execu
For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
Can someone confirm this with AMD folks?
If AMD doesn't like such optimization, I'll put my optimization under
micro-architecture tuning.
Bootstrapped and regte
Prevent rtl optimization of vec_duplicate + zero_extend to
vpbroadcastm since there could be an extra kmov after RA.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ready to push to trunk.
gcc/ChangeLog:
PR target/110788
* config/i386/sse.md (avx512cd_maskb_vec_dup): Add
After
b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb is the first bad commit
commit b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb
Author: Jan Hubicka
Date: Fri Jul 28 09:16:09 2023 +0200
loop-split improvements, part 1
Now we have
vpbroadcastd %ecx, %xmm0
vpaddd .LC3(%rip), %xmm0, %xmm0
v
AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph.
Also remove scalar mode from fmaddsub/fmsubadd pattern since there's
no scalar instruction for that.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
PR target/81904
* config/i3
In [1], I propose a patch to generate vmovdqu for all vlddqu intrinsics
after AVX2, it's rejected as
> The instruction is reachable only as __builtin_ia32_lddqu* (aka
> _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> think that in this case, the compiler should not be too smart
And split it to GPR-version instruction after reload.
This will enable below optimization for 16/32/64-bit vector bit_op
- movd(%rdi), %xmm0
- movd(%rsi), %xmm1
- pand%xmm1, %xmm0
- movd%xmm0, (%rdi)
+ movl(%rsi), %eax
+ andl%eax, (%rdi)
The patch only handles load/store(including ctor/permutation, except
gather/scatter) for complex type, other operations don't needs to be
handled since they will be lowered by pass cplxlower.(MASK_LOAD is not
supported for complex type, so no need to handle either).
Instead of support vector(2) _C
And split it to GPR-version instruction after reload.
> ?r was introduced under the assumption that we want vector values
> mostly in vector registers. Currently there are no instructions with
> memory or immediate operand, so that made sense at the time. Let's
> keep ?r until logic instructions w
And split it after reload.
>IMO, the only case it is worth adding is a direct immediate store to
>memory, which HJ recently added.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/106038
* config/i386/mmx.md (3): Extend to AND mem,
V2 update:
Handle VMAT_ELEMENTWISE, VMAT_CONTIGUOUS_PERMUTE, VMAT_STRIDED_SLP,
VMAT_CONTIGUOUS_REVERSE, VMAT_CONTIGUOUS_DOWN for complex type.
I've run SPECspeed@2017 627.cam4_s, there's some vectorization cases,
but no big performance impact(since this patch only handle load/store).
Any co
And split it after reload.
> You will need ix86_binary_operator_ok insn constraint here with
> corresponding expander using ix86_fixup_binary_operands_no_copy to
> prepare insn operands.
Split define_expand with just register_operand, and allow
memory/immediate in define_insn, assume combine/forwp
__builtin_cexpi can't be vectorized since there's gap between it and
vectorized sincos version(In libmvec, it passes a double and two
double pointer and returns nothing.) And it will lose some
vectorization opportunity if sin & cos are optimized to cexpi before
vectorizer.
I'm trying to add vect_r
> My original comments still stand (it feels like this should be more generic).
> Can we go the way lowering complex loads/stores first? A large part
> of the testcases
> added by the patch should pass after that.
This is the patch as suggested, one additional change is handling COMPLEX_CST
for r
And split it after reload.
gcc/ChangeLog:
PR target/106038
* config/i386/mmx.md (3): New define_expand, it's
original "3".
(*3): New define_insn, it's original
"3" be extended to handle memory and immediate
operand with ix86_binary_operator_ok. Also
r13-1762-gf9d4c3b45c5ed5f45c8089c990dbd4e181929c3d lower complex type
move to scalars, but testcase pr23911 is supposed to scan __complex__
constant which is never available, so adjust testcase to scan
IMAGPART/REALPART_EXPR constants separately.
Pushed as obvious patch.
gcc/testsuite/ChangeLog
For neg, the patch create a vec_init as [ a, -a, a, -a, ... ] and no
vec_step is needed to update vectorized iv since vf is always multiple
of 2(negative * negative is positive).
For shift, the patch create a vec_init as [ a, a >> c, a >> 2*c, ..]
as vec_step as [ c * nunits, c * nunits, c * nuni
/var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:
In function ‘matmul_i1_avx512f’:
/var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:1781:1:
internal compiler error: RTL check: expected
Similar like r14-2786-gade30fad6669e5, the patch is for V4HF/V2HFmode.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/110762
* config/i386/mmx.md (3): Changed from define_insn
to define_expand and break into ..
(v4
Don't access leaf 7 subleaf 1 unless subleaf 0 says it is
supported via EAX.
Intel documentation says invalid subleaves return 0. We had been
relying on that behavior instead of checking the max sublef number.
It appears that some Sandy Bridge CPUs return at least the subleaf 0
EDX value for subl
> Please rather do it in a more self-descriptive way, as proposed in the
> attached patch. You won't need a comment then.
>
Adjusted in V2 patch.
Don't access leaf 7 subleaf 1 unless subleaf 0 says it is
supported via EAX.
Intel documentation says invalid subleaves return 0. We had been
relying
This minor fix is preapproved in [1].
Committed to trunk.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626758.html
gcc/ChangeLog:
* common/config/i386/cpuinfo.h (get_available_features):
Rename local variable subleaf_level to max_subleaf_level.
---
gcc/common/config
Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named
patterns in order to avoid generation of partial vector V8HFmode
trapping instructions.
Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}
Ok for trunk?
gcc/ChangeLog:
PR target/110832
* config/i386/mmx.md
Currently we have 3 different independent tunes for gather
"use_gather,use_gather_2parts,use_gather_4parts",
similar for scatter, there're
"use_scatter,use_scatter_2parts,use_scatter_4parts"
The patch support 2 standardizing options to enable/disable
vectorization for all gather/scatter instructio
For more details of GDS (Gather Data Sampling), refer to
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html
After microcode update, there's performance regression. To avoid that,
the patch disables gather gene
Rename original use_gather to use_gather_8parts, Support
-mtune-ctrl={,^}use_gather to set/clear tune features
use_gather_{2parts, 4parts, 8parts}. Support the new option -mgather
as alias of -mtune-ctrl=, use_gather, ^use_gather.
Similar for use_scatter.
How about this version?
gcc/ChangeLog:
vmovapd can enable register renaming and have same code size as
vmovsd. Similar for vmovsh vs vmovaps, vmovaps is 1 byte less than
vmovsh.
When TARGET_AVX512VL is not available, still generate
vmovsd/vmovss/vmovsh to avoid vmovapd/vmovaps zmm16-31.
Bootstrapped and regtested on x86_64-pc-linux-gn
Alderlake-N is E-core only, add it as an alias of Alderlake.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Any comments?
gcc/ChangeLog:
* common/config/i386/cpuinfo.h (get_intel_cpu): Detect
Alderlake-N.
* common/config/i386/i386-common.cc (alias_table): Suppo
---
htdocs/gcc-14/changes.html | 4
1 file changed, 4 insertions(+)
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index eae25f1a..2c888660 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -151,6 +151,10 @@ a work-in-progress.
-march=luna
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512f-pr88464-2.c: Add -mgather to
options.
* gcc.target/i386/avx512f-pr88464-3.c: Ditto.
* gcc.target/i386/avx512f-pr88464-4.c: Ditto.
* gcc.target/i386/avx512f-pr88464-6.c: Ditto.
* gcc.target/i386/avx51
Commit as an abvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/invariant-ternlog-1.c: Only scan %rdx under
TARGET_64BIT.
---
gcc/testsuite/gcc.target/i386/invariant-ternlog-1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/gcc/testsuite/gcc.target/i386
I notice there's some refactor in vectorizable_conversion
for code_helper,so I've adjusted my patch to that.
Here's the patch I'm going to commit.
We have already use intermidate type in case WIDEN, but not for NONE,
this patch extended that.
gcc/ChangeLog:
PR target/110018
* tre
If mem_addr points to a memory region with less than whole vector size
bytes of accessible memory and k is a mask that would prevent reading
the inaccessible bytes from mem_addr, add UNSPEC_MASKLOAD to prevent
it to be transformed to vpblendd.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32
> > Hmm, good question. GENERIC has a direct truncation to unsigned char
> > for example, the C standard generally says if the integral part cannot
> > be represented then the behavior is undefined. So I think we should be
> > safe here (0x1.0p32 doesn't fit an int).
>
> We should be following An
The new assembly looks better than original one, so I adjust those testcases.
Ok for trunk?
gcc/testsuite/ChangeLog:
PR tree-optimization/110371
PR tree-optimization/110018
* gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Scan scvt +
sxtw instead of scvt + zip1 + z
When there're multiple operands in vec_oprnds0, vec_dest will be
overwrited to vectype_out, but in multi_step_cvt case, cvt_type is
expected. It caused an ICE when verify_gimple_in_cfg.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} and aarch64-linux-gnu.
Ok for trunk?
gcc/ChangeLog:
__bfloat16 is redefined from typedef short to real __bf16 since GCC
V13. The patch issues an warning for potential silent implicit
conversion between __bf16 and short where users may only expect a
data movement.
To avoid too many false positive, warning is only under
TARGET_AVX512BF16.
Bootstrapp
At the rtl level, we cannot guarantee that the maskstore is not optimized
to other full-memory accesses, as the current implementations are equivalent
in terms of pattern, to solve this potential problem, this patch refines
the pattern of the maskstore and the intrinsics with unspec.
One thing I'm
pass_insert_vzeroupper is under condition
TARGET_AVX && TARGET_VZEROUPPER
&& flag_expensive_optimizations && !optimize_size
But the document of mvzeroupper doesn't mention the insertion
required -O2 and above, it may confuse users when they explicitly
use -Os -mvzeroupper.
mvzeroupp
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/82735
* config/i386/i386.cc (ix86_avx_u127_mode_needed): Don't emit
vzeroupper for vzeroupper call_insn.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx-vzeroupper-30.
vpternlog is also used for optimization which doesn't need any valid
input operand, in that case, the destination is used as input in the
instruction and that creates a false dependence.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
PR t
For testcase
void __cond_swap(double* __x, double* __y) {
bool __r = (*__x < *__y);
auto __tmp = __r ? *__x : *__y;
*__y = __r ? *__y : *__x;
*__x = __tmp;
}
GCC-14 with -O2 and -march=x86-64 options generates the following code:
__cond_swap(double*, double*):
movsd xmm1, QWORD
They should have same cost as vector mode since both generate
pand/pandn/pxor/por instruction.
Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
* config/i386/i386.cc (ix86_rtx_costs): Adjust rtx_cost for
DF/SFmode AND/IOR/XOR/ANDN operations.
We have ix86_expand_sse_fp_minmax to detect min/max sematics, but
it requires rtx_equal_p for cmp_op0/cmp_op1 and if_true/if_false, for
the testcase in the PR, there's an extra move from cmp_op0 to if_true,
and it failed ix86_expand_sse_fp_minmax.
This patch adds pre_reload splitter to detect the
> Please split the above pattern into two, one emitting UNSPEC_IEEE_MAX
> and the other emitting UNSPEC_IEEE_MIN.
Splitted.
> The test involves blendv instruction, which is SSE4.1, so it is
> pointless to test it without -msse4.1. Please add -msse4.1 instead of
> -march=x86_64 and use sse4_runtime
False dependency happens when destination is only updated by
pternlog. There is no false dependency when destination is also used
in source. So either a pxor should be inserted, or input operand
should be set with constraint '0'.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to p
Similar like we did for cmpxchg, but extended to all
ix86_comparison_int_operator since cmpccxadd set EFLAGS exactly same
as CMP.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,},
Ok for trunk?
gcc/ChangeLog:
PR target/110591
* config/i386/sync.md (cmpccxadd_): Add a new
Here's updated patch.
1. use optimize_insn_for_speed_p instead of using optimize_function_for_speed_p.
2. explicitly move memory to dest register to avoid false dependence in
one_cmpl pattern.
False dependency happens when destination is only updated by
pternlog. There is no false dependency whe
Similar like we did for CMPXCHG, but extended to all
ix86_comparison_int_operator since CMPCCXADD set EFLAGS exactly same
as CMP.
When operand order in CMP insn is same as that in CMPCCXADD,
CMP insn can be eliminated directly.
When operand order is swapped in CMP insn, only optimize
cmpccxadd +
Antony Polukhin 2023-07-11 09:51:58 UTC
There's a typo at
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/g%2B%2B.target/i386/pr110170.C;h=e638b12a5ee2264ecef77acca86432a9f24b103b;hb=d41a57c46df6f8f7dae0c0a8b349e734806a837b#l87
It should be `|| !test3() || !test3r()` rather than `|| !te
> The quoted patch shows -shared in context and you didn't post a
> backport version
> to look at. But yes, we shouldn't change -shared behavior on a
> branch, even less so make it
> inconsistent between targets.
Here's the patch.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/89701
* common.opt: Refactor -fcf-protection= to support combination
of param.
* lto-wrapper.c (merge_and_complain): Adjusted.
* opts.c (parse_cf_protection_opt
> I think this could be simplified if you use either EnumSet or
> EnumBitSet instead in common.opt for `-fcf-protection=`.
Use EnumSet instead of EnumBitSet since CF_FULL is not power of 2.
It is a bit tricky for sets classification, cf_branch and cf_return
should be in different sets, but they bo
r14-172-g0368d169492017 replaces GENERAL_REGS with NO_REGS in cost
calculation when the preferred register class are not known yet.
It regressed powerpc PR109610 and PR109858, it looks too aggressive to use
NO_REGS when mode can be allocated with GENERAL_REGS.
The patch takes a step back, still use
Also for 64-bit vector abs intrinsics _mm_abs_{pi8,pi16,pi32}.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/109900
* config/i386/i386.cc (ix86_gimple_fold_builtin): Fold
_mm{,256,512}_abs_{epi8,epi16,epi32,epi64} and
r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to
transform notl + pbroadcast + pand to pbroadcast + pandn for
VI124_AVX2 which leaves out all DI-element-size ones as
well as all 512-bit ones.
This patch extend the splitter to VI_AVX2 which will handle DImode for
AVX2, and V64QI
lzcnt/tzcnt has been fixed since skylake, popcnt has been fixed since
icelake. At least for icelake and later intel Core processors, the
errata tune is not needed. And the tune isn't need for ATOM either.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/Chang
Hi:
This patch supports cond_add/sub/mul/div expanders for vector float/double.
There're still cond_fma/fms/fnms/fma/max/min/xor/ior/and left which I failed
to figure out a testcase to validate them.
Also cond_add/sub/mul for vector integer.
Bootstrap is ok, survive the regression test on
gcc/ChangeLog:
* config/i386/i386-modes.def (FLOAT_MODE): Define ieee HFmode.
* config/i386/i386.c (enum x86_64_reg_class): Add
X86_64_SSEHF_CLASS.
(merge_classes): Handle X86_64_SSEHF_CLASS.
(examine_argument): Ditto.
(construct_container): Ditto.
Update from v2:
1. Support -fexcess-precision=16 which will enable
FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when backend supports _Float16.
2. Update ix86_get_excess_precision, so -fexcess-precision=standard
should not do anything different from -fexcess-precision=fast
regarding _Float16.
3. Avoiding
gcc/ada/ChangeLog:
* gcc-interface/misc.c (gnat_post_options): Issue an error for
-fexcess-precision=16.
gcc/c-family/ChangeLog:
* c-common.c (excess_precision_mode_join): Update below comments.
(c_ts18661_flt_eval_method): Set excess_precision_type to
EXC
libgcc/ChangeLog
* soft-fp/eqhf2.c: New file.
* soft-fp/extendhfdf2.c: New file.
* soft-fp/extendhfsf2.c: New file.
* soft-fp/extendhfxf2.c: New file.
* soft-fp/half.h (FP_CMP_EQ_H): New marco.
* soft-fp/truncdfhf2.c: New file
* soft-fp/trunc
libgcc/ChangeLog:
* config/i386/32/sfp-machine.h (_FP_NANFRAC_H): New macro.
* config/i386/64/sfp-machine.h (_FP_NANFRAC_H): Ditto.
* config/i386/sfp-machine.h (_FP_NANSIGN_H): Ditto.
* config/i386/t-softfp: Add hf soft-fp.
* config.host: Add i386/64/t-softf
gcc/ChangeLog:
* config/i386/avx512fp16intrin.h (_mm_set_ph): New intrinsic.
(_mm256_set_ph): Likewise.
(_mm512_set_ph): Likewise.
(_mm_setr_ph): Likewise.
(_mm256_setr_ph): Likewise.
(_mm512_setr_ph): Likewise.
(_mm_set1_ph): Likewise.
From: "Guo, Xuepeng"
gcc/ChangeLog:
* common/config/i386/cpuinfo.h (get_available_features):
Detect FEATURE_AVX512FP16.
* common/config/i386/i386-common.c
(OPTION_MASK_ISA_AVX512FP16_SET,
OPTION_MASK_ISA_AVX512FP16_UNSET,
OPTION_MASK_ISA2_AVX512FP1
Hi:
This is a follow up of [1].
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Pushed to trunk.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576514.html
gcc/ChangeLog:
* config/i386/sse.md (cond_): New expander.
(cond_mul): Ditto.
gcc/testsuite/ChangeLo
Hi:
The define_peephole2 which is added by r12-2640-gf7bf03cf69ccb7dc
should only work on general registers, considering that x86 also
supports mov instructions between gpr, sse reg, mask reg, limiting the
peephole2 predicate to general_reg_operand.
I failed to contruct a testcase, but I believ
Hi:
This patch add expanders cond_{fma,fms,fnms,fnms}
for vector float/double modes.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Pushed to trunk.
gcc/ChangeLog:
* config/i386/sse.md (cond_fma): New expander.
(cond_fms): Ditto.
(cond_fnma): Ditto.
Hi:
Pushed to trunk as an abvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/cond_op_addsubmul_d-2.c: Add
dg-require-effective-target for avx512.
* gcc.target/i386/cond_op_addsubmul_q-2.c: Ditto.
* gcc.target/i386/cond_op_addsubmul_w-2.c: Ditto.
* gc
Hi:
Together with the previous 3 patches, all cond_op expanders of vector
modes are supported (if they have a corresponding avx512 mask instruction).
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
liuhongt (3):
[i386] Support cond_{smax,smin,umax,umin} for vector integer modes
gcc/ChangeLog:
* config/i386/sse.md (cond_): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/cond_op_maxmin_b-1.c: New test.
* gcc.target/i386/cond_op_maxmin_b-2.c: New test.
* gcc.target/i386/cond_op_maxmin_d-1.c: New test.
* gcc.target/i386/cond
gcc/ChangeLog:
* config/i386/sse.md (cond_): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/cond_op_anylogic_d-1.c: New test.
* gcc.target/i386/cond_op_anylogic_d-2.c: New test.
* gcc.target/i386/cond_op_anylogic_q-1.c: New test.
* gcc.target/i38
gcc/ChangeLog:
* config/i386/sse.md (cond_): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/cond_op_maxmin_double-1.c: New test.
* gcc.target/i386/cond_op_maxmin_double-2.c: New test.
* gcc.target/i386/cond_op_maxmin_float-1.c: New test.
* gcc.ta
Hi:
---
OK, I think sth is amiss here upthread. insv/extv do look like they
are designed
to work on integer modes (but docs do not say anything about this here).
In fact the caller of extract_bit_field_using_extv is named
extract_integral_bit_field. Of course nothing seems to check what kind of
m
Hi:
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}
Ok for trunk?
gcc/ChangeLog:
PR rtl-optimization/101796
* simplify-rtx.c
(simplify_context::simplify_binary_operation_1): Simplify
vector shift/rotate with const_vec_duplicate to vector
shift/rot
Hi:
Boostrapped and regtested on x86_64-linux-gnu{-m32,}.
gcc/ChangeLog:
* config/i386/sse.md (cond_): New expander.
(VI248_AVX512VLBW): New mode iterator.
* config/i386/predicates.md
(nonimmediate_or_const_vec_dup_operand): New predicate.
gcc/testsuite/ChangeLo
Hi:
AVX512F supported vscalefs{s,d} which is the same as ldexp except the second
operand should be floating point.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
gcc/ChangeLog:
PR target/98309
* config/i386/i386.md (ldexp3): Extend to vscalefs[sd]
when TARGET_
Hi:
Add define_insn_and_split to combine avx_vec_concatv16si/2 and
avx512f_zero_extendv16hiv16si2_1 since the latter already zero_extend
the upper bits, similar for other patterns which are related to
pmovzx{bw,wd,dq}.
It will do optimization like
- vmovdqa %ymm0, %ymm0# 7 [c=4 l=
Hi:
This is the patch i'm going to checkin.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,};
2021-08-12 Uros Bizjak
gcc/ChangeLog:
PR target/98309
* config/i386/i386.md (avx512f_scalef2): New
define_insn.
(ldexp3): Adjust for new define_insn.
Hi:
This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb}
under AVX512.
For scenarios(like pr101846-2.c) where the upper half is not used, this patch
generates better code with only one vpmov{wb,dw,qd} instruction. For
scenarios(like pr101846-3.c) where the upper half is actu
Hi:
Here's updated patch which does 3 things:
1. Support vpermw/vpermb in ix86_expand_vec_one_operand_perm_avx512.
2. Support 256/128-bits vpermi2b in ix86_expand_vec_perm_vpermt2.
3. Add define_insn_and_split to optimize specific vector permutation to
opmov{dw,wb,qd}.
Bootstrapped and regtes
Hi:
avx512f_scalef2 only accept register_operand for operands[1],
force it to reg in ldexp3.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Ok for trunk.
gcc/ChangeLog:
PR target/101930
* config/i386/i386.md (ldexp3): Force operands[1] to
reg.
gcc/testsuite
Hi:
This patch add a new x86 tune named X86_TUNE_V2DF_REDUCTION_PREFER_HADDPD
to enable haddpd for v2df vector reduction, the tune is disabled by default.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}
Ok for trunk?
gcc/ChangeLog:
PR target/97147
* config/i386/i386.h
This reverts commit 872da9a6f664a06d73c987aa0cb2e5b830158a10.
PR target/101936
PR target/101929
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}
Pushed to master.
---
gcc/config/i386/i386.c | 6 +-
gcc/config/i386/i386.h | 1 -
gcc/config/i386/x8
Performance impact for the commit with option:
-march=x86-64 -O2 -ftree-vectorize -fvect-cost-model=very-cheap
SPEC2017 fprate
503.bwaves_rBuildSame
507.cactuBSSN_r -0.04
508.namd_r 0.14
510.parest_r-0.54
511.povray_r 0.10
519.lbm_r B
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Pushed to trunk.
gcc/ChangeLog:
PR target/102016
* config/i386/sse.md (*avx512f_pshufb_truncv8hiv8qi_1): Add
TARGET_AVX512BW to condition.
gcc/testsuite/ChangeLog:
PR target/102016
* gcc.target/i3
Also optimize below 3 forms to vpternlog, op1, op2, op3 are
register_operand or unary_p as (not reg)
A: (any_logic (any_logic op1 op2) op3)
B: (any_logic (any_logic op1 op2) (any_logic op3 op4)) op3/op4 should
be equal to op1/op2
C: (any_logic (any_logic (any_logic:op1 op2) op3) op4) op3/op4 shoul
Hi:
This patch extend change_zero_ext to change illegitimate constant
into constant pool, this will enable simplification of below:
Trying 5 -> 7:
5: r85:V4SF=[`*.LC0']
REG_EQUAL const_vector
7: r84:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel)
REG_DEAD r85:V4SF
1 - 100 of 409 matches
Mail list logo