Here's the patch I've commited.
The patch also remove % for vfmaddcph.
gcc/ChangeLog:
PR target/111306
PR target/111335
* config/i386/sse.md (int_comm): New int_attr.
(fma__):
Remove % for Complex conjugate operations since they're not
commutative.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} on SPR.
Ready push to trunk and backport to GCC13/GCC12.
gcc/ChangeLog:
PR target/111306
* config/i386/sse.md (int_comm): New int_attr.
(fma__):
Remove % for Complex conjugate operations since they're not
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md
(_vpermt2var3): New define_insn.
(VHFBF_AVX512VL): New mode iterator.
(VI2HFBF_AVX512VL): New mode iterator.
---
gcc/config/i386/sse.md | 32
On SPR, vmovsh can be execute on 3 ports, vpblendw can only be
executed on 2 ports.
On znver4, vpblendw can be executed on 4 ports, if vmovsh is similar
as vmovss, then it can also be executed on 4 ports.
So there's no difference for znver? but vmovsh is more optimized on
SPR.
Bootstrapped and reg
r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
gather/scatter.
commit 24905a4bd1375ccd99c02510b9f9529015a48315
Author: Richard Biener
Date: Wed Jan 18 11:04:49 2023 +0100
Adjust costing of emulated vectorized gather/scatter
Emulated gather/scatter behave similar to
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (_blendm): Merge
VF_AVX512HFBFVL into VI12HFBF_AVX512VL.
(VF_AVX512HFBF16): Renamed to VHFBF.
(VF_AVX512FP16VL): Renamed to VHF_AVX512VL.
(VF_
vpmaskmov{d,q} is available for TARGET_AVX2, vmaskmov{ps,ps} is
available for TARGET_AVX, w/o TARGET_AVX2, we can use vmaskmov{ps,pd}
for VI48_128_256
Bootstrapped and regtested on x86_64-pc-linux{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/19
* config/i386/sse.md (
Merge V_128H and V_256H into V_128 and V_256, adjust related patterns.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (vec_set): Removed.
(V_128H): Merge into ..
(V_128): .. this.
(V_256H): Merge
Both "graniterapid-d" and "graniterapids" are attached with
PROCESSOR_GRANITERAPID in processor_alias_table but mapped to
different __cpu_subtype in get_intel_cpu.
And get_builtin_code_for_version will try to match the first
PROCESSOR_GRANITERAPIDS in processor_alias_table which maps to
"granitepr
Commit as an abvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/invariant-ternlog-1.c: Only scan %rdx under
TARGET_64BIT.
---
gcc/testsuite/gcc.target/i386/invariant-ternlog-1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/gcc/testsuite/gcc.target/i386
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512f-pr88464-2.c: Add -mgather to
options.
* gcc.target/i386/avx512f-pr88464-3.c: Ditto.
* gcc.target/i386/avx512f-pr88464-4.c: Ditto.
* gcc.target/i386/avx512f-pr88464-6.c: Ditto.
* gcc.target/i386/avx51
---
htdocs/gcc-14/changes.html | 4
1 file changed, 4 insertions(+)
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index eae25f1a..2c888660 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -151,6 +151,10 @@ a work-in-progress.
-march=luna
Alderlake-N is E-core only, add it as an alias of Alderlake.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Any comments?
gcc/ChangeLog:
* common/config/i386/cpuinfo.h (get_intel_cpu): Detect
Alderlake-N.
* common/config/i386/i386-common.cc (alias_table): Suppo
vmovapd can enable register renaming and have same code size as
vmovsd. Similar for vmovsh vs vmovaps, vmovaps is 1 byte less than
vmovsh.
When TARGET_AVX512VL is not available, still generate
vmovsd/vmovss/vmovsh to avoid vmovapd/vmovaps zmm16-31.
Bootstrapped and regtested on x86_64-pc-linux-gn
Rename original use_gather to use_gather_8parts, Support
-mtune-ctrl={,^}use_gather to set/clear tune features
use_gather_{2parts, 4parts, 8parts}. Support the new option -mgather
as alias of -mtune-ctrl=, use_gather, ^use_gather.
Similar for use_scatter.
How about this version?
gcc/ChangeLog:
For more details of GDS (Gather Data Sampling), refer to
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html
After microcode update, there's performance regression. To avoid that,
the patch disables gather gene
Currently we have 3 different independent tunes for gather
"use_gather,use_gather_2parts,use_gather_4parts",
similar for scatter, there're
"use_scatter,use_scatter_2parts,use_scatter_4parts"
The patch support 2 standardizing options to enable/disable
vectorization for all gather/scatter instructio
Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named
patterns in order to avoid generation of partial vector V8HFmode
trapping instructions.
Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}
Ok for trunk?
gcc/ChangeLog:
PR target/110832
* config/i386/mmx.md
This minor fix is preapproved in [1].
Committed to trunk.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626758.html
gcc/ChangeLog:
* common/config/i386/cpuinfo.h (get_available_features):
Rename local variable subleaf_level to max_subleaf_level.
---
gcc/common/config
> Please rather do it in a more self-descriptive way, as proposed in the
> attached patch. You won't need a comment then.
>
Adjusted in V2 patch.
Don't access leaf 7 subleaf 1 unless subleaf 0 says it is
supported via EAX.
Intel documentation says invalid subleaves return 0. We had been
relying
Don't access leaf 7 subleaf 1 unless subleaf 0 says it is
supported via EAX.
Intel documentation says invalid subleaves return 0. We had been
relying on that behavior instead of checking the max sublef number.
It appears that some Sandy Bridge CPUs return at least the subleaf 0
EDX value for subl
Similar like r14-2786-gade30fad6669e5, the patch is for V4HF/V2HFmode.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/110762
* config/i386/mmx.md (3): Changed from define_insn
to define_expand and break into ..
(v4
/var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:
In function ‘matmul_i1_avx512f’:
/var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:1781:1:
internal compiler error: RTL check: expected
In [1], I propose a patch to generate vmovdqu for all vlddqu intrinsics
after AVX2, it's rejected as
> The instruction is reachable only as __builtin_ia32_lddqu* (aka
> _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> think that in this case, the compiler should not be too smart
AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph.
Also remove scalar mode from fmaddsub/fmsubadd pattern since there's
no scalar instruction for that.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
PR target/81904
* config/i3
After
b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb is the first bad commit
commit b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb
Author: Jan Hubicka
Date: Fri Jul 28 09:16:09 2023 +0200
loop-split improvements, part 1
Now we have
vpbroadcastd %ecx, %xmm0
vpaddd .LC3(%rip), %xmm0, %xmm0
v
Prevent rtl optimization of vec_duplicate + zero_extend to
vpbroadcastm since there could be an extra kmov after RA.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ready to push to trunk.
gcc/ChangeLog:
PR target/110788
* config/i386/sse.md (avx512cd_maskb_vec_dup): Add
For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
Can someone confirm this with AMD folks?
If AMD doesn't like such optimization, I'll put my optimization under
micro-architecture tuning.
Bootstrapped and regte
> I see some regressions most likely with this change on i686-linux,
> in particular:
> +FAIL: gcc.dg/pr107547.c (test for excess errors)
> +FAIL: gcc.dg/torture/floatn-convert.c -O0 (test for excess errors)
> +UNRESOLVED: gcc.dg/torture/floatn-convert.c -O0 compilation failed to
> produce execu
optimize_insn_for_speed () in assemble output is not aligned with
splitter condition, and it cause an ICE when building SPEC2017
blender_r.
Not sure if ctrl is supposed to be reliable in assemble output, the patch just
remove that as a walkaround.
Bootstrapped and regtested on x86_64-pc-linux-gnu
Antony Polukhin 2023-07-11 09:51:58 UTC
There's a typo at
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/g%2B%2B.target/i386/pr110170.C;h=e638b12a5ee2264ecef77acca86432a9f24b103b;hb=d41a57c46df6f8f7dae0c0a8b349e734806a837b#l87
It should be `|| !test3() || !test3r()` rather than `|| !te
Similar like we did for CMPXCHG, but extended to all
ix86_comparison_int_operator since CMPCCXADD set EFLAGS exactly same
as CMP.
When operand order in CMP insn is same as that in CMPCCXADD,
CMP insn can be eliminated directly.
When operand order is swapped in CMP insn, only optimize
cmpccxadd +
Here's updated patch.
1. use optimize_insn_for_speed_p instead of using optimize_function_for_speed_p.
2. explicitly move memory to dest register to avoid false dependence in
one_cmpl pattern.
False dependency happens when destination is only updated by
pternlog. There is no false dependency whe
Similar like we did for cmpxchg, but extended to all
ix86_comparison_int_operator since cmpccxadd set EFLAGS exactly same
as CMP.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,},
Ok for trunk?
gcc/ChangeLog:
PR target/110591
* config/i386/sync.md (cmpccxadd_): Add a new
False dependency happens when destination is only updated by
pternlog. There is no false dependency when destination is also used
in source. So either a pxor should be inserted, or input operand
should be set with constraint '0'.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to p
> Please split the above pattern into two, one emitting UNSPEC_IEEE_MAX
> and the other emitting UNSPEC_IEEE_MIN.
Splitted.
> The test involves blendv instruction, which is SSE4.1, so it is
> pointless to test it without -msse4.1. Please add -msse4.1 instead of
> -march=x86_64 and use sse4_runtime
We have ix86_expand_sse_fp_minmax to detect min/max sematics, but
it requires rtx_equal_p for cmp_op0/cmp_op1 and if_true/if_false, for
the testcase in the PR, there's an extra move from cmp_op0 to if_true,
and it failed ix86_expand_sse_fp_minmax.
This patch adds pre_reload splitter to detect the
They should have same cost as vector mode since both generate
pand/pandn/pxor/por instruction.
Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
* config/i386/i386.cc (ix86_rtx_costs): Adjust rtx_cost for
DF/SFmode AND/IOR/XOR/ANDN operations.
For testcase
void __cond_swap(double* __x, double* __y) {
bool __r = (*__x < *__y);
auto __tmp = __r ? *__x : *__y;
*__y = __r ? *__y : *__x;
*__x = __tmp;
}
GCC-14 with -O2 and -march=x86-64 options generates the following code:
__cond_swap(double*, double*):
movsd xmm1, QWORD
vpternlog is also used for optimization which doesn't need any valid
input operand, in that case, the destination is used as input in the
instruction and that creates a false dependence.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
PR t
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/82735
* config/i386/i386.cc (ix86_avx_u127_mode_needed): Don't emit
vzeroupper for vzeroupper call_insn.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx-vzeroupper-30.
pass_insert_vzeroupper is under condition
TARGET_AVX && TARGET_VZEROUPPER
&& flag_expensive_optimizations && !optimize_size
But the document of mvzeroupper doesn't mention the insertion
required -O2 and above, it may confuse users when they explicitly
use -Os -mvzeroupper.
mvzeroupp
At the rtl level, we cannot guarantee that the maskstore is not optimized
to other full-memory accesses, as the current implementations are equivalent
in terms of pattern, to solve this potential problem, this patch refines
the pattern of the maskstore and the intrinsics with unspec.
One thing I'm
__bfloat16 is redefined from typedef short to real __bf16 since GCC
V13. The patch issues an warning for potential silent implicit
conversion between __bf16 and short where users may only expect a
data movement.
To avoid too many false positive, warning is only under
TARGET_AVX512BF16.
Bootstrapp
When there're multiple operands in vec_oprnds0, vec_dest will be
overwrited to vectype_out, but in multi_step_cvt case, cvt_type is
expected. It caused an ICE when verify_gimple_in_cfg.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} and aarch64-linux-gnu.
Ok for trunk?
gcc/ChangeLog:
The new assembly looks better than original one, so I adjust those testcases.
Ok for trunk?
gcc/testsuite/ChangeLog:
PR tree-optimization/110371
PR tree-optimization/110018
* gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Scan scvt +
sxtw instead of scvt + zip1 + z
> > Hmm, good question. GENERIC has a direct truncation to unsigned char
> > for example, the C standard generally says if the integral part cannot
> > be represented then the behavior is undefined. So I think we should be
> > safe here (0x1.0p32 doesn't fit an int).
>
> We should be following An
If mem_addr points to a memory region with less than whole vector size
bytes of accessible memory and k is a mask that would prevent reading
the inaccessible bytes from mem_addr, add UNSPEC_MASKLOAD to prevent
it to be transformed to vpblendd.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32
I notice there's some refactor in vectorizable_conversion
for code_helper,so I've adjusted my patch to that.
Here's the patch I'm going to commit.
We have already use intermidate type in case WIDEN, but not for NONE,
this patch extended that.
gcc/ChangeLog:
PR target/110018
* tre
The packing in vpacksswb/vpackssdw is not a simple concat, it's an
interweave from src1 and src2 for every 128 bit(or 64-bit for the
ss_truncate result).
.i.e.
dst[192-255] = ss_truncate (src2[128-255])
dst[128-191] = ss_truncate (src1[128-255])
dst[64-127] = ss_truncate (src2[0-127])
dst[0-63] =
packuswb/packusdw does unsigned saturation for signed source, but rtl
us_truncate means does unsigned saturation for unsigned source.
So for value -1, packuswb will produce 0, but us_truncate produces
255. The patch reimplement those related patterns and functions with
UNSPEC_US_TRUNCATE instead of
Since there's no evex version for vpcmpeq ymm, ymm, ymm.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk and backport to GCC13.
gcc/ChangeLog:
PR target/110227
* config/i386/sse.md (mov_internal>): Use x instead of v
for alternative 2 sinc
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
TYPE_MIN, but PABSB will store unsigned result into dst. The patch
uses ABSU_EXPR + VCE instead of ABS_EXPR.
Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
vector absm2 is guarded with TARGET_MMX_WITH_SSE.
g
> I think this is a better patch and will always be correct and still
> get folded at the gimple level (correctly):
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ff56ee8dd..02bf5ba93a5 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -18561,8
Since mask < 0 will be always false when -funsigned-char, but
vpblendvb needs to check the most significant bit.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and backport to GCC12/GCC13 release branch?
gcc/ChangeLog:
PR target/110108
* config/i386/i386-b
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
TYPE_MIN, but PABSB will store unsigned result into dst. The patch
uses ABSU_EXPR + VCE instead of ABS_EXPR.
Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
vector absm2 is guarded with TARGET_MMX_WITH_SSE.
B
This patch only support vec_pack/unpacks optabs for vector modes whose lenth >=
128.
For 32/64-bit vector, they're more hanlded by BB vectorizer with
truncmn2/extendmn2/fix{,uns}_truncmn2.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/ChangeLog:
*
We have already use intermidate type in case WIDEN, but not for NONE,
this patch extended that.
I didn't do that in pattern recog since we need to know whether the
stmt belongs to any slp_node to decide the vectype, the related optabs
are checked according to vectype_in and vectype_out. For non-sl
Add missing insn patterns for v2si -> v2hi/v2qi and v2hi-> v2qi vector
truncate.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/92658
* config/i386/mmx.md (truncv2hiv2qi2): New define_insn.
(truncv2si2): Ditto.
gcc/testsu
For the testcase in the PR, we have
br64 = br;
br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull);
n->n: 0x300200.
n->range: 32.
n->type: uint64.
The original code assumes n->range is same as TYPE PRECISION(n->type),
and tries to rotate the mask from 0x30200
lzcnt/tzcnt has been fixed since skylake, popcnt has been fixed since
icelake. At least for icelake and later intel Core processors, the
errata tune is not needed. And the tune isn't need for ATOM either.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready to push to trunk.
gcc/Chang
r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to
transform notl + pbroadcast + pand to pbroadcast + pandn for
VI124_AVX2 which leaves out all DI-element-size ones as
well as all 512-bit ones.
This patch extend the splitter to VI_AVX2 which will handle DImode for
AVX2, and V64QI
Also for 64-bit vector abs intrinsics _mm_abs_{pi8,pi16,pi32}.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/109900
* config/i386/i386.cc (ix86_gimple_fold_builtin): Fold
_mm{,256,512}_abs_{epi8,epi16,epi32,epi64} and
r14-172-g0368d169492017 replaces GENERAL_REGS with NO_REGS in cost
calculation when the preferred register class are not known yet.
It regressed powerpc PR109610 and PR109858, it looks too aggressive to use
NO_REGS when mode can be allocated with GENERAL_REGS.
The patch takes a step back, still use
> I think this could be simplified if you use either EnumSet or
> EnumBitSet instead in common.opt for `-fcf-protection=`.
Use EnumSet instead of EnumBitSet since CF_FULL is not power of 2.
It is a bit tricky for sets classification, cf_branch and cf_return
should be in different sets, but they bo
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/89701
* common.opt: Refactor -fcf-protection= to support combination
of param.
* lto-wrapper.c (merge_and_complain): Adjusted.
* opts.c (parse_cf_protection_opt
> The quoted patch shows -shared in context and you didn't post a
> backport version
> to look at. But yes, we shouldn't change -shared behavior on a
> branch, even less so make it
> inconsistent between targets.
Here's the patch.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for
The patch doesn't handle:
1. cast64_to_32,
2. memory source with rsize < range.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR middle-end/108938
* gimple-ssa-store-merging.cc (is_bswap_or_nop_p): New
function, cut from origin
> > @@ -4799,7 +4800,8 @@ vect_create_vectorized_demotion_stmts (vec_info
> > *vinfo, vec *vec_oprnds,
> >stmt_vec_info stmt_info,
> >vec &vec_dsts,
> >gimple_stmt_iterator *gsi,
r14-172-g0368d169492017 use NO_REGS instead of GENERAL_REGS in memory cost
calculation when preferred register class is unkown.
+ /* Costs for NO_REGS are used in cost calculation on the
+1st pass when the preferred register classes are not
+known yet. In this case we take the
Here's update patch with documents in md.texi.
Ok for trunk?
--
Use swap_communattive_operands_p for canonicalization. When both value
has same operand precedence value, then first bit in the mask should
select first operand.
The canonicalization should help backends for pattern match
Similar like WIDEN FLOAT_EXPR, when direct_optab is not existed, try
intermediate integer type whenever gimple ranger can tell it's safe.
.i.e.
When there's no direct optab for vector long long -> vector float, but
the value range of integer can be represented as int, try vector int
-> vector floa
Ready push to trunk.
gcc/testsuite/ChangeLog:
PR tree-optimization/109011
* gcc.target/i386/pr109011-b1.c: New test.
* gcc.target/i386/pr109011-b2.c: New test.
* gcc.target/i386/pr109011-d1.c: New test.
* gcc.target/i386/pr109011-d2.c: New test.
* g
> > + if (!TARGET_SSE2)
> > +{
> > + if (c_dialect_cxx ()
> > + && cxx_dialect > cxx20)
>
> Formatting, both conditions are short, so just put them on one line.
Changed.
> But for the C++23 macros, more importantly I think we really should
> also in ix86_target_macros_internal add
> But for the C++23 macros, more importantly I think we really should
> also in ix86_target_macros_internal add
> if (c_dialect_cxx ()
> && cxx_dialect > cxx20
> && (isa_flag & OPTION_MASK_ISA_SSE2))
> {
> def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
> def_or_undef
Use swap_communattive_operands_p for canonicalization. When both value
has same operand precedence value, then first bit in the mask should
select first operand.
The canonicalization should help backends for pattern match. .i.e. x86
backend has lots of vec_merge patterns, combine will create any f
After optimization for RA, memory op is not propagated into
instructions(>1), and it make testcases not generate vxorps since
the memory is loaded into the dest, and the dest is never unused now.
So rewrite testcases to make the codegen more stable.
gcc/testsuite/ChangeLog:
* gcc.target/
1547 /* If this insn loads a parameter from its stack slot, then it
1548 represents a savings, rather than a cost, if the parameter is
1549 stored in memory. Record this fact.
1550
1551 Similarly if we're loading other constants from memory (constant
1552 pool, TOC references, sma
-Jakub's comments--
That said, these fundamental types whose presence/absence depends on ISA flags
are quite problematic IMHO, as they are incompatible with the target
attribute/pragmas. Whether they are available or not available depends on
whether in this case SSE2 is enabled during c
There's a potential performance issue when backend returns some
unreasonable value for the mode which can be never be allocate with
reg class.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk(or GCC14 stage1)?
gcc/ChangeLog:
PR rtl-optimization/109351
* ira.
Look through all backends which defined signbitm2.
1. When m is a scalar mode, the dest is SImode.
2. When m is a vector mode, the dest mode is the vector integer
mode has the same size and elements number as m.
Ok for trunk?
gcc/ChangeLog:
* doc/md.texi: Document signbitm2.
---
gcc/doc
RA sometimes will use lowest the cost of the mode with all different regclasses
w/o check if it's hard_regno_mode_ok.
It's impossible to put modes whose size > 8 into MASK_REGS, ajdust the cost to
avoid potential performance issue.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for t
> > Just rename the instruction and fix all its call sites. The name of
> > the insn pattern is internal to the compiler and can be renamed at
> > will.
>
> Ideally, we should standardize all the names to a standard name, so
> e.g. ufix_ -> fixuns_ and ufloat -> floatuns.
Updated.
There's some t
There's some typo for the standard pattern name for unsigned_{float,fix},
it should be floatunsmn2/fixuns_truncmn2, not ufloatmn2/ufix_truncmn2
in current trunk, the patch fix the typo.
Also vcvttps2udq is available under AVX512VL, so it can be generated
directly instead of being emulated via vcvt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ok for GCC14 stage-1(or maybe trunk)?
gcc/ChangeLog:
* config/i386/i386-expand.cc (expand_vec_perm_blend): Generate
vpblendd instead of vpblendw for V4SI under avx2.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr888
The target hook is only used by i386, and the current definition is
same as default gen_reg_rtx. So there's no need for this target hook.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk(or GCC14)?
gcc/ChangeLog:
* builtins.cc (builtin_memset_read_str): Replace
Normally when vf is not constant, it will be prevented by
vectorizable_nonlinear_inductions, but for this case, it failed going
into
if (STMT_VINFO_RELEVANT_P (stmt_info))
{
need_to_vectorize = true;
if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def
&&
Ready to push to trunk.
---
htdocs/gcc-12/changes.html | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html
index 30fa4d6e..49055ffe 100644
--- a/htdocs/gcc-12/changes.html
+++ b/htdocs/gcc-12/changes.html
@@ -754,7 +754,7 @@
The official name is AVX512-FP16.
Ready to push to trunk.
gcc/ChangeLog:
* config/i386/i386.opt: Change AVX512FP16 to AVX512-FP16.
* doc/invoke.texi: Ditto.
---
gcc/config/i386/i386.opt | 2 +-
gcc/doc/invoke.texi | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
Patches [1] and [2] fixed PR55522 for x86-linux but left all other x86
targets unfixed (x86-cygwin, x86-darwin and x86-mingw32).
This patch applies a similar change to other specs using crtfastmath.o.
Ok for trunk?
[1] https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608528.html
[2] https:
Update in v2:
1. Support -mno-daz-ftz, and make the the option effectively three state as:
if (mdaz-ftz)
link crtfastmath.o
else if ((Ofast || ffast-math || funsafe-math-optimizations)
&& !shared && !mno-daz-ftz)
link crtfastmath.o
else
Don't link crtfastmath.o
2. Still make the op
Update in V2:
Split -shared change into a separate commit and add some documentation
for it.
Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
Ok of trunk?
Don't add crtfastmath.o for -shared to avoid changing the MXCSR register
when loading a shared library. crtfastmath.o will be used onl
Don't add crtfastmath.o for -shared to avoid changing the MXCSR
register when loading a shared library. crtfastmath.o will be used
only when building executables.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/55522
PR target/368
ice.i:7:1: error: unrecognizable insn:
7 | }
| ^
(insn 7 6 8 2 (set (reg:V2SF 84 [ vect__3.8 ])
(unspec:V2SF [
(reg:V2SF 86 [ vect__1.7 ])
(const_int 11 [0xb])
] UNSPEC_ROUND)) "ice.i":5:14 -1
(nil))
during RTL pass: vregs
Bootstra
After supporting extendbfsf2_1, ix86_expand_fast_convert_bf_to_sf can
be improved with pslld either.
CONST_INT_P is not handled since constant shift can be optimized off.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
* config/i386/i386-expand.cc
;; if reg/mem op
(define_insn_reservation "slm_sseishft_3" 2
(and (eq_attr "cpu" "slm")
(and (eq_attr "type" "sseishft")
(not (match_operand 2 "immediate_operand"
"slm-complex, slm-all-eu")
in slm.md it will check operands[2] for type sseishft, but for
extendbfsf2_1 the
Update in V2:
Add documentation for -mlam={none,u48,u57} to x86 options in invoke.texi.
gcc/ChangeLog:
* doc/invoke.texi (x86 options): Document
-mlam={none,u48,u57}.
* config/i386/i386-opts.h (enum lam_type): New enum.
* config/i386/i386.c (ix86_memtag_can_tag_add
For __builtin_ia32_vec_set_v16qi (a, -1, 2) with
!flag_signed_char. it's transformed to
__builtin_ia32_vec_set_v16qi (_4, 255, 2) in the gimple,
and expanded to (const_int 255) in the rtl. But for immediate_operand,
it expects (const_int 255) to be signed extended to
(const_int -1). The mismatch ca
Update in V3:
Remove !flag_signaling_nans since there's already HONOR_NANS (BFmode).
Here's the patch:
After supporting real __bf16, the implementation of _mm_cvtsbh_ss went
wrong.
The patch add a builtin to generate pslld for the intrinsic, also
extendbfsf2 is supported with pslld when !HONOR_N
After supporting real __bf16, the implementation of _mm_cvtsbh_ss went
wrong.
The patch add a builtin to generate pslld for the intrinsic, also
extendbfsf2 is supported with pslld when !flag_signaling_nans &&
!HONOR_NANS (BFmode).
truncsfbf2 is supported with vcvtneps2bf16 when !flag_signaling_na
1 - 100 of 409 matches
Mail list logo