[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #37 from Hongtao Liu  ---
(In reply to Richard Biener from comment #36)
> For example with AVX512VL and the following, using -O -fgimple -mavx512vl
> we get simply
> 
> notl%esi
> orl %esi, %edi
> cmpb$15, %dil
> je  .L6
> 
> typedef long v4si __attribute__((vector_size(4*sizeof(long;
> typedef v4si v4sib __attribute__((vector_mask));
> typedef _Bool sbool1 __attribute__((signed_bool_precision(1)));
> 
> void __GIMPLE (ssa) foo (v4sib v1, v4sib v2)
> {
>   v4sib tem;
> 
> __BB(2):
>   tem_5 = ~v2_2(D);
>   tem_3 = v1_1(D) | tem_5;
>   tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1,
> _Literal (sbool1) -1, _Literal (sbool1) -1 };
>   if (tem_3 == tem_4)
> goto __BB3;
>   else
> goto __BB4;
> 
> __BB(3):
>   __builtin_abort ();
> 
> __BB(4):
>   return;
> }
> 
> 
> the question is whether that matches the semantics of GIMPLE (the padding
> is inverted, too), whether it invokes undefined behavior (don't do it - it
> seems for people using intrinsics that's what it is?) or whether we
> should avoid affecting padding.
> 
> Note after the patch I proposed on the mailing list the constant mask is
> now expanded with zero padding.

I think we should also mask off the upper bits of variable mask?

notl%esi
orl %esi, %edi
notl%edi
andl$15, %edi
je  .L3

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #38 from Hongtao Liu  ---

> I think we should also mask off the upper bits of variable mask?
> 
> notl%esi
> orl %esi, %edi
> notl%edi
> andl$15, %edi
> je  .L3

with -mbmi, it's 

andn%esi, %edi, %edi
andl$15, %edi
je  .L3

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #39 from Hongtao Liu  ---
> > the question is whether that matches the semantics of GIMPLE (the padding
> > is inverted, too), whether it invokes undefined behavior (don't do it - it
> > seems for people using intrinsics that's what it is?)
For the intrinisc, the instructions only care about lower bits, so it's not big
issue? And it sounds like similar issue as _BitInt(4)/_BitInt(2), I assume
there're garbage in the upper bits.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #43 from Hongtao Liu  ---

> Well, yes, the discussion in this bug was whether to do this at consumers
> (that's sth new) or with all mask operations (that's how we handle
> bit-precision integer operations, so it might be relatively easy to
> do that - specifically spot the places eventually needing adjustment).
> 
> There's do_store_flag to fixup for uses not in branches and
> do_compare_and_jump for conditional jumps.

reasonable enough for me.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #44 from Hongtao Liu  ---
> 
> Note the AND is removed by combine if I add it:
> 
> Successfully matched this instruction:
> (set (reg:CCZ 17 flags)
> (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0))
> (const_int 15 [0xf]))
> (const_int 0 [0])))
> 
> (*testhi_not)
> 
> -9: {r103:QI=r102:QI&0xf;clobber flags:CC;}
> +  REG_DEAD r99:QI
> +9: NOTE_INSN_DELETED
> +   12: flags:CCZ=cmp(~r102:QI#0&0xf,0)
>REG_DEAD r102:QI
> -  REG_UNUSED flags:CC
> -   12: flags:CCZ=cmp(r103:QI,0xf)
> -  REG_DEAD r103:QI
> 
> and we get
> 
> foo:
> .LFB0:
> .cfi_startproc
> notl%esi
> orl %esi, %edi
> notl%edi
> testb   $15, %dil
> je  .L6
> ret
> 
> which I'm not sure is OK?
> 

Yes, I think it's on purpose

11508;; Split and;cmp (as optimized by combine) into not;test
11509;; Except when TARGET_BMI provides andn (*andn__ccno).
11510(define_insn_and_split "*test_not"
11511  [(set (reg:CCZ FLAGS_REG)
11512(compare:CCZ
11513  (and:SWI
11514(not:SWI (match_operand:SWI 0 "register_operand"))
11515(match_operand:SWI 1 ""))
11516  (const_int 0)))]
11517  "ix86_pre_reload_split ()
11518   && (!TARGET_BMI || !REG_P (operands[1]))"
11519  "#"
11520  "&& 1"
11521  [(set (match_dup 2) (not:SWI (match_dup 0)))
11522   (set (reg:CCZ FLAGS_REG)
11523(compare:CCZ (and:SWI (match_dup 2) (match_dup 1))
11524 (const_int 0)))]
11525  "operands[2] = gen_reg_rtx (mode);")
11526
11527;; Split and;cmp (as optimized by combine) into andn;cmp $0
11528(define_insn_and_split "*test_not_doubleword"

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #45 from Hongtao Liu  ---

> > There's do_store_flag to fixup for uses not in branches and
> > do_compare_and_jump for conditional jumps.
> 
> reasonable enough for me.
I mean we only handle it at consumers where upper bits matters.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #57 from Hongtao Liu  ---
> For dg-do run testcases I really think we should avoid those -march=
> options, because it means a lot of other stuff, BMI, LZCNT, ...

Make sense.

[Bug tree-optimization/109885] gcc does not generate movmskps and testps instructions (clang does)

2024-02-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109885

--- Comment #4 from Hongtao Liu  ---
int sum() {
   int ret = 0;
   for (int i=0; i<8; ++i) ret +=(0==v[i]);
   return ret;
}

int sum2() {
   int ret = 0;
   auto m = v==0;
   for (int i=0; i<8; ++i) ret += m[i];
   return ret;
}

For sum, gcc tries to reduce for an {0/1, 0/1, ...} vector, for sum2, it tries
to reduce {0/-1,0/-1,...} vector. But LLVM tries to reduce {0/1, 0/1, ... }
vector for both sum and sum2. Not sure which is correct?

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #7 from Hongtao Liu  ---
perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
pshufb/shufps are avaible for most cases.
But for 256/512-bit vectors, when the permuation is cross-lane, the cost could
be higher. One solution is increase perm_cost when vector size is more than 128
since vperm is most likely used instead of vblend/vpblend/vpshuf/vshuf.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #8 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #7)
> perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
> pshufb/shufps are avaible for most cases.
> But for 256/512-bit vectors, when the permuation is cross-lane, the cost
> could be higher. One solution is increase perm_cost when vector size is more
> than 128 since vperm is most likely used instead of
> vblend/vpblend/vpshuf/vshuf.

Furthermore, if we can get indices in the backend when calculating vec_perm
cost, we can check if the permutation is cross-lane or not, and set cost more
accurately for 256/512-bit vector permutation.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #11 from Hongtao Liu  ---
(In reply to N Schaeffer from comment #9)
> In addition, optimizing for size with -Os leads to a non-vectorized
> double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced
> by clang -Os) leads to 40 bytes.
> It is thus also a missed optimization for -Os.

vectorization is enabled with O2 but not Os.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #9 from Hongtao Liu  ---
The original case is a little different from the one in PR.
It comes from ggml

#include 
#include 

typedef uint16_t ggml_fp16_t;
static float table_f32_f16[1 << 16];

inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
uint16_t s;
memcpy(&s, &f, sizeof(uint16_t));
return table_f32_f16[s];
}

typedef struct {
ggml_fp16_t d;
ggml_fp16_t m;
uint8_t qh[4];
uint8_t qs[32 / 2];
} block_q5_1;

typedef struct {
float d;
float s;
int8_t qs[32];
} block_q8_1;

void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void *
restrict vx, const void * restrict vy) {
const int qk = 32;
const int nb = n / qk;

const block_q5_1 * restrict x = vx;
const block_q8_1 * restrict y = vy;

float sumf = 0.0;

for (int i = 0; i < nb; i++) {
uint32_t qh;
memcpy(&qh, x[i].qh, sizeof(qh));

int sumi = 0;

for (int j = 0; j < qk/2; ++j) {
const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10;

const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
const int32_t x1 = (x[i].qs[j] >> 4) | xh_1;

sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
}

sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi +
ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
}

*s = sumf;
}

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #10 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #9)
> The original case is a little different from the one in PR.
But the issue is similar, after cunrolli, GCC failed to vectorize the outer
loop.

The interesting thing is in estimated_unrolled_size, the original unr_insns is
288 which is bigger than param_max_completely_peeled_insns(200), but unr_insn
is decreased by 1/3 due to

   Loop body is likely going to simplify further, this is difficult
   to guess, we just decrease the result by 1/3.  */

In practice, this loop body is not simplied for 1/3 of the instructions.

Considering the unroll factor is 16, the unr_insn is large(192), I was
wondering if we could add some heuristic algorithm to avoid complete loop
unroll, because usually for such a big loop, both loop and BB vectorizer may
not perform well.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #11 from Hongtao Liu  ---

>Loop body is likely going to simplify further, this is difficult
>to guess, we just decrease the result by 1/3.  */
> 

This is introduced by r0-68074-g91a01f21abfe19

/* Estimate number of insns of completely unrolled loop.  We assume
+   that the size of the unrolled loop is decreased in the
+   following way (the numbers of insns are based on what
+   estimate_num_insns returns for appropriate statements):
+
+   1) exit condition gets removed (2 insns)
+   2) increment of the control variable gets removed (2 insns)
+   3) All remaining statements are likely to get simplified
+  due to constant propagation.  Hard to estimate; just
+  as a heuristics we decrease the rest by 1/3.
+
+   NINSNS is the number of insns in the loop before unrolling.
+   NUNROLL is the number of times the loop is unrolled.  */
+
+static unsigned HOST_WIDE_INT
+estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
+unsigned HOST_WIDE_INT nunroll)
+{
+  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
+  if (unr_insns <= 0)
+unr_insns = 1;
+  unr_insns *= (nunroll + 1);
+
+  return unr_insns;
+}

And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
keeped.

+/* Estimate number of insns of completely unrolled loop.
+   It is (NUNROLL + 1) * size of loop body with taking into account
+   the fact that in last copy everything after exit conditional
+   is dead and that some instructions will be eliminated after
+   peeling.

-   NINSNS is the number of insns in the loop before unrolling.
-   NUNROLL is the number of times the loop is unrolled.  */
+   Loop body is likely going to simplify futher, this is difficult
+   to guess, we just decrease the result by 1/3.  */

 static unsigned HOST_WIDE_INT
-estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
+estimated_unrolled_size (struct loop_size *size,
 unsigned HOST_WIDE_INT nunroll)
 {
-  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
+  HOST_WIDE_INT unr_insns = ((nunroll)
+* (HOST_WIDE_INT) (size->overall
+   -
size->eliminated_by_peeling));
+  if (!nunroll)
+unr_insns = 0;
+  unr_insns += size->last_iteration -
size->last_iteration_eliminated_by_peeling;
+
+  unr_insns = unr_insns * 2 / 3;
   if (unr_insns <= 0)
 unr_insns = 1;
-  unr_insns *= (nunroll + 1);

It looks to me 1 / 3 overestimates the instructions that can be optimised away,
especially if we've subtracted eliminated_by_peeling

[Bug target/114125] New: Support vcond_mask_qiqi and friends.

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114125

Bug ID: 114125
   Summary: Support vcond_mask_qiqi and friends.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

Quote from https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646587.html

> On Linux/x86_64,
>
> af66ad89e8169f44db723813662917cf4cbb78fc is the first bad commit
> commit af66ad89e8169f44db723813662917cf4cbb78fc
> Author: Richard Biener 
> Date:   Fri Feb 23 16:06:05 2024 +0100
>
> middle-end/114070 - folding breaking VEC_COND expansion
>
> caused
>
> FAIL: gcc.dg/tree-ssa/andnot-2.c scan-tree-dump-not forwprop3 "_expr"

This shows that the x86 backend is missing vcond_mask_qiqi and friends
(for AVX512 mask modes).  Either that or both expand_vec_cond_expr_p
and all the machinery behind it (ISEL pass, lowering) should handle
pure integer mode VEC_COND_EXPR via bit operations.  I think quite some
targets now implement patterns for these variants, whatever their
boolean vector modes are.

One complication with the change, which was

  (simplify
   (op @3 (vec_cond:s @0 @1 @2))
-  (vec_cond @0 (op! @3 @1) (op! @3 @2
+  (if (TREE_CODE_CLASS (op) != tcc_comparison
+   || types_match (type, TREE_TYPE (@1))
+   || expand_vec_cond_expr_p (type, TREE_TYPE (@0), ERROR_MARK))
+   (vec_cond @0 (op! @3 @1) (op! @3 @2)

is that expand_vec_cond_expr_p can also handle comparison defined
masks, but whether or not we have this isn't visible here so we
can only check whether vcond_mask expansion would work.

We have optimize_vectors_before_lowering_p but we shouldn't even there
turn supported into not supported ops and as said, what's supported or
not cannot be finally decided (if it's only vcond and not vcond_mask
that is supported).  Also optimize_vectors_before_lowering_p is set
for a short time between vectorization and vector lowering and we
definitely do not want to turn supported vectorizer emitted stmts
into ones that we need to lower.  For GCC 15 we should see to move
vector lowering before vectorization (before loop optimization I'd
say) to close this particula hole (and also reliably ICE when the
vectorizer creates unsupported IL).  We also definitely want to
retire vcond expanders (no target I know of supports single-instruction
compare-and-select).

So short term we either live with this regression (the testcase
verifies we perform constant folding to { 0, 0 }), implement
the four missing patterns (qi, hi, si and di missing value mode
vcond_mask patterns) or see to implement generic code for this.

Given precedent I'd tend towards adding the x86 patterns.

Hongtao, can you handle that?

[Bug target/114125] Support vcond_mask_qiqi and friends.

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114125

Hongtao Liu  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-02-27
 Target||x86_64-*-* i?86-*-*

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-27 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #14 from Hongtao Liu  ---
(In reply to rguent...@suse.de from comment #13)
> On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> > 
> > --- Comment #11 from Hongtao Liu  ---
> > 
> > >Loop body is likely going to simplify further, this is difficult
> > >to guess, we just decrease the result by 1/3.  */
> > > 
> > 
> > This is introduced by r0-68074-g91a01f21abfe19
> > 
> > /* Estimate number of insns of completely unrolled loop.  We assume
> > +   that the size of the unrolled loop is decreased in the
> > +   following way (the numbers of insns are based on what
> > +   estimate_num_insns returns for appropriate statements):
> > +
> > +   1) exit condition gets removed (2 insns)
> > +   2) increment of the control variable gets removed (2 insns)
> > +   3) All remaining statements are likely to get simplified
> > +  due to constant propagation.  Hard to estimate; just
> > +  as a heuristics we decrease the rest by 1/3.
> > +
> > +   NINSNS is the number of insns in the loop before unrolling.
> > +   NUNROLL is the number of times the loop is unrolled.  */
> > +
> > +static unsigned HOST_WIDE_INT
> > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > +unsigned HOST_WIDE_INT nunroll)
> > +{
> > +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > +  if (unr_insns <= 0)
> > +unr_insns = 1;
> > +  unr_insns *= (nunroll + 1);
> > +
> > +  return unr_insns;
> > +}
> > 
> > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> > likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> > keeped.
> > 
> > +/* Estimate number of insns of completely unrolled loop.
> > +   It is (NUNROLL + 1) * size of loop body with taking into account
> > +   the fact that in last copy everything after exit conditional
> > +   is dead and that some instructions will be eliminated after
> > +   peeling.
> > 
> > -   NINSNS is the number of insns in the loop before unrolling.
> > -   NUNROLL is the number of times the loop is unrolled.  */
> > +   Loop body is likely going to simplify futher, this is difficult
> > +   to guess, we just decrease the result by 1/3.  */
> > 
> >  static unsigned HOST_WIDE_INT
> > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > +estimated_unrolled_size (struct loop_size *size,
> >  unsigned HOST_WIDE_INT nunroll)
> >  {
> > -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > +  HOST_WIDE_INT unr_insns = ((nunroll)
> > +* (HOST_WIDE_INT) (size->overall
> > +   -
> > size->eliminated_by_peeling));
> > +  if (!nunroll)
> > +unr_insns = 0;
> > +  unr_insns += size->last_iteration -
> > size->last_iteration_eliminated_by_peeling;
> > +
> > +  unr_insns = unr_insns * 2 / 3;
> >if (unr_insns <= 0)
> >  unr_insns = 1;
> > -  unr_insns *= (nunroll + 1);
> > 
> > It looks to me 1 / 3 overestimates the instructions that can be optimised 
> > away,
> > especially if we've subtracted eliminated_by_peeling
> 
> Yes, that 1/3 reduction is a bit odd - you could have the same effect
> by increasing the instruction limit by 1/3, but that means it doesn't
> really matter, does it?  It would be interesting to see if increasing
> the limit by 1/3 and removing the above is neutral on SPEC?

Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with
-march=native -O3, no big impact on other integer benchmark.

The regression comes from below function, cunrolli unrolls the inner loop,
cunroll unrolls the outer loop, and causes lots of spills.

typedef unsigned long long uint64_t;
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
uint64_t x264_pixel_var_8x8(uint8_t *pix, int i_stride )
{
uint32_t sum = 0, sqr = 0;
for( int y = 0; y < 8; y++ )
{
for( int x = 0; x < 8; x++ ) 
{
sum += pix[x]; 
sqr += pix[x] * pix[x]; 
}  
pix += i_stride;   
}   
return sum + ((uint64_t)sqr << 32);
}

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #16 from Hongtao Liu  ---

> I'm all for removing the 1/3 for innermost loop handling (in cunroll
> the unrolled loop is then innermost).  I'm more concerned about
> unrolling more than one level which is exactly what's required for
> 454.calculix.

Removing 1/3 for the innermost loop would be sufficient to solve both the issue
in the PR and x264_pixel_var_8x8 from 525.x264_r. I'll try to benchmark that.

[Bug tree-optimization/114164] simdclone vectorization creates unsupported IL

2024-02-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114164

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #2)
> (In reply to Richard Biener from comment #1)
> > I'm not sure who's responsible to reject this, whether the vectorizer can
> > expect there's a way to create the mask arguments when the simdclone is
> > marked usable by the target or whether it has to verify that itself.
> > 
> > This becomes an ICE if we move vector lowering before vectorization.
> 
> Wasn't this valid when VEC_COND_EXPR allowed the comparison directly in the
> operand?
> Or maybe I misremember.  Certainly I believe -mavx -mno-avx2 should be able
> to do
> 256-bit conditional moves of float/double elements.

Here, mask is v4si which is 128-bit, and vector is v4df which is 256-bit
w/o avx512, x86 backend only supports vcond/vcond_mask with same size
(vcond{,_mask}v4sfv4si or vcond{,_mask}v4dfv4di), but not
vcond{,_mask}v4dfv4si.

BTW, we may get v4di mask from v4si mask by

vshufps xmm1, xmm0, xmm0, 80# xmm1 = xmm0[0,0,1,1]
vshufps xmm0, xmm0, xmm0, 250   # xmm0 = xmm0[2,2,3,3]
vinsertf128 ymm0, ymm1, xmm0, 1


under AVX, under AVX2 we can just use pmovsxdq

[Bug d/114171] [13/14 Regression] gdc -O2 -mavx generates misaligned vmovdqa instruction

2024-02-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114171

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org
   Last reconfirmed||2024-3-1

--- Comment #2 from Hongtao Liu  ---
on rtl level,we get

(insn 7 6 8 2 (set (reg:CCZ 17 flags)
(compare:CCZ (mem:TI (plus:DI (reg/f:DI 100 [ _5 ])
(const_int 24 [0x18])) [0 MEM[(ucent *)_5 + 24B]+0 S16
A128])
(const_int 0 [0]))) "test.d":15:16 30 {*cmpti_doubleword}
 (nil))

It's 16-byte aligned.

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #16 from Hongtao Liu  ---
(In reply to Uroš Bizjak from comment #11)
> (In reply to Richard Biener from comment #10)
> > The easiest fix would be to refuse applying STV to a insn that
> > can_throw_internal () (that's an insn that has associated EH info).  
> > Updating
> > in this case would require splitting the BB or at least moving the now
> > no longer throwing insn to the next block (along the fallthru edge).
> 
> This would be simply:
> 
> --cut here--
> diff --git a/gcc/config/i386/i386-features.cc
> b/gcc/config/i386/i386-features.cc
> index 1de2a07ed75..90acb33db49 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -437,6 +437,10 @@ scalar_chain::add_insn (bitmap candidates, unsigned int
> insn_uid,
>&& !HARD_REGISTER_P (SET_DEST (def_set)))
>  bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>  
> +  if (cfun->can_throw_non_call_exceptions
> +  && can_throw_internal (insn))
> +return false;
> +
>/* ???  The following is quadratic since analyze_register_chain
>   iterates over all refs to look for dual-mode regs.  Instead this
>   should be done separately for all regs mentioned in the chain once.  */
> --cut here--
> 
> But I think, we could do better. Adding CC.

It looks like the similar issue we have solved in PR89650 with
r9-6543-g12fb7712a8a20f. We manually split the block after insn.

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #12 from Hongtao Liu  ---
(In reply to Sam James from comment #11)
> Calling it a 11..14 regression as we know 14 is bad and 7.5 is OK, but I
> can't test 11/12 on an avx512 machine right now.

I can't reproduce that with 11/12, but with gcc13 for the case in PR114276.

It looks like the codegen is already wrong in .expand, the offensive part is
mentioned in #c0

>Now, if `__asan_option_detect_stack_use_after_return` is 0, the variable at 
>>%rcx-128 is correctly aligned to 64. However, if it is 1, 
>__asan_stack_malloc_1 >returns something aligned to 64 << 1 (as per 
>https://github.com/gcc->mirror/gcc/blob/master/gcc/asan.cc#L1917) and adding 
>160 results in %rcx-128 >being only aligned to 32. And thus the segfault.


;; Function foo (_Z3foov, funcdef_no=14, decl_uid=3962, cgraph_uid=10,
symbol_order=9)

(note 1 0 37 NOTE_INSN_DELETED)
;; basic block 2, loop depth 0, maybe hot
;;  prev block 0, next block 3, flags: (NEW, REACHABLE, RTL, MODIFIED)
;;  pred:   ENTRY (FALLTHRU)
(note 37 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 37 3 2 (parallel [
(set (reg:DI 105)
(plus:DI (reg/f:DI 19 frame)
(const_int -160 [0xff60])))
(clobber (reg:CC 17 flags))
]) "test1.cc":7:12 247 {*adddi_1}
 (nil))
(insn 3 2 4 2 (set (reg:DI 106)
(reg:DI 105)) "test1.cc":7:12 82 {*movdi_internal}
 (nil))
(insn 4 3 5 2 (set (reg:CCZ 17 flags)
(compare:CCZ (mem/c:SI (symbol_ref:DI
("__asan_option_detect_stack_use_after_return") [flags 0x40]  ) [4
__asan_option_detect_stack_use_after_return+0 S4 A32])
(const_int 0 [0]))) "test1.cc":7:12 7 {*cmpsi_ccno_1}
 (nil))
(jump_insn 5 4 93 2 (set (pc)
(if_then_else (eq (reg:CCZ 17 flags)
(const_int 0 [0]))
(label_ref 11)
(pc))) "test1.cc":7:12 995 {*jcc}
 (nil)
 -> 11)
;;  succ:   5
;;  3 (FALLTHRU)

;; basic block 3, loop depth 0, maybe hot
;;  prev block 2, next block 4, flags: (NEW, REACHABLE, RTL, MODIFIED)
;;  pred:   2 (FALLTHRU)
(note 93 5 6 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
(insn 6 93 7 3 (set (reg:DI 5 di)
(const_int 128 [0x80])) "test1.cc":7:12 82 {*movdi_internal}
 (nil))
(call_insn 7 6 8 3 (set (reg:DI 0 ax)
(call (mem:QI (symbol_ref:DI ("__asan_stack_malloc_1") [flags 0x41] 
) [0  S1 A8])
(const_int 0 [0]))) "test1.cc":7:12 1013 {*call_value}
 (expr_list:REG_EH_REGION (const_int -2147483648 [0x8000])
(nil))
(expr_list (use (reg:DI 5 di))
(nil)))
(insn 8 7 9 3 (set (reg:CCZ 17 flags)
(compare:CCZ (reg:DI 0 ax)
(const_int 0 [0]))) "test1.cc":7:12 8 {*cmpdi_ccno_1}
 (nil))
(jump_insn 9 8 94 3 (set (pc)
(if_then_else (eq (reg:CCZ 17 flags)
(const_int 0 [0]))
(label_ref 11)
(pc))) "test1.cc":7:12 995 {*jcc}
 (nil)
 -> 11)
;;  succ:   5
;;  4 (FALLTHRU)
;; basic block 4, loop depth 0, maybe hot
;;  prev block 3, next block 5, flags: (NEW, REACHABLE, RTL, MODIFIED)
;;  pred:   3 (FALLTHRU)
(note 94 9 10 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 10 94 11 4 (set (reg:DI 105)
(reg:DI 0 ax)) "test1.cc":7:12 82 {*movdi_internal}
 (nil))
;;  succ:   5 (FALLTHRU)

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #13 from Hongtao Liu  ---
So the stack is like

--- stack top

-32

- (offset -32)

-64 (32 bytes redzone)

- (offset -64)

-128 (64 bytes __m512)

 (offset -128)

 (32-bytes redzone)

---(offset -160)   <--- __asan_stack_malloc_128 try to allocate an buffer 


  /* Emit the prologue sequence.  */
  if (asan_frame_size > 32 && asan_frame_size <= 65536 && pbase
  && param_asan_use_after_return)
{
  use_after_return_class = floor_log2 (asan_frame_size - 1) - 5;
  /* __asan_stack_malloc_N guarantees alignment
 N < 6 ? (64 << N) : 4096 bytes.  */
  if (alignb > (use_after_return_class < 6
? (64U << use_after_return_class) : 4096U))
use_after_return_class = -1;
  else if (alignb > ASAN_RED_ZONE_SIZE && (asan_frame_size & (alignb - 1)))
base_align_bias = ((asan_frame_size + alignb - 1)
   & ~(alignb - HOST_WIDE_INT_1)) - asan_frame_size;
}

  /* Align base if target is STRICT_ALIGNMENT.  */
  if (STRICT_ALIGNMENT)
{
  const HOST_WIDE_INT align
= (GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT) << ASAN_SHADOW_SHIFT;
  base = expand_binop (Pmode, and_optab, base, gen_int_mode (-align,
Pmode),
   NULL_RTX, 1, OPTAB_DIRECT);
}

  if (use_after_return_class == -1 && pbase)
emit_move_insn (pbase, base);

  base = expand_binop (Pmode, add_optab, base,
   gen_int_mode (base_offset - base_align_bias, Pmode),
   NULL_RTX, 1, OPTAB_DIRECT); -- suspicious add

  orig_base = NULL_RTX;
  if (use_after_return_class != -1)
{
  ...
  ret = emit_library_call_value (ret, NULL_RTX, LCT_NORMAL, ptr_mode,
 GEN_INT (asan_frame_size
  + base_align_bias),
 TYPE_MODE (pointer_sized_int_node));
  /* __asan_stack_malloc_[n] returns a pointer to fake stack if succeeded
 and NULL otherwise.  Check RET value is NULL here and jump over the
 BASE reassignment in this case.  Otherwise, reassign BASE to RET.  */
  emit_cmp_and_jump_insns (ret, const0_rtx, EQ, NULL_RTX,
   VOIDmode, 0, lab,
   profile_probability:: very_unlikely ());
  ret = convert_memory_address (Pmode, ret);
  emit_move_insn (base, ret);
  emit_label (lab);
  emit_move_insn (pbase, expand_binop (Pmode, add_optab, base,
   gen_int_mode (base_align_bias
 - base_offset, Pmode),
   NULL_RTX, 1, OPTAB_DIRECT));


base_align_bias is calculated to make (asan_frame_size(128) +
base_align_bias(0)) be multiple of alignb (64),  but didn't make `base_offset
(160) - base_align_bias (0)` be multiple of 64, so when __asan_stack_malloc_128
return an address aligned to 64, and then plus (base_offset (160) -
base_align_bias (0)), it's misaligned.

[Bug libgcc/111731] [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291

2024-03-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #16 from Hongtao Liu  ---
(In reply to Thomas Neumann from comment #15)
> Created attachment 57679 [details]
> fixed patch
> 
> Can you please try the updated patch? I had accidentally dropped an if
> nesting level when trying to adhere to the gcc style, sorry for that.

I'm trying to validating your patch, but it could take sometime to setup
enviroments.

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #14 from Hongtao Liu  ---
diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 0de299c62e3..92062378d8e 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -1214,7 +1214,7 @@ expand_stack_vars (bool (*pred) (size_t), class
stack_vars_data *data)
{
  if (data->asan_vec.is_empty ())
{
- align_frame_offset (ASAN_RED_ZONE_SIZE);
+ align_frame_offset (MAX (alignb, ASAN_RED_ZONE_SIZE));
  prev_offset = frame_offset.to_constant ();
}
  prev_offset = align_base (prev_offset,


This can fix the issue, but not sure if it's the correct way.

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #17 from Hongtao Liu  ---
Forget to add PR to my commit, it's solved by r14-9459-g618e34d56cc38e and
backport to r13-8438-gbdbcfbfcf59138, r12-10214-ga861f940efffae.

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #15 from Hongtao Liu  ---
A patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647604.html

[Bug target/114334] [14 Regression] ICE: in extract_insn, at recog.cc:2812 (unrecognizable insn and:HF?) with lroundf16() and -ffast-math -mavx512fp16

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114334

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-03-15
 CC||liuhongt at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Hongtao Liu  ---
Mine

[Bug tree-optimization/66862] OpenMP SIMD does not work (use SIMD instructions) on conditional code

2024-03-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66862

--- Comment #5 from Hongtao Liu  ---
> Now, it seems AVX512BW (and AVX512VL in some cases) has the needed
> instructions,
> in particular VMOVDQU{8,16}, but it is not reflected in maskload and
> maskstore expanders.  CCing Kyrill and Uros on this.

w/ -mavx512bw and -mavx512vl, the loop is vectorized since GCC 8.1.

[Bug target/114334] [14 Regression] ICE: in extract_insn, at recog.cc:2812 (unrecognizable insn and:HF?) with lroundf16() and -ffast-math -mavx512fp16

2024-03-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114334

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Hongtao Liu  ---
Fixed in GCC14.

[Bug middle-end/114347] wrong constant folding when casting __bf16 to int

2024-03-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114347

--- Comment #9 from Hongtao Liu  ---
(In reply to Richard Biener from comment #7)
> (In reply to Jakub Jelinek from comment #6)
> > You can use -fexcess-precision=16 if you don't want treating _Float16 and
> > __bf16 as having excess precision.  With excess precision, I think the above
> > behavior is correct.
> > You'd need (int) (__bf16) 257.0bf16 to get 256 even with excess precision.
> 
> Ah, -fexcess-precision=16 doesn't seem to be documented though (how does
> this influence long double handling then?)

Oh, I forgot to add that in invoke.texi.

-fexcess-precision=16 doesn't impact types with precision > 16. And it's not
compatible with -mfpmath=387.

[Bug tree-optimization/67683] Missed vectorization: shifts of an induction variable

2024-03-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #6 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #5)
> /app/example.cpp:5:20: note:   vect_is_simple_use: operand # RANGE [irange]
> short unsigned int [0, 2047][3294, 3294][6589, 6589][13179, 13179][26359,
> 26359][52719, 52719]
> val_16 = PHI , type of def: induction
> 
> We detect it now.
> But then it is still not vectorized ...

We don't know how to peel for variable niter, there could be UD if we peel it
like val(epilogue) = val >> (max / vf) * vf.

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #15 from Hongtao Liu  ---
(In reply to Richard Biener from comment #9)
> (In reply to Robin Dapp from comment #8)
> > No fallout on x86 or aarch64.
> > 
> > Of course using false instead of TYPE_SIGN (utype) is also possible and
> > maybe clearer?
> 
> Well, wi::from_mpz doesn't take a sign argument.  It's comment says
> 
> /* Returns X converted to TYPE.  If WRAP is true, then out-of-range
>values of VAL will be wrapped; otherwise, they will be set to the
>appropriate minimum or maximum TYPE bound.  */
> wide_int
> wi::from_mpz (const_tree type, mpz_t x, bool wrap)
> 
> I'm not sure if we really want saturating behavior here, so 'true' is
> more correct?  Note if we want an unsigned result we should pass utype here,
> that might be the bug?  So
> 
> begin = wi::from_mpz (utype, res, true);
> 
> ?
Yes, it should be.
> 
> The to_mpz args look like they could be mixing signs as well:
> 
> case vect_step_op_mul:
>   {
> tree utype = unsigned_type_for (type);
> init_expr = gimple_convert (stmts, utype, init_expr);
> wide_int skipn = wi::to_wide (skip_niters);
> wide_int begin = wi::to_wide (step_expr);
> auto_mpz base, exp, mod, res;
> wi::to_mpz (begin, base, TYPE_SIGN (type));
> 
> TYPE_SIGN (step_expr)?
step_expr should have same type as init_expr.
> 
> wi::to_mpz (skipn, exp, UNSIGNED);
> 
> TYPE_SIGN (skip_niters) (which should be UNSIGNED I guess)?
skipn must be a postive value, so I assume UNSIGNED/SIGNED doesn't make any
difference here.

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #16 from Hongtao Liu  ---
Mine.

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #17 from Hongtao Liu  ---
> > 
> > The to_mpz args look like they could be mixing signs as well:
> > 
I tries below, looks like mixing signs works well.
debug show step_expr is -5 and signed.

short a = 0xF;
short b[16];
unsigned short ua = 0xF;
unsigned short ub[16];

int main() {
  for (int e = 0; e < 9; e += 1)
b[e] = a *= 0x5;
  __builtin_printf("decimal: %d\n", a);
  __builtin_printf("hex: %X\n", a);

  for (int e = 0; e < 9; e += 1)
b[e] = a *= -5;
  __builtin_printf("decimal: %d\n", a);
  __builtin_printf("hex: %X\n", a);

  for (int e = 0; e < 9; e += 1)
ub[e] = ua *= 0x5;
  __builtin_printf("decimal: %d\n", ua);
  __builtin_printf("hex: %X\n", ua);

  for (int e = 0; e < 9; e += 1)
ub[e] = ua *= -5;
  __builtin_printf("decimal: %d\n", ua);
  __builtin_printf("hex: %X\n", ua);

}

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #7 from Hongtao Liu  ---
Another simple case is 

typedef int v4si __attribute__((vector_size(16)));
typedef short v8hi __attribute__((vector_size(16)));

v8hi a;
v4si b;
void
foo ()
{
   b = __extension__(v4si){0, 0, 0, 0};
   a = __extension__(v8hi){0, 0, 0, 0, 0, 0, 0, 0};
}

GCC generates 2 pxor

foo():
vpxor   xmm0, xmm0, xmm0
vmovdqa XMMWORD PTR b[rip], xmm0
vpxor   xmm0, xmm0, xmm0
vmovdqa XMMWORD PTR a[rip], xmm0
ret

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

--- Comment #9 from Hongtao Liu  ---

> If we were to expose that vpxor before postreload we'd likely CSE but
> we have
> 
> 5: xmm0:V4SI=const_vector
>   REG_EQUIV const_vector
> 6: [`b']=xmm0:V4SI
> 7: xmm0:V8HI=const_vector
>   REG_EQUIV const_vector
> 8: [`a']=xmm0:V8HI
> 
> until the very end.  But since we have the same mode size on the xmm0
> sets CSE could easily handle (integral) constants by hashing/comparing
> on their byte representation rather than by using the RTX structure.
> OTOH as we mostly have special constants allowed in the IL like this
> treating all-zeros and all-ones specially might be good enough ...

We only handle scalar code, guess could do something similar, maybe 
1. iteraters over vector modes with same vector length?
2. iteraters over vector modes with same component mode but with bigger vector
length?

But will miss v8hi/v8si pxor, another alternative is canonicalize const_vector
with scalar mode, i.e v4si -> TI, v8si -> OI, v16si -> XI. then we can just
query with TI/OI/XImode?


4873  /* See if we have a CONST_INT that is already in a register in a
4874 wider mode.  */
4875
4876  if (src_const && src_related == 0 && CONST_INT_P (src_const)
4877  && is_int_mode (mode, &int_mode)
4878  && GET_MODE_PRECISION (int_mode) < BITS_PER_WORD)
4879{
4880  opt_scalar_int_mode wider_mode_iter;
4881  FOR_EACH_WIDER_MODE (wider_mode_iter, int_mode)
4882{
4883  scalar_int_mode wider_mode = wider_mode_iter.require ();
4884  if (GET_MODE_PRECISION (wider_mode) > BITS_PER_WORD)
4885break;
4886
4887  struct table_elt *const_elt
4888= lookup (src_const, HASH (src_const, wider_mode),
wider_mode);
4889
4890  if (const_elt == 0)
4891continue;
4892
4893  for (const_elt = const_elt->first_same_value;
4894   const_elt; const_elt = const_elt->next_same_value)
4895if (REG_P (const_elt->exp))
4896  {
4897src_related = gen_lowpart (int_mode, const_elt->exp);
4898break;
4899  }
4900
4901  if (src_related != 0)
4902break;
4903}
4904}

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #20 from Hongtao Liu  ---
(In reply to JuzheZhong from comment #19)
> I think it's better to add pr114396.c into vect testsuite instead of x86
> target test since it's the bug not only happens on x86.

Sure, there's no target specific intrinsics in the testcase, I'll move that to
vect testsuite.

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #25 from Hongtao Liu  ---
Fixed in GCC14 and GCC13.3.

[Bug target/114427] New: [x86] ec_pack_truncv8si/v4si can be optimized with pblendw instead of pand for AVX2 target

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114427

Bug ID: 114427
   Summary: [x86] ec_pack_truncv8si/v4si can be optimized with
pblendw instead of pand for AVX2 target
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

void
foo (int* a, short* __restrict b, int* c)
{
for (int i = 0; i != 8; i++)
  b[i] = c[i] + a[i];
}

gcc -O2 -march=x86-64-v3 -S

mov eax, 65535
vmovd   xmm0, eax
vpbroadcastdxmm0, xmm0
vpand   xmm2, xmm0, XMMWORD PTR [rdi+16]
vpand   xmm1, xmm0, XMMWORD PTR [rdi]
vpackusdw   xmm1, xmm1, xmm2
vpand   xmm2, xmm0, XMMWORD PTR [rdx]
vpand   xmm0, xmm0, XMMWORD PTR [rdx+16]
vpackusdw   xmm0, xmm2, xmm0
vpaddw  xmm0, xmm1, xmm0
vmovdqu XMMWORD PTR [rsi], xmm0


It can be better with below, 

vpxor   %xmm0, %xmm0, %xmm0
vpblendw$85, 16(%rdi), %xmm0, %xmm2
vpblendw$85, (%rdi), %xmm0, %xmm1
vpackusdw   %xmm2, %xmm1, %xmm1
vpblendw$85, (%rdx), %xmm0, %xmm2
vpblendw$85, 16(%rdx), %xmm0, %xmm0
vpackusdw   %xmm0, %xmm2, %xmm0
vpaddw  %xmm0, %xmm1, %xmm0
vmovdqu %xmm0, (%rsi)

Currently, we're using (const_vector:v4si (const_int 0x) x4) as mask to
clear upper 16 bits, but pblendw with zero vector can also be used, and zero
vector is much cheaper than (const_vector:v4si (const_int 0x) x4)

mov eax, 65535
vmovd   xmm0, eax
vpbroadcastdxmm0, xmm0

pblendw has same latency as pand, but could be a little bit worse from
thoughput view(0.33->0.5 on ADL P-core, same on Zen4).

[Bug target/114428] New: [x86] psrad xmm, xmm, 16 and pand xmm, const_vector (0xffff x4) can be optimized to psrld

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114428

Bug ID: 114428
   Summary: [x86] psrad xmm, xmm, 16 and pand xmm, const_vector
(0x x4) can be optimized to psrld
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef unsigned short uint16_t;
typedef short int16_t;

#define QUANT_ONE( coef, mf, f )\
{ \
if( (coef) > 0 ) \
(coef) = (f + (coef)) * (mf) >> 16; \
else \
(coef) = - ((f - (coef)) * (mf) >> 16); \
nz |= (coef); \
}

int quant_4x4( int16_t dct[16], uint16_t mf[16], uint16_t bias[16] )
{
int nz = 0;
for( int i = 0; i < 16; i++ )
QUANT_ONE( dct[i], mf[i], bias[i] );
return !!nz;
}


gcc -O2 -march=x86-64-v3 -S

mov edx, 65535
vmovd   xmm4, edx
vpbroadcastdymm4, xmm4
...
vpsrad  ymm2, ymm2, 16
vpsrad  ymm6, ymm6, 16
vpsrad  ymm0, ymm0, 16
vpand   ymm2, ymm4, ymm2
vpsrad  ymm1, ymm1, 16
vpand   ymm6, ymm4, ymm6
vpand   ymm0, ymm4, ymm0
vpand   ymm4, ymm4, ymm1
vpackusdw   ymm2, ymm2, ymm6
vpackusdw   ymm0, ymm0, ymm4
vpermq  ymm2, ymm2, 216
vpermq  ymm0, ymm0, 216
...

it can be optimized to below.

vpsrld  ymm2, ymm2, 16
vpsrld  ymm6, ymm6, 16
vpsrld  ymm0, ymm0, 16
vpsrld  ymm1, ymm1, 16
vpackusdw   ymm2, ymm2, ymm6
vpackusdw   ymm0, ymm0, ymm4
vpermq  ymm2, ymm2, 216
vpermq  ymm0, ymm0, 216

The optimization opportunity is exposed after vec_pack_trunk_expr is expanded
to vpand + vpackusdw.

[Bug target/114429] New: [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429

Bug ID: 114429
   Summary: [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef unsigned char uint8_t;
uint8_t x264_clip_uint8( int x )
{
return x&(~255) ? (-x)>>31 : x;
}

void
foo (int* a, int* __restrict b, int n)
{
for (int i = 0; i != 8; i++)
  b[i] = x264_clip_uint8 (a[i]);
}

gcc -O2 -march=x86-64-v3 -S


foo(int*, int*, int):
..
mov eax, 255
vpxor   xmm0, xmm0, xmm0
vmovd   xmm1, eax
vpbroadcastdymm1, xmm1
vmovdqu ymm2, YMMWORD PTR [rdi]
vpminud ymm3, ymm2, ymm1
vpsubd  ymm0, ymm0, ymm2
vmovdqa YMMWORD PTR [rsp-32], ymm3
vpsrad  ymm0, ymm0, 31
vpcmpeqdymm3, ymm2, YMMWORD PTR [rsp-32]
vpblendvb   ymm0, ymm0, ymm2, ymm3
vpand   ymm1, ymm1, ymm0
vmovdqu YMMWORD PTR [rsi], ymm1


It can be better with

mov eax, 255
vmovd   xmm1, eax
vpxor xmm0, xmm0, xmm0. 
vpbroadcastdymm1, xmm1
vmovdqu ymm2, YMMWORD PTR [rdi]
vpminud ymm3, ymm2, ymm1
vmovdqa YMMWORD PTR [rsp-32], ymm3
vcmpgtps  ymm0, ymm2, ymm0
vpcmpeqdymm3, ymm2, YMMWORD PTR [rsp-32]
vpblendvb   ymm0, ymm0, ymm2, ymm3
vpand   ymm1, ymm1, ymm0
vmovdqu YMMWORD PTR [rsi], ymm1

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429

Hongtao Liu  changed:

   What|Removed |Added

 Target||x86_64-*-* i?86-*-*

--- Comment #1 from Hongtao Liu  ---
when x is INT_MIN, I assume -x is UD, so compiler can do anything.
otherwise, (-x) >> 31 is just x > 0.
>From rtl view. neg of INT_MIN is assumed to 0 after it's truncated.
(neg:m x)
(ss_neg:m x)
(us_neg:m x)
These two expressions represent the negation (subtraction from zero) of the
value represented by x, carried out in mode m. They differ in the behavior on
overflow of integer modes. In the case of neg, the negation of the operand may
be a number not representable in mode m, in which case it is truncated to m.
ss_neg and us_neg ensure that an out-of-bounds result saturates to the maximum
or minimum signed or unsigned value.

so we can optimize (neg a)>>31 to a>0.

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429

--- Comment #2 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #1)
> when x is INT_MIN, I assume -x is UD, so compiler can do anything.
> otherwise, (-x) >> 31 is just x > 0.
> From rtl view. neg of INT_MIN is assumed to 0 after it's truncated.

Wait, is -INT_MIN truncated to INT_MIN? if that's case, we can't do the
optimization at rtl.

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #3 from Hongtao Liu  ---
Then invalid.

[Bug tree-optimization/114471] [14 regression] ICE when building liblc3-1.0.4 with -fno-vect-cost-model -march=x86-64-v4

2024-03-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114471

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #5 from Hongtao Liu  ---
Maybe we should always use kmask under AVX512, currently only >= 128-bits
vector of vector _Float16 use kmask, < 128 bits vector still use vector mask.

24628  /* Scalar mask case.  */
24629  if ((TARGET_AVX512F && TARGET_EVEX512 && vector_size == 64)
24630  || (TARGET_AVX512VL && (vector_size == 32 || vector_size == 16))
24631  /* AVX512FP16 only supports vector comparison
24632 to kmask for _Float16.  */
24633  || (TARGET_AVX512VL && TARGET_AVX512FP16
24634  && GET_MODE_INNER (data_mode) == E_HFmode))
24635{
24636  if (elem_size == 4
24637  || elem_size == 8
24638  || (TARGET_AVX512BW && (elem_size == 1 || elem_size == 2)))
24639return smallest_int_mode_for_size (nunits);
24640}
24641
24642  scalar_int_mode elem_mode
24643= smallest_int_mode_for_size (elem_size * BITS_PER_UNIT);
24644
24645  gcc_assert (elem_size * nunits == vector_size);
24646
24647  return mode_for_vector (elem_mode, nunits);

[Bug tree-optimization/114471] [14 regression] ICE when building liblc3-1.0.4 with -fno-vect-cost-model -march=x86-64-v4

2024-03-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114471

--- Comment #6 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #5)
> Maybe we should always use kmask under AVX512, currently only >= 128-bits
> vector of vector _Float16 use kmask, < 128 bits vector still use vector mask.
> 
and we need to support vec_cmp/vcond_mask for 64/32/16-bit vectors.
For the testcase, there's no kmask used at all, why x86-64-v3 doesn't issue an
error.

[Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

Bug ID: 114514
   Summary: v16qi >> 7 can be optimized with vpcmpgtb
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

v16qi
foo2 (v16qi a, v16qi b)
{
return a >> 7;
}

it can be optimized with
vpxor   xmm1, xmm1, xmm1
vpcmpgtbxmm0, xmm1, xmm0
ret

currently we generate(emulated with v16hi)

movl$16843009, %eax
vpsraw  $7, %xmm0, %xmm0
vmovd   %eax, %xmm1
vpbroadcastd%xmm1, %xmm1
vpandn  %xmm1, %xmm0, %xmm0
vpsubb  %xmm1, %xmm0, %xmm0

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514

--- Comment #3 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #1)
> Confirmed.
> 
> Note non sign bit can be improved too:
> ```
I assume you're talking about broadcast from imm or directly from constant
pool. GCC chooses the former, with -Os we can also generate the later.
According to microbenchmark, the former is better. I also tries to disable
broadcasting from imm and test with stress-ng vecmath, the performance is
similar.

[Bug target/114544] New: [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

Bug ID: 114544
   Summary: [x86] stv should transform (subreg DI (V1TI) 8) as
(vec_select:DI (V2DI) (const_int 1))
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef __uint128_t v128_t __attribute__((vector_size(16)));

v128_t c;


v128_t
foo1 (v128_t *a, v128_t *b)
{
c =  (*a >> 1 & *b) / (__extension__(v128_t){(__int128_t)0x3 << 120
| (__int128_t)0x3 << 112
| (__int128_t)0x3 << 104
| (__int128_t)0x3 << 96
| (__int128_t)0x3 << 88
| (__int128_t)0x3 << 80
| (__int128_t)0x3 << 72
| (__int128_t)0x3 << 64
| (__int128_t)0x3 << 56
| (__int128_t)0x3 << 48
| (__int128_t)0x3 << 40
| (__int128_t)0x3 << 32
| (__int128_t)0x3 << 24
| (__int128_t)0x3 << 16
| (__int128_t)0x3 << 8
| (__int128_t)0x3 << 0});
}


stv generates

(insn 32 11 35 2 (set (reg:DI 124 [ _4 ])
(subreg:DI (reg:V1TI 111 [ _4 ]) 0)) "/app/example.c":28:25 84
{*movdi_internal}
 (nil))
(insn 35 32 12 2 (set (reg:DI 127 [+8 ])
(subreg:DI (reg:V1TI 111 [ _4 ]) 8)) "/app/example.c":28:25 84
{*movdi_internal}
 (expr_list:REG_DEAD (reg:V1TI 111 [ _4 ])

(subreg:DI (reg:V1TI 111 [ _4 ]) 8) makes reload spills.


foo1:
movabsq $217020518514230019, %rdx # 57  [c=1 l=10] 
*movdi_internal/4
subq$24, %rsp   # 59  [c=4 l=4] 
pro_epilogue_adjust_stack_add_di/0
vmovdqa (%rdi), %xmm0 # 8   [c=9 l=4]  movv1ti_internal/3
movq%rdx, %rcx  # 58[c=4 l=3]  *movdi_internal/3
vpsrldq $8, %xmm0, %xmm1  # 42[c=4 l=5]  sse2_lshrv1ti3/1
vpsrlq  $1, %xmm0, %xmm0# 45[c=4 l=5]  lshrv2di3/1
vpsllq  $63, %xmm1, %xmm1   # 46  [c=4 l=5]  ashlv2di3/1
vpor%xmm1, %xmm0, %xmm0 # 47  [c=4 l=4]  *iorv2di3/1
vpand   (%rsi), %xmm0, %xmm2  # 10[c=13 l=4]  andv1ti3/1
vmovdqa %xmm2, (%rsp) # 52  [c=4 l=5]  movv1ti_internal/4
movq(%rsp), %rdi# 56[c=5 l=4]  *movdi_internal/3
movq8(%rsp), %rsi   # 35  [c=9 l=5]  *movdi_internal/3
call__udivti3   # 19  [c=13 l=5]  *call_value
vmovq   %rax, %xmm3   # 53  [c=4 l=5]  *movdi_internal/20
vpinsrq $1, %rdx, %xmm3, %xmm0# 23[c=4 l=6]  vec_concatv2di/2
vmovdqa %xmm0, c(%rip)# 25[c=4 l=8]  movv1ti_internal/4
addq$24, %rsp   # 62  [c=4 l=4] 
pro_epilogue_adjust_stack_add_di/0
ret   # 63[c=0 l=1]  simple_return_internal

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

--- Comment #1 from Hongtao Liu  ---
20590;; Turn SImode or DImode extraction from arbitrary SSE/AVX/AVX512F
20591;; vector modes into vec_extract*.
20592(define_split
20593  [(set (match_operand:SWI48x 0 "nonimmediate_operand")
20594(subreg:SWI48x (match_operand 1 "register_operand") 0))]
20595  "can_create_pseudo_p ()
20596   && REG_P (operands[1])
20597   && VECTOR_MODE_P (GET_MODE (operands[1]))
20598   && ((TARGET_SSE && GET_MODE_SIZE (GET_MODE (operands[1])) == 16)
20599   || (TARGET_AVX && GET_MODE_SIZE (GET_MODE (operands[1])) == 32)
20600   || (TARGET_AVX512F && TARGET_EVEX512
20601   && GET_MODE_SIZE (GET_MODE (operands[1])) == 64))
20602   && (mode == SImode || TARGET_64BIT || MEM_P (operands[0]))"
20603  [(set (match_dup 0) (vec_select:SWI48x (match_dup 1)
20604 (parallel [(const_int 0)])))]
20605{
20606  rtx tmp;

We need to do something similar.

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

--- Comment #2 from Hongtao Liu  ---
Also for 
void
foo2 (v128_t* a, v128_t* b)
{
   c = (*a & *b)+ *b;
}

(insn 9 8 10 2 (set (reg:V1TI 108 [ _3 ])
(and:V1TI (reg:V1TI 99 [ _2 ])
(mem:V1TI (reg:DI 113) [1 *a_6(D)+0 S16 A128])))
"/app/example.c":49:12 7100 {andv1ti3}
 (expr_list:REG_DEAD (reg:DI 113)
(nil)))
(insn 10 9 13 2 (parallel [
(set (reg:TI 109 [ _11 ])
(plus:TI (subreg:TI (reg:V1TI 108 [ _3 ]) 0)
(subreg:TI (reg:V1TI 99 [ _2 ]) 0)))
(clobber (reg:CC 17 flags))
]) "/app/example.c":49:17 256 {*addti3_doubleword}
 (expr_list:REG_DEAD (reg:V1TI 108 [ _3 ])
(expr_list:REG_DEAD (reg:V1TI 99 [ _2 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)

Since V1TImode can only be accocated as SSE_REGS, reload use stack for
(subreg:TI (reg:V1TI 108 [ _3 ]) 0) since the latter only support GENERAL_REGS.

[Bug rtl-optimization/114556] New: weird loop unrolling when there's attribute aligned in side the loop

2024-04-02 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114556

Bug ID: 114556
   Summary: weird loop unrolling when there's attribute aligned in
side the loop
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

v32qi
z (void* pa, void* pb, void* pc)
{
v32qi __attribute__((aligned(64))) a;
v32qi __attribute__((aligned(64))) b;
v32qi __attribute__((aligned(64))) c;
__builtin_memcpy (&a, pa, sizeof (a));
__builtin_memcpy (&b, pb, sizeof (a));
__builtin_memcpy (&c, pc, sizeof (a));
#pragma GCC unroll 8
for (int i = 0; i != 2048; i++)
  a += b;
  return a;
}

-O2 -mavx2, we have 

z:
vmovdqu (%rsi), %ymm1
vpaddb  (%rdi), %ymm1, %ymm0
movl$2041, %eax
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
jmp .L2
.L3:
vpaddb  %ymm0, %ymm1, %ymm0
subl$8, %eax
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
vpaddb  %ymm0, %ymm1, %ymm0
.L2:
vpaddb  %ymm0, %ymm1, %ymm0
cmpl$1, %eax
jne .L3
ret

But shouldn't it better with

z:
vmovdqu (%rsi), %ymm1
vmovdqu (%rdi), %ymm0
movl$2048, %eax
.L2:
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
vpaddb  %ymm1, %ymm0, %ymm0
subl$8, %eax
jne .L2
ret

[Bug target/114570] New: GCC doesn't perform good loop invariant code motion for very long vector operations.

2024-04-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114570

Bug ID: 114570
   Summary: GCC doesn't perform good loop invariant code motion
for very long vector operations.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

typedef float v128_32 __attribute__((vector_size (128 * 4), aligned(2048)));
v128_32
foo (v128_32 a, v128_32 b, v128_32 c, int n)
{
for (int i = 0; i != 2048; i++)
{
a = a / c;
a = a / b;
}
return a;
}

   [local count: 1063004408]:
  # a_13 = PHI 
  # ivtmp_2 = PHI 
  # DEBUG i => NULL
  # DEBUG a => NULL
  # DEBUG BEGIN_STMT
  _14 = BIT_FIELD_REF ;
  _15 = BIT_FIELD_REF ;
  _10 = _14 / _15;
  _11 = BIT_FIELD_REF ;
  _12 = BIT_FIELD_REF ;
  _16 = _11 / _12;
  _17 = BIT_FIELD_REF ;
  _18 = BIT_FIELD_REF ;
  _19 = _17 / _18;
  _20 = BIT_FIELD_REF ;
  _21 = BIT_FIELD_REF ;
  _22 = _20 / _21;
  _23 = BIT_FIELD_REF ;
  _24 = BIT_FIELD_REF ;
  _25 = _23 / _24;
  _26 = BIT_FIELD_REF ;
  _27 = BIT_FIELD_REF ;
  _28 = _26 / _27;
  _29 = BIT_FIELD_REF ;
  _30 = BIT_FIELD_REF ;
  _31 = _29 / _30;
  _32 = BIT_FIELD_REF ;
  _33 = BIT_FIELD_REF ;
  _34 = _32 / _33;
  _35 = BIT_FIELD_REF ;
  _36 = BIT_FIELD_REF ;
  _37 = _35 / _36;
  _38 = BIT_FIELD_REF ;
  _39 = BIT_FIELD_REF ;
  _40 = _38 / _39;
  _41 = BIT_FIELD_REF ;
  _42 = BIT_FIELD_REF ;
  _43 = _41 / _42;
  _44 = BIT_FIELD_REF ;
  _45 = BIT_FIELD_REF ;
  _46 = _44 / _45;
  _47 = BIT_FIELD_REF ;
  _48 = BIT_FIELD_REF ;
  _49 = _47 / _48;
  _50 = BIT_FIELD_REF ;
  _51 = BIT_FIELD_REF ;
  _52 = _50 / _51;
  _53 = BIT_FIELD_REF ;
  _54 = BIT_FIELD_REF ;
  _55 = _53 / _54;
  _56 = BIT_FIELD_REF ;
  _57 = BIT_FIELD_REF ;
  _58 = _56 / _57;
  # DEBUG a => {_10, _16, _19, _22, _25, _28, _31, _34, _37, _40, _43, _46,
_49, _52, _55, _58}
  # DEBUG BEGIN_STMT
  _59 = BIT_FIELD_REF ;
  _60 = _10 / _59;
  _61 = BIT_FIELD_REF ;
  _62 = _16 / _61;
  _63 = BIT_FIELD_REF ;
  _64 = _19 / _63;
  _65 = BIT_FIELD_REF ;
  _66 = _22 / _65;
  _67 = BIT_FIELD_REF ;
  _68 = _25 / _67;
  _69 = BIT_FIELD_REF ;
  _70 = _28 / _69;
  _71 = BIT_FIELD_REF ;
  _72 = _31 / _71;
  _73 = BIT_FIELD_REF ;
  _74 = _34 / _73;
  _75 = BIT_FIELD_REF ;
  _76 = _37 / _75;
  _77 = BIT_FIELD_REF ;
  _78 = _40 / _77;
  _79 = BIT_FIELD_REF ;
  _80 = _43 / _79;
  _81 = BIT_FIELD_REF ;
  _82 = _46 / _81;
  _83 = BIT_FIELD_REF ;
  _84 = _49 / _83;
  _85 = BIT_FIELD_REF ;
  _86 = _52 / _85;
  _87 = BIT_FIELD_REF ;
  _88 = _55 / _87;
  _89 = BIT_FIELD_REF ;
  _90 = _58 / _89;
  a_9 = {_60, _62, _64, _66, _68, _70, _72, _74, _76, _78, _80, _82, _84, _86,
_88, _90};
  # DEBUG a => a_9
  # DEBUG BEGIN_STMT
  # DEBUG i => NULL
  # DEBUG a => a_9
  # DEBUG BEGIN_STMT
  ivtmp_1 = ivtmp_2 + 4294967295;
  if (ivtmp_1 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

Ideally, those BIT_FIELD_REF can be hoisted out and 
# a_13 = PHI  can be optimized with those 256-bit vectors.

we finanly generate 

foo:
pushq   %rbp
movq%rdi, %rax
movl$2048, %edx
movq%rsp, %rbp
subq$408, %rsp
leaq-120(%rsp), %r8
.L2:
vmovaps 16(%rbp), %ymm15
vmovaps 48(%rbp), %ymm14
movq%r8, %rsi
vdivps  1040(%rbp), %ymm15, %ymm15
vmovaps 80(%rbp), %ymm13
vmovaps 112(%rbp), %ymm12
vdivps  528(%rbp), %ymm15, %ymm15
vdivps  1072(%rbp), %ymm14, %ymm14
vmovaps 144(%rbp), %ymm11
vmovaps 176(%rbp), %ymm10
vdivps  560(%rbp), %ymm14, %ymm14
vdivps  1104(%rbp), %ymm13, %ymm13
vmovaps 208(%rbp), %ymm9
vmovaps 240(%rbp), %ymm8
vdivps  592(%rbp), %ymm13, %ymm13
vdivps  1136(%rbp), %ymm12, %ymm12
vmovaps 272(%rbp), %ymm7
vmovaps 304(%rbp), %ymm6
vdivps  624(%rbp), %ymm12, %ymm12
vdivps  1168(%rbp), %ymm11, %ymm11
vmovaps 336(%rbp), %ymm5
vdivps  656(%rbp), %ymm11, %ymm11
vdivps  1200(%rbp), %ymm10, %ymm10
vdivps  1232(%rbp), %ymm9, %ymm9
vdivps  688(%rbp), %ymm10, %ymm10
vdivps  720(%rbp), %ymm9, %ymm9
vdivps  1264(%rbp), %ymm8, %ymm8
vdivps  1296(%rbp), %ymm7, %ymm7
vdivps  752(%rbp), %ymm8, %ymm8
vdivps  784(%rbp), %ymm7, %ymm7
vdivps  1328(%rbp), %ymm6, %ymm6
movl$64, %ecx
vdivps  816(%rbp), %ymm6, %ymm6
leaq16(%rbp), %rdi
vdivps  1360(%rbp), %ymm5, %ymm5
vdivps  848(%rbp), %ymm5, %ymm5
vmovaps 368(%rbp), %ymm4
vmovaps 400(%rbp), %ymm3
vdivps  1392(%rbp), %ymm4, %ymm4
vdivps  1424(%rbp), %ymm3, %ymm3
vmovaps 432(%rbp), %ymm2
vmovaps 464(%rbp), %ymm1
vdivps  880(%rbp), %ymm4, %ymm4
vdivps  912(%rbp), %ymm3, %ymm3
vmovaps 496

[Bug target/113744] Unnecessary "m" constraint in *adddi_4

2024-07-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113744

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |lingling.kong7 at gmail 
dot com
 Ever confirmed|0   |1
   Last reconfirmed|2024-02-04 00:00:00 |2024-07-31

--- Comment #4 from Hongtao Liu  ---
Then please remove constraint from the pattern.

[Bug target/116157] AVX2 _mm256_exp_ps function is missing in the compiler

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116157

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Hongtao Liu  ---
Don't have plan to support it in GCC.

*** This bug has been marked as a duplicate of bug 85236 ***

[Bug target/85236] missing _mm256_atan2_ps

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85236

Hongtao Liu  changed:

   What|Removed |Added

 CC||binklings at 163 dot com

--- Comment #8 from Hongtao Liu  ---
*** Bug 116157 has been marked as a duplicate of this bug. ***

[Bug target/116122] [14/15 regression] __FLT16_MAX__ is defined even with -mno-sse2 on 32-bit x86

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116122

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Hongtao Liu  ---
Mentioned in GCC14 "Changes" and "Porting to" documentation.

[Bug target/115981] [14/15 Regression] Redundant vmovaps to itself after vmovups since r14-537

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115981

--- Comment #4 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #3)
> Created attachment 58786 [details]
> gcc15-pr115981.patch
> 
> Untested fix.  As since that commit it checks swap_commutative_operands_p:
> 1) CONST_VECTOR I think has commutative_operand_precedence -4
> 2) REG has commutative_operand_precedence -1 or -2
> 3) SUBREG of object has commutative_operand_precedence -3
> 4) VEC_DUPLICATE has commutative_operand_precedence 0
> Which means the VEC_DUPLICATE operand will always come first and whatever
> matches reg_or_0_operand will always come second, i.e. exactly not the order
> in the pattern, so we don't need to add another one, can just change order
> of this one.

Patch LGTM.

[Bug target/113744] Unnecessary "m" constraint in *adddi_4

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113744

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC15.

[Bug tree-optimization/89749] Very odd vector constructor

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89749

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
  Known to work||12.1.0
 CC||liuhongt at gcc dot gnu.org
 Status|NEW |RESOLVED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC12 and above.

[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst

2024-08-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Hongtao Liu  ---
Fixed in GCC15.

[Bug rtl-optimization/115021] [14 regression] unnecessary spill for vpternlog

2024-08-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition

2024-08-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #4 from Hongtao Liu  ---
W/ below patch, compiled with -march=x86-64-v3 -O3, redundant spills is gone.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index f044826269c..e8bcf314752 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -20292,6 +20292,10 @@ inline_secondary_memory_needed (machine_mode mode,
reg_class_t class1,
   if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)))
return true;

+  /* *movti_internal supports movement between SSE_REGS and GENERAL_REGS. 
*/
+  if (mode == TImode)
+   return false;
+
   int msize = GET_MODE_SIZE (mode);

   /* Between SSE and general, we have moves no larger than word size.  */


struct aq { long x,y; };
long testq(struct aq a) { return a.x+a.y; }

struct aw { short a0,a1,a2,a3,a4,a5,a6,a7; };
short testw(struct aw a) { return a.a0+a.a1+a.a2+a.a3+a.a4+a.a5+a.a6+a.a7; }

struct ad { int x,y,z,w; };
int testd(struct ad a) { return a.x+a.y+a.z+a.w; }

testq:
.LFB0:
.cfi_startproc
vmovq   %rdi, %xmm1
vpinsrq $1, %rsi, %xmm1, %xmm1
vpsrldq $8, %xmm1, %xmm0
vpaddq  %xmm1, %xmm0, %xmm0
vmovq   %xmm0, %rax
ret
.cfi_endproc
.LFE0:
.size   testq, .-testq
.p2align 4
.globl  testw
.type   testw, @function
testw:
.LFB1:
.cfi_startproc
vmovq   %rdi, %xmm1
vpinsrq $1, %rsi, %xmm1, %xmm1
vpsrldq $8, %xmm1, %xmm0
vpaddw  %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddw  %xmm1, %xmm0, %xmm0
vpsrldq $2, %xmm0, %xmm1
vpaddw  %xmm1, %xmm0, %xmm0
vpextrw $0, %xmm0, %eax
ret
.cfi_endproc
.LFE1:
.size   testw, .-testw
.p2align 4
.globl  testd
.type   testd, @function
testd:
.LFB2:
.cfi_startproc
vmovq   %rdi, %xmm1
vpinsrq $1, %rsi, %xmm1, %xmm1
vpsrldq $8, %xmm1, %xmm0
vpaddd  %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddd  %xmm1, %xmm0, %xmm0
vmovd   %xmm0, %eax
ret
.cfi_endproc

But with -march=x86-64-v2 or -march=x86-64 -O3, the spills are still there,
hmm.

[Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition

2024-08-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #5 from Hongtao Liu  ---
For non-avx case, looks like it hits here

  748  /* Special case TImode to 128-bit vector conversions via V2DI.  */   
  749  if (VECTOR_MODE_P (mode) 
  750  && GET_MODE_SIZE (mode) == 16
  751  && SUBREG_P (op1)
  752  && GET_MODE (SUBREG_REG (op1)) == TImode 
  753  && TARGET_64BIT && TARGET_SSE
  754  && can_create_pseudo_p ())   
  755{  
  756  rtx tmp = gen_reg_rtx (V2DImode);
  757  rtx lo = gen_reg_rtx (DImode);   
  758  rtx hi = gen_reg_rtx (DImode);   
  759  emit_move_insn (lo, gen_lowpart (DImode, SUBREG_REG (op1))); 
  760  emit_move_insn (hi, gen_highpart (DImode, SUBREG_REG (op1)));
  761  emit_insn (gen_vec_concatv2di (tmp, lo, hi));
  762  emit_move_insn (op0, gen_lowpart (mode, tmp));   
  763  return;  
  764}

[Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition

2024-08-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #6 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #5)
> For non-avx case, looks like it hits here
> 
>   748  /* Special case TImode to 128-bit vector conversions via V2DI.  */   
> 


Prevent that in reload, we get

.file   "test.c"
.text
.p2align 4
.globl  testq
.type   testq, @function
testq:
.LFB0:
.cfi_startproc
movq%rdi, %xmm1
pinsrq  $1, %rsi, %xmm1
movdqa  %xmm1, %xmm0
psrldq  $8, %xmm0
paddq   %xmm1, %xmm0
movq%xmm0, %rax
ret
.cfi_endproc
.LFE0:
.size   testq, .-testq
.p2align 4
.globl  testw
.type   testw, @function
testw:
.LFB1:
.cfi_startproc
movq%rdi, %xmm1
pinsrq  $1, %rsi, %xmm1
movdqa  %xmm1, %xmm0
psrldq  $8, %xmm0
paddw   %xmm1, %xmm0
movdqa  %xmm0, %xmm1
psrldq  $4, %xmm1
paddw   %xmm1, %xmm0
movdqa  %xmm0, %xmm1
psrldq  $2, %xmm1
paddw   %xmm1, %xmm0
pextrw  $0, %xmm0, %eax
ret
.cfi_endproc
.LFE1:
.size   testw, .-testw
.p2align 4
.globl  testd
.type   testd, @function
testd:
.LFB2:
.cfi_startproc
movq%rdi, %xmm1
pinsrq  $1, %rsi, %xmm1
movdqa  %xmm1, %xmm0
psrldq  $8, %xmm0
paddd   %xmm1, %xmm0
movdqa  %xmm0, %xmm1
psrldq  $4, %xmm1
paddd   %xmm1, %xmm0
movd%xmm0, %eax
ret
.cfi_endproc
.LFE2:

[Bug target/113729] Missing APX NDD optimization

2024-08-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113729

Hongtao Liu  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #7 from Hongtao Liu  ---
fixed in GCC15

[Bug target/116174] [14/15 regression] Alignment request is added before endbr with -fcf-protection=branch since r15-888-gb644126237a1aa

2024-08-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116174

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #9 from Hongtao Liu  ---
(In reply to Arnd Bergmann from comment #7)
> I confirmed that the patch from comment #6 addresses the build warnings I
> see in the kernel.

Does the commit also fix the issue? If so I'll backport the patch to GCC14
release branch

[Bug target/115756] default tuning for x86_64 produces shifts for `*240`

2024-08-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs

2024-08-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Bug 115749 depends on bug 115756, which changed state.

Bug 115756 Summary: default tuning for x86_64 produces shifts for `*240`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs

2024-08-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #14 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/113600] [14/15 regression] 525.x264_r run-time regresses by 8% with PGO -Ofast -march=znver4

2024-08-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113600

--- Comment #10 from Hongtao Liu  ---
I think it should be fixed by r15-2820-gab18785840d7b8

[Bug target/81602] Unnecessary zero-extension after 16 bit popcnt

2024-08-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #5 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #4)
> Interesting clang does:
> ```
> movzx   ecx, word ptr [rdi + 2*rax]
> popcnt  ecx, ecx
> lea rsi, [rsi + 2*rcx]
> ```
> 
> 
> While GCC 14+ does:
> ```
> xor eax, eax
> add rdi, 2
> mov WORD PTR [rsi], ax
> popcnt  ax, WORD PTR [rdi-2]
> and eax, 31
> ```
> 
> So clang has a zero extend before the popcount while GCC has it afterwards
> ...

I guess zero_extend before popcnt is used to clear popcnt false dependence(use
same register for source and dest, gcc use xor there), and clang knows the
upper bits of popcnt must be zero, so there's no zero_extend afterwards.

Maybe we should simplify that at rtl for zero_extend:popcnt or and:popcnt, imm

[Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition

2024-08-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #8 from Hongtao Liu  ---

> 
> codegen is probably an RA/LRA artifact caused by bad instruction constraints
> and the refuse to reload to a gpr.  Not sure if a move high to gpr is a
> thing,
> pextrq would work for sure.  But an unpck looks like a better match anyway.

RA issue is fixed.

[Bug target/116174] [14/15 regression] Alignment request is added before endbr with -fcf-protection=branch since r15-888-gb644126237a1aa

2024-08-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116174

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #12 from Hongtao Liu  ---
Fixed in GCC14.3 and GCC15

[Bug target/115982] [15 Regression] ICE: unrecognizable insn in ira_remove_insn_scratches with -mavx512vl since r15-1742

2024-08-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115982

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
  Known to fail||15.0
  Known to work||15.0
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/115683] [15 Regression] SSE2 regressions after obselete of vcond{,u,eq}.

2024-08-19 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115683

--- Comment #6 from Hongtao Liu  ---
(In reply to Uroš Bizjak from comment #5)
> (In reply to Hongtao Liu from comment #0)
> 
> > g++: g++.target/i386/pr100637-1b.C 
> > g++: g++.target/i386/pr100637-1w.C
> > g++: g++.target/i386/pr103861-1.C
> >
> > There're extra 1 pcmpeq instruction generated in below 3 testcase for
> > comparison of GTU, x86 doesn't support native GTU comparison, but use
> > psubusw + pcmpeq + pcmpeq, the second pcmpeq is used to negate the mask, and
> > the negate can be eliminated in vcond{,u,eq} expander by just swapping
> > if_true and if_else.
> 
> How to do that? The output from vec_cmpu is a mask value in the output
> register that is used by vcond_mask as an input. I fail to see how the swap
> of if_true and if_false operands (in vcond_mask RTX) can be communicated
> from vec_cmpu to vcond_mask.

One possible solution is that we define the "fake" blendv pattern to help
combine do the optimization, and then split this fake pattern back to op1 &
mask | op2 & ~mask when !TAREGT_SSE4_1

[Bug target/116497] Need no_caller_saved_registers with SSE support

2024-08-27 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116497

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #6 from Hongtao Liu  ---
(In reply to Andi Kleen from comment #1)
> Disable check for no_caller_saved_registers enforcing non FP.
> 
> diff --git a/gcc/config/i386/i386-options.cc
> b/gcc/config/i386/i386-options.cc
> index f79257cc764..cec652cc9e6 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -3639,8 +3639,8 @@ ix86_set_current_function (tree fndecl)
>  reinit_regs ();
>  
>if (cfun->machine->func_type != TYPE_NORMAL
> -  || (cfun->machine->call_saved_registers
> - == TYPE_NO_CALLER_SAVED_REGISTERS))
> +  /* || (cfun->machine->call_saved_registers
> +== TYPE_NO_CALLER_SAVED_REGISTERS) */)
>  {
>/* Don't allow SSE, MMX nor x87 instructions since they
>  may change processor state.  */

I think RA is smart enough to save and restore SSE,MMX or x87 registers, and we
can remove TYPE_NO_CALLER_SAVED_REGISTERS part from this.
Or are there any other concerns here regarding the comments? @hj

[Bug target/116512] [12/13/14/15 Regression] vzeroupper emitted even though the upper half of the z registers are returned

2024-08-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116512

Hongtao Liu  changed:

   What|Removed |Added

   Last reconfirmed|2024-08-28 00:00:00 |
 Status|NEW |UNCONFIRMED
 Ever confirmed|1   |0
  Known to work|6.1.0, 6.4.0, 7.2.0 |
   Keywords|wrong-code  |
  Known to fail|6.5.0, 7.3.0, 7.5.0, 8.1.0  |

--- Comment #3 from Hongtao Liu  ---
14881static bool
14882ix86_check_avx_upper_register (const_rtx exp)  
14883{  
14884  return (SSE_REG_P (exp)  
14885  && !EXT_REX_SSE_REG_P (exp)  
14886  && GET_MODE_BITSIZE (GET_MODE (exp)) > 128); 
14887}  
14888  

But the code doesn't check modes.

[Bug target/116512] [12/13/14/15 Regression] vzeroupper emitted even though the upper half of the z registers are returned

2024-08-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116512

--- Comment #4 from Hongtao Liu  ---
gdb shows crtl->return_rtx is

21(parallel/i:BLK [ 
22(expr_list:REG_DEP_TRUE (reg:XI 20 xmm0)  
23(const_int 0 [0]))
24])

[Bug target/116512] [12/13/14/15 Regression] vzeroupper emitted even though the upper half of the z registers are returned

2024-08-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116512

--- Comment #6 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #5)
> (In reply to Hongtao Liu from comment #4)
> > gdb shows crtl->return_rtx is
> > 
> > 21(parallel/i:BLK [ 
> > 
> > 22(expr_list:REG_DEP_TRUE (reg:XI 20 xmm0)  
> > 
> > 23(const_int 0 [0]))
> > 
> > 24])
> 
> Oh, so ix86_avx_u128_mode_exit does not understand parallel here.

For function arguments/return, when it's BLK mode, it's put in parallel with an
expr_list which needs extra handle.

[Bug target/116512] [12/13/14/15 Regression] vzeroupper emitted even though the upper half of the z registers are returned

2024-09-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116512

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #11 from Hongtao Liu  ---
Fixed in GCC12.5/GCC13.4/GCC14.3/GCC15

[Bug target/116582] gather is a win in some cases on zen CPUs

2024-09-04 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582

--- Comment #5 from Hongtao Liu  ---
(In reply to Richard Biener from comment #4)
> (In reply to Jan Hubicka from comment #3)
> > Just for completeness the codegen for parest sparse matrix multiply is:
> > 
> >   0.31 │320:   kmovb %k1,%k4
> >   0.25 │   kmovb %k1,%k5
> >   0.28 │   vmovdqu32 (%rcx,%rax,1),%zmm0
> >   0.32 │   vpmovzxdq %ymm0,%zmm4
> >   0.31 │   vextracti32x8 $0x1,%zmm0,%ymm0
> >   0.48 │   vpmovzxdq %ymm0,%zmm0
> >  10.32 │   vgatherqpd(%r14,%zmm4,8),%zmm2{%k4}
> >   1.90 │   vfmadd231pd   (%rdx,%rax,2),%zmm2,%zmm1
> >  14.86 │   vgatherqpd(%r14,%zmm0,8),%zmm5{%k5}   
> >   0.27 │   vfmadd231pd   0x40(%rdx,%rax,2),%zmm5,%zmm1
> >   0.26 │   add   $0x40,%rax
> >   0.23 │   cmp   %rax,%rdi   
> >│ ↑ jne   320 
> > 
> > which looks OK to me.
> 
> The in-loop mask moves are odd, but yes.
> 
>
It's because vgatherqpd will set k4 to 0, and it needs to be reinitialized to
-1(%k1)

[Bug target/116617] x86_64: arch lunarlake not documented

2024-09-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116617

Hongtao Liu  changed:

   What|Removed |Added

 CC||haochen.jiang at intel dot com,
   ||liuhongt at gcc dot gnu.org

--- Comment #2 from Hongtao Liu  ---
@Haochen Could you add that.

[Bug target/116617] x86_64: arch lunarlake not documented

2024-09-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116617

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Hongtao Liu  ---
Fixed.

[Bug middle-end/116658] New: [GCC15 regression] ICE in vect_is_slp_load_node

2024-09-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116658

Bug ID: 116658
   Summary: [GCC15 regression] ICE in vect_is_slp_load_node
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59082
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59082&action=edit
test111.i

g++ -O3 test111.i -march=x86-64-v4

during GIMPLE pass: vect
test111.i: In member function ‘void a::bg::r::dc(unsigned int,
a::bq::bq, const az*, a::bu&) [with int dim = 1; az =
a::bb]’:
test111.i:173:6: internal compiler error: Segmentation fault
  173 | void r< dim, az >::dc(unsigned q, bq::bq de, const az *p,
  |  ^~~~
0x30c5da6 internal_error(char const*, ...)
   
/export/users/liuhongt/gcc/git_trunk/master/gcc/diagnostic-global-context.cc:492
0x1987e16 crash_signal
/export/users/liuhongt/gcc/git_trunk/master/gcc/toplev.cc:321
0x1cc8d44 vect_is_slp_load_node
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vect-slp.cc:3269
0x1cc8d44 optimize_load_redistribution_1
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vect-slp.cc:3305
0x1cc970b optimize_load_redistribution
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vect-slp.cc:3375
0x1cc970b vect_analyze_slp(vec_info*, unsigned int)
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vect-slp.cc:4759
0x1c9665a vect_analyze_loop_2
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vect-loop.cc:2862
0x1c97a0f vect_analyze_loop_1
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vect-loop.cc:3409
0x1c9820b vect_analyze_loop(loop*, vec_info_shared*)
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vect-loop.cc:3567
0x1ce4ffa try_vectorize_loop_1
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vectorizer.cc:1068
0x1ce4ffa try_vectorize_loop
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vectorizer.cc:1184
0x1ce57d4 execute
/export/users/liuhongt/gcc/git_trunk/master/gcc/tree-vectorizer.cc:1300
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug tree-optimization/116674] New: [15 regression] ICE in vectorizable_simd_clone_call bisected to r15-3509-gd34cda72098867

2024-09-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116674

Bug ID: 116674
   Summary: [15 regression] ICE in vectorizable_simd_clone_call
bisected to r15-3509-gd34cda72098867
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59093
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59093&action=edit
test.i

during GIMPLE pass: vect
test.i: In function ‘void b(int)’:
test.i:71:6: internal compiler error: in vectorizable_simd_clone_call, at
tree-vect-stmts.cc:4039
   71 | void b(int bc) {
  |  ^
0x27f90f5 internal_error(char const*, ...)
/home/liuhongt/work/gcc/master/gcc/diagnostic-global-context.cc:517
0xa836ce fancy_abort(char const*, int, char const*)
/home/liuhongt/work/gcc/master/gcc/diagnostic.cc:1657
0x987e86 vectorizable_simd_clone_call
/home/liuhongt/work/gcc/master/gcc/tree-vect-stmts.cc:4039
0x16a8628 vect_analyze_stmt(vec_info*, _stmt_vec_info*, bool*, _slp_tree*,
_slp_instance*, vec*)
/home/liuhongt/work/gcc/master/gcc/tree-vect-stmts.cc:13362
0x1706999 vect_slp_analyze_node_operations_1
/home/liuhongt/work/gcc/master/gcc/tree-vect-slp.cc:7364
0x1706999 vect_slp_analyze_node_operations
/home/liuhongt/work/gcc/master/gcc/tree-vect-slp.cc:7567
0x17068d0 vect_slp_analyze_node_operations
/home/liuhongt/work/gcc/master/gcc/tree-vect-slp.cc:7546
0x17085ed vect_slp_analyze_operations(vec_info*)
/home/liuhongt/work/gcc/master/gcc/tree-vect-slp.cc:7962
0x16d3fc6 vect_analyze_loop_2
/home/liuhongt/work/gcc/master/gcc/tree-vect-loop.cc:2954
0x16d6010 vect_analyze_loop_1
/home/liuhongt/work/gcc/master/gcc/tree-vect-loop.cc:3409
0x16d676b vect_analyze_loop(loop*, vec_info_shared*)
/home/liuhongt/work/gcc/master/gcc/tree-vect-loop.cc:3567
0x171e1b4 try_vectorize_loop_1
/home/liuhongt/work/gcc/master/gcc/tree-vectorizer.cc:1068
0x171e1b4 try_vectorize_loop
/home/liuhongt/work/gcc/master/gcc/tree-vectorizer.cc:1184
0x171eadc execute
/home/liuhongt/work/gcc/master/gcc/tree-vectorizer.cc:1300
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.


gcc -Ofast -march=x86-64-v3 -S

[Bug tree-optimization/116674] [15 regression] ICE in vectorizable_simd_clone_call bisected to r15-3509-gd34cda72098867

2024-09-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116674

--- Comment #1 from Hongtao Liu  ---
Created attachment 59094
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59094&action=edit
test.i

A more reduced case.

[Bug target/116675] No blend constant permute for V8HImode with just SSE2

2024-09-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116675

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu  ---
 5032(define_expand "vcond_mask_" 
 5033  [(set (match_operand:VI_128 0 "register_operand")
 5034(vec_merge:VI_128  
 5035  (match_operand:VI_128 1 "vector_operand")
 5036  (match_operand:VI_128 2 "nonimm_or_0_operand")   
 5037  (match_operand: 3 "register_operand")))]  
 5038  "TARGET_SSE2"
 5039{  
 5040  ix86_expand_sse_movcc (operands[0], operands[3], 
 5041 operands[1], operands[2]);
 5042  DONE;
 5043}) 

Yes, a blend shoud be same as vcond_mask.

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

--- Comment #3 from Hongtao Liu  ---
 <__umodti3>:
...

 37  58:   66 48 0f 6e c7  movq   %rdi,%xmm0
 38  5d:   66 48 0f 6e d6  movq   %rsi,%xmm2
 39  62:   66 0f 6c c2 punpcklqdq %xmm2,%xmm0
 40  66:   0f 29 44 24 f0  movaps %xmm0,-0x10(%rsp)
 41  6b:   48 8b 44 24 f0  mov-0x10(%rsp),%rax
 42  70:   48 8b 54 24 f8  mov-0x8(%rsp),%rdx
 43  75:   5b  pop%rbx
 44  76:   c3  ret

Look like the misoptimization is also in __umodti3.

[Bug target/113288] [i386] Missing #define for -mavx10.1-256 and -mavx10.1-512

2024-04-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113288

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Hongtao Liu  ---
.

[Bug tree-optimization/66862] OpenMP SIMD does not work (use SIMD instructions) on conditional code

2024-04-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66862

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Hongtao Liu  ---
Fixed since GCC8 w/ avx512bw avx512vl.
w/o avx512bw, x86 doesn't support packed int16 mask{load,store}, and can't
vectorize the loop.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #4 from Hongtao Liu  ---
(In reply to Uroš Bizjak from comment #3)
> (In reply to Jakub Jelinek from comment #2)
> > This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6
> 
> Strange, if I revert the constraints to the previous setting with: 
> 
> --cut here--
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 10ae3113ae8..262dd25a8e0 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -2870,9 +2870,9 @@ (define_peephole2
>  
>  (define_insn "*movhi_internal"
>[(set (match_operand:HI 0 "nonimmediate_operand"
> -"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*Yv,*v,*v,jm,m")
> +"=r,r,r,m ,*k,*k ,*r ,*m ,*k ,?r,?v,*Yv,*v,*v,*jm,*m")
> (match_operand:HI 1 "general_operand"
> -"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r  ,C  ,*v,m ,*x,*v"))]
> +"r ,n,m,rn,*r ,*km,*k,*k,CBC,v,r  ,C  ,v,m ,x,v"))]
>"!(MEM_P (operands[0]) && MEM_P (operands[1]))
> && ix86_hardreg_mov_ok (operands[0], operands[1])"
>  {
> --cut here--
> 
> I still get:
> 
> movlv1(%rip), %eax  # 6 [c=6 l=6]  *zero_extendsidi2/3
> movq%rax, v2(%rip)  # 16[c=4 l=7]  *movdi_internal/5
> movzwl  v1(%rip), %eax  # 7 [c=5 l=7]  *movhi_internal/2

My experience is memory cost for the operand with rm or separate r, m is
different which impacts RA decision.

https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #5 from Hongtao Liu  ---
> My experience is memory cost for the operand with rm or separate r, m is
> different which impacts RA decision.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html

Change operands[1] alternative 2 from m -> rm, then RA makes perfect decision.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #9 from Hongtao Liu  ---

> 
> It looks that different modes of memory read confuse LRA to not CSE the read.
> 
> IMO, if the preloaded value is later accessed in different modes, LRA should
> leave it. Alternatively, LRA should CSE memory accesses in different modes.

(insn 7 6 12 2 (set (reg:HI 101 [ _5 ])
(subreg:HI (reg:SI 98 [ v1.0_1 ]) 0)) "test.c":6:12 86
{*movhi_internal}
 (expr_list:REG_DEAD (reg:SI 98 [ v1.0_1 ])
(nil)))

May be we should reduce cost from simple move instruction(with subreg?) when
calculating total_cost, since it's probably be eliminated by later rtl
optimization.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #11 from Hongtao Liu  ---
unsigned v;
long long v2;
char foo ()
{
v2 = v;
return v;
}

This is related to *movqi_internal, and codegen has been worse since gcc8.1

foo:
movlv(%rip), %eax
movq%rax, v2(%rip)
movzbl  v(%rip), %eax
ret

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #12 from Hongtao Liu  ---
short a;
short c;
short d;
void
foo (short b, short f)
{
  c = b + a;
  d = f + a;
}

foo(short, short):
addwa(%rip), %di
addwa(%rip), %si
movw%di, c(%rip)
movw%si, d(%rip)
ret

this one is bad since gcc10.1 and there's no subreg, The problem is if the
operand is used by more than 1 insn, and they all support separate m
constraint, mem_cost is quite small(just 1, reg move cost is 2), and this makes
RA more inclined to propagate memory across insns. I guess RA assumes the
separate m means the insn only support memory_operand?

 961  if (op_class == NO_REGS)
 962/* Although we don't need insn to reload from
 963   memory, still accessing memory is usually more
 964   expensive than a register.  */
 965pp->mem_cost = frequency;
 966  else

[Bug middle-end/110027] [11/12/13/14 regression] Stack objects with extended alignments (vectors etc) misaligned on detect_stack_use_after_return

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #19 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #17)
> Both of the posted patches are incorrect, this needs to be fixed in
> asan_emit_stack_protection, account for the different offsets[0] which
> happens when a stack pointer guard is created.
> I'll deal with it tomorrow.

It seems to me that the only offend place is where I've modifed, are there
other places where align_frame_offset (ASAN_RED_ZONE_SIZE) is also added?

Also, your patch adds a gcc_assert for offset[0], which seems to me there was
an assumption that offset[0] should be a multiple of alignb, thus making my
patch more reasonable?

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #15 from Hongtao Liu  ---
> I don't see this as problematic. IIRC, there was a discussion in the past
> that a couple (two?) memory accesses from the same location close to each
> other can be faster (so, -O2, not -Os) than preloading the value to the
> register first.
At lease for memory with vector mode, it's better to preload the value to
register first.
> 
> In contrast, the example from the Comment #11 already has the correct value
> in %eax, so there is no need to reload it again from memory, even in a
> narrower mode.

So the problem is why cse can't handle same memory with narrower mode, maybe
it's because there's zero_extend in the first load. cse looks like can handle
simple wider mode memory.

4952  /* See if a MEM has already been loaded with a widening operation;
4953 if it has, we can use a subreg of that.  Many CISC machines
4954 also have such operations, but this is only likely to be
4955 beneficial on these machines.  */

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #16 from Hongtao Liu  ---

> 
> 4952  /* See if a MEM has already been loaded with a widening operation;
> 4953 if it has, we can use a subreg of that.  Many CISC machines
> 4954 also have such operations, but this is only likely to be
> 4955 beneficial on these machines.  */

Oh, it's pre_reload cse_insn, not postreload gcse

  1   2   3   4   5   6   >