On Sun, Oct 3, 2021 at 7:27 AM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> My recent attempts to come up with a testcase for my patch to evaluate
> ss_plus in simplify-rtx.c, identified a missed optimization opportunity
> (that's potentially a long-time regression): The RTL optimizers no longer
> place constants in the constant pool.
>
> The motivating x86_64 example is the simple program:
>
> typedef char v8qi __attribute__ ((vector_size (8)));
>
> v8qi foo()
> {
>   v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 };
>   v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
>   v8qi t = __builtin_ia32_paddsb(tx, ty);
>   return t;
> }
>
> which (with my previous patch) currently results in:
> foo:    movq    .LC0(%rip), %xmm0
>         movq    .LC1(%rip), %xmm1
>         paddsb  %xmm1, %xmm0
>         ret
>
> even though the RTL contains the result in a REG_EQUAL note:
>
> (insn 7 6 12 2 (set (reg:V8QI 83)
>         (ss_plus:V8QI (reg:V8QI 84)
>             (reg:V8QI 85))) "ssaddqi3.c":7:12 1419 {*mmx_ssaddv8qi3}
>      (expr_list:REG_DEAD (reg:V8QI 85)
>         (expr_list:REG_DEAD (reg:V8QI 84)
>             (expr_list:REG_EQUAL (const_vector:V8QI [
>                         (const_int 3 [0x3])
>                         (const_int 0 [0]) repeated x7
>                     ])
>                 (nil)))))
>
> Together with the patch below, GCC will now generate the much
> more sensible:
> foo:    movq    .LC2(%rip), %xmm0
>         ret
>
> My first approach was to look in cse.c (where the REG_EQUAL note gets
> added) and notice that the constant pool handling functionality has been
> unreachable for a while.  A quick search for constant_pool_entries_cost
> shows that it's initialized to zero, but never set to a non-zero value,
> meaning that force_const_mem is never called.  This functionality used
> to work way back in 2003, but has been lost over time:
> https://gcc.gnu.org/pipermail/gcc-patches/2003-October/116435.html
>
> The changes to cse.c below restore this functionality (placing suitable
> constants in the constant pool) with two significant refinements;
> (i) it only attempts to do this if the function already uses a constant
> pool (thanks to the availability of crtl->uses_constant_pool since 2003).
> (ii) it allows different constants (i.e. modes) to have different costs,
> so that floating point "doubles" and 64-bit, 128-bit, 256-bit and 512-bit
> vectors don't all have the share the same cost.  Back in 2003, the
> assumption was that everything in a constant pool had the same
> cost, hence the global variable constant_pool_entries_cost.
>
> Although this is a useful CSE fix, it turns out that it doesn't cure my
> motivating problem above.  CSE only considers a single instruction,
> so determines that it's cheaper to perform the ss_plus (COSTS_N_INSNS(1))
> than read the result from the constant pool (COSTS_N_INSNS(2)).  It's
> only when the other reads from the constant pool are also eliminated,
> that this transformation is a win.  Hence a better place to perform
> this transformation is in combine, where after failing to "recog" the
> load of a suitable constant, it can retry after calling force_const_mem.
> This achieves the desired transformation and allows the backend insn_cost
> call-back to control whether or not using the constant pool is preferrable.
>
> Alas, it's rare to change code generation without affecting something in
> GCC's testsuite.  On x86_64-pc-linux-gnu there were two families of new
> failures (and I'd predict similar benign fallout on other platforms).
> One failure was gcc.target/i386/387-12.c (aka PR target/26915), where
> the test is missing an explicit -m32 flag.  On i686, it's very reasonable
> to materialize -1.0 using "fld1; fchs", but on x86_64-pc-linux-gnu we
> currently generate the awkward:
> testm1: fld1
>         fchs
>         fstpl   -8(%rsp)
>         movsd   -8(%rsp), %xmm0
>         ret
>
> which combine now very reasonably simplifies to just:
> testm1: movsd   .LC3(%rip), %xmm0
>         ret
>
> The other class of x86_64-pc-linux-gnu failure was from materialization
> of vector constants using vpbroadcast (e.g. gcc.target/i386/pr90773-17.c)
> where the decision is finely balanced; the load of an integer register
> with an immediate constant, followed by a vpbroadcast is deemed to be
> COSTS_N_INSNS(2), whereas a load from the constant pool is also reported
> as COSTS_N_INSNS(2).  My solution is to tweak the i386.c's rtx_costs
> so that all other things being equal, an instruction (sequence) that
> accesses memory is fractionally more expensive than one that doesn't.
>
>
> Hopefully, this all makes sense.  If someone could benchmark this for
> me that would me much appreciated.  This patch has been tested on
> x86_64-pc-linux-gnu with "make bootstrap" and "make -k check" with no
> new failures.  Ok for mainline?
>
>
> 2021-10-03  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * combine.c (recog_for_combine): For an unrecognized move/set of
>         a constant, try force_const_mem to place it in the constant pool.
>         * cse.c (constant_pool_entries_cost, constant_pool_entries_regcost):
>         Delete global variables (that are no longer assigned a cost value).
>         (cse_insn): Simplify logic for deciding whether to place a folded
>         constant in the constant pool using force_const_mem.
>         (cse_main): Remove zero initialization of constant_pool_entries_cost
>         and constant_pool_entries_regcost.
>
>         * config/i386/i386.c (ix86_rtx_costs): Make memory accesses
>         fractionally more expensive, when optimizing for speed.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/387-12.c: Add explicit -m32 option.

Should it use

/* { dg-do compile { target ia32 } } */

instead?

> Roger
> --



-- 
H.J.

Reply via email to