https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59393

--- Comment #8 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 31 Mar 2016, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59393
> 
> Jeffrey A. Law <law at redhat dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |law at redhat dot com
> 
> --- Comment #7 from Jeffrey A. Law <law at redhat dot com> ---
> I was looking at this and noticed we have several sequences like
> 
>  _18 = l_11 >> 16;
>   _19 = _18 & 255;
>   _20 = _19 + 256;
>   _21 = _20 * 8;
>
> There's variations in the constants, but the pattern repeats regularly.  My
> first thought was to rewrite that as
> 
>  _18 = l_11 >> 13;
>  _19 = _18 & 0x7f8;
>  _20 = _19 + 0x800;
> 
> That seemed to be slightly worse on x86_64.  I'd already noticed that the
> addition was setting bits we knew to be zero, so it could be rewritten using 
> an
> IOR like this:
> 
> 
>  _18 = l_11 >> 13;
>  _19 = _18 & 0x7f8;
>  _20 = _19 | 0x800;
> 
> In isolation, that looked good on x86_64.  So my thought was that we may have
> an gcc-7 improvement that could be made for this code.  But then I coded up a
> quick pattern in match.pd and tested it and the resulting assembly code was
> considerably worse on x86_64 for the benchmark code.
> 
> There's a couple things in play here on x86_64.  In the benchmark code these
> are address computations.  The *8 and +256 in the original sequence can be a
> part of the effective address in the memory reference.  Furthermore, the
> masking is a 2 byte movzbl in the original sequence, but a 3 byte and # in the
> later sequences.  This negates all the gain by using IOR instead of PLUS, 
> which
> was shorter for x86_64.

Yeah, I think we have several fold-const.c pieces that try to make sure
to preserve / create shifts / ands that match mode widths.

> mips16 does slightly better with the second sequence, saving ~76 bytes on the
> included testcase.
> 
> However, given how highly dependent this is on the target's addressing modes,
> match.pd is probably not the place to attack this problem.  Combine is likely 
> a
> better place, using either a generic splitting sequence that self-tunes via
> rtx_cost.  Or via a target specific splitter.

True, though the idea to have target specific match.pd bits is still
on the plate - we'd have sth like config/$arch/$arch.pd which we can
include from match.pd and we could guard those patterns by
sth like compile-phase == pre-RTL-expand so they get enabled only
in late GIMPLE (after loop opts).  We'd add those mainly to remove
expand complexity and its reliance on TER to see complex expressions
for better initial instruction selection.

> The closest we get right now is this combine attempt:
> 
> (set (reg:SI 1077)
>     (plus:SI (ashift:SI (and:SI (lshiftrt:SI (reg:SI 1073)
>                     (const_int 8 [0x8]))
>                 (reg:SI 1074))
>             (const_int 2 [0x2]))
>         (const_int 1024 [0x400])))
> 
> 
> reg:SI 1074 is (const_int 255), but we can't blindly substitute in because reg
> 1074 has other uses as seen by this attempt:
> 
> (parallel [
>         (set (reg:SI 1077)
>             (plus:SI (and:SI (ashift:SI (reg:SI 1072)
>                         (const_int 2 [0x2]))
>                     (const_int 1020 [0x3fc]))
>                 (const_int 1024 [0x400])))
>         (set (reg:SI 1074)
>             (const_int 255 [0xff]))
>     ])

Yeah, the multi-use restriction in combine is a serious limitation.
OTOH we face a similar issue in GIMPLE forwprop and all those
"aritificial" single_use tests in match.pd - to do better the
pattern detection / replacement would need to be done with a
cost model that includes all pattern applications (all with
have uses in common at least).

Reply via email to