Re: Live range Analysis based on tree representations

2015-09-02 Thread Aaron Sawdey
On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote:
> All:
> 
> The Live ranges info on tree SSA representation is important step towards the 
> SSA based code motion optimizations.
> As the code motion optimization based on the SSA representation effects the 
> register pressure and reasons for performance
> Bottleneck.
> 
> I am proposing the Live range Analysis based on the SSA representation. The 
> Live range analysis traverses the dominator
> Tree. The SSA and phi variables are represented based on dominance frontier 
> info and the SSA representation reflects
> The dominance info. Based on such dominance info Live range Overlapping 
> Analysis can be derived.
> 
> Variable V intersects W if Vdef dominates the Wdef. The variable v intersects 
> at point p if Vdef dominates P and Wdef
> Dominates the P. If Vdef dominates Wdef and Wdef dominates Udef , then the 
> Vdef dominates Udef and thus Live range
> Of V intersect W and live range W intersect U, thus the live range V 
> intersects the U. Such dominance info can be used to
> Represent the Overlapping Live range Analysis and the register pressure is 
> derived from Overlapping Live ranges based 
> On the dominator info inherited from the SSA representation. The SSA 
> representation is derived based on dominance
> Frontier and the traversal of dominator tree based on SSA can derive the 
> Overlapping Live ranges.
> 
> The above Overlapping Live range info can be used to derive the register 
> pressure and the optimization based out of tree
> Representation can use the above overlapping live ranges to take register 
> pressure into account.

Ajit,
  I did a prototype of this kind of analysis at one point last year to
see if it could help improve inlining decisions in LTO. Basically I did
exactly what you suggest and computed the number of overlapping SSA live
ranges and used that as a proxy for register pressure. It did appear to
be able to help in some circumstances but the real solution is to
improve register allocation so it doesn't fall down when register
pressure gets high.

The code is in a branch called lto-pressure.

  Aaron

> 
> Thanks & Regards
> Ajit
> 

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: Live range Analysis based on tree representations

2015-09-04 Thread Aaron Sawdey
On Thu, 2015-09-03 at 15:22 +, Ajit Kumar Agarwal wrote:
> 
> 
> -Original Message-
> From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] 
> Sent: Wednesday, September 02, 2015 8:23 PM
> To: Ajit Kumar Agarwal
> Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod 
> Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: Live range Analysis based on tree representations
> 
> On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote:
> > All:
> > 
> > The Live ranges info on tree SSA representation is important step towards 
> > the SSA based code motion optimizations.
> > As the code motion optimization based on the SSA representation 
> > effects the register pressure and reasons for performance Bottleneck.
> > 
> > I am proposing the Live range Analysis based on the SSA 
> > representation. The Live range analysis traverses the dominator Tree. 
> > The SSA and phi variables are represented based on dominance frontier info 
> > and the SSA representation reflects The dominance info. Based on such 
> > dominance info Live range Overlapping Analysis can be derived.
> > 
> > Variable V intersects W if Vdef dominates the Wdef. The variable v 
> > intersects at point p if Vdef dominates P and Wdef Dominates the P. If 
> > Vdef dominates Wdef and Wdef dominates Udef , then the Vdef dominates 
> > Udef and thus Live range Of V intersect W and live range W intersect 
> > U, thus the live range V intersects the U. Such dominance info can be 
> > used to Represent the Overlapping Live range Analysis and the register 
> > pressure is derived from Overlapping Live ranges based On the dominator 
> > info inherited from the SSA representation. The SSA representation is 
> > derived based on dominance Frontier and the traversal of dominator tree 
> > based on SSA can derive the Overlapping Live ranges.
> > 
> > The above Overlapping Live range info can be used to derive the 
> > register pressure and the optimization based out of tree Representation can 
> > use the above overlapping live ranges to take register pressure into 
> > account.
> 
> >>Ajit,
>   >>I did a prototype of this kind of analysis at one point last year to see 
> if it could help improve inlining decisions in LTO. Basically I did >>exactly 
> what you suggest and computed the number of overlapping SSA live ranges and 
> used that as a proxy for register pressure. It >>did appear to be able to 
> help in some circumstances but the real solution is to improve register 
> allocation so it doesn't fall down when >>register pressure gets high.
> 
> Aaron:
> Would you mind in explaining on what circumstances it helps and when it 
> won't.  The Live ranges on SSA
> representation forms chordal Graphs that might have the different 
> colorability requirements than the real
> register allocator. This may not give exact register pressure compared to  
> register allocator as register allocator 
> is  further down the optimization and the code generation pipeline  but forms 
> the basis of optimization based 
> on SSA that effects the register pressure.
>  
> >>The code is in a branch called lto-pressure.
> 
> Thanks. I would like to see the code.

Ajit,
  The branch is here: svn://gcc.gnu.org/svn/gcc/branches/lto-pressure
The analysis is in ipa-inline.c; if you search for "pressure" you'll
find the code.

The live ranges in SSA are certainly different than what the register
allocator is going to see, but it's what we have to work with at the
point where the inlining decisions are made, which is why I was looking
at it. My hope was that it would be a reasonable proxy for downstream
register pressure. 

I went and did this after seeing a particular situation in bzip2 where a
bunch of inlining done by LTO sent register pressure up a lot and
resulted in a measureable increase in loads/stores due to extra spill
code. Functions that are called in only one place and not externally
visible will be inlined regardless of their size. There is one function
in bzip2 that has particularly complex control and inlining this into
another non-trivial function caused all the excess spill code.

Setting limits on "inline functions called only once" did work, i.e.
inhibited the particular inlining that I wanted to eliminate; it just
didn't help enough to justify the complexity of the extra analysis. So I
went a step further and put the caller and callee pressure in as a term
in the edge badness used to prioritize inlining of small functions. This
seemed to help in some cases and hurt others, probably because SSA
pressure doesn't accurately represent the kind

Re: Live range Analysis based on tree representations

2015-09-15 Thread Aaron Sawdey
On Sat, 2015-09-12 at 18:45 +, Ajit Kumar Agarwal wrote:
> 
> 
> -Original Message-
> From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] 
> Sent: Friday, September 04, 2015 11:51 PM
> To: Ajit Kumar Agarwal
> Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod 
> Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: Live range Analysis based on tree representations
> 
> On Thu, 2015-09-03 at 15:22 +, Ajit Kumar Agarwal wrote:
> > 
> > 
> > -Original Message-
> > From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com]
> > Sent: Wednesday, September 02, 2015 8:23 PM
> > To: Ajit Kumar Agarwal
> > Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; 
> > Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju 
> > Mekala
> > Subject: Re: Live range Analysis based on tree representations
> > 
> > On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote:
> > > All:
> > > 
> > > The Live ranges info on tree SSA representation is important step towards 
> > > the SSA based code motion optimizations.
> > > As the code motion optimization based on the SSA representation 
> > > effects the register pressure and reasons for performance Bottleneck.
> > > 
> > > I am proposing the Live range Analysis based on the SSA 
> > > representation. The Live range analysis traverses the dominator Tree.
> > > The SSA and phi variables are represented based on dominance frontier 
> > > info and the SSA representation reflects The dominance info. Based on 
> > > such dominance info Live range Overlapping Analysis can be derived.
> > > 
> > > Variable V intersects W if Vdef dominates the Wdef. The variable v 
> > > intersects at point p if Vdef dominates P and Wdef Dominates the P. 
> > > If Vdef dominates Wdef and Wdef dominates Udef , then the Vdef 
> > > dominates Udef and thus Live range Of V intersect W and live range W 
> > > intersect U, thus the live range V intersects the U. Such dominance 
> > > info can be used to Represent the Overlapping Live range Analysis and the 
> > > register pressure is derived from Overlapping Live ranges based On the 
> > > dominator info inherited from the SSA representation. The SSA 
> > > representation is derived based on dominance Frontier and the traversal 
> > > of dominator tree based on SSA can derive the Overlapping Live ranges.
> > > 
> > > The above Overlapping Live range info can be used to derive the 
> > > register pressure and the optimization based out of tree Representation 
> > > can use the above overlapping live ranges to take register pressure into 
> > > account.
> > 
> > >>Ajit,
> >   >>I did a prototype of this kind of analysis at one point last year to 
> > see if it could help improve inlining decisions in LTO. Basically I did 
> > >>exactly what you suggest and computed the number of overlapping SSA live 
> > ranges and used that as a proxy for register pressure. It >>did appear to 
> > be able to help in some circumstances but the real solution is to improve 
> > register allocation so it doesn't fall down when >>register pressure gets 
> > high.
> > 
> > Aaron:
> > Would you mind in explaining on what circumstances it helps and when 
> > it won't.  The Live ranges on SSA representation forms chordal Graphs 
> > that might have the different colorability requirements than the real 
> > register allocator. This may not give exact register pressure compared 
> > to  register allocator as register allocator is  further down the 
> > optimization and the code generation pipeline  but forms the basis of 
> > optimization based on SSA that effects the register pressure.
> >  
> > >>The code is in a branch called lto-pressure.
> > 
> > Thanks. I would like to see the code.
> 
> Ajit,
> >>  The branch is here: svn://gcc.gnu.org/svn/gcc/branches/lto-pressure
> >>The analysis is in ipa-inline.c; if you search for "pressure" you'll find 
> >>the code.
> 
> >>The live ranges in SSA are certainly different than what the register 
> >>allocator is going to see, but it's what we have to work with at the point 
> >>where the >>inlining decisions are made, which is why I was looking at it. 
> >>My hope was that it would be a reasonable proxy for downstream register 
> >>pressure. 
> 
> >>I went and did this after seeing a particular s

guessed profile counts leading to incorrect static branch hints on ppc64

2015-12-09 Thread Aaron Sawdey
So, I'm wondering if anyone can help with where the ultimate problem may
lie here. I've seen some cases where gcc generates branches with the
static branch hints bit set. It is happening because the branch
probability gets set to "always". This happens in
tree-ssa-threadupdate.c:

static void
recompute_probabilities (basic_block bb)
{
  edge esucc;
  edge_iterator ei;
  FOR_EACH_EDGE (esucc, ei, bb->succs)
{
  if (!bb->count)
continue;

  /* Prevent overflow computation due to insane profiles.  */
  if (esucc->count < bb->count)
esucc->probability = GCOV_COMPUTE_SCALE (esucc->count,
 bb->count);
  else {
/* Can happen with missing/guessed probabilities, since we
   may determine that more is flowing along duplicated
   path than joiner succ probabilities allowed.
   Counts and freqs will be insane after jump threading,
   at least make sure probability is sane or we will
   get a flow verification error.
   Not much we can do to make counts/freqs sane without
   redoing the profile estimation.  */
esucc->probability = REG_BR_PROB_BASE;
  }
}
}

It would appear that the guessed counts are getting changed
inconsistently before this during the tree-ssa-dom pass.

Any trail of breadcrumbs to follow through the forest would be helpful
here ...

Thanks!

   Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



determining reassociation width

2016-05-02 Thread Aaron Sawdey
So, my first cut at the function to select reassociation width for
power was modeled after what I saw i386 and aarch64 doing, which is to
return something based on the number of that kind of op we can do at
the same time:

static int
rs6000_reassociation_width (unsigned int opc, enum machine_mode mode)
{
switch (rs6000_cpu) {
case PROCESSOR_POWER8:
case PROCESSOR_POWER9:
if (VECTOR_MODE_P (mode)) 
return 2;
if (INTEGRAL_MODE_P (mode)) {
if ( opc == MULT_EXPR ) return 2;
return 6; /* correct for all integral modes? */
}
if (FLOAT_MODE_P (mode))
return 2;
/* decimal float gets default 1 */
break;
default:
break;
}

return 1;
}

However, the reality of the situation is a bit more complicated I
think.

* If we want maximum parallelism, we should really base this on the
number of units times the latency. I.e. for float on p8 we have 2 units
and 6 cycles latency so we would want to issue up to 12 fadd or fmul in
parallel, then the result from the first one would be ready for the
next series of dependent ops.
* Of course this may cause massive register spills and so we can't
really make things that wide. So, reassociation ought to be aware of
how much register pressure it is creating and how much has been created
by things that want to be live across this bb. 
* Ideally we would also be aware of whether we are reassociating a tree
of fp additions whose terms are fp multiplies because now we have
fused multipy-adds to consider. See PR 70912 for more on this.

Suggestions?

Thanks, 
   Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain




cmpstrnsi pattern should check for zero byte?

2016-11-01 Thread Aaron Sawdey
I am about to put up a patch to make a change to the way
builtins.c/expand_builtin_strncmp does things. At present it will only
expand if one of the strings is a constant, and the length it passes in
is MIN(len, strlen(constant string)+1). I want to change it so that it
will also attempt to expand strncmp where neither of the input strings
is a constant.

What I ran into is that the i386 target cmpstrnsi pattern uses repz
cmpsb which does not check for the zero byte and thus depends on
expand_builtin_strncmp to arrange things so the length is not past the
end of both strings.

SH and RX are the other two targets that have cmpstrnsi patterns.

My reading of sh_expand_cmpnstr() in sh-mem.cc is that the code that it
emits will both look for a zero byte and compare the strings so it will
not have trouble with this change to expand_builtin_strncmp.

It also looks to me like rx.md emits the "scmpu" instruction to do the
comparison. The RX manual I found showed pseudocode for scmpu that
shows it both checks for zero byte as well as comparing the strings.

If this isn't correct, please let me know here or on the patch itself.

Thanks,
    Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Aaron Sawdey
On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote:
> Hello All:
>  
> I have worked on the Open64 compiler where the Register Pressure Guided 
> Unroll and Jam gave a good amount of performance improvement for the  C and 
> C++ Spec Benchmark and also Fortran benchmarks.
> 
> The Unroll and Jam increases the register pressure in the Unrolled Loop 
> leading to increase in the Spill and Fetch degrading the performance of the 
> Unrolled Loop. The Performance of Cache locality achieved through Unroll and 
> Jam is degraded with the presence of Spilling instruction due to increases in 
> register pressure Its better to do the decision  of Unrolled Factor of the 
> Loop based on the Performance model of the register pressure.
> 
> Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
> Level IR. The register pressure based Unroll and Jam requires the calculation 
> of register pressure in the High Level IR  which will be similar to register 
> pressure we calculate on Register Allocation. This makes the implementation 
> complex.
> 
> To overcome this, the Open64 compiler does the decision of Unrolling to both 
> High Level IR and also at the Code Generation Level. Some of the decisions 
> way at the end of the Code Generation . The advantage of using this approach 
> like Open64 helps in using the register pressure information calculated by 
> the Register Allocator. This helps the implementation much simpler and less 
> complex.
> 
> Can we have this approach in GCC of the Decisions of Unroll and Jam in the 
> High Level IR  and also to defer some of the decision at the Code Generation 
> Level like Open64? 
> 
>  Please let me know what do you think.

I have been working on calculating something analogous to register
pressure using a count of the number of live SSA values during the
ipa-inline pass. I've been working on steering inlining (especially in
LTO) away from decisions that explode the register pressure downstream,
with a similar goal of avoiding situations that cause a lot of spill
code.

I have been working in a branch if you want to take a look:
gcc/branches/lto-pressure

Aaron

> 
> Thanks & Regards
> Ajit
> 

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Aaron Sawdey
On Mon, 2014-06-16 at 14:42 -0400, Vladimir Makarov wrote:
> On 2014-06-16, 2:25 PM, Aaron Sawdey wrote:
> > On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote:
> >> Hello All:
> >>
> >> I have worked on the Open64 compiler where the Register Pressure Guided 
> >> Unroll and Jam gave a good amount of performance improvement for the  C 
> >> and C++ Spec Benchmark and also Fortran benchmarks.
> >>
> >> The Unroll and Jam increases the register pressure in the Unrolled Loop 
> >> leading to increase in the Spill and Fetch degrading the performance of 
> >> the Unrolled Loop. The Performance of Cache locality achieved through 
> >> Unroll and Jam is degraded with the presence of Spilling instruction due 
> >> to increases in register pressure Its better to do the decision  of 
> >> Unrolled Factor of the Loop based on the Performance model of the register 
> >> pressure.
> >>
> >> Most of the Loop Optimization Like Unroll and Jam is implemented in the 
> >> High Level IR. The register pressure based Unroll and Jam requires the 
> >> calculation of register pressure in the High Level IR  which will be 
> >> similar to register pressure we calculate on Register Allocation. This 
> >> makes the implementation complex.
> >>
> >> To overcome this, the Open64 compiler does the decision of Unrolling to 
> >> both High Level IR and also at the Code Generation Level. Some of the 
> >> decisions way at the end of the Code Generation . The advantage of using 
> >> this approach like Open64 helps in using the register pressure information 
> >> calculated by the Register Allocator. This helps the implementation much 
> >> simpler and less complex.
> >>
> >> Can we have this approach in GCC of the Decisions of Unroll and Jam in the 
> >> High Level IR  and also to defer some of the decision at the Code 
> >> Generation Level like Open64?
> >>
> >>   Please let me know what do you think.
> >
> > I have been working on calculating something analogous to register
> > pressure using a count of the number of live SSA values during the
> > ipa-inline pass. I've been working on steering inlining (especially in
> > LTO) away from decisions that explode the register pressure downstream,
> > with a similar goal of avoiding situations that cause a lot of spill
> > code.
> >
> > I have been working in a branch if you want to take a look:
> > gcc/branches/lto-pressure
> >
> 
> Any pressure evaluation is a better than its absence.  But on this level 
> it is hard to evaluate it accurately.
> 
> E.g. pressure in loop can be high for general regs, for fp regs or the 
> both.  Using live SSA values is still very inaccurate to make a right 
> decision for the transformations.
> 

Yes, the jump I have not made yet is to classify the pressure by what
register class it might end up in. The other big piece that's
potentially missing at that point is pressure caused by temps and by
scheduling. But I think you can still get order-of-magnitude type
estimates.

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



[RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

2018-10-17 Thread Aaron Sawdey
I've previously posted a patch to add vector/vsx inline expansion of
strcmp/strncmp for the power8/power9 processors. Here are some of the
other items I have in the pipeline that I hope to get into gcc9:

* vector/vsx support for inline expansion of memcmp to non-loop code.
  This improves performance of small memcmp.
* vector/vsx support for inline expansion of memcmp to loop code. This
  will close the performance gap for lengths of about 128-512 bytes
  by making the loop code closer to the performance of the library
  memcmp.
* generate inline expansion to a loop for strcmp/strncmp. This closes
  another performance gap because strcmp/strncmp vector/vsx code
  currently generated is lots faster than the library call but we
  only generate comparison of 64 bytes to avoid exploding code size.
  Similar code in a loop would be compact and allow inline comparison
  of maybe the first 512 bytes inline before dumping to the library
  function.

If anyone has any other input on the inline expansion work I've been
doing for the rs6000 target, please let me know.

Thanks!
Aaron


-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

2018-10-18 Thread Aaron Sawdey
On 10/17/18 4:03 PM, Florian Weimer wrote:
> * Aaron Sawdey:
> 
>> I've previously posted a patch to add vector/vsx inline expansion of
>> strcmp/strncmp for the power8/power9 processors. Here are some of the
>> other items I have in the pipeline that I hope to get into gcc9:
>>
>> * vector/vsx support for inline expansion of memcmp to non-loop code.
>>   This improves performance of small memcmp.
>> * vector/vsx support for inline expansion of memcmp to loop code. This
>>   will close the performance gap for lengths of about 128-512 bytes
>>   by making the loop code closer to the performance of the library
>>   memcmp.
>> * generate inline expansion to a loop for strcmp/strncmp. This closes
>>   another performance gap because strcmp/strncmp vector/vsx code
>>   currently generated is lots faster than the library call but we
>>   only generate comparison of 64 bytes to avoid exploding code size.
>>   Similar code in a loop would be compact and allow inline comparison
>>   of maybe the first 512 bytes inline before dumping to the library
>>   function.
>>
>> If anyone has any other input on the inline expansion work I've been
>> doing for the rs6000 target, please let me know.
> 
> The inline expansion of strcmp is problematic for valgrind:
> 
>   <https://bugs.kde.org/show_bug.cgi?id=386945>

I'm aware of this. One thing that will help is that I believe the vsx
expansion for strcmp/strncmp does not have this problem, so with
current gcc9 trunk the problem should only be seen if one of the strings is
known at compile time to be less than 16 bytes, or if -mcpu=power7, or
if vector/vsx is disabled. My position is that it is valgrind's problem
if it doesn't understand correct code, but I also want valgrind to be a
useful tool so I'm going to take a look and see if I can find a gpr
sequence that is equally fast that it can understand.

> We currently see around 0.5 KiB of instructions for each call to
> strcmp.  I find it hard to believe that this improves general system
> performance except in micro-benchmarks.

The expansion of strcmp where both arguments are strings of unknown
length at compile time will compare 64 bytes then call strcmp on the
remainder if no difference is found. If the gpr sequence is used (p7
or vec/vsx disabled) then the overhead is 91 instructions. If the
p8 vsx sequence is used, the overhead is 59 instructions. If the p9
vsx sequence is used, then the overhead is 41 instructions.

Yes, this will increase the instruction footprint. However the processors
that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some
of the comparison inline makes the common cases of strings being totally
different, or identical and <= 64 bytes in length very much faster, and
also avoiding the plt call means less pressure on the count cache and
better branch prediction elsewhere.

If you are aware of any real world code that is faster when built
with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
so I can look at avoiding those situations.

  Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Fixing inline expansion of overlapping memmove and non-overlapping memcpy

2019-05-14 Thread Aaron Sawdey
GCC does not currently do inline expansion of overlapping memmove, nor does it
have an expansion pattern to allow for non-overlapping memcpy, so I plan to add
patterns and support to implement this in gcc 10 timeframe.

At present memcpy and memmove are kind of entangled. Here's the current state of
play:

memcpy -> expand with movmem pattern
memmove (no overlap) -> transform to memcpy -> expand with movmem pattern
memmove (overlap) -> remains memmove -> glibc call

There are several problems currently. If the memmove() arguments are in fact
overlapping, then the expansion is actually not used which makes no sense and
costs performance of calling a library function instead of inline expanding
memmove() of small blocks.

There is currently no way to have a separate memcpy pattern. I know from
experience with expansion of memcmp on power that lengths on the order of
hundreds of bytes are needed before the function call overhead is overcome by
optimized glibc code. But we need the memcpy guarantee of non-overlapping
arguments to make that happen, as we don't want to do a runtime overlap test.

There is some analysis that happens in gimple_fold_builtin_memory_op() that
determines when memmove calls cannot have an overlap between the arguments and
converts them into memcpy() which is nice.

However in builtins.c expand_builtin_memmove() does not actually do the
expansion using the memmove pattern. This is why a memmove() call that cannot be
converted to memcpy() by gimple_fold_builtin_memory_op() is not expanded and we
call glibc memmove(). Only expand_builtin_memcpy() actually uses the memmove
pattern.

So here's my proposed set of fixes:
 * Add new optab entries for nonoverlapping_memcpy and overlapping_memmove
   cases.
 * The movmem optab will continue to be treated exactly as it is today so
   that ports that might have a broken movmem pattern that doesn't actually
   handle the overlap cases will continue to work.
 * expand_builtin_memmove() needs to actually do the memmove() expansion.
 * expand_builtin_memcpy() needs to use cpymem. Currently this happens down in
   emit_block_move_via_movmem() so some functions might need to be renamed.
 * ports can then add the new overlapping move and nonoverlapping copy expanders
   and will get better expansion of both memmove and memcpy functions.

I'd be interested in any comments about pieces of this machinery that need to
work a certain way, or other related issues that should be addressed in
between expand_builtin_memcpy() and emit_block_move_via_movmem().

Thanks!
   Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy

2019-05-15 Thread Aaron Sawdey
On 5/15/19 8:10 AM, Michael Matz wrote:> On Tue, 14 May 2019, Aaron Sawdey 
wrote:
> 
>> memcpy -> expand with movmem pattern
>> memmove (no overlap) -> transform to memcpy -> expand with movmem pattern
>> memmove (overlap) -> remains memmove -> glibc call
> ...
>> However in builtins.c expand_builtin_memmove() does not actually do the 
>> expansion using the memmove pattern.
> 
> Because it can't: the movmem pattern is not defined to require handling 
> overlaps, and hence can't be used for any possibly overlapping 
> memmove.  (So, in a way the pattern is misnamed and should probably have 
> been called cpymem from the beginning, alas there we are).
> 
>> So here's my proposed set of fixes:
>>  * Add new optab entries for nonoverlapping_memcpy and overlapping_memmove
>>cases.
> 
> Wouldn't it be nicer to rename the current movmem pattern to cpymem 
> wholesale for all ports (i.e. roughly a big s/movmem/cpymem/ over the 
> whole tree) and then introduce a new optional movmem pattern with 
> overlapping semantics?

Yeah that makes a lot of sense. I was unaware of that history, and was led
astray by the fact that the powerpc implementation of movemem works by
doing a bunch of loads into registers followed by a bunch of stores and
so (I think) would actually work for the overlap case.

Thanks,
   Aaron




-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy

2019-05-15 Thread Aaron Sawdey
On 5/15/19 7:22 AM, Richard Biener wrote:
> On Tue, May 14, 2019 at 9:21 PM Aaron Sawdey  wrote:
>> I'd be interested in any comments about pieces of this machinery that need to
>> work a certain way, or other related issues that should be addressed in
>> between expand_builtin_memcpy() and emit_block_move_via_movmem().
> 
> I wonder if introducing a __builtin_memmove_with_hints specifying whether
> src < dst or dst > src or unknown and/or a safe block size where that
> doesn't matter
> would help?  I can then be safely expanded to memmove() or to specific
> inline code.

Yes this would be a nice thing to get to, a single move/copy underlying
builtin, to which we communicate what the compiler's analysis tells us
about whether the operands overlap and by how much.

Next question would be how do we move from the existing movmem pattern
(which Michael Matz tells us should be renamed cpymem anyway) to this
new thing. Are you proposing that we still have both movmem and cpymem
optab entries underneath to call the patterns but introduce this
new memmove_with_hints() to be used by things called by expand_builtin_memmove()
and expand_builtin_memcpy()?

Thanks!
   Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy

2019-05-15 Thread Aaron Sawdey
On 5/15/19 9:02 AM, Michael Matz wrote:
> On Wed, 15 May 2019, Aaron Sawdey wrote:
>> Next question would be how do we move from the existing movmem pattern 
>> (which Michael Matz tells us should be renamed cpymem anyway) to this 
>> new thing. Are you proposing that we still have both movmem and cpymem 
>> optab entries underneath to call the patterns but introduce this new 
>> memmove_with_hints() to be used by things called by 
>> expand_builtin_memmove() and expand_builtin_memcpy()?
> 
> I'd say so.  There are multiple levels at play:
> a) exposal to user: probably a new __builtint_memmove, or a new combined 
>builtin with a hint param to differentiate (but we can't get rid of 
>__builtin_memcpy/mempcpy/strcpy, which all can go through the same 
>route in the middleend)
> b) getting it through the gimple pipeline, probably just a new builtin 
>code, trivial
> c) expanding the new builtin, with the help of next items
> d) RTL block moves: they are defined as non-overlapping and I don't think 
>we should change this (essentially they're the reflection of struct 
>copies in C)
> e) how any of the above (builtins and RTL block moves) are implemented: 
>currently non-overlapping only, using movmem pattern when possible; 
>ultimately all sitting in the emit_block_move_hints() routine.
> 
> So, I'd add a new method to emit_block_move_hints indicating possible 
> overlap, disabling the use of move_by_pieces.  Then in 
> emit_block_move_via_movmem (alse getting an indication of overlap), do the 
> equivalent of:
> 
>   finished = 0;
>   if (overlap_possible) {
> if (optab[movmem])
>   finished = emit(movmem)
>   } else {
> if (optab[cpymem])
>   finished = emit(cpymem);
> if (!finished && optab[movmem])  // can use movmem also for overlap
>   finished = emit(movmem);
>   }
> 
> The overlap_possible method would only ever be used from the builtin 
> expansion, and never from the RTL block move expand.  Additionally a 
> target may optionally only define the movmem pattern if it's just as good 
> as the cpymem pattern (e.g. because it only handles fixed small sizes and 
> uses a load-all then store-all sequence).

We currently have gimple_fold_builtin_memory_op() figuring out where there
is no overlap and converging __builtin_memmove() to __builtin_memcpy(). Would
you forsee looking for converting __builtin_memmove() with overlap into
a call to __builtin_memmove_hint if it is a case where we can define the
overlap precisely enough to provide the hint? My guess is that this wouldn't
be a very common case.

My goals for this are:
 * memcpy() call becomes __builtin_memcpy and goes to optab[cpymem]
 * memmove() call becomes __builtin_memmove (or __builtin_memcpy based
   on the gimple analysis) and goes through optab[movmem] or optab[cpymem]

I think what you've described meets these goals and cleans things up.

Thanks,
Aaron


-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy

2019-05-15 Thread Aaron Sawdey
On 5/15/19 11:31 AM, Jakub Jelinek wrote:
> On Wed, May 15, 2019 at 11:23:54AM -0500, Aaron Sawdey wrote:
>> My goals for this are:
>>  * memcpy() call becomes __builtin_memcpy and goes to optab[cpymem]
>>  * memmove() call becomes __builtin_memmove (or __builtin_memcpy based
>>on the gimple analysis) and goes through optab[movmem] or optab[cpymem]
> 
> Except for the becomes part (the memcpy call is the same thing as
> __builtin_memcpy in the middle-end, all you care about if it is
> BUILT_IN_MEMCPY etc. and whether it has compatible arguments), and for the
> missing optab[movmem] part and movmem->cpymem renaming, isn't that what we
> have already?

Yes. I was just trying to state what I wanted it to become, some of which
is already present. So I think I will start working on two patches:

1) rename optab movmem and the underlying patterns to cpymem.
2) add a new optab movmem that is really memmove() and add support for
having __builtin_memmove() use it.

Handling of the partial overlap case can be a separate piece of work.

Thanks,
   Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain


Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy

2019-05-15 Thread Aaron Sawdey
On 5/15/19 1:01 PM, Jakub Jelinek wrote:
> On Wed, May 15, 2019 at 12:59:01PM -0500, Aaron Sawdey wrote:
>> 1) rename optab movmem and the underlying patterns to cpymem.
>> 2) add a new optab movmem that is really memmove() and add support for
>> having __builtin_memmove() use it.
>>
>> Handling of the partial overlap case can be a separate piece of work.
> 
> That 1) and 2) can be also separate pieces of work ;).
Exactly -- make things as easy as possible when I go begging for reviewers :-)

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: k-byte memset/memcpy/strlen builtins

2017-01-11 Thread Aaron Sawdey
On Wed, 2017-01-11 at 17:16 +0100, Robin Dapp wrote:
> Hi,

Hi Robin,
  I thought I'd share some of what I've run into while doing similar
things for the rs6000 target.

First off, be aware that glibc does some macro expansion things to try
to handle 1/2/3 byte string operations in some cases.

Secondly, the way I approached this was to use the patterns 
defined in optabs.def for these things:

OPTAB_D (cmpmem_optab, "cmpmem$a")
OPTAB_D (cmpstr_optab, "cmpstr$a")
OPTAB_D (cmpstrn_optab, "cmpstrn$a")
OPTAB_D (movmem_optab, "movmem$a")
OPTAB_D (setmem_optab, "setmem$a")
OPTAB_D (strlen_optab, "strlen$a")

If you define movmemsi, that should get used by expand_builtin_memcpy
for any memcpy call that it sees.

The constraints I was able to find when implementing cmpmemsi for
memcmp were:
 * don't compare past the given length (obviously)
 * don't read past the given length
 * except it's ok to do so if you can prove via alignment or
   runtime check that you are not going to cause a pagefault.
   Not crossing a 4k boundary seems to be generally viewed as
   acceptable.

I would recommend looking at preprocessed code to make sure no funny
business is happening, and then look at your .md files. It looks to me
like s390 has got both movmem and strlen patterns there already.

If I understand correctly you are wanting to do multi-byte characters.
Seems to me you need to follow the path Richard Biener suggests and
make optab expansions that handle wider chars and then perhaps map
wcslen et. al. to them?

   Aaron
> 
> For 1-byte memset/memcpy the builtin functions provide a
> straightforward
> way to achieve this. At first sight it seemed possible to extend
> tree-loop-distribution.c to include the additional variants we need.
> However, multibyte memsets/memcpys are not covered by the C standard
> and
> I'm therefore unsure if such an approach is preferable or if there
> are
> more idiomatic ways or places where to add the functionality.
> 
> The same question goes for 2-byte strlen. I didn't see a recognition
> pattern for strlen (apart from optimizations due to known string
> length
> in tree-ssa-strlen.c). Would it make sense to include strlen
> recognition
> and subsequently handling for 2-byte strlen? The situation might of
> course more complicated than memset because of encodings etc. My
> snippet
> in question used a fixed-length encoding of 2 bytes, however.
> 
> Another simple idea to tackle this would be a peephole optimization
> but
> I'm not sure if this is really feasible for something like memset.
> Wouldn't the peephole have to be recursive then?
> 
> Regards
>  Robin
> 
-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Re: help with PR78809 - inline strcmp for small constant strings

2017-08-05 Thread Aaron Sawdey
On Fri, 2017-08-04 at 17:30 -0500, Segher Boessenkool wrote:
> On Fri, Aug 04, 2017 at 08:38:11PM +, Wilco Dijkstra wrote:
> > Richard Henderson wrote:
> > > On 08/04/2017 05:59 AM, Prathamesh Kulkarni wrote:
> > > > For i386, it seems strcmp is expanded inline via cmpstr optab
> > > > by 
> > > > expand_builtin_strcmp if one of the strings is constant. Could
> > > > we similarly
> > > > define cmpstr pattern for AArch64?
> > > 
> > > Certainly that's possible.
> > 
> > I'd suggest to do it as a target independent way, this is not a
> > target specific
> > optimization and shouldn't be done in the target unless there are
> > special
> > strcmp instructions.
> 
> See rs6000-string.c; cc:ing Aaron.

I think I'd argue that even if there isn't an instruction that does
strcmp (i.e. repz cmpsb) you are still better off with target specific
code. For example, on power one needs to be aware of how the different
processors deal with unaligned loads and that power7 for example
doesn't like overlapping unaligned loads which are fine on power8. Also
instructions like cmpb are useful and I don't really see how a target
independent expansion would end up using that.

> > > For constant strings of small length (upto 3?), I was wondering
> > > if it'd be a
> > > good idea to manually unroll strcmp loop, similar to __strcmp_*
> > > macros in
> > > bits/string.h?>
> > > For eg in gimple-fold, transform
> > > x = __builtin_strcmp(s, "ab")
> > > to
> > > x = s[0] - 'a';
> > > if (x == 0)
> > > {
> > >    x = s[1] - 'b';
> > >    if (x == 0)
> > >  x = s[2];
> > > }

There was code to manually unroll strcmp/strncmp in
string/bits/string2.h in GLIBC but that was recently removed:

https://sourceware.org/git/?p=glibc.git;a=commit;h=f7db120f67d853e0cfa2
72fa39c8b9be67c0a935

Personally I'm glad to see it go not only because of the expansion into
individual comparisons of the first couple characters, but also because
it expanded strcmp/strncmp to __builtin_strcmp/strncmp which meant that
you could not disable the gcc expansions of strcmp/strncmp by using 
-fno-builtin-strcmp/strncmp.

> > 
> > If there is already code that does something similar (see comment
> > #1 in PR78809),
> > it could be easily adapted to handle more cases.
> > 
> > >   if (memcmp(s, "ab", 3) != 0)
> > > 
> > > to be implemented with cmp+ccmp+ccmp and one branch.
> > 
> > Even better would be wider loads if you either know the alignment
> > of s or it's max size
> > (although given the overhead of creating the return value that
> > works best for equality).
> 
> All those things are handled there, too.  Most things that can really
> help
> are quite target-specific, I think.

Same thing goes for memcmp, you really need to know how the target
handles aligned/unaligned and for example on power9 it is handy to be
able to use setb to produce the correct SImode result even if you just
did a DImode compare of 64 bits.

> 
> 
> Segher
> 

Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain



Avoiding truncate/sign-extend of POImode on ppc target

2020-09-02 Thread Aaron Sawdey via Gcc


PR96791 is happening because DSE is trying to truncate a
POImode reg down to DImode. The POImode is created by a
structure copy that gets inline expanded using lxvp/stxvp
which we have defined using POImode. DSE recognizes that a
following load overlaps with the stxvp and can be satisfied
by a truncation of the POImode reg used for the stxvp.

The intention of using POImode was to avoid having to fully
define OImode and it seems like defining these truncations
would be the first step along that path.  If I haven't
misrepresented it, I believe that is Segher's position on
this.

The function extract_low_bits() in expmed.c does some
querying of the target hook modes_tieable_p but not in any
way that allows us to tell it not to allow truncation of
POImode to a smaller mode.  As a result we run into the ICE
because the patterns are not provided to do this.

So, I put it to the community: Is there an existing way to do
this?  Or, do we need another target hook of some kind to
check this sort of thing?

Thanks,
    Aaron


Aaron Sawdey, Ph.D. saw...@linux.ibm.com
IBM Linux on POWER Toolchain
 



Re: Avoiding truncate/sign-extend of POImode on ppc target

2020-09-02 Thread Aaron Sawdey via Gcc
Meant to CC a few people, oops.

Aaron Sawdey, Ph.D. saw...@linux.ibm.com
IBM Linux on POWER Toolchain
 

> On Sep 2, 2020, at 9:22 AM, Aaron Sawdey via Gcc  wrote:
> 
> 
> PR96791 is happening because DSE is trying to truncate a
> POImode reg down to DImode. The POImode is created by a
> structure copy that gets inline expanded using lxvp/stxvp
> which we have defined using POImode. DSE recognizes that a
> following load overlaps with the stxvp and can be satisfied
> by a truncation of the POImode reg used for the stxvp.
> 
> The intention of using POImode was to avoid having to fully
> define OImode and it seems like defining these truncations
> would be the first step along that path.  If I haven't
> misrepresented it, I believe that is Segher's position on
> this.
> 
> The function extract_low_bits() in expmed.c does some
> querying of the target hook modes_tieable_p but not in any
> way that allows us to tell it not to allow truncation of
> POImode to a smaller mode.  As a result we run into the ICE
> because the patterns are not provided to do this.
> 
> So, I put it to the community: Is there an existing way to do
> this?  Or, do we need another target hook of some kind to
> check this sort of thing?
> 
> Thanks,
>Aaron
> 
> 
> Aaron Sawdey, Ph.D. saw...@linux.ibm.com
> IBM Linux on POWER Toolchain
> 
> 



Add ops_num to targetm.sched.reassociation_width hook

2021-08-03 Thread Aaron Sawdey via Gcc
Richard,

So, I’m noticing that in get_reassociation_width() we know how many ops 
(ops_num) are in the expression being considered for parallel reassociation, 
but this is not passed to the target hook. In my testing this seems like it 
might be useful to have. If you determine the maximum width that gives 
additional speedup for a large number of terms, and then use that as the width 
from the target hook, get_reassociation_width() is more aggressive than you 
would like for small expressions with maybe 4-16 terms and produces code that 
is slower than optimal. For example in many cases you want to continue using a 
width of 1 until you get to 16 terms or so. My testing shows this to be the 
case for power8, power9, and power10 processors. 

So, I’m wondering how it might be received if I posted a patch that adds this 
to the reassociation_width target hook (and of course fixes all uses of that 
target hook)?

Thanks!
   Aaron


Aaron Sawdey, Ph.D. saw...@linux.ibm.com
IBM Linux on POWER Toolchain