Re: Live range Analysis based on tree representations
On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote: > All: > > The Live ranges info on tree SSA representation is important step towards the > SSA based code motion optimizations. > As the code motion optimization based on the SSA representation effects the > register pressure and reasons for performance > Bottleneck. > > I am proposing the Live range Analysis based on the SSA representation. The > Live range analysis traverses the dominator > Tree. The SSA and phi variables are represented based on dominance frontier > info and the SSA representation reflects > The dominance info. Based on such dominance info Live range Overlapping > Analysis can be derived. > > Variable V intersects W if Vdef dominates the Wdef. The variable v intersects > at point p if Vdef dominates P and Wdef > Dominates the P. If Vdef dominates Wdef and Wdef dominates Udef , then the > Vdef dominates Udef and thus Live range > Of V intersect W and live range W intersect U, thus the live range V > intersects the U. Such dominance info can be used to > Represent the Overlapping Live range Analysis and the register pressure is > derived from Overlapping Live ranges based > On the dominator info inherited from the SSA representation. The SSA > representation is derived based on dominance > Frontier and the traversal of dominator tree based on SSA can derive the > Overlapping Live ranges. > > The above Overlapping Live range info can be used to derive the register > pressure and the optimization based out of tree > Representation can use the above overlapping live ranges to take register > pressure into account. Ajit, I did a prototype of this kind of analysis at one point last year to see if it could help improve inlining decisions in LTO. Basically I did exactly what you suggest and computed the number of overlapping SSA live ranges and used that as a proxy for register pressure. It did appear to be able to help in some circumstances but the real solution is to improve register allocation so it doesn't fall down when register pressure gets high. The code is in a branch called lto-pressure. Aaron > > Thanks & Regards > Ajit > -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Live range Analysis based on tree representations
On Thu, 2015-09-03 at 15:22 +, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] > Sent: Wednesday, September 02, 2015 8:23 PM > To: Ajit Kumar Agarwal > Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod > Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala > Subject: Re: Live range Analysis based on tree representations > > On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote: > > All: > > > > The Live ranges info on tree SSA representation is important step towards > > the SSA based code motion optimizations. > > As the code motion optimization based on the SSA representation > > effects the register pressure and reasons for performance Bottleneck. > > > > I am proposing the Live range Analysis based on the SSA > > representation. The Live range analysis traverses the dominator Tree. > > The SSA and phi variables are represented based on dominance frontier info > > and the SSA representation reflects The dominance info. Based on such > > dominance info Live range Overlapping Analysis can be derived. > > > > Variable V intersects W if Vdef dominates the Wdef. The variable v > > intersects at point p if Vdef dominates P and Wdef Dominates the P. If > > Vdef dominates Wdef and Wdef dominates Udef , then the Vdef dominates > > Udef and thus Live range Of V intersect W and live range W intersect > > U, thus the live range V intersects the U. Such dominance info can be > > used to Represent the Overlapping Live range Analysis and the register > > pressure is derived from Overlapping Live ranges based On the dominator > > info inherited from the SSA representation. The SSA representation is > > derived based on dominance Frontier and the traversal of dominator tree > > based on SSA can derive the Overlapping Live ranges. > > > > The above Overlapping Live range info can be used to derive the > > register pressure and the optimization based out of tree Representation can > > use the above overlapping live ranges to take register pressure into > > account. > > >>Ajit, > >>I did a prototype of this kind of analysis at one point last year to see > if it could help improve inlining decisions in LTO. Basically I did >>exactly > what you suggest and computed the number of overlapping SSA live ranges and > used that as a proxy for register pressure. It >>did appear to be able to > help in some circumstances but the real solution is to improve register > allocation so it doesn't fall down when >>register pressure gets high. > > Aaron: > Would you mind in explaining on what circumstances it helps and when it > won't. The Live ranges on SSA > representation forms chordal Graphs that might have the different > colorability requirements than the real > register allocator. This may not give exact register pressure compared to > register allocator as register allocator > is further down the optimization and the code generation pipeline but forms > the basis of optimization based > on SSA that effects the register pressure. > > >>The code is in a branch called lto-pressure. > > Thanks. I would like to see the code. Ajit, The branch is here: svn://gcc.gnu.org/svn/gcc/branches/lto-pressure The analysis is in ipa-inline.c; if you search for "pressure" you'll find the code. The live ranges in SSA are certainly different than what the register allocator is going to see, but it's what we have to work with at the point where the inlining decisions are made, which is why I was looking at it. My hope was that it would be a reasonable proxy for downstream register pressure. I went and did this after seeing a particular situation in bzip2 where a bunch of inlining done by LTO sent register pressure up a lot and resulted in a measureable increase in loads/stores due to extra spill code. Functions that are called in only one place and not externally visible will be inlined regardless of their size. There is one function in bzip2 that has particularly complex control and inlining this into another non-trivial function caused all the excess spill code. Setting limits on "inline functions called only once" did work, i.e. inhibited the particular inlining that I wanted to eliminate; it just didn't help enough to justify the complexity of the extra analysis. So I went a step further and put the caller and callee pressure in as a term in the edge badness used to prioritize inlining of small functions. This seemed to help in some cases and hurt others, probably because SSA pressure doesn't accurately represent the kind
Re: Live range Analysis based on tree representations
On Sat, 2015-09-12 at 18:45 +, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] > Sent: Friday, September 04, 2015 11:51 PM > To: Ajit Kumar Agarwal > Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod > Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala > Subject: Re: Live range Analysis based on tree representations > > On Thu, 2015-09-03 at 15:22 +, Ajit Kumar Agarwal wrote: > > > > > > -Original Message- > > From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] > > Sent: Wednesday, September 02, 2015 8:23 PM > > To: Ajit Kumar Agarwal > > Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; > > Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju > > Mekala > > Subject: Re: Live range Analysis based on tree representations > > > > On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote: > > > All: > > > > > > The Live ranges info on tree SSA representation is important step towards > > > the SSA based code motion optimizations. > > > As the code motion optimization based on the SSA representation > > > effects the register pressure and reasons for performance Bottleneck. > > > > > > I am proposing the Live range Analysis based on the SSA > > > representation. The Live range analysis traverses the dominator Tree. > > > The SSA and phi variables are represented based on dominance frontier > > > info and the SSA representation reflects The dominance info. Based on > > > such dominance info Live range Overlapping Analysis can be derived. > > > > > > Variable V intersects W if Vdef dominates the Wdef. The variable v > > > intersects at point p if Vdef dominates P and Wdef Dominates the P. > > > If Vdef dominates Wdef and Wdef dominates Udef , then the Vdef > > > dominates Udef and thus Live range Of V intersect W and live range W > > > intersect U, thus the live range V intersects the U. Such dominance > > > info can be used to Represent the Overlapping Live range Analysis and the > > > register pressure is derived from Overlapping Live ranges based On the > > > dominator info inherited from the SSA representation. The SSA > > > representation is derived based on dominance Frontier and the traversal > > > of dominator tree based on SSA can derive the Overlapping Live ranges. > > > > > > The above Overlapping Live range info can be used to derive the > > > register pressure and the optimization based out of tree Representation > > > can use the above overlapping live ranges to take register pressure into > > > account. > > > > >>Ajit, > > >>I did a prototype of this kind of analysis at one point last year to > > see if it could help improve inlining decisions in LTO. Basically I did > > >>exactly what you suggest and computed the number of overlapping SSA live > > ranges and used that as a proxy for register pressure. It >>did appear to > > be able to help in some circumstances but the real solution is to improve > > register allocation so it doesn't fall down when >>register pressure gets > > high. > > > > Aaron: > > Would you mind in explaining on what circumstances it helps and when > > it won't. The Live ranges on SSA representation forms chordal Graphs > > that might have the different colorability requirements than the real > > register allocator. This may not give exact register pressure compared > > to register allocator as register allocator is further down the > > optimization and the code generation pipeline but forms the basis of > > optimization based on SSA that effects the register pressure. > > > > >>The code is in a branch called lto-pressure. > > > > Thanks. I would like to see the code. > > Ajit, > >> The branch is here: svn://gcc.gnu.org/svn/gcc/branches/lto-pressure > >>The analysis is in ipa-inline.c; if you search for "pressure" you'll find > >>the code. > > >>The live ranges in SSA are certainly different than what the register > >>allocator is going to see, but it's what we have to work with at the point > >>where the >>inlining decisions are made, which is why I was looking at it. > >>My hope was that it would be a reasonable proxy for downstream register > >>pressure. > > >>I went and did this after seeing a particular s
guessed profile counts leading to incorrect static branch hints on ppc64
So, I'm wondering if anyone can help with where the ultimate problem may lie here. I've seen some cases where gcc generates branches with the static branch hints bit set. It is happening because the branch probability gets set to "always". This happens in tree-ssa-threadupdate.c: static void recompute_probabilities (basic_block bb) { edge esucc; edge_iterator ei; FOR_EACH_EDGE (esucc, ei, bb->succs) { if (!bb->count) continue; /* Prevent overflow computation due to insane profiles. */ if (esucc->count < bb->count) esucc->probability = GCOV_COMPUTE_SCALE (esucc->count, bb->count); else { /* Can happen with missing/guessed probabilities, since we may determine that more is flowing along duplicated path than joiner succ probabilities allowed. Counts and freqs will be insane after jump threading, at least make sure probability is sane or we will get a flow verification error. Not much we can do to make counts/freqs sane without redoing the profile estimation. */ esucc->probability = REG_BR_PROB_BASE; } } } It would appear that the guessed counts are getting changed inconsistently before this during the tree-ssa-dom pass. Any trail of breadcrumbs to follow through the forest would be helpful here ... Thanks! Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
determining reassociation width
So, my first cut at the function to select reassociation width for power was modeled after what I saw i386 and aarch64 doing, which is to return something based on the number of that kind of op we can do at the same time: static int rs6000_reassociation_width (unsigned int opc, enum machine_mode mode) { switch (rs6000_cpu) { case PROCESSOR_POWER8: case PROCESSOR_POWER9: if (VECTOR_MODE_P (mode)) return 2; if (INTEGRAL_MODE_P (mode)) { if ( opc == MULT_EXPR ) return 2; return 6; /* correct for all integral modes? */ } if (FLOAT_MODE_P (mode)) return 2; /* decimal float gets default 1 */ break; default: break; } return 1; } However, the reality of the situation is a bit more complicated I think. * If we want maximum parallelism, we should really base this on the number of units times the latency. I.e. for float on p8 we have 2 units and 6 cycles latency so we would want to issue up to 12 fadd or fmul in parallel, then the result from the first one would be ready for the next series of dependent ops. * Of course this may cause massive register spills and so we can't really make things that wide. So, reassociation ought to be aware of how much register pressure it is creating and how much has been created by things that want to be live across this bb. * Ideally we would also be aware of whether we are reassociating a tree of fp additions whose terms are fp multiplies because now we have fused multipy-adds to consider. See PR 70912 for more on this. Suggestions? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
cmpstrnsi pattern should check for zero byte?
I am about to put up a patch to make a change to the way builtins.c/expand_builtin_strncmp does things. At present it will only expand if one of the strings is a constant, and the length it passes in is MIN(len, strlen(constant string)+1). I want to change it so that it will also attempt to expand strncmp where neither of the input strings is a constant. What I ran into is that the i386 target cmpstrnsi pattern uses repz cmpsb which does not check for the zero byte and thus depends on expand_builtin_strncmp to arrange things so the length is not past the end of both strings. SH and RX are the other two targets that have cmpstrnsi patterns. My reading of sh_expand_cmpnstr() in sh-mem.cc is that the code that it emits will both look for a zero byte and compare the strings so it will not have trouble with this change to expand_builtin_strncmp. It also looks to me like rx.md emits the "scmpu" instruction to do the comparison. The RX manual I found showed pseudocode for scmpu that shows it both checks for zero byte as well as comparing the strings. If this isn't correct, please let me know here or on the patch itself. Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Register Pressure guided Unroll and Jam in GCC !!
On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote: > Hello All: > > I have worked on the Open64 compiler where the Register Pressure Guided > Unroll and Jam gave a good amount of performance improvement for the C and > C++ Spec Benchmark and also Fortran benchmarks. > > The Unroll and Jam increases the register pressure in the Unrolled Loop > leading to increase in the Spill and Fetch degrading the performance of the > Unrolled Loop. The Performance of Cache locality achieved through Unroll and > Jam is degraded with the presence of Spilling instruction due to increases in > register pressure Its better to do the decision of Unrolled Factor of the > Loop based on the Performance model of the register pressure. > > Most of the Loop Optimization Like Unroll and Jam is implemented in the High > Level IR. The register pressure based Unroll and Jam requires the calculation > of register pressure in the High Level IR which will be similar to register > pressure we calculate on Register Allocation. This makes the implementation > complex. > > To overcome this, the Open64 compiler does the decision of Unrolling to both > High Level IR and also at the Code Generation Level. Some of the decisions > way at the end of the Code Generation . The advantage of using this approach > like Open64 helps in using the register pressure information calculated by > the Register Allocator. This helps the implementation much simpler and less > complex. > > Can we have this approach in GCC of the Decisions of Unroll and Jam in the > High Level IR and also to defer some of the decision at the Code Generation > Level like Open64? > > Please let me know what do you think. I have been working on calculating something analogous to register pressure using a count of the number of live SSA values during the ipa-inline pass. I've been working on steering inlining (especially in LTO) away from decisions that explode the register pressure downstream, with a similar goal of avoiding situations that cause a lot of spill code. I have been working in a branch if you want to take a look: gcc/branches/lto-pressure Aaron > > Thanks & Regards > Ajit > -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Register Pressure guided Unroll and Jam in GCC !!
On Mon, 2014-06-16 at 14:42 -0400, Vladimir Makarov wrote: > On 2014-06-16, 2:25 PM, Aaron Sawdey wrote: > > On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote: > >> Hello All: > >> > >> I have worked on the Open64 compiler where the Register Pressure Guided > >> Unroll and Jam gave a good amount of performance improvement for the C > >> and C++ Spec Benchmark and also Fortran benchmarks. > >> > >> The Unroll and Jam increases the register pressure in the Unrolled Loop > >> leading to increase in the Spill and Fetch degrading the performance of > >> the Unrolled Loop. The Performance of Cache locality achieved through > >> Unroll and Jam is degraded with the presence of Spilling instruction due > >> to increases in register pressure Its better to do the decision of > >> Unrolled Factor of the Loop based on the Performance model of the register > >> pressure. > >> > >> Most of the Loop Optimization Like Unroll and Jam is implemented in the > >> High Level IR. The register pressure based Unroll and Jam requires the > >> calculation of register pressure in the High Level IR which will be > >> similar to register pressure we calculate on Register Allocation. This > >> makes the implementation complex. > >> > >> To overcome this, the Open64 compiler does the decision of Unrolling to > >> both High Level IR and also at the Code Generation Level. Some of the > >> decisions way at the end of the Code Generation . The advantage of using > >> this approach like Open64 helps in using the register pressure information > >> calculated by the Register Allocator. This helps the implementation much > >> simpler and less complex. > >> > >> Can we have this approach in GCC of the Decisions of Unroll and Jam in the > >> High Level IR and also to defer some of the decision at the Code > >> Generation Level like Open64? > >> > >> Please let me know what do you think. > > > > I have been working on calculating something analogous to register > > pressure using a count of the number of live SSA values during the > > ipa-inline pass. I've been working on steering inlining (especially in > > LTO) away from decisions that explode the register pressure downstream, > > with a similar goal of avoiding situations that cause a lot of spill > > code. > > > > I have been working in a branch if you want to take a look: > > gcc/branches/lto-pressure > > > > Any pressure evaluation is a better than its absence. But on this level > it is hard to evaluate it accurately. > > E.g. pressure in loop can be high for general regs, for fp regs or the > both. Using live SSA values is still very inaccurate to make a right > decision for the transformations. > Yes, the jump I have not made yet is to classify the pressure by what register class it might end up in. The other big piece that's potentially missing at that point is pressure caused by temps and by scheduling. But I think you can still get order-of-magnitude type estimates. -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
[RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc
I've previously posted a patch to add vector/vsx inline expansion of strcmp/strncmp for the power8/power9 processors. Here are some of the other items I have in the pipeline that I hope to get into gcc9: * vector/vsx support for inline expansion of memcmp to non-loop code. This improves performance of small memcmp. * vector/vsx support for inline expansion of memcmp to loop code. This will close the performance gap for lengths of about 128-512 bytes by making the loop code closer to the performance of the library memcmp. * generate inline expansion to a loop for strcmp/strncmp. This closes another performance gap because strcmp/strncmp vector/vsx code currently generated is lots faster than the library call but we only generate comparison of 64 bytes to avoid exploding code size. Similar code in a loop would be compact and allow inline comparison of maybe the first 512 bytes inline before dumping to the library function. If anyone has any other input on the inline expansion work I've been doing for the rs6000 target, please let me know. Thanks! Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc
On 10/17/18 4:03 PM, Florian Weimer wrote: > * Aaron Sawdey: > >> I've previously posted a patch to add vector/vsx inline expansion of >> strcmp/strncmp for the power8/power9 processors. Here are some of the >> other items I have in the pipeline that I hope to get into gcc9: >> >> * vector/vsx support for inline expansion of memcmp to non-loop code. >> This improves performance of small memcmp. >> * vector/vsx support for inline expansion of memcmp to loop code. This >> will close the performance gap for lengths of about 128-512 bytes >> by making the loop code closer to the performance of the library >> memcmp. >> * generate inline expansion to a loop for strcmp/strncmp. This closes >> another performance gap because strcmp/strncmp vector/vsx code >> currently generated is lots faster than the library call but we >> only generate comparison of 64 bytes to avoid exploding code size. >> Similar code in a loop would be compact and allow inline comparison >> of maybe the first 512 bytes inline before dumping to the library >> function. >> >> If anyone has any other input on the inline expansion work I've been >> doing for the rs6000 target, please let me know. > > The inline expansion of strcmp is problematic for valgrind: > > <https://bugs.kde.org/show_bug.cgi?id=386945> I'm aware of this. One thing that will help is that I believe the vsx expansion for strcmp/strncmp does not have this problem, so with current gcc9 trunk the problem should only be seen if one of the strings is known at compile time to be less than 16 bytes, or if -mcpu=power7, or if vector/vsx is disabled. My position is that it is valgrind's problem if it doesn't understand correct code, but I also want valgrind to be a useful tool so I'm going to take a look and see if I can find a gpr sequence that is equally fast that it can understand. > We currently see around 0.5 KiB of instructions for each call to > strcmp. I find it hard to believe that this improves general system > performance except in micro-benchmarks. The expansion of strcmp where both arguments are strings of unknown length at compile time will compare 64 bytes then call strcmp on the remainder if no difference is found. If the gpr sequence is used (p7 or vec/vsx disabled) then the overhead is 91 instructions. If the p8 vsx sequence is used, the overhead is 59 instructions. If the p9 vsx sequence is used, then the overhead is 41 instructions. Yes, this will increase the instruction footprint. However the processors that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some of the comparison inline makes the common cases of strings being totally different, or identical and <= 64 bytes in length very much faster, and also avoiding the plt call means less pressure on the count cache and better branch prediction elsewhere. If you are aware of any real world code that is faster when built with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know so I can look at avoiding those situations. Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Fixing inline expansion of overlapping memmove and non-overlapping memcpy
GCC does not currently do inline expansion of overlapping memmove, nor does it have an expansion pattern to allow for non-overlapping memcpy, so I plan to add patterns and support to implement this in gcc 10 timeframe. At present memcpy and memmove are kind of entangled. Here's the current state of play: memcpy -> expand with movmem pattern memmove (no overlap) -> transform to memcpy -> expand with movmem pattern memmove (overlap) -> remains memmove -> glibc call There are several problems currently. If the memmove() arguments are in fact overlapping, then the expansion is actually not used which makes no sense and costs performance of calling a library function instead of inline expanding memmove() of small blocks. There is currently no way to have a separate memcpy pattern. I know from experience with expansion of memcmp on power that lengths on the order of hundreds of bytes are needed before the function call overhead is overcome by optimized glibc code. But we need the memcpy guarantee of non-overlapping arguments to make that happen, as we don't want to do a runtime overlap test. There is some analysis that happens in gimple_fold_builtin_memory_op() that determines when memmove calls cannot have an overlap between the arguments and converts them into memcpy() which is nice. However in builtins.c expand_builtin_memmove() does not actually do the expansion using the memmove pattern. This is why a memmove() call that cannot be converted to memcpy() by gimple_fold_builtin_memory_op() is not expanded and we call glibc memmove(). Only expand_builtin_memcpy() actually uses the memmove pattern. So here's my proposed set of fixes: * Add new optab entries for nonoverlapping_memcpy and overlapping_memmove cases. * The movmem optab will continue to be treated exactly as it is today so that ports that might have a broken movmem pattern that doesn't actually handle the overlap cases will continue to work. * expand_builtin_memmove() needs to actually do the memmove() expansion. * expand_builtin_memcpy() needs to use cpymem. Currently this happens down in emit_block_move_via_movmem() so some functions might need to be renamed. * ports can then add the new overlapping move and nonoverlapping copy expanders and will get better expansion of both memmove and memcpy functions. I'd be interested in any comments about pieces of this machinery that need to work a certain way, or other related issues that should be addressed in between expand_builtin_memcpy() and emit_block_move_via_movmem(). Thanks! Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy
On 5/15/19 8:10 AM, Michael Matz wrote:> On Tue, 14 May 2019, Aaron Sawdey wrote: > >> memcpy -> expand with movmem pattern >> memmove (no overlap) -> transform to memcpy -> expand with movmem pattern >> memmove (overlap) -> remains memmove -> glibc call > ... >> However in builtins.c expand_builtin_memmove() does not actually do the >> expansion using the memmove pattern. > > Because it can't: the movmem pattern is not defined to require handling > overlaps, and hence can't be used for any possibly overlapping > memmove. (So, in a way the pattern is misnamed and should probably have > been called cpymem from the beginning, alas there we are). > >> So here's my proposed set of fixes: >> * Add new optab entries for nonoverlapping_memcpy and overlapping_memmove >>cases. > > Wouldn't it be nicer to rename the current movmem pattern to cpymem > wholesale for all ports (i.e. roughly a big s/movmem/cpymem/ over the > whole tree) and then introduce a new optional movmem pattern with > overlapping semantics? Yeah that makes a lot of sense. I was unaware of that history, and was led astray by the fact that the powerpc implementation of movemem works by doing a bunch of loads into registers followed by a bunch of stores and so (I think) would actually work for the overlap case. Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy
On 5/15/19 7:22 AM, Richard Biener wrote: > On Tue, May 14, 2019 at 9:21 PM Aaron Sawdey wrote: >> I'd be interested in any comments about pieces of this machinery that need to >> work a certain way, or other related issues that should be addressed in >> between expand_builtin_memcpy() and emit_block_move_via_movmem(). > > I wonder if introducing a __builtin_memmove_with_hints specifying whether > src < dst or dst > src or unknown and/or a safe block size where that > doesn't matter > would help? I can then be safely expanded to memmove() or to specific > inline code. Yes this would be a nice thing to get to, a single move/copy underlying builtin, to which we communicate what the compiler's analysis tells us about whether the operands overlap and by how much. Next question would be how do we move from the existing movmem pattern (which Michael Matz tells us should be renamed cpymem anyway) to this new thing. Are you proposing that we still have both movmem and cpymem optab entries underneath to call the patterns but introduce this new memmove_with_hints() to be used by things called by expand_builtin_memmove() and expand_builtin_memcpy()? Thanks! Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy
On 5/15/19 9:02 AM, Michael Matz wrote: > On Wed, 15 May 2019, Aaron Sawdey wrote: >> Next question would be how do we move from the existing movmem pattern >> (which Michael Matz tells us should be renamed cpymem anyway) to this >> new thing. Are you proposing that we still have both movmem and cpymem >> optab entries underneath to call the patterns but introduce this new >> memmove_with_hints() to be used by things called by >> expand_builtin_memmove() and expand_builtin_memcpy()? > > I'd say so. There are multiple levels at play: > a) exposal to user: probably a new __builtint_memmove, or a new combined >builtin with a hint param to differentiate (but we can't get rid of >__builtin_memcpy/mempcpy/strcpy, which all can go through the same >route in the middleend) > b) getting it through the gimple pipeline, probably just a new builtin >code, trivial > c) expanding the new builtin, with the help of next items > d) RTL block moves: they are defined as non-overlapping and I don't think >we should change this (essentially they're the reflection of struct >copies in C) > e) how any of the above (builtins and RTL block moves) are implemented: >currently non-overlapping only, using movmem pattern when possible; >ultimately all sitting in the emit_block_move_hints() routine. > > So, I'd add a new method to emit_block_move_hints indicating possible > overlap, disabling the use of move_by_pieces. Then in > emit_block_move_via_movmem (alse getting an indication of overlap), do the > equivalent of: > > finished = 0; > if (overlap_possible) { > if (optab[movmem]) > finished = emit(movmem) > } else { > if (optab[cpymem]) > finished = emit(cpymem); > if (!finished && optab[movmem]) // can use movmem also for overlap > finished = emit(movmem); > } > > The overlap_possible method would only ever be used from the builtin > expansion, and never from the RTL block move expand. Additionally a > target may optionally only define the movmem pattern if it's just as good > as the cpymem pattern (e.g. because it only handles fixed small sizes and > uses a load-all then store-all sequence). We currently have gimple_fold_builtin_memory_op() figuring out where there is no overlap and converging __builtin_memmove() to __builtin_memcpy(). Would you forsee looking for converting __builtin_memmove() with overlap into a call to __builtin_memmove_hint if it is a case where we can define the overlap precisely enough to provide the hint? My guess is that this wouldn't be a very common case. My goals for this are: * memcpy() call becomes __builtin_memcpy and goes to optab[cpymem] * memmove() call becomes __builtin_memmove (or __builtin_memcpy based on the gimple analysis) and goes through optab[movmem] or optab[cpymem] I think what you've described meets these goals and cleans things up. Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy
On 5/15/19 11:31 AM, Jakub Jelinek wrote: > On Wed, May 15, 2019 at 11:23:54AM -0500, Aaron Sawdey wrote: >> My goals for this are: >> * memcpy() call becomes __builtin_memcpy and goes to optab[cpymem] >> * memmove() call becomes __builtin_memmove (or __builtin_memcpy based >>on the gimple analysis) and goes through optab[movmem] or optab[cpymem] > > Except for the becomes part (the memcpy call is the same thing as > __builtin_memcpy in the middle-end, all you care about if it is > BUILT_IN_MEMCPY etc. and whether it has compatible arguments), and for the > missing optab[movmem] part and movmem->cpymem renaming, isn't that what we > have already? Yes. I was just trying to state what I wanted it to become, some of which is already present. So I think I will start working on two patches: 1) rename optab movmem and the underlying patterns to cpymem. 2) add a new optab movmem that is really memmove() and add support for having __builtin_memmove() use it. Handling of the partial overlap case can be a separate piece of work. Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Fixing inline expansion of overlapping memmove and non-overlapping memcpy
On 5/15/19 1:01 PM, Jakub Jelinek wrote: > On Wed, May 15, 2019 at 12:59:01PM -0500, Aaron Sawdey wrote: >> 1) rename optab movmem and the underlying patterns to cpymem. >> 2) add a new optab movmem that is really memmove() and add support for >> having __builtin_memmove() use it. >> >> Handling of the partial overlap case can be a separate piece of work. > > That 1) and 2) can be also separate pieces of work ;). Exactly -- make things as easy as possible when I go begging for reviewers :-) -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: k-byte memset/memcpy/strlen builtins
On Wed, 2017-01-11 at 17:16 +0100, Robin Dapp wrote: > Hi, Hi Robin, I thought I'd share some of what I've run into while doing similar things for the rs6000 target. First off, be aware that glibc does some macro expansion things to try to handle 1/2/3 byte string operations in some cases. Secondly, the way I approached this was to use the patterns defined in optabs.def for these things: OPTAB_D (cmpmem_optab, "cmpmem$a") OPTAB_D (cmpstr_optab, "cmpstr$a") OPTAB_D (cmpstrn_optab, "cmpstrn$a") OPTAB_D (movmem_optab, "movmem$a") OPTAB_D (setmem_optab, "setmem$a") OPTAB_D (strlen_optab, "strlen$a") If you define movmemsi, that should get used by expand_builtin_memcpy for any memcpy call that it sees. The constraints I was able to find when implementing cmpmemsi for memcmp were: * don't compare past the given length (obviously) * don't read past the given length * except it's ok to do so if you can prove via alignment or runtime check that you are not going to cause a pagefault. Not crossing a 4k boundary seems to be generally viewed as acceptable. I would recommend looking at preprocessed code to make sure no funny business is happening, and then look at your .md files. It looks to me like s390 has got both movmem and strlen patterns there already. If I understand correctly you are wanting to do multi-byte characters. Seems to me you need to follow the path Richard Biener suggests and make optab expansions that handle wider chars and then perhaps map wcslen et. al. to them? Aaron > > For 1-byte memset/memcpy the builtin functions provide a > straightforward > way to achieve this. At first sight it seemed possible to extend > tree-loop-distribution.c to include the additional variants we need. > However, multibyte memsets/memcpys are not covered by the C standard > and > I'm therefore unsure if such an approach is preferable or if there > are > more idiomatic ways or places where to add the functionality. > > The same question goes for 2-byte strlen. I didn't see a recognition > pattern for strlen (apart from optimizations due to known string > length > in tree-ssa-strlen.c). Would it make sense to include strlen > recognition > and subsequently handling for 2-byte strlen? The situation might of > course more complicated than memset because of encodings etc. My > snippet > in question used a fixed-length encoding of 2 bytes, however. > > Another simple idea to tackle this would be a peephole optimization > but > I'm not sure if this is really feasible for something like memset. > Wouldn't the peephole have to be recursive then? > > Regards > Robin > -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: help with PR78809 - inline strcmp for small constant strings
On Fri, 2017-08-04 at 17:30 -0500, Segher Boessenkool wrote: > On Fri, Aug 04, 2017 at 08:38:11PM +, Wilco Dijkstra wrote: > > Richard Henderson wrote: > > > On 08/04/2017 05:59 AM, Prathamesh Kulkarni wrote: > > > > For i386, it seems strcmp is expanded inline via cmpstr optab > > > > by > > > > expand_builtin_strcmp if one of the strings is constant. Could > > > > we similarly > > > > define cmpstr pattern for AArch64? > > > > > > Certainly that's possible. > > > > I'd suggest to do it as a target independent way, this is not a > > target specific > > optimization and shouldn't be done in the target unless there are > > special > > strcmp instructions. > > See rs6000-string.c; cc:ing Aaron. I think I'd argue that even if there isn't an instruction that does strcmp (i.e. repz cmpsb) you are still better off with target specific code. For example, on power one needs to be aware of how the different processors deal with unaligned loads and that power7 for example doesn't like overlapping unaligned loads which are fine on power8. Also instructions like cmpb are useful and I don't really see how a target independent expansion would end up using that. > > > For constant strings of small length (upto 3?), I was wondering > > > if it'd be a > > > good idea to manually unroll strcmp loop, similar to __strcmp_* > > > macros in > > > bits/string.h?> > > > For eg in gimple-fold, transform > > > x = __builtin_strcmp(s, "ab") > > > to > > > x = s[0] - 'a'; > > > if (x == 0) > > > { > > > x = s[1] - 'b'; > > > if (x == 0) > > > x = s[2]; > > > } There was code to manually unroll strcmp/strncmp in string/bits/string2.h in GLIBC but that was recently removed: https://sourceware.org/git/?p=glibc.git;a=commit;h=f7db120f67d853e0cfa2 72fa39c8b9be67c0a935 Personally I'm glad to see it go not only because of the expansion into individual comparisons of the first couple characters, but also because it expanded strcmp/strncmp to __builtin_strcmp/strncmp which meant that you could not disable the gcc expansions of strcmp/strncmp by using -fno-builtin-strcmp/strncmp. > > > > If there is already code that does something similar (see comment > > #1 in PR78809), > > it could be easily adapted to handle more cases. > > > > > if (memcmp(s, "ab", 3) != 0) > > > > > > to be implemented with cmp+ccmp+ccmp and one branch. > > > > Even better would be wider loads if you either know the alignment > > of s or it's max size > > (although given the overhead of creating the return value that > > works best for equality). > > All those things are handled there, too. Most things that can really > help > are quite target-specific, I think. Same thing goes for memcmp, you really need to know how the target handles aligned/unaligned and for example on power9 it is handy to be able to use setb to produce the correct SImode result even if you just did a DImode compare of 64 bits. > > > Segher > Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Avoiding truncate/sign-extend of POImode on ppc target
PR96791 is happening because DSE is trying to truncate a POImode reg down to DImode. The POImode is created by a structure copy that gets inline expanded using lxvp/stxvp which we have defined using POImode. DSE recognizes that a following load overlaps with the stxvp and can be satisfied by a truncation of the POImode reg used for the stxvp. The intention of using POImode was to avoid having to fully define OImode and it seems like defining these truncations would be the first step along that path. If I haven't misrepresented it, I believe that is Segher's position on this. The function extract_low_bits() in expmed.c does some querying of the target hook modes_tieable_p but not in any way that allows us to tell it not to allow truncation of POImode to a smaller mode. As a result we run into the ICE because the patterns are not provided to do this. So, I put it to the community: Is there an existing way to do this? Or, do we need another target hook of some kind to check this sort of thing? Thanks, Aaron Aaron Sawdey, Ph.D. saw...@linux.ibm.com IBM Linux on POWER Toolchain
Re: Avoiding truncate/sign-extend of POImode on ppc target
Meant to CC a few people, oops. Aaron Sawdey, Ph.D. saw...@linux.ibm.com IBM Linux on POWER Toolchain > On Sep 2, 2020, at 9:22 AM, Aaron Sawdey via Gcc wrote: > > > PR96791 is happening because DSE is trying to truncate a > POImode reg down to DImode. The POImode is created by a > structure copy that gets inline expanded using lxvp/stxvp > which we have defined using POImode. DSE recognizes that a > following load overlaps with the stxvp and can be satisfied > by a truncation of the POImode reg used for the stxvp. > > The intention of using POImode was to avoid having to fully > define OImode and it seems like defining these truncations > would be the first step along that path. If I haven't > misrepresented it, I believe that is Segher's position on > this. > > The function extract_low_bits() in expmed.c does some > querying of the target hook modes_tieable_p but not in any > way that allows us to tell it not to allow truncation of > POImode to a smaller mode. As a result we run into the ICE > because the patterns are not provided to do this. > > So, I put it to the community: Is there an existing way to do > this? Or, do we need another target hook of some kind to > check this sort of thing? > > Thanks, >Aaron > > > Aaron Sawdey, Ph.D. saw...@linux.ibm.com > IBM Linux on POWER Toolchain > >
Add ops_num to targetm.sched.reassociation_width hook
Richard, So, I’m noticing that in get_reassociation_width() we know how many ops (ops_num) are in the expression being considered for parallel reassociation, but this is not passed to the target hook. In my testing this seems like it might be useful to have. If you determine the maximum width that gives additional speedup for a large number of terms, and then use that as the width from the target hook, get_reassociation_width() is more aggressive than you would like for small expressions with maybe 4-16 terms and produces code that is slower than optimal. For example in many cases you want to continue using a width of 1 until you get to 16 terms or so. My testing shows this to be the case for power8, power9, and power10 processors. So, I’m wondering how it might be received if I posted a patch that adds this to the reassociation_width target hook (and of course fixes all uses of that target hook)? Thanks! Aaron Aaron Sawdey, Ph.D. saw...@linux.ibm.com IBM Linux on POWER Toolchain