On Wed, Dec 16, 2015 at 12:24 PM, Richard Earnshaw (lists) <richard.earns...@arm.com> wrote: > On 15/12/15 23:34, Evandro Menezes wrote: >> On 12/14/2015 05:26 AM, James Greenhalgh wrote: >>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote: >>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote: >>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote: >>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote: >>>>>>> 2015-11-05 Evandro Menezes <e.mene...@samsung.com> >>>>>>> >>>>>>> gcc/ >>>>>>> >>>>>>> * config/aarch64/aarch64.c >>>>>>> (aarch64_override_options_internal): >>>>>>> Increase loop peeling limit. >>>>>>> >>>>>>> This patch increases the limit for the number of peeled insns. >>>>>>> With this change, I noticed no major regression in either >>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP >>>>>>> ones, improved significantly. >>>>>>> >>>>>>> I tested this tuning on Exynos M1 and on A57. ThunderX seems to >>>>>>> benefit from this tuning too. However, I'd appreciate comments >>>>>> >from other stakeholders. >>>>>> >>>>>> Ping. >>>>> I'd like to leave this for a call from the port maintainers. I can >>>>> see why >>>>> this leads to more opportunities for vectorization, but I'm >>>>> concerned about >>>>> the wider impact on code size. Certainly I wouldn't expect this to >>>>> be our >>>>> default at -O2 and below. >>>>> >>>>> My gut feeling is that this doesn't really belong in the back-end >>>>> (there are >>>>> presumably good reasons why the default for this parameter across >>>>> GCC has >>>>> fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd >>>>> like Marcus or Richard to make the call as to whether or not we take >>>>> this >>>>> patch. >>>> Please, correct me if I'm wrong, but loop peeling is enabled only >>>> with loop unrolling (and with PGO). If so, then extra code size is >>>> not a concern, for this heuristic is only active when unrolling >>>> loops, when code size is already of secondary importance. >>> My understanding was that loop peeling is enabled from -O2 upwards, and >>> is also used to partially peel unaligned loops for vectorization >>> (allowing >>> the vector code to be well aligned), or to completely peel inner loops >>> which >>> may then become amenable to SLP vectorization. >>> >>> If I'm wrong then I take back these objections. But I was sure this >>> parameter was used in a number of situations outside of just >>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing >>> performance >>> sensitivities to this parameter at -O3 in some internal workloads I was >>> analysing. >> >> Vectorization, including SLP, is only enabled at -O3, isn't it? It >> seems to me that peeling is only used by optimizations which already >> lead to potential increase in code size. >> >> For instance, with "-Ofast -funroll-all-loops", the total text size for >> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB >> without it; with just "-O2", it is the same at 23.1MB regardless of this >> setting. >> >> So it seems to me that this proposal should be neutral for up to -O2. >> >> Thank you, >> > > My preference would be to not diverge from the global parameter > settings. I haven't looked in detail at this parameter but it seems to > me there are two possible paths: > > 1) We could get agreement globally that the parameter should be increased. > 2) We could agree that this specific use of the parameter is distinct > from some other uses and deserves a new param in its own right with a > higher value.
I think the fix is to improve the unrolled size estimates by better taking into account constant propagation and CSE opportunities. I have some ideas here but not sure if I have enough free cycles to implement this for GCC 7. Richard. > R.