collaborative tuning of GCC optimization heuristic

2016-03-05 Thread Grigori Fursin

Dear colleagues,

If it's of interest, we have released a new version of our open-source 
framework to share compiler optimization knowledge across diverse 
workloads and hardware. We would like to thank all the volunteers who 
ran this framework and shared some results for GCC 4.9 .. 6.0 in the 
public repository here: http://cTuning.org/crowdtuning-results-gcc


Here is a brief note how this framework for crowdtuning compiler 
optimization heuristics works (for more details, please see 
https://github.com/ctuning/ck/wiki/Crowdsource_Experiments): you just 
install a small Android app 
(https://play.google.com/store/apps/details?id=openscience.crowdsource.experiments) 
or python-based Collective Knowledge framework 
(http://github.com/ctuning/ck). This program sends system properties to 
a public server. The server compiles a random shared workload using some 
flag combinations that have been found to work well on similar machines, 
as well as some new random ones. The client executes the compiled 
workload several times to account for variability etc, and sends the 
results back to the server.


If a combination of compiler flags is found that improves performance 
over the combinations found so far, it gets reduced (by removing flags 
that do now affect the performance) and uploaded to a public repository. 
Importantly, if a combination significantly degrades performance for a 
particular workload, it gets recorded as well. This potentially points 
to a problem with optimization heuristics for a particular target, which 
may be worth investigating and improving.


At the moment, only global GCC compiler flags are exposed for 
collaborative optimization. Longer term, it can be useful to cover 
finer-grain transformation decisions (vectorization, unrolling, etc) via 
plugin interface. Please, note that this is a prototype framework and 
much more can be done! Please get in touch if you are interested to know 
more or contribute!


Take care,
Grigori

=
Grigori Fursin, CTO, dividiti, UK



Re: collaborative tuning of GCC optimization heuristic

2016-03-05 Thread David Edelsohn
On Sat, Mar 5, 2016 at 9:13 AM, Grigori Fursin  wrote:
> Dear colleagues,
>
> If it's of interest, we have released a new version of our open-source
> framework to share compiler optimization knowledge across diverse workloads
> and hardware. We would like to thank all the volunteers who ran this
> framework and shared some results for GCC 4.9 .. 6.0 in the public
> repository here: http://cTuning.org/crowdtuning-results-gcc
>
> Here is a brief note how this framework for crowdtuning compiler
> optimization heuristics works (for more details, please see
> https://github.com/ctuning/ck/wiki/Crowdsource_Experiments): you just
> install a small Android app
> (https://play.google.com/store/apps/details?id=openscience.crowdsource.experiments)
> or python-based Collective Knowledge framework
> (http://github.com/ctuning/ck). This program sends system properties to a
> public server. The server compiles a random shared workload using some flag
> combinations that have been found to work well on similar machines, as well
> as some new random ones. The client executes the compiled workload several
> times to account for variability etc, and sends the results back to the
> server.
>
> If a combination of compiler flags is found that improves performance over
> the combinations found so far, it gets reduced (by removing flags that do
> now affect the performance) and uploaded to a public repository.
> Importantly, if a combination significantly degrades performance for a
> particular workload, it gets recorded as well. This potentially points to a
> problem with optimization heuristics for a particular target, which may be
> worth investigating and improving.
>
> At the moment, only global GCC compiler flags are exposed for collaborative
> optimization. Longer term, it can be useful to cover finer-grain
> transformation decisions (vectorization, unrolling, etc) via plugin
> interface. Please, note that this is a prototype framework and much more can
> be done! Please get in touch if you are interested to know more or
> contribute!

Thanks for creating and sharing this interesting framework.

I think a central issue is the "random shared workload" because the
optimal optimizations and optimization pipeline are
application-dependent.  The proposed changes to the heuristics may
benefit for the particular set of workloads that the framework tests
but why are those workloads and particular implementations of the
workloads representative for applications of interest to end users of
GCC?   GCC is turned for an arbitrary set of workloads, but why are
the workloads from cTuning any better?

Thanks, David


Re: Implementing TI mode (128-bit) and the 2nd pipeline for the MIPS R5900

2016-03-05 Thread Richard Henderson

On 02/27/2016 01:38 AM, Woon yung Liu wrote:

I've given up on trying to implement MMI support for this target because I

couldn't get the larger-than-normal GPR sizes to work nicely with the GCC
internals (registers sometimes get split due to the defined word size, or the
stuff in expr.c will just suffer from assertion failures).

[ Apologies for assumptions being made here, since I can't find an instruction 
set reference for the r5900 anymore.  ]


You probably don't want to be using TImode for MMI support anyway, since, in 
the broader context this instruction set extension is about SIMD.


Thus e.g. V16QImode and V8HImode might be more appropriate.



It seems like the RTL patterns are not unique according to their names, but

the inputs/outputs.

Correct.



Is there a way to force GCC to use a specific pattern (i.e.

"r5900_qword_store" and "r5900_qword_load")? I don't want to add the lq/sq
instructions to mips_output_move because it will allow lq/sq to be used for
stuff that isn't supported (i.e. loading TI-mode data types into a register for
arithmetic operations that don't exist).

You can't.  For a given set of inputs, one must provide all of the valid ways 
that one can perform the operation as alternatives.


Thus for TImode move, currently defined as

(define_insn "*movti"
  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d")
(match_operand:TI 1 "move_operand" "d,i,m,dJ,*J,*d,*a"))]

one would have to add additional alternatives for the lq and sq instructions 
(and probably a register-register alternative as well, e.g. por d,s,s).


You must use the set_attr section to describe when the alternatives that you 
add are valid.  The "enabled" attribute controls this.  Looking at the mips 
port, it would appear that adding to "move_type" would be best.


It is of course simpler if the patterns that you want to add do not overlap 
with existing patterns.  Thus if you stick to the vector modes you have less 
overlap than if you describe MMI as using TImode.



r~