RE: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

2016-07-07 Thread Sameera Deshpande


From: Richard Biener [richard.guent...@gmail.com]
Sent: 06 July 2016 16:46:17
To: Sameera Deshpande
Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep Patil
Subject: Re: [Patch 0,1a] Improving effectiveness and generality of 
autovectorization using unified representation.

On Wed, Jul 6, 2016 at 12:49 PM, Sameera Deshpande
 wrote:
> 
> From: Sameera Deshpande [sameera.deshpa...@imgtec.com]
> Sent: 20 June 2016 11:37:58
> To: Richard Biener
> Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep 
> Patil
> Subject: Re: [Patch 0,1a] Improving effectiveness and generality of 
> autovectorization using unified representation.
>
> On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote:
>> On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande
>>  wrote:
>>> On Thursday 09 June 2016 05:45 PM, Richard Biener wrote:

 On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener
  wrote:
>
> On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande
>  wrote:
>>
>> Hi Richard,
>>
>> This is with reference to our discussion at GNU Tools Cauldron 2015
>> regarding my talk titled "Improving the effectiveness and generality of 
>> GCC
>> auto-vectorization." Further to our prototype implementation of the 
>> concept,
>> we have started implementing this concept in GCC.
>>
>> We are following incremental model to add language support in our
>> front-end, and corresponding back-end (for auto-vectorizer) will be added
>> for feature completion.
>>
>> Looking at the complexity and scale of the project, we have divided this
>> project into subtasks listed below, for ease of implementation, testing 
>> and
>> review.
>>
>> 0. Add new pass to perform autovectorization using unified
>> representation - Current GCC framework does not give complete overview of
>> the loop to be vectorized : it either breaks the loop across body, or 
>> across
>> iterations. Because of which these data structures can not be reused for 
>> our
>> approach which gathers all the information of loop body at one place 
>> using
>> primitive permute operations. Hence, define new data structures and 
>> populate
>> them.
>>
>> 1. Add support for vectorization of LOAD/STORE instructions
>>   a. Create permute order tree for the loop with LOAD and STORE
>> instructions for single or multi-dimensional arrays, aggregates within
>> nested loops.
>>   b. Basic transformation phase to generate vectorized code for the
>> primitive reorder tree generated at stage 1a using tree tiling algorithm.
>> This phase handles code generation for SCATTER, GATHER, stridded memory
>> accesses etc. along with permute instruction generation.
>>
>> 2. Implementation of k-arity promotion/reduction : The permute nodes
>> within primitive reorder tree generated from input program can have any
>> arity. However, the target can support maximum of arity = 2 in most of 
>> the
>> cases. Hence, we need to promote or reduce the arity of permute order 
>> tree
>> to enable successful tree tiling.
>>
>> 3. Vector size reduction : Depending upon the vector size for target,
>> reduce vector size per statement and adjust the loop count for vectorized
>> loop accordingly.
>>
>> 4. Support simple arithmetic operations :
>>   a. Add support for analyzing statements with simple arithmetic
>> operations like +, -, *, / for vectorization, and create primitive 
>> reorder
>> tree with compute_op.
>>   b. Generate vector code for primitive reorder tree generated at
>> stage 4a using tree tiling algorithm - here support for complex patterns
>> like multiply-add should be checked and appropriate instruction to be
>> generated.
>>
>> 5. Support reduction operation :
>>   a. Add support for reduction operation analysis and primitive
>> reorder tree generation. The reduction operation needs special handling, 
>> as
>> the finish statement should COLLAPSE the temporary reduction vector 
>> TEMP_VAR
>> into original reduction variable.
>>   b. The code generation for primitive reorder tree does not need any
>> handling - as reduction tree is same as tree generated in 4a, with only
>> difference that in 4a, the destination is MEMREF (because of STORE
>> operation) and for reduction it is TEMP_VAR. At this stage, generate code
>> for COLLAPSE node in finish statements.
>>
>> 6. Support other vectorizable statements like complex arithmetic
>> operations, bitwise operations, type conversions etc.
>>   a. Add support for analysis and primitive reorder tree generation.
>>   b. Vector code generation.
>>
>> 7. Cost effective tree tiling 

Re: [gimplefe] hacking pass manager

2016-07-07 Thread Prasad Ghangal
On 6 July 2016 at 14:24, Richard Biener  wrote:
> On Wed, Jul 6, 2016 at 9:51 AM, Prasad Ghangal  
> wrote:
>> On 30 June 2016 at 17:10, Richard Biener  wrote:
>>> On Wed, Jun 29, 2016 at 9:13 PM, Prasad Ghangal
>>>  wrote:
 On 29 June 2016 at 22:15, Richard Biener  
 wrote:
> On June 29, 2016 6:20:29 PM GMT+02:00, Prathamesh Kulkarni 
>  wrote:
>>On 18 June 2016 at 12:02, Prasad Ghangal 
>>wrote:
>>> Hi,
>>>
>>> I tried hacking pass manager to execute only given passes. For this I
>>> am adding new member as opt_pass *custom_pass_list to the function
>>> structure to store passes need to execute and providing the
>>> custom_pass_list to execute_pass_list() function instead of all
>>passes
>>>
>>> for test case like-
>>>
>>> int a;
>>> void __GIMPLE (execute ("tree-ccp1", "tree-fre1")) foo()
>>> {
>>> bb_1:
>>>   a = 1 + a;
>>> }
>>>
>>> it will execute only given passes i.e. ccp1 and fre1 pass on the
>>function
>>>
>>> and for test case like -
>>>
>>> int a;
>>> void __GIMPLE (startwith ("tree-ccp1")) foo()
>>> {
>>> bb_1:
>>>   a = 1 + a;
>>> }
>>>
>>> it will act as a entry point to the pipeline and will execute passes
>>> starting from given pass.
>>Bike-shedding:
>>Would it make sense to have syntax for defining pass ranges to execute
>>?
>>for instance:
>>void __GIMPLE(execute (pass_start : pass_end))
>>which would execute all the passes within range [pass_start, pass_end],
>>which would be convenient if the range is large.
>
> But it would rely on a particular pass pipeline, f.e. pass-start 
> appearing before pass-end.
>
> Currently control doesn't work 100% as it only replaces all_optimizations 
> but not lowering passes or early opts, nor IPA opts.
>

 Each pass needs GIMPLE in some specific form. So I am letting lowering
 and early opt passes to execute. I think we have to execute some
 passes (like cfg) anyway to represent GIMPLE into proper form
>>>
>>> Yes, that's true.  Note that early opt passes only optimize but we need
>>> pass_build_ssa_passes at least (for into-SSA).  For proper unit-testing
>>> of GIMPLE passes we do need to guard off early opts somehow
>>> (I guess a simple if (flag_gimple && cfun->custom_pass_list) would do
>>> that).
>>>
>>> Then there is of course the question about IPA passes which I think is
>>> somewhat harder (one could always disable all IPA passes manually
>>> via flags of course or finally have a global -fipa/no-ipa like most
>>> other compilers).
>>>
>> Can we iterate through all ipa passes and do -fdisable-ipa-pass or
>> -fenable-ipa-pass equivalent for each?
>
> We could do that, yes.  But let's postpone this issue.  I think that
> startwith is going to be most useful and rather than constructing
> a pass list for it "native" support for it in the pass manager is
> likely to produce better results (add a 'startwith' member alongside
> the pass list member and if it is set the pass manager skips all
> passes that do not match 'startwith' and once it reaches it it clears
> the field).
>
> In the future I hope we can get away from a static pass list and more
> towards rule-driven pass execution (we have all those PROP_* stuff
> already but it isn't really used for example).  But well, that would be
> a separate GSoC project ;)
>
> IMHO startwith will provide everything needed for unit-testing.  We can
> add a flag on whether further passes should be executed or not and
> even a pass list like execute ("ccp1", "fre") can be implemented by
> startwith ccp1 and then from there executing the rest of the passes in the
> list and stopping at the end.
>
> As said, unit-testing should exercise a single pass if we can control
> its input.
>
In this patch I am skipping execution of passes until pass_startwith
is found. Unlike previous build, now pass manager executes all passes
in pipeline starting from pass_startwith instead of just sub passes.

> Thanks,
> Richard.
>
>> Thanks,
>> Prasad
>>
>>> Richard.
>>>
> Richard.
>
>>Thanks,
>>Prathamesh
>>>
>>>
>>>
>>> Thanks,
>>> Prasad Ghangal
>
>
diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index 00e0bc5..d7ffdce 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -1413,7 +1413,7 @@ static c_expr c_parser_gimple_unary_expression (c_parser 
*);
 static struct c_expr c_parser_gimple_postfix_expression (c_parser *);
 static struct c_expr c_parser_gimple_postfix_expression_after_primary 
(c_parser *,
   struct 
c_expr);
-static void c_parser_gimple_pass_list (c_parser *, opt_pass **);
+static void c_parser_gimple_pass_list (c_parser *, opt_pass **, bool *);
 static opt_pass *c_parser_gimple_pass_list_params (c_parser *, opt_pass **);
 static void c_parser_gimple_declaration (c_parser *);
 stati

gcc-6-20160707 is now available

2016-07-07 Thread gccadmin
Snapshot gcc-6-20160707 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/6-20160707/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 6 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-6-branch 
revision 238150

You'll find:

 gcc-6-20160707.tar.bz2   Complete GCC

  MD5=eb301d98f444e83a8b49beb630d86466
  SHA1=9b3e051d685dfba605101f21cd252b4239844c19

Diffs from 6-20160630 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-6
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: Fwd: Re: GCC libatomic questions

2016-07-07 Thread Bin Fan

Hi,

I have a revised version of the libatomic ABI draft which tries to 
accommodate Richard's comments. The new version is attached. The diff is 
also appended.


Thanks,
- Bin

diff ABI.txt ABI-1.1.txt
28a29,30
> - The versioning of the library external symbols
>
47a50,57
> Note
>
> Some 64-bit x86 ISA does not support the cmpxchg16b instruction, for
> example, some early AMD64 processors and later Intel Xeon Phi co-
> processor. Whether cmpxchg16b is supported may affect the ABI
> specification for certain atomic types. We will discuss the detail
> where it has an impact.
>
101c111,112
< _Atomic __int12816  16 N   not 
applicable

---
> _Atomic __int128 (with at16)1616 Y   not 
applicable
> _Atomic __int128 (w/o at16) 1616 N   not 
applicable

105c116,117
< _Atomic long double 1616 N   12
4  N

---
> _Atomic long double (with at16) 1616 Y   12
4  N
> _Atomic long double (w/o at16)  1616 N   12
4  N

106a119,120
> _Atomic double _Complex 1616(8) Y   16
16(8)  N

> (with at16)
107a122
> (w/o at16)
110a126,127
> _Atomic long double _Imaginary  1616 Y   12
4  N

> (with at16)
111a129
> (w/o at16)
146a165,167
> with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA
> does not support cmpxchg16b.
>
191a213,214
> _Atomic struct {char a[16];}1616(1) Y
1616(1)  N

> (with at16)
192a216
> (w/o at16)
208a233,235
> with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA
> does not support cmpxchg16b.
>
246a274,276
> On the 64-bit x86 platform which supports the cmpxchg16b instruction,
> 16-byte atomic types whose alignment matches the size is inlineable.
>
303,306c333,338
< CMPXCHG16B is not always available on 64-bit x86 platforms, so 16-byte
< naturally aligned atomics are not inlineable. The support functions for
< such atomics are free to use lock-free implementation if the instruction
< is available on specific platforms.
---
> "Inlineability" is a compile time property, which in most cases depends
> only on the type. In a few cases it also depends on whether the target
> ISA supports the cmpxchg16b instruction. A compiler may get the ISA
> information by either compilation flags or inquiring the hardware
> capabilities. When the hardware capabilities information is not 
available,

> the compiler should assume the cmpxchg16b instruction is not supported.
665a698,705
> The function takes the size of an object and an address which
> is one of the following three cases
> - the address of the object
> - a faked address that solely indicates the alignment of the
>   object's address
> - NULL, which means that the alignment of the object matches size
> and returns whether the object is lock-free.
>
711c751
< 5. Libatomic Assumption on Non-blocking Memory Instructions
---
> 5. Libatomic symbol versioning
712a753,868
> Here is the mapfile for symbol versioning of the libatomic library
> specified by this ABI specification
>
> LIBATOMIC_1.0 {
>   global:
> __atomic_load;
> __atomic_store;
> __atomic_exchange;
> __atomic_compare_exchange;
> __atomic_is_lock_free;
>
> __atomic_add_fetch_1;
> __atomic_add_fetch_2;
> __atomic_add_fetch_4;
> __atomic_add_fetch_8;
> __atomic_add_fetch_16;
> __atomic_and_fetch_1;
> __atomic_and_fetch_2;
> __atomic_and_fetch_4;
> __atomic_and_fetch_8;
> __atomic_and_fetch_16;
> __atomic_compare_exchange_1;
> __atomic_compare_exchange_2;
> __atomic_compare_exchange_4;
> __atomic_compare_exchange_8;
> __atomic_compare_exchange_16;
> __atomic_exchange_1;
> __atomic_exchange_2;
> __atomic_exchange_4;
> __atomic_exchange_8;
> __atomic_exchange_16;
> __atomic_fetch_add_1;
> __atomic_fetch_add_2;
> __atomic_fetch_add_4;
> __atomic_fetch_add_8;
> __atomic_fetch_add_16;
> __atomic_fetch_and_1;
> __atomic_fetch_and_2;
> __atomic_fetch_and_4;
> __atomic_fetch_and_8;
> __atomic_fetch_and_16;
> __atomic_fetch_nand_1;
> __atomic_fetch_nand_2;
> __atomic_fetch_nand_4;
> __atomic_fetch_nand_8;
> __atomic_fetch_nand_16;
> __atomic_fetch_or_1;
> __atomic_fetch_or_2;
> __atomic_fetch_or_4;
> __atomic_fetch_or_8;
> __atomic_fetch_or_16;
> __atomic_fetch_sub_1;
> __atomic_fetch_sub_2;
> __atomic_fetch_sub_4;
> __atomic_fetch_sub_8;
> __atomic_fetch_sub_16;
> __atomic_fetch_xor_1;
> __atomic_fetch_xor_2;
> __atomic_fetch_xor_4;
> __atomic_fetch_xor_8;
> __atomic_fetch_xor_16;
> __atomic_load_1;
> __atomic_load_2;
> __atomic_l