RE: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.
From: Richard Biener [richard.guent...@gmail.com] Sent: 06 July 2016 16:46:17 To: Sameera Deshpande Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep Patil Subject: Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation. On Wed, Jul 6, 2016 at 12:49 PM, Sameera Deshpande wrote: > > From: Sameera Deshpande [sameera.deshpa...@imgtec.com] > Sent: 20 June 2016 11:37:58 > To: Richard Biener > Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep > Patil > Subject: Re: [Patch 0,1a] Improving effectiveness and generality of > autovectorization using unified representation. > > On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote: >> On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande >> wrote: >>> On Thursday 09 June 2016 05:45 PM, Richard Biener wrote: On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener wrote: > > On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande > wrote: >> >> Hi Richard, >> >> This is with reference to our discussion at GNU Tools Cauldron 2015 >> regarding my talk titled "Improving the effectiveness and generality of >> GCC >> auto-vectorization." Further to our prototype implementation of the >> concept, >> we have started implementing this concept in GCC. >> >> We are following incremental model to add language support in our >> front-end, and corresponding back-end (for auto-vectorizer) will be added >> for feature completion. >> >> Looking at the complexity and scale of the project, we have divided this >> project into subtasks listed below, for ease of implementation, testing >> and >> review. >> >> 0. Add new pass to perform autovectorization using unified >> representation - Current GCC framework does not give complete overview of >> the loop to be vectorized : it either breaks the loop across body, or >> across >> iterations. Because of which these data structures can not be reused for >> our >> approach which gathers all the information of loop body at one place >> using >> primitive permute operations. Hence, define new data structures and >> populate >> them. >> >> 1. Add support for vectorization of LOAD/STORE instructions >> a. Create permute order tree for the loop with LOAD and STORE >> instructions for single or multi-dimensional arrays, aggregates within >> nested loops. >> b. Basic transformation phase to generate vectorized code for the >> primitive reorder tree generated at stage 1a using tree tiling algorithm. >> This phase handles code generation for SCATTER, GATHER, stridded memory >> accesses etc. along with permute instruction generation. >> >> 2. Implementation of k-arity promotion/reduction : The permute nodes >> within primitive reorder tree generated from input program can have any >> arity. However, the target can support maximum of arity = 2 in most of >> the >> cases. Hence, we need to promote or reduce the arity of permute order >> tree >> to enable successful tree tiling. >> >> 3. Vector size reduction : Depending upon the vector size for target, >> reduce vector size per statement and adjust the loop count for vectorized >> loop accordingly. >> >> 4. Support simple arithmetic operations : >> a. Add support for analyzing statements with simple arithmetic >> operations like +, -, *, / for vectorization, and create primitive >> reorder >> tree with compute_op. >> b. Generate vector code for primitive reorder tree generated at >> stage 4a using tree tiling algorithm - here support for complex patterns >> like multiply-add should be checked and appropriate instruction to be >> generated. >> >> 5. Support reduction operation : >> a. Add support for reduction operation analysis and primitive >> reorder tree generation. The reduction operation needs special handling, >> as >> the finish statement should COLLAPSE the temporary reduction vector >> TEMP_VAR >> into original reduction variable. >> b. The code generation for primitive reorder tree does not need any >> handling - as reduction tree is same as tree generated in 4a, with only >> difference that in 4a, the destination is MEMREF (because of STORE >> operation) and for reduction it is TEMP_VAR. At this stage, generate code >> for COLLAPSE node in finish statements. >> >> 6. Support other vectorizable statements like complex arithmetic >> operations, bitwise operations, type conversions etc. >> a. Add support for analysis and primitive reorder tree generation. >> b. Vector code generation. >> >> 7. Cost effective tree tiling
Re: [gimplefe] hacking pass manager
On 6 July 2016 at 14:24, Richard Biener wrote: > On Wed, Jul 6, 2016 at 9:51 AM, Prasad Ghangal > wrote: >> On 30 June 2016 at 17:10, Richard Biener wrote: >>> On Wed, Jun 29, 2016 at 9:13 PM, Prasad Ghangal >>> wrote: On 29 June 2016 at 22:15, Richard Biener wrote: > On June 29, 2016 6:20:29 PM GMT+02:00, Prathamesh Kulkarni > wrote: >>On 18 June 2016 at 12:02, Prasad Ghangal >>wrote: >>> Hi, >>> >>> I tried hacking pass manager to execute only given passes. For this I >>> am adding new member as opt_pass *custom_pass_list to the function >>> structure to store passes need to execute and providing the >>> custom_pass_list to execute_pass_list() function instead of all >>passes >>> >>> for test case like- >>> >>> int a; >>> void __GIMPLE (execute ("tree-ccp1", "tree-fre1")) foo() >>> { >>> bb_1: >>> a = 1 + a; >>> } >>> >>> it will execute only given passes i.e. ccp1 and fre1 pass on the >>function >>> >>> and for test case like - >>> >>> int a; >>> void __GIMPLE (startwith ("tree-ccp1")) foo() >>> { >>> bb_1: >>> a = 1 + a; >>> } >>> >>> it will act as a entry point to the pipeline and will execute passes >>> starting from given pass. >>Bike-shedding: >>Would it make sense to have syntax for defining pass ranges to execute >>? >>for instance: >>void __GIMPLE(execute (pass_start : pass_end)) >>which would execute all the passes within range [pass_start, pass_end], >>which would be convenient if the range is large. > > But it would rely on a particular pass pipeline, f.e. pass-start > appearing before pass-end. > > Currently control doesn't work 100% as it only replaces all_optimizations > but not lowering passes or early opts, nor IPA opts. > Each pass needs GIMPLE in some specific form. So I am letting lowering and early opt passes to execute. I think we have to execute some passes (like cfg) anyway to represent GIMPLE into proper form >>> >>> Yes, that's true. Note that early opt passes only optimize but we need >>> pass_build_ssa_passes at least (for into-SSA). For proper unit-testing >>> of GIMPLE passes we do need to guard off early opts somehow >>> (I guess a simple if (flag_gimple && cfun->custom_pass_list) would do >>> that). >>> >>> Then there is of course the question about IPA passes which I think is >>> somewhat harder (one could always disable all IPA passes manually >>> via flags of course or finally have a global -fipa/no-ipa like most >>> other compilers). >>> >> Can we iterate through all ipa passes and do -fdisable-ipa-pass or >> -fenable-ipa-pass equivalent for each? > > We could do that, yes. But let's postpone this issue. I think that > startwith is going to be most useful and rather than constructing > a pass list for it "native" support for it in the pass manager is > likely to produce better results (add a 'startwith' member alongside > the pass list member and if it is set the pass manager skips all > passes that do not match 'startwith' and once it reaches it it clears > the field). > > In the future I hope we can get away from a static pass list and more > towards rule-driven pass execution (we have all those PROP_* stuff > already but it isn't really used for example). But well, that would be > a separate GSoC project ;) > > IMHO startwith will provide everything needed for unit-testing. We can > add a flag on whether further passes should be executed or not and > even a pass list like execute ("ccp1", "fre") can be implemented by > startwith ccp1 and then from there executing the rest of the passes in the > list and stopping at the end. > > As said, unit-testing should exercise a single pass if we can control > its input. > In this patch I am skipping execution of passes until pass_startwith is found. Unlike previous build, now pass manager executes all passes in pipeline starting from pass_startwith instead of just sub passes. > Thanks, > Richard. > >> Thanks, >> Prasad >> >>> Richard. >>> > Richard. > >>Thanks, >>Prathamesh >>> >>> >>> >>> Thanks, >>> Prasad Ghangal > > diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c index 00e0bc5..d7ffdce 100644 --- a/gcc/c/c-parser.c +++ b/gcc/c/c-parser.c @@ -1413,7 +1413,7 @@ static c_expr c_parser_gimple_unary_expression (c_parser *); static struct c_expr c_parser_gimple_postfix_expression (c_parser *); static struct c_expr c_parser_gimple_postfix_expression_after_primary (c_parser *, struct c_expr); -static void c_parser_gimple_pass_list (c_parser *, opt_pass **); +static void c_parser_gimple_pass_list (c_parser *, opt_pass **, bool *); static opt_pass *c_parser_gimple_pass_list_params (c_parser *, opt_pass **); static void c_parser_gimple_declaration (c_parser *); stati
gcc-6-20160707 is now available
Snapshot gcc-6-20160707 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/6-20160707/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 6 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-6-branch revision 238150 You'll find: gcc-6-20160707.tar.bz2 Complete GCC MD5=eb301d98f444e83a8b49beb630d86466 SHA1=9b3e051d685dfba605101f21cd252b4239844c19 Diffs from 6-20160630 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-6 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Fwd: Re: GCC libatomic questions
Hi, I have a revised version of the libatomic ABI draft which tries to accommodate Richard's comments. The new version is attached. The diff is also appended. Thanks, - Bin diff ABI.txt ABI-1.1.txt 28a29,30 > - The versioning of the library external symbols > 47a50,57 > Note > > Some 64-bit x86 ISA does not support the cmpxchg16b instruction, for > example, some early AMD64 processors and later Intel Xeon Phi co- > processor. Whether cmpxchg16b is supported may affect the ABI > specification for certain atomic types. We will discuss the detail > where it has an impact. > 101c111,112 < _Atomic __int12816 16 N not applicable --- > _Atomic __int128 (with at16)1616 Y not applicable > _Atomic __int128 (w/o at16) 1616 N not applicable 105c116,117 < _Atomic long double 1616 N 12 4 N --- > _Atomic long double (with at16) 1616 Y 12 4 N > _Atomic long double (w/o at16) 1616 N 12 4 N 106a119,120 > _Atomic double _Complex 1616(8) Y 16 16(8) N > (with at16) 107a122 > (w/o at16) 110a126,127 > _Atomic long double _Imaginary 1616 Y 12 4 N > (with at16) 111a129 > (w/o at16) 146a165,167 > with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA > does not support cmpxchg16b. > 191a213,214 > _Atomic struct {char a[16];}1616(1) Y 1616(1) N > (with at16) 192a216 > (w/o at16) 208a233,235 > with at16 means the ISA supports cmpxchg16b, w/o at16 means the ISA > does not support cmpxchg16b. > 246a274,276 > On the 64-bit x86 platform which supports the cmpxchg16b instruction, > 16-byte atomic types whose alignment matches the size is inlineable. > 303,306c333,338 < CMPXCHG16B is not always available on 64-bit x86 platforms, so 16-byte < naturally aligned atomics are not inlineable. The support functions for < such atomics are free to use lock-free implementation if the instruction < is available on specific platforms. --- > "Inlineability" is a compile time property, which in most cases depends > only on the type. In a few cases it also depends on whether the target > ISA supports the cmpxchg16b instruction. A compiler may get the ISA > information by either compilation flags or inquiring the hardware > capabilities. When the hardware capabilities information is not available, > the compiler should assume the cmpxchg16b instruction is not supported. 665a698,705 > The function takes the size of an object and an address which > is one of the following three cases > - the address of the object > - a faked address that solely indicates the alignment of the > object's address > - NULL, which means that the alignment of the object matches size > and returns whether the object is lock-free. > 711c751 < 5. Libatomic Assumption on Non-blocking Memory Instructions --- > 5. Libatomic symbol versioning 712a753,868 > Here is the mapfile for symbol versioning of the libatomic library > specified by this ABI specification > > LIBATOMIC_1.0 { > global: > __atomic_load; > __atomic_store; > __atomic_exchange; > __atomic_compare_exchange; > __atomic_is_lock_free; > > __atomic_add_fetch_1; > __atomic_add_fetch_2; > __atomic_add_fetch_4; > __atomic_add_fetch_8; > __atomic_add_fetch_16; > __atomic_and_fetch_1; > __atomic_and_fetch_2; > __atomic_and_fetch_4; > __atomic_and_fetch_8; > __atomic_and_fetch_16; > __atomic_compare_exchange_1; > __atomic_compare_exchange_2; > __atomic_compare_exchange_4; > __atomic_compare_exchange_8; > __atomic_compare_exchange_16; > __atomic_exchange_1; > __atomic_exchange_2; > __atomic_exchange_4; > __atomic_exchange_8; > __atomic_exchange_16; > __atomic_fetch_add_1; > __atomic_fetch_add_2; > __atomic_fetch_add_4; > __atomic_fetch_add_8; > __atomic_fetch_add_16; > __atomic_fetch_and_1; > __atomic_fetch_and_2; > __atomic_fetch_and_4; > __atomic_fetch_and_8; > __atomic_fetch_and_16; > __atomic_fetch_nand_1; > __atomic_fetch_nand_2; > __atomic_fetch_nand_4; > __atomic_fetch_nand_8; > __atomic_fetch_nand_16; > __atomic_fetch_or_1; > __atomic_fetch_or_2; > __atomic_fetch_or_4; > __atomic_fetch_or_8; > __atomic_fetch_or_16; > __atomic_fetch_sub_1; > __atomic_fetch_sub_2; > __atomic_fetch_sub_4; > __atomic_fetch_sub_8; > __atomic_fetch_sub_16; > __atomic_fetch_xor_1; > __atomic_fetch_xor_2; > __atomic_fetch_xor_4; > __atomic_fetch_xor_8; > __atomic_fetch_xor_16; > __atomic_load_1; > __atomic_load_2; > __atomic_l