Re: [mem-ssa] Updated documentation
Hi Diego, In the example of dynamic partitioning below (Figure 6), I don't understand why MEM7 is not killed in line 13 and is killed in line 20 later. As far as I understand, in line 13 'c' is in the alias set, and it's currdef is MEM7, so it must be killed by the store in line 14. What am I missing? Thanks, Ira a, b, c} q6 points?to {b, c} CD(v) means that the generated MEM i name is the “current definition” for v. LU(v) looks up the “current definition” for v. The initial SSA name for MEM is MEM7. 1 . . . 2 # MEM8 = VDEF ) CD(a) 3 a = 2 4 5 # MEM10 = VDEF ) CD(b) 6 b = 5 7 8 # VUSE ) LU(b) 9 b.311 = b 10 11 D.153612 = b.311 + 3 12 13 # MEM25 = VDEF ) CD(a, b, c) 14 *p5 = D.153612 15 16 # VUSE ) CD(b) 17 b.313 = b 18 D.153714 = 10 ? b.313 19 20 # MEM26 = VDEF ) CD(b, c) 21 *q6 = D.153714 22 23 # VUSE ) LU(a) 24 a.415 = a 25 26 # MEM17 = VDEF ) CD(SFT.2) 27 X.x = a.415 28 return }
Re: Scheduling an early complete loop unrolling pass?
Dorit Nuzman/Haifa/IBM wrote on 05/02/2007 21:13:40: > Richard Guenther <[EMAIL PROTECTED]> wrote on 05/02/2007 17:59:00: > > > On Mon, 5 Feb 2007, Paolo Bonzini wrote: > > > > > > > > > As we also only vectorize innermost loops I believe doing a > > > > complete unrolling pass early will help in general (I pushed > > > > for this some time ago). > > > > > > > > Thoughts? > > > > > > It might also hurt, though, since we don't have a basic block vectorizer. > > > IIUC the vectorizer is able to turn > > > > > > for (i = 0; i < 4; i++) > > > v[i] = 0.0; > > > > > > into > > > > > > *(vector double *)v = (vector double){0.0, 0.0, 0.0, 0.0}; > > > > That's true. > > That's going to change once this project goes in: "(3.2) Straight- > line code vectorization" from http://gcc.gnu. > org/wiki/AutovectBranchOptimizations. In fact, I think in autovect- > branch, if you unroll the above loop it should get vectorized > already. Ira - is that really the case? The completely unrolled loop will not get vectorized because the code will not be inside any loop (and our SLP implementation will focus, at least as a first step, on loops). The following will get vectorized (without permutation on autovect branch, and with redundant permutation on mainline): for (i = 0; i < n; i++) { v[4*i] = 0.0; v[4*i + 1] = 0.0; v[4*i + 2] = 0.0; v[4*i + 3] = 0.0; } The original completely unrolled loop will get vectorized if it is encapsulated in an outer-loop, like so: for (j=0; j
vcond implementation in altivec
Hi, We were looking at the implementation of vcond for altivec and we have a couple of questions. vcond has 6 operands, rs6000_emit_vector_cond_expr is called from define_expand for "vcond". It gets those operands in their original order, as in vcond, and emits op0 = (op4 cond op5 ? op1 : op2), where cond is op3. Here is vcond for vector short (vconduv8hi, vcondv16qi, and vconduv16qi are similar): (define_expand "vcondv8hi" [(set (match_operand:V4SF 0 "register_operand" "=v") (unspec:V8HI [(match_operand:V4SI 1 "register_operand" "v") (match_operand:V8HI 2 "register_operand" "v") (match_operand:V8HI 3 "comparison_operator" "") (match_operand:V8HI 4 "register_operand" "v") (match_operand:V8HI 5 "register_operand" "v") ] UNSPEC_VCOND_V8HI))] "TARGET_ALTIVEC" " { if (rs6000_emit_vector_cond_expr (operands[0], operands[1], operands[2], operands[3], operands[4], operands[5])) DONE; else FAIL; } ") Is there a reason why op0 is V4SF and op1 is V4SI (and not V8HI)? In V4SF, op1 is V4SI: (define_expand "vcondv4sf" [(set (match_operand:V4SF 0 "register_operand" "=v") (unspec:V4SF [(match_operand:V4SI 1 "register_operand" "v") (match_operand:V4SF 2 "register_operand" "v") (match_operand:V4SF 3 "comparison_operator" "") (match_operand:V4SF 4 "register_operand" "v") (match_operand:V4SF 5 "register_operand" "v") ] UNSPEC_VCOND_V4SF))] "TARGET_ALTIVEC" " { if (rs6000_emit_vector_cond_expr (operands[0], operands[1], operands[2], operands[3], operands[4], operands[5])) DONE; else FAIL; } ") Same question: is there a reason for op1 to be V4SI? And also, why not use if_then_else instead of unspec (in all vcond's)? Thanks, Sa and Ira
Re: Vector permutation only deals with # of vector elements same as mask?
Hi, "Bingfeng Mei" wrote on 10/02/2011 05:35:45 PM: > > Hi, > I noticed that vector permutation gets more use in GCC > 4.6, which is great. It is used to handle negative step > by reversing vector elements now. > > However, after reading the related code, I understood > that it only works when the # of vector elements is > the same as that of mask vector in the following code. > > perm_mask_for_reverse (tree-vect-stmts.c) > ... > mask_type = get_vectype_for_scalar_type (mask_element_type); > nunits = TYPE_VECTOR_SUBPARTS (vectype); > if (!mask_type > || TYPE_VECTOR_SUBPARTS (vectype) != TYPE_VECTOR_SUBPARTS (mask_type)) > return NULL; > ... > > For PowerPC altivec, the mask_type is V16QI. It means that > compiler can only permute V16QI type. But given the capability of > altivec vperm instruction, it can permute any 128-bit type > (V8HI, V4SI, etc). We just need convert in/out V16QI from > given types and a bit more extra work in producing mask. > > Do I understand correctly or miss something here? Yes, you are right. The support of reverse access is somewhat limited. Please see vect_transform_slp_perm_load() in tree-vect-slp.c for example of all type permutation support. But, anyway, reverse accesses are not supported for altivec's load realignment scheme. Ira > > Thanks, > Bingfeng Mei > > > >
Re: Fw: RFC: Representing vector lane load/store operations
>> ...Ira would know best, but I don't think it would be used for this >> kind of loop. It would be more something like: >> >> for (i=0; i> X[i] = Y[i].red + Y[i].blue + Y[i].green; >> >> (not a realistic example). You'd then have: >> >> compoundY = __builtin_load_lanes (Y); >> red = ARRAY_REF >> green = ARRAY_REF >> blue = ARRAY_REF >> D1 = red + green >> D2 = D1 + blue >> MEM_REF = D2; >> >> My understanding is that'd we never do any operations besides ARRAY_REFs >> on the compound value, and that the individual vectors would be treated >> pretty much like any other. > > Ok, I thought it might be used to have a larger vectorization factor for > loads and stores, basically make further unrolling cheaper because you > don't have to duplicate the loads and stores. Right, we can do that using vld1/vst1 instructions (full load/store with N=1) and operate on up to 4 doubleword vectors in parallel. But at the moment we are concentrating on efficient support of strided memory accesses. Ira
Re: Strange vect.exp test results
gcc-ow...@gcc.gnu.org wrote on 30/05/2011 06:36:36 PM: > > Hi, > > I've been playing with the vectorizer for my port, and of course I use > the testsuite to check the generated code. I fail to understand some > of the FAILs I get. For example, in slp-3.c, the test contains: > > /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { > xfail vect_no_align } } } */ > > This test fails for me because I get 4 vectorized loops instead of 3. > There are multiple other tests that generate more vectorization then > expected. I'd like to understand the reason for these failures, but I > can't see what motivates the choice of only 3 vectorized loops among > the 4 vectorizable loops of the test. Can someone enlighten me? The fourth loop (line 104) has only 3 scalar iterations, too few to vectorize unless your target has vectors of 2 shorts. Ira > > Many thanks, > Fred
Re: Strange vect.exp test results
Frederic Riss wrote on 31/05/2011 12:34:35 PM: > Hi Ira, > > thanks for your answer, however: > > On 31 May 2011 08:06, Ira Rosen wrote: > >> This test fails for me because I get 4 vectorized loops instead of 3. > >> There are multiple other tests that generate more vectorization then > >> expected. I'd like to understand the reason for these failures, but I > >> can't see what motivates the choice of only 3 vectorized loops among > >> the 4 vectorizable loops of the test. Can someone enlighten me? > > > > The fourth loop (line 104) has only 3 scalar iterations, too few to > > vectorize unless your target has vectors of 2 shorts. > > My port has vectors of 2 shorts, but I don't expose them directly to > GCC. The V2HI type is defined, but UNITS_PER_SIMD_WORD always returns > 8, which I believe should prompt GCC to use V4HI which is also > defined. > > Regarding slp-3.c I don't get why the loop you point isn't > vectorizable. I my version of the file (4.5 branch), I see 9 short > copies in a loop iterating 4 times (a total of 36 short assignements). > After the vectorization pass, I get 9 V4HI assignments which seem > totally right. I don't see why this shouldn't be the case... You are right. slp-3.c was fixed lately (revision 171569) on trunk for targets with V4HI. I think there are other tests as well that fail because of the vector size assumption. I'm planing to fix them. Ira > > Many thanks, > Fred
Re: SLP vectorizer on non-loop?
gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM: > Hello, > I have one example with two very similar loops. cunrolli pass > unrolls one loop completely > but not the other based on slightly different cost estimations. The > not-unrolled loop > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the > other unrolled loop cannot > be vectorized since it is not a loop any more. In the end, there is > big difference of > performance between two loops. > Here what I see with the current trunk on x86_64 with -O3 (with the two loops split into different functions): The first loop, the one that doesn't get unrolled by cunrolli, gets loop vectorized with -fno-vect-cost-model. With the cost model the vectorization fails because the number of iterations is not sufficient (the vectorizer tries to apply loop peeling in order to align the accesses), the loop gets later unrolled by cunroll and the basic block gets vectorized by SLP. The second loop, unrolled by cunrolli, also gets vectorized by SLP. The *.optimized dumps look similar: : vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)]; MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48; return; : vect_var_.7_57 = MEM[(int *)p_input_10(D)]; MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57; return; > My question is why SLP vectorization has to be performed on loop (it > is a sub-pass under > pass_tree_loop). Conceptually, cannot it be done on any basic block? > Our port are still > stuck at 4.5. But I checked 4.7, it seems still the same. I also > checked functions in > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in > some places it checks > whether loop_vinfo exists to use it or other alternative. I tried to > add an extra SLP > pass after pass_tree_loop, but it didn't work. I wonder how easy to > make SLP works for > non-loop. SLP vectorization works both on loops (in vectorize pass) and on basic blocks (in slp-vectorize pass). Ira > > Thanks, > Bingfeng Mei > > Broadcom UK > > void foo (int *__restrict__ temp_hist_buffer, > int * __restrict__ p_hist_buff, > int *__restrict__ p_input) > { > int i; > for(i=0;i<4;i++) > temp_hist_buffer[i]=p_hist_buff[i]; > > for(i=0;i<4;i++) > temp_hist_buffer[i+4]=p_input[i]; > > } > >
RE: SLP vectorizer on non-loop?
"Bingfeng Mei" wrote on 01/11/2011 01:25:14 PM: > Ira, > Thank you very much for quick answer. I will check 4.7 x86-64 > to see difference from our port. Is there significant change > between 4.5 & 4.7 regarding SLP? Yes, I think so. 4.5 can't SLP data accesses with unknown alignment that you have here. Ira > > Cheers, > Bingfeng > > > -Original Message- > > From: Ira Rosen [mailto:i...@il.ibm.com] > > Sent: 01 November 2011 11:13 > > To: Bingfeng Mei > > Cc: gcc@gcc.gnu.org > > Subject: Re: SLP vectorizer on non-loop? > > > > > > > > gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM: > > > > > Hello, > > > I have one example with two very similar loops. cunrolli pass > > > unrolls one loop completely > > > but not the other based on slightly different cost estimations. The > > > not-unrolled loop > > > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the > > > other unrolled loop cannot > > > be vectorized since it is not a loop any more. In the end, there is > > > big difference of > > > performance between two loops. > > > > > > > Here what I see with the current trunk on x86_64 with -O3 (with the two > > loops split into different functions): > > > > The first loop, the one that doesn't get unrolled by cunrolli, gets > > loop > > vectorized with -fno-vect-cost-model. With the cost model the > > vectorization > > fails because the number of iterations is not sufficient (the > > vectorizer > > tries to apply loop peeling in order to align the accesses), the loop > > gets > > later unrolled by cunroll and the basic block gets vectorized by SLP. > > > > The second loop, unrolled by cunrolli, also gets vectorized by SLP. > > > > The *.optimized dumps look similar: > > > > > > : > > vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)]; > > MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48; > > return; > > > > > > : > > vect_var_.7_57 = MEM[(int *)p_input_10(D)]; > > MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57; > > return; > > > > > > > My question is why SLP vectorization has to be performed on loop (it > > > is a sub-pass under > > > pass_tree_loop). Conceptually, cannot it be done on any basic block? > > > Our port are still > > > stuck at 4.5. But I checked 4.7, it seems still the same. I also > > > checked functions in > > > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in > > > some places it checks > > > whether loop_vinfo exists to use it or other alternative. I tried to > > > add an extra SLP > > > pass after pass_tree_loop, but it didn't work. I wonder how easy to > > > make SLP works for > > > non-loop. > > > > SLP vectorization works both on loops (in vectorize pass) and on basic > > blocks (in slp-vectorize pass). > > > > Ira > > > > > > > > Thanks, > > > Bingfeng Mei > > > > > > Broadcom UK > > > > > > void foo (int *__restrict__ temp_hist_buffer, > > > int * __restrict__ p_hist_buff, > > > int *__restrict__ p_input) > > > { > > > int i; > > > for(i=0;i<4;i++) > > > temp_hist_buffer[i]=p_hist_buff[i]; > > > > > > for(i=0;i<4;i++) > > > temp_hist_buffer[i+4]=p_input[i]; > > > > > > } > > > > > > > > > >
Re: targetm.vectorize.builtin_vec_perm
Richard Henderson wrote on 17/11/2009 03:39:42: > Richard Henderson > 17/11/2009 03:39 > > To > > Ira Rosen/Haifa/i...@ibmil > > cc > > gcc@gcc.gnu.org > > Subject > > targetm.vectorize.builtin_vec_perm > > What is this hook supposed to do? There is no description of its arguments. > > What is the theory of operation of permute within the vectorizer? Do > you actually need variable permute, or would constants be ok? It is currently used for a specific load permutation of RGB to YUV conversion (http://gcc.gnu.org/ml/gcc-patches/2008-07/msg00445.html). The arguments are vector type and mask type (the last one is returned by the hook). The permute is constant, it depends on the number of loads (group size) and their type. However, there are cases, that we may want to support in the future, that require variable permute - indirect accesses, for example. > > I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of > the form: > >VEC_PERMUTE_EXPR (vlow, vhigh, vperm) > > which would be exactly equal to > >(vec_select > (vec_concat vlow vhigh) > vperm) > > at the rtl level. I.e. vperm is an integral vector of the same number > of elements as vlow. > > Truly variable permutation is something that's only supported by ppc and > spu. Also Altivec and SPU support byte permutation (and not only element permutation), however, the vectorizer does not make use of this at present. > Intel AVX has a limited variable permutation -- 64-bit or 32-bit > elements can be rearranged but only within a 128-bit subvector. > So if you're working with 128-bit vectors, it's fully variable, but if > you're working with 256-bit vectors, it's like doing 2 128-bit permute > operations in parallel. Intel before AVX has no variable permute. > > HOWEVER! Most of the useful permutations that I can think of for the > optimizers to generate are actually constant. And these can be > implemented everywhere (with varying degrees of efficiency). > > Anyway, I'm thinking that it might be better to add such a general > operation instead of continuing to add things like > >VEC_EXTRACT_EVEN_EXPR, >VEC_EXTRACT_ODD_EXPR, >VEC_INTERLEAVE_HIGH_EXPR, >VEC_INTERLEAVE_LOW_EXPR, > > and other obvious patterns like broadcast, duplicate even to odd, > duplicate odd to even, etc. If the back end will be able to identify specific masks, e.g., {0,2,4,6} as extract even operation, then we can certainly remove those codes. > > I can imagine having some sort of target hook that computed a cost > metric for a given constant permutation pattern. For instance, I'd > imagine that the interleave patterns are half as expensive as a full > permute for altivec, due to not having to load a mask. This hook would > be fairly complicated for x86, given all of the permuting insns that > were incrementally added in various ISA revisions, but such is life. > > In any case, would a VEC_PERMUTE_EXPR, as described above, work for the > uses of builtin_vec_perm within the vectorizer at present? Yes. Ira > > > r~
Re: targetm.vectorize.builtin_vec_perm
> > I can imagine having some sort of target hook that computed a cost > > metric for a given constant permutation pattern. For instance, I'd > > imagine that the interleave patterns are half as expensive as a full > > permute for altivec, due to not having to load a mask. This hook would > > be fairly complicated for x86, given all of the permuting insns that > > were incrementally added in various ISA revisions, but such is life. > > There should be some way to account for the difference between the cost > in straight-line code, where a mask load is a hard cost, a large loop, > where the load can be hoisted at the cost of some target-dependent > register pressure (e.g. being able to use inverted masks might save half > of the cost), and a tight loop, where the constant load can be easily > amortized over the entire loop. Vectorizer cost model already does that. AFAIU, vectorizer cost model will call the cost model hook to get a cost of a permute, and then incorporate that cost into the general loop/basic block vectorization cost. Ira
Re: Vectorizing 16bit signed integers
gcc-ow...@gcc.gnu.org wrote on 11/12/2009 20:25:33: > Allan Sandfeld Jensen > Hi > > I hope someone can help me. I've been trying to write some tight > integer loops > in way that could be auto-vectorized, saving me to write assembler or using > specific vectorization extensions. Unfortunately I've not yet managed to make > gcc vectorize any of them. > > I've simplified the case to just perform the very first operation inthe loop; > converting from two's complement to sign-and-magnitude. > > I've then used -ftree-vectorizer-verbose to examine if and if not, > why not the > loops were not vectorized, but I am afraid I don't understand the output. > > The simplest version of the loop is here (it appears the branch is not a > problem, but I have another version without). > > inline uint16_t transsign(int16_t v) { > if (v<0) { > return 0x8000U | (1-v); > } else { > return v; > } > } > > It very simply converts in a fashion that maintains the full effective bit- > width. > > The error from the vectorizer is: > vectorizesign.cpp:42: note: not vectorized: relevant stmt not supported: > v.1_16 = (uint16_t) D.2157_11; > > It appears the unsupported operation in vectorization is the typecast from > int16_t to uint16_t, can this really be the case, or is the output misleading? Yes, the problem is in signed->unsigned cast. I think it is related to PR 26128. Ira > > If it is the case, then is there good reason for it, or can I fix itmyself by > adding additional vectorizable operations? > > I've attached both test case and full output of ftree-vectorized-verbose=9 > > Best regards > `Allan > > [attachment "vectorizesign.cpp" deleted by Ira Rosen/Haifa/IBM] > [attachment "vectorizesign-debug.txt" deleted by Ira Rosen/Haifa/IBM]
Re: Autovectorizing does not work with classes
[EMAIL PROTECTED] wrote on 07/10/2008 10:48:29: > Dear gcc developers, > > I am new to this list. > I tried to use the auto-vectorization (4.2.1 (SUSE Linux)) but unfortunately > with limited success. > My code is bassically a matrix library in C++. The vectorizer does not like > the member variables. Consider this code compiled with > gcc -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 -funsafe- > math-optimizations > that gives basically "not vectorized: unhandled data-ref" The unhandled data-ref here is sum. It is invariant in the loop, and invariant data-refs are currently unsupported by the data dependence analysis. If you can change your code to pass sum by value, it will get vectorized (at least with gcc 4.3). This is not C++ specific problem (for me your C version does not get vectorized either because of the same reason). HTH, Ira, > > class P{ > public: > P() : m(5),n(3) { > double *d = data; > for (int i=0; i d[i] = i/10.2; > } > void test(const double& sum); > private: > int m; > int n; > double data[15]; > }; > > void P::test(const double& sum) { > double *d = this->data; > for(int i=0; i d[i]+=sum; > } > } > > whereas the more or less equivalent C version works just fine: > > int m=5; > int n=3; > double data[15]; > > void test(const double& sum) { > int mn = m*n; > for(int i=0; i data[i]+=sum; > } > } > > > Is there a fundamental problem in using the vectorizer in C++? > > Regards! >Georg > [attachment "signature.asc" deleted by Ira Rosen/Haifa/IBM]
Re: Merging the alias-improvements branch
> I will announce the time I am doing the last trunk -> alias-improvements > branch merge and freeze the trunk for that. > > Thus, this is a heads-up - if I collide with your planned merge schedule > just tell me and we can sort it out. I was planning to commit the vectorizer reorganization patch ( http://gcc.gnu.org/ml/gcc-patches/2009-02/msg00573.html). Do you prefer that I wait, so it doesn't disturb the merge? Thanks, Ira
Re: Merging the alias-improvements branch
Richard Guenther wrote on 29/03/2009 13:05:56: > On Sun, 29 Mar 2009, Ira Rosen wrote: > > > > > > I will announce the time I am doing the last trunk -> alias-improvements > > > branch merge and freeze the trunk for that. > > > > > > Thus, this is a heads-up - if I collide with your planned merge schedule > > > just tell me and we can sort it out. > > > > I was planning to commit the vectorizer reorganization patch ( > > http://gcc.gnu.org/ml/gcc-patches/2009-02/msg00573.html). Do you prefer > > that I wait, so it doesn't disturb the merge? > > If you can commit the patch soon (like, before wednesday) you can go > ahead. The differences are not big (see attachment below for what > is the difference between trunk and branch in tree-vect-*), so I think > I can deal with the reorg just fine. Great! I will commit it today or tomorrow then. Thanks, Ira > > Thanks, > Richard. > > > Index: gcc/tree-vectorizer.c > === > --- gcc/tree-vectorizer.c (.../trunk) (revision 145210) > +++ gcc/tree-vectorizer.c (.../branches/alias-improvements) > (revision 145211) > @@ -973,7 +973,7 @@ slpeel_can_duplicate_loop_p (const struc >gimple orig_cond = get_loop_exit_condition (loop); >gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src); > > - if (need_ssa_update_p ()) > + if (need_ssa_update_p (cfun)) > return false; > >if (loop->inner > Index: gcc/tree-vect-analyze.c > === > --- gcc/tree-vect-analyze.c (.../trunk) (revision 145210) > +++ gcc/tree-vect-analyze.c (.../branches/alias-improvements) > (revision 145211) > @@ -3563,16 +3563,6 @@ vect_analyze_data_refs (loop_vec_info lo >return false; > } > > - if (!DR_SYMBOL_TAG (dr)) > -{ > - if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS)) > -{ > - fprintf (vect_dump, "not vectorized: no memory tag for "); > - print_generic_expr (vect_dump, DR_REF (dr), TDF_SLIM); > -} > - return false; > -} > - >base = unshare_expr (DR_BASE_ADDRESS (dr)); >offset = unshare_expr (DR_OFFSET (dr)); >init = unshare_expr (DR_INIT (dr)); > @@ -3804,7 +3794,7 @@ vect_stmt_relevant_p (gimple stmt, loop_ > >/* changing memory. */ >if (gimple_code (stmt) != GIMPLE_PHI) > -if (!ZERO_SSA_OPERANDS (stmt, SSA_OP_VIRTUAL_DEFS)) > +if (gimple_vdef (stmt)) >{ > if (vect_print_dump_info (REPORT_DETAILS)) > fprintf (vect_dump, "vec_stmt_relevant_p: stmt has vdefs."); > Index: gcc/tree-vect-transform.c > === > --- gcc/tree-vect-transform.c (.../trunk) (revision 145210) > +++ gcc/tree-vect-transform.c (.../branches/alias-improvements) > (revision 145211) > @@ -51,7 +51,7 @@ static bool vect_transform_stmt (gimple, > slp_tree, slp_instance); > static tree vect_create_destination_var (tree, tree); > static tree vect_create_data_ref_ptr > - (gimple, struct loop*, tree, tree *, gimple *, bool, bool *, tree); > + (gimple, struct loop*, tree, tree *, gimple *, bool, bool *); > static tree vect_create_addr_base_for_vector_ref >(gimple, gimple_seq *, tree, struct loop *); > static tree vect_get_new_vect_var (tree, enum vect_var_kind, const char *); > @@ -1009,7 +1009,7 @@ vect_create_addr_base_for_vector_ref (gi > static tree > vect_create_data_ref_ptr (gimple stmt, struct loop *at_loop, > tree offset, tree *initial_address, gimple *ptr_incr, > - bool only_init, bool *inv_p, tree type) > + bool only_init, bool *inv_p) > { >tree base_name; >stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > @@ -1020,7 +1020,6 @@ vect_create_data_ref_ptr (gimple stmt, s >tree vectype = STMT_VINFO_VECTYPE (stmt_info); >tree vect_ptr_type; >tree vect_ptr; > - tree tag; >tree new_temp; >gimple vec_stmt; >gimple_seq new_stmt_list = NULL; > @@ -1068,42 +1067,33 @@ vect_create_data_ref_ptr (gimple stmt, s > } > >/** (1) Create the new vector-pointer variable: **/ > - if (type) > -vect_ptr_type = build_pointer_type (type); > - else > -vect_ptr_type = build_pointer_type (vectype); > - > - if (TREE_CODE (DR_BASE_ADDRESS (dr)) == SSA_NAME > - && TYPE_RESTRICT (TREE_TYPE (DR_BASE_ADDRESS (dr > -vect_ptr_type = build_qualified_type (vect_ptr_type, TYPE_QUAL_RESTRICT); > + vect_ptr_type = build_pointer_ty
Re: Inner loop unable to compute sufficient information during vectorization
gcc-ow...@gcc.gnu.org wrote on 25/05/2009 21:53:41: > for a loop like > > 1 for(i=0;i 2 for(j=0;j 3 a[i][j] = a[i][j]+b[i][j]; > > GCC 4.3.* is unable to get the information for the inner loop that > array reference 'a' is alias of each other and generates code for > runtime aliasing check during vectorization. Both current trunk and GCC4.4 vectorize the inner loop without any runtime alias checks. > Is it necessary to > recompute all information in loop_vec_info in function > vect_analyze_ref for analysis of inner loop also, as most of the > information is similar for the outer loop for the program. Maybe you are right, and it is possible to extract at least part of the information for the inner loop from the outer loop information. > > Similarly, outer loop is able to compute correct chrec i.e. NULL , for > array 'a' reference, while innerloop has chrec as chrec_dont_know, and > therfore complaint about runtime alias check. The chrecs are not the same for inner and outer loops, so it is reasonable that the results of the data dependence tests will be different. In this case, however, it seems to be a bug. Ira
Re: Inner loop unable to compute sufficient information during vectorization
Abhishek Shrivastav wrote on 31/05/2009 16:44:34: > In this case, I think that Outer loop could be vectorized as there is > no dependency in the loop,the access pattern is simple enough and > there is unit stride in both the loops. Current version 4.4.* is not > doing outer loop vectorization. The memory accesses are consecutive in the inner loop and strided in the outer loop. Therefore, inner loop vectorization is preferable in this case (and also strided accesses are not yet supported in outer loop vectorization). Ira > > On Tue, May 26, 2009 at 5:57 PM, Ira Rosen wrote: > > > > > > gcc-ow...@gcc.gnu.org wrote on 25/05/2009 21:53:41: > > > >> for a loop like > >> > >> 1 for(i=0;i >> 2 for(j=0;j >> 3 a[i][j] = a[i][j]+b[i][j]; > >> > >> GCC 4.3.* is unable to get the information for the inner loop that > >> array reference 'a' is alias of each other and generates code for > >> runtime aliasing check during vectorization. > > > > Both current trunk and GCC4.4 vectorize the inner loop without any runtime > > alias checks. > > > >> Is it necessary to > >> recompute all information in loop_vec_info in function > >> vect_analyze_ref for analysis of inner loop also, as most of the > >> information is similar for the outer loop for the program. > > > > Maybe you are right, and it is possible to extract at least part of the > > information for the inner loop from the outer loop information. > > > >> > >> Similarly, outer loop is able to compute correct chrec i.e. NULL , for > >> array 'a' reference, while innerloop has chrec as chrec_dont_know, and > >> therfore complaint about runtime alias check. > > > > The chrecs are not the same for inner and outer loops, so it is reasonable > > that the results of the data dependence tests will be different. > > In this case, however, it seems to be a bug. > > > > Ira > > > > > > > >
Re: Loops no longer vectorized
gcc-ow...@gcc.gnu.org wrote on 28/05/2010 03:52:30 PM: > Hi, > > I just noticed today that (implicit) loops of the kind > > xmin = minval(nodes(1,inductor_number(1:number_of_nodes))) > > (lines 5057 to 5062 of the polyhedron test induct.f90) are no longer > vectorized (the change occurred between revisions 158215 and > 158921). With -ftree-vectorizer-verbose=6, I got > > induct.f90:5057: note: not vectorized: data ref analysis failed D. > 6088_872 = (*D.4001_143)[D.6087_871]; > > induct.f90:5057: note: Alignment of access forced using peeling. > induct.f90:5057: note: Vectorizing an unaligned access. > induct.f90:5057: note: vect_model_load_cost: unaligned supported by hardware. > induct.f90:5057: note: vect_model_load_cost: inside_cost = 2, > outside_cost = 0 . > induct.f90:5057: note: vect_model_simple_cost: inside_cost = 2, > outside_cost = 0 . > induct.f90:5057: note: vect_model_store_cost: inside_cost = 2, > outside_cost = 0 . > induct.f90:5057: note: cost model: prologue peel iters set to vf/2. > induct.f90:5057: note: cost model: epilogue peel iters set to vf/2 > because peeling for alignment is unknown . > induct.f90:5057: note: Cost model analysis: > Vector inside of loop cost: 6 > Vector outside of loop cost: 20 > Scalar iteration cost: 3 > Scalar outside cost: 7 > prologue iterations: 2 > epilogue iterations: 2 > Calculated minimum iters for profitability: 5 > > induct.f90:5057: note: Profitability threshold = 4 > > induct.f90:5057: note: Profitability threshold is 4 loop iterations. > induct.f90:5057: note: LOOP VECTORIZED. > > and now: > > induct.f90:5057: note: not vectorized: data ref analysis failed D. > 6017_848 = (*D.4001_131)[D.6016_847]; > > Is this known/expected or should I open a new PR? The loop that computes MIN_EXPR is not vectorizable because of indirect access. You see for both versions: induct.f90:5057: note: not vectorized: data ref analysis failed D. 6017_848 = (*D.4001_131)[D.6016_847]; The loop that got vectorized in the older revision is another loop associated with the same source code line: : # S.648_810 = PHI S.648_856 = S.648_810 + 1; D.6082_858 = (*D.4108_840)[S.648_810]; D.6083_859 = (integer(kind=8)) D.6082_858; (*pretmp.3557_2254)[S.648_810] = D.6083_859; if (D.4111_844 < S.648_856) goto ; else goto ; And in the later revision this loop is replaced with: : D.6008_833 = &(*D.5896_830)[0]; pretmp.3873_1387 = (integer(kind=4)[0:] *) D.6008_833; So, there is no loop now. Ira > > Cheers > > Dominique
Re: Target macros vs. target hooks - policy/goal is hooks, isn't it?
Steven Bosscher wrote on 02/06/2010 06:13:36 PM: > > On Wed, May 26, 2010 at 7:16 PM, Mark Mitchell wrote: > > Ulrich Weigand wrote: > > > >>> So the question is: The goal is to have hooks, not macros, right? If > >>> so, can reviewers please take care to reject patches that introduce > >>> new macros? > >> > >> I don't know to which extent this is a formal goal these days, but I > >> personally agree that it would be nice to eliminate macros. > > > > Yes, the (informally agreed) policy is to have hooks, not macros. There > > may be situations where that is technically impossible, but I'd expect > > those to be very rare. > > Another batch of recently introduced target macros instead of target hooks: Not so recently - three years ago. > > tree-vectorizer.h:#ifndef TARG_COND_TAKEN_BRANCH_COST > tree-vectorizer.h:#ifndef TARG_COND_NOT_TAKEN_BRANCH_COST > tree-vectorizer.h:#ifndef TARG_SCALAR_STMT_COST > tree-vectorizer.h:#ifndef TARG_SCALAR_LOAD_COST > tree-vectorizer.h:#ifndef TARG_SCALAR_STORE_COST > tree-vectorizer.h:#ifndef TARG_VEC_STMT_COST > tree-vectorizer.h:#ifndef TARG_VEC_TO_SCALAR_COST > tree-vectorizer.h:#ifndef TARG_SCALAR_TO_VEC_COST > tree-vectorizer.h:#ifndef TARG_VEC_LOAD_COST > tree-vectorizer.h:#ifndef TARG_VEC_UNALIGNED_LOAD_COST > tree-vectorizer.h:#ifndef TARG_VEC_STORE_COST > tree-vectorizer.h:#ifndef TARG_VEC_PERMUTE_COST > > Could the vectorizer folks please turn these into target hooks? OK, I'll do that. Ira > > Ciao! > Steven
Re: Target macros vs. target hooks - policy/goal is hooks, isn't it?
Richard Guenther wrote on 03/06/2010 02:00:00 PM: > >> tree-vectorizer.h:#ifndef TARG_COND_TAKEN_BRANCH_COST > >> tree-vectorizer.h:#ifndef TARG_COND_NOT_TAKEN_BRANCH_COST > >> tree-vectorizer.h:#ifndef TARG_SCALAR_STMT_COST > >> tree-vectorizer.h:#ifndef TARG_SCALAR_LOAD_COST > >> tree-vectorizer.h:#ifndef TARG_SCALAR_STORE_COST > >> tree-vectorizer.h:#ifndef TARG_VEC_STMT_COST > >> tree-vectorizer.h:#ifndef TARG_VEC_TO_SCALAR_COST > >> tree-vectorizer.h:#ifndef TARG_SCALAR_TO_VEC_COST > >> tree-vectorizer.h:#ifndef TARG_VEC_LOAD_COST > >> tree-vectorizer.h:#ifndef TARG_VEC_UNALIGNED_LOAD_COST > >> tree-vectorizer.h:#ifndef TARG_VEC_STORE_COST > >> tree-vectorizer.h:#ifndef TARG_VEC_PERMUTE_COST > >> > >> Could the vectorizer folks please turn these into target hooks? > > Btw, a single cost target hook with an enum argument would be > preferred here. Where is the best place to define such enum? Thanks, Ira > > Richard. > > > OK, I'll do that. > > > > Ira > > > >> > >> Ciao! > >> Steven > > > >
Re: Why doesn't vetorizer skips loop peeling/versioning for target supports hardware misaligned access?
Hi, gcc-ow...@gcc.gnu.org wrote on 24/01/2011 03:21:51 PM: > Hello, > Some of our target processors support complete hardware misaligned > memory access. I implemented movmisalignm patterns, and found > TARGET_SUPPORT_VECTOR_MISALIGNMENT > (TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT > On 4.6) hook is based on checking these patterns. Somehow this > hook doesn't seem to be used. vect_enhance_data_refs_alignment > is called regardless whether the target has HW misaligned support > or not. targetm.vectorize.support_vector_misalignment is used in vect_supportable_dr_alignment to decide whether a specific misaligned access is supported. > > Shouldn't using HW misaligned memory access be better than > generating extra code for loop peeling/versioning? Or at least > if for some architectures it is not the case, we should have > a compiler hook to choose between them. BTW, I mainly work > on 4.5, maybe 4.6 has changed. Right. And we have that implemented in 4.6 at least partially: for known misalignment and for peeling for loads. Maybe this part needs to be enhanced, concrete testcases could help. Ira > > Thanks, > Bingfeng Mei >
Re: Documentation for loop infrastructure
> Here is the documentation for the data dependence analysis. I can add a description of data-refs creation/analysis if it is useful. Ira
Re: Type yielded by TARGET_VECTORIZE_BUILTIN_MASK_FOR_LOAD hook?
Hi, Does this patch fix the problem? Ira Index: tree-vect-transform.c === --- tree-vect-transform.c (revision 117002) +++ tree-vect-transform.c (working copy) @@ -1916,10 +1916,10 @@ vectorizable_load (tree stmt, block_stmt /* Create permutation mask, if required, in loop preheader. */ tree builtin_decl; params = build_tree_list (NULL_TREE, init_addr); - vec_dest = vect_create_destination_var (scalar_dest, vectype); builtin_decl = targetm.vectorize.builtin_mask_for_load (); new_stmt = build_function_call_expr (builtin_decl, params); - new_stmt = build2 (MODIFY_EXPR, vectype, vec_dest, new_stmt); + vec_dest = vect_create_destination_var (scalar_dest, TREE_TYPE (new_stmt)); + new_stmt = build2 (MODIFY_EXPR, TREE_TYPE (vec_dest), vec_dest, new_stmt); new_temp = make_ssa_name (vec_dest, new_stmt); TREE_OPERAND (new_stmt, 0) = new_temp; new_bb = bsi_insert_on_edge_immediate (pe, new_stmt); Dorit Nuzman/Haifa/IBM wrote on 16/09/2006 12:37:28: > > I'm trying to add a hook for aligning vectors for loads. > > > > I'm using the altivec rs6000 code as a baseline. > > > > However, the instruction is like the iwmmxt_walign instruction in the > > ARM port; it takes > > a normalish register and uses the bottom bits... it doesn't use a > > full-width vector. > > > > GCC complains when my builtin pointed to by > > TARGET_VECTORIZE_BUILTIN_MASK_FOR_LOAD yields a QImode result, because > > it has no way of converting that to the vector moe it is expecting. I > > Looks like it's a bug in the vectorizer - we treat both the return > value of the mask_for_load builtin, and the 3rd argument to the > realign_load stmt (e.g. Altivec's vperm), as variables of type > 'vectype', instead of obtaining the type from the target machine > description. All we need to care about is that these two variables > have the same type. I'll look into that > > dorit > > > think the altivec side would have a similar problem, as the expected > > RTX output RTX is: > > > > (reg:V8HI 131 [ vect_var_.2540 ]) > > > > but it changes that to: > > > > (reg:V16QI 160) > > > > for the VLSR instruction. V16QImode is what VPERM expects, and I > > think since V8HI and V16QI mode are the same size everyone is happy. > > > > Is there a way to tell GCC what the type of the > > TARGET_VECTORIZE_BUILTIN_MASK_FOR_LOAD should be? Looking at > > http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gccint/Addressing-Modes. > > html#Addressing-Modes > > it reads like it must merely match the last operand of the > > vec_realign_load_ pattern. > > > > -- > > Why are ``tolerant'' people so intolerant of intolerant people?
Re: Type yielded by TARGET_VECTORIZE_BUILTIN_MASK_FOR_LOAD hook?
"Erich Plondke" <[EMAIL PROTECTED]> wrote on 20/09/2006 04:09:14: > On 9/19/06, Erich Plondke <[EMAIL PROTECTED]> wrote: > > On 9/19/06, Ira Rosen <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > Does this patch fix the problem? > > > > Well... seems pretty good. I get the instruction generated from the > > builtin, and it lives outside the body of the loop. > > > > GCC then moves the value out of the special register, zero extends it, > > and moves it back into the special register uselessly. :-( But I > > have the feeling that I have something else in my backend to blame for > > that. > > Yes, it's because the special register is a QI and the PROMOTE_MODE > macro always > says to promote the QI to an SI. > > So the patch looks great! Thanks! Great, I'll prepare a patch for the mainline then. Ira > > -- > Why are ``tolerant'' people so intolerant of intolerant people?
Re: Documentation for loop infrastructure
Sebastian Pop <[EMAIL PROTECTED]> wrote on 08/09/2006 18:04:01: > Ira Rosen wrote: > > > > > Here is the documentation for the data dependence analysis. > > > > I can add a description of data-refs creation/analysis if it is useful. > > > > That's a good idea, thanks. > > Sebastian Here it is. Ira > The data references are discovered in a particular order during the > scanning of the loop body: the loop body is analyzed in execution > order, and the data references of each statement are pushed at the end > of the data reference array. Two data references syntactically occur > in the program in the same order as in the array of data references. > This syntactic order is important in some classical data dependence > tests, and mapping this order to the elements of this array avoids > costly queries to the loop body representation. Three types of data references are currently handled: ARRAY_REF, INDIRECT_REF and COMPONENT_REF. The data structure for the data reference is @code{data_reference}, where @code{data_reference_p} is a name of a pointer to the data reference structure. The structure contains the following elements: @itemize @item @code{base_object_info}: Provides information about the base object of the data reference and its access functions. These access functions represent the evolution of the data reference in the loop relative to its base, in keeping with the classical meaning of the data reference access function for the support of arrays. For example, for a reference @code{a.b[i][j]}, the base object is @code{a.b} and the access functions, one for each array subscript, are: @[EMAIL PROTECTED], + [EMAIL PROTECTED], @{j_init, +, [EMAIL PROTECTED] @item @code{first_location_in_loop}: Provides information about the first location accessed by the data reference in the loop and about the access function used to represent evolution relative to this location. This data is used to support pointers, and is not used for arrays (for which we have base objects). Pointer accesses are represented as a one-dimensional access that starts from the first location accessed in the loop. For example: @smallexample for i for j *((int *)p + i + j) = a[i][j]; @end smallexample The access function of the pointer access is @[EMAIL PROTECTED], + [EMAIL PROTECTED] relative to @code{p + i}. The access functions of the array are @[EMAIL PROTECTED], + [EMAIL PROTECTED] and @[EMAIL PROTECTED], +, [EMAIL PROTECTED] relative to @code{a}. Usually, the object the pointer refers to is either unknown, or we can’t prove that the access is confined to the boundaries of a certain object. Two data references can be compared only if at least one of these two representations has all its fields filled for both data references. The current strategy for data dependence tests is as follows: If both @code{a} and @code{b} are represented as arrays, compare @code{a.base_object} and @code{b.base_object}; if they are equal, apply dependence tests (use access functions based on base_objects). Else if both @code{a} and @code{b} are represented as pointers, compare @code{a.first_location} and @code{b.first_location}; if they are equal, apply dependence tests (use access functions based on first location). However, if @code{a} and @code{b} are represented differently, only try to prove that the bases are definitely different. @item Aliasing information. @item Alignment information. @end itemize > The structure describing the relation between two data references is > @code{data_dependence_relation} and the shorter name for a pointer to > such a structure is @code{ddr_p}. This structure contains:
Re: Documentation for loop infrastructure
Sebastian Pop <[EMAIL PROTECTED]> wrote on 26/09/2006 21:24:18: > It is probably better to include the loop indexes in the example, and > modify the syntax of the scev for making it more explicit, like: > > @smallexample > for1 i > for2 j > *((int *)p + i + j) = a[i][j]; > @end smallexample > > and the access function becomes: @[EMAIL PROTECTED], + [EMAIL PROTECTED] > Done. I guess, I'll commit my part as soon as loop.texi (and Dependency analysis part) is committed. Ira > The data references are discovered in a particular order during the > scanning of the loop body: the loop body is analyzed in execution > order, and the data references of each statement are pushed at the end > of the data reference array. Two data references syntactically occur > in the program in the same order as in the array of data references. > This syntactic order is important in some classical data dependence > tests, and mapping this order to the elements of this array avoids > costly queries to the loop body representation. Three types of data references are currently handled: ARRAY_REF, INDIRECT_REF and COMPONENT_REF. The data structure for the data reference is @code{data_reference}, where @code{data_reference_p} is a name of a pointer to the data reference structure. The structure contains the following elements: @itemize @item @code{base_object_info}: Provides information about the base object of the data reference and its access functions. These access functions represent the evolution of the data reference in the loop relative to its base, in keeping with the classical meaning of the data reference access function for the support of arrays. For example, for a reference @code{a.b[i][j]}, the base object is @code{a.b} and the access functions, one for each array subscript, are: @[EMAIL PROTECTED], + [EMAIL PROTECTED], @{j_init, +, [EMAIL PROTECTED] @item @code{first_location_in_loop}: Provides information about the first location accessed by the data reference in the loop and about the access function used to represent evolution relative to this location. This data is used to support pointers, and is not used for arrays (for which we have base objects). Pointer accesses are represented as a one-dimensional access that starts from the first location accessed in the loop. For example: @smallexample for1 i for2 j *((int *)p + i + j) = a[i][j]; @end smallexample The access function of the pointer access is @[EMAIL PROTECTED], + [EMAIL PROTECTED] relative to @code{p + i}. The access functions of the array are @[EMAIL PROTECTED], + [EMAIL PROTECTED] and @[EMAIL PROTECTED], +, [EMAIL PROTECTED] relative to @code{a}. Usually, the object the pointer refers to is either unknown, or we can’t prove that the access is confined to the boundaries of a certain object. Two data references can be compared only if at least one of these two representations has all its fields filled for both data references. The current strategy for data dependence tests is as follows: If both @code{a} and @code{b} are represented as arrays, compare @code{a.base_object} and @code{b.base_object}; if they are equal, apply dependence tests (use access functions based on base_objects). Else if both @code{a} and @code{b} are represented as pointers, compare @code{a.first_location} and @code{b.first_location}; if they are equal, apply dependence tests (use access functions based on first location). However, if @code{a} and @code{b} are represented differently, only try to prove that the bases are definitely different. @item Aliasing information. @item Alignment information. @end itemize > The structure describing the relation between two data references is > @code{data_dependence_relation} and the shorter name for a pointer to > such a structure is @code{ddr_p}. This structure contains:
Re: Documentation for loop infrastructure
Zdenek Dvorak <[EMAIL PROTECTED]> wrote on 28/09/2006 15:04:07: > > I have commited the documentation, including the parts from Daniel and > Sebastian (but not yours) now. > > Zdenek I've committed my part. Ira
Added myself to MAINTAINERS (write after approval)
Index: MAINTAINERS === RCS file: /cvs/gcc/gcc/MAINTAINERS,v retrieving revision 1.395 diff -c -3 -p -r1.395 MAINTAINERS *** MAINTAINERS 14 Feb 2005 11:21:09 - 1.395 --- MAINTAINERS 17 Feb 2005 08:50:31 - *** Volker Reichelt [EMAIL PROTECTED] *** 287,292 --- 287,293 Tom Rix [EMAIL PROTECTED] Craig Rodrigues [EMAIL PROTECTED] Gavin Romig-Koch [EMAIL PROTECTED] + Ira Rosen [EMAIL PROTECTED] Ira Ruben [EMAIL PROTECTED] Douglas Rupp [EMAIL PROTECTED] Matthew Sachs [EMAIL PROTECTED]
Re: Mainline is now regression and documentation fixes only
Dorit Nuzman/Haifa/IBM wrote on 23/01/2008 21:49:51: > There are however a couple of small cost-model changes that were > going to be submitted this week for the Cell SPU - it's unfortunate > if these cannot get into 4.3. It's indeed unfortunate. However, those changes are not crucial and there is still some more work to be done (check on additional benchmarks, etc.). So, I guess, it will have to wait for 4.4. Ira > > dorit >
Re: Memory leaks in compiler
(I am resending this, since some of the addresses got corrupted. My apologies.) Hi, [EMAIL PROTECTED] wrote on 16/01/2008 15:20:00: > > When a loop is vectorized, some statements are removed from the basic > > blocks, but the vectorizer information attached to these BBs is never > > freed. > > Sebastian, thanks for bringing this to our attention. I'll look into this. > I hope that removing stmts from a BB can be easily localized. > -- Victor > The attached patch, mainly written by Victor, fixes memory leaks in the vectorizer, that were found with the help of valgrind and by examining the code. Bootstrapped with vectorization enabled and tested on vectorizer testsuite on ppc-linux. I still have to perform full regtesting. Is it O.K. for 4.3? Or will it wait for 4.4? Thanks,. Victor and Ira ChangeLog: * tree-vectorizer.c (free_stmt_vec_info): New function. (destroy_loop_vec_info): Move code to free_stmt_vec_info(). Call free_stmt_vec_info(). Free LOOP_VINFO_STRIDED_STORES.. * tree-vectorizer.h (free_stmt_vec_info): Declare. * tree-vect-transform.c (vectorizable_conversion): Free vec_oprnds0 if it was allocated. (vect_permute_store_chain): Remove unused VECs. (vectorizable_store): Free VECs that are allocated in the.. function. (vect_transform_strided_load, vectorizable_load): Likewise. (vect_remove_stores): Simplify the code. (vect_transform_loop): Move code to vect_remove_stores(). Call vect_remove_stores() and free_stmt_vec_info(). (See attached file: memleaks.txt) Index: tree-vectorizer.c === --- tree-vectorizer.c (revision 131899) +++ tree-vectorizer.c (working copy) @@ -1558,6 +1558,22 @@ new_stmt_vec_info (tree stmt, loop_vec_i } +/* Free stmt vectorization related info. */ + +void +free_stmt_vec_info (tree stmt) +{ + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + + if (!stmt_info) +return; + + VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info)); + free (stmt_info); + set_stmt_info (stmt_ann (stmt), NULL); +} + + /* Function bb_in_loop_p Used as predicate for dfs order traversal of the loop bbs. */ @@ -1714,21 +1730,13 @@ destroy_loop_vec_info (loop_vec_info loo { basic_block bb = bbs[j]; tree phi; - stmt_vec_info stmt_info; for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi)) -{ - stmt_ann_t ann = stmt_ann (phi); - - stmt_info = vinfo_for_stmt (phi); - free (stmt_info); - set_stmt_info (ann, NULL); -} +free_stmt_vec_info (phi); for (si = bsi_start (bb); !bsi_end_p (si); ) { tree stmt = bsi_stmt (si); - stmt_ann_t ann = stmt_ann (stmt); stmt_vec_info stmt_info = vinfo_for_stmt (stmt); if (stmt_info) @@ -1746,9 +1754,7 @@ destroy_loop_vec_info (loop_vec_info loo } /* Free stmt_vec_info. */ - VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info)); - free (stmt_info); - set_stmt_info (ann, NULL); + free_stmt_vec_info (stmt); /* Remove dead "pattern stmts". */ if (remove_stmt_p) @@ -1767,6 +1773,7 @@ destroy_loop_vec_info (loop_vec_info loo for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++) vect_free_slp_tree (SLP_INSTANCE_TREE (instance)); VEC_free (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo)); + VEC_free (tree, heap, LOOP_VINFO_STRIDED_STORES (loop_vinfo)); free (loop_vinfo); loop->aux = NULL; Index: tree-vectorizer.h === --- tree-vectorizer.h (revision 131899) +++ tree-vectorizer.h (working copy) @@ -667,6 +667,7 @@ extern bool supportable_narrowing_operat extern loop_vec_info new_loop_vec_info (struct loop *loop); extern void destroy_loop_vec_info (loop_vec_info, bool); extern stmt_vec_info new_stmt_vec_info (tree stmt, loop_vec_info); +extern void free_stmt_vec_info (tree stmt); /** In tree-vect-analyze.c **/ Index: tree-vect-transform.c === --- tree-vect-transform.c (revision 131899) +++ tree-vect-transform.c (working copy) @@ -3638,6 +3638,9 @@ vectorizable_conversion (tree stmt, bloc *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info); } + if (vec_oprnds0) +VEC_free (tree, heap, vec_oprnds0); + return true; } @@ -4589,11 +4592,8 @@ vect_permute_store_chain (VEC(tree,heap) tree scalar_dest, tmp; int i; unsigned int j; - VEC(tree,heap) *first, *second; scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0); - first = VEC_alloc (tree, heap, length/2); - second = VEC_alloc (tree, heap, length/2); /* Check that the operation is supported. */ if (!vect_strided_store_supported (vectype
Re: Memory leaks in compiler
Hi, [EMAIL PROTECTED] wrote on 16/01/2008 15:20:00: > > When a loop is vectorized, some statements are removed from the basic > > blocks, but the vectorizer information attached to these BBs is never > > freed. > > Sebastian, thanks for bringing this to our attention. I'll look into this. > I hope that removing stmts from a BB can be easily localized. > -- Victor > The attached patch, mainly written by Victor, fixes memory leaks in the vectorizer, that were found with the help of valgrind and by examining the code. Bootstrapped with vectorization enabled and tested on vectorizer testsuite on ppc-linux. I still have to perform full regtesting. Is it O.K. for 4.3? Or will it wait for 4.4? Thanks,. Victor and Ira ChangeLog: * tree-vectorizer.c (free_stmt_vec_info): New function. (destroy_loop_vec_info): Move code to free_stmt_vec_info().): Call free_stmt_vec_info(). Free LOOP_VINFO_STRIDED_STORES. * tree-vectorizer.h (free_stmt_vec_info): Declare. * tree-vect-transform.c (vectorizable_conversion): Free vec_oprnds0 if it was allocated. (vect_permute_store_chain): Remove unused VECs. (vectorizable_store): Free VECs that are allocated in the function. (vect_transform_strided_load, vectorizable_load): Likewise. (vect_remove_stores): Simplify the code. (vect_transform_loop): Move code to vect_remove_stores(). Call vect_remove_stores() and free_stmt_vec_info(). (See attached file: memleaks.txt) Index: tree-vectorizer.c === --- tree-vectorizer.c (revision 131899) +++ tree-vectorizer.c (working copy) @@ -1558,6 +1558,22 @@ new_stmt_vec_info (tree stmt, loop_vec_i } +/* Free stmt vectorization related info. */ + +void +free_stmt_vec_info (tree stmt) +{ + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + + if (!stmt_info) +return; + + VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info)); + free (stmt_info); + set_stmt_info (stmt_ann (stmt), NULL); +} + + /* Function bb_in_loop_p Used as predicate for dfs order traversal of the loop bbs. */ @@ -1714,21 +1730,13 @@ destroy_loop_vec_info (loop_vec_info loo { basic_block bb = bbs[j]; tree phi; - stmt_vec_info stmt_info; for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi)) -{ - stmt_ann_t ann = stmt_ann (phi); - - stmt_info = vinfo_for_stmt (phi); - free (stmt_info); - set_stmt_info (ann, NULL); -} +free_stmt_vec_info (phi); for (si = bsi_start (bb); !bsi_end_p (si); ) { tree stmt = bsi_stmt (si); - stmt_ann_t ann = stmt_ann (stmt); stmt_vec_info stmt_info = vinfo_for_stmt (stmt); if (stmt_info) @@ -1746,9 +1754,7 @@ destroy_loop_vec_info (loop_vec_info loo } /* Free stmt_vec_info. */ - VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info)); - free (stmt_info); - set_stmt_info (ann, NULL); + free_stmt_vec_info (stmt); /* Remove dead "pattern stmts". */ if (remove_stmt_p) @@ -1767,6 +1773,7 @@ destroy_loop_vec_info (loop_vec_info loo for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++) vect_free_slp_tree (SLP_INSTANCE_TREE (instance)); VEC_free (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo)); + VEC_free (tree, heap, LOOP_VINFO_STRIDED_STORES (loop_vinfo)); free (loop_vinfo); loop->aux = NULL; Index: tree-vectorizer.h === --- tree-vectorizer.h (revision 131899) +++ tree-vectorizer.h (working copy) @@ -667,6 +667,7 @@ extern bool supportable_narrowing_operat extern loop_vec_info new_loop_vec_info (struct loop *loop); extern void destroy_loop_vec_info (loop_vec_info, bool); extern stmt_vec_info new_stmt_vec_info (tree stmt, loop_vec_info); +extern void free_stmt_vec_info (tree stmt); /** In tree-vect-analyze.c **/ Index: tree-vect-transform.c === --- tree-vect-transform.c (revision 131899) +++ tree-vect-transform.c (working copy) @@ -3638,6 +3638,9 @@ vectorizable_conversion (tree stmt, bloc *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info); } + if (vec_oprnds0) +VEC_free (tree, heap, vec_oprnds0); + return true; } @@ -4589,11 +4592,8 @@ vect_permute_store_chain (VEC(tree,heap) tree scalar_dest, tmp; int i; unsigned int j; - VEC(tree,heap) *first, *second; scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0); - first = VEC_alloc (tree, heap, length/2); - second = VEC_alloc (tree, heap, length/2); /* Check that the operation is supported. */ if (!vect_strided_store_supported (vectype)) @@ -4976,6 +4976,11 @@ vectorizable_store (tree stmt, block_stm }
Re: Memory leaks in compiler
(I am resending this, since some of the addresses got corrupted. My apologies.) Hi, [EMAIL PROTECTED] wrote on 16/01/2008 15:20:00: > > When a loop is vectorized, some statements are removed from the basic > > blocks, but the vectorizer information attached to these BBs is never > > freed. > > Sebastian, thanks for bringing this to our attention. I'll look into this. > I hope that removing stmts from a BB can be easily localized. > -- Victor > The attached patch, mainly written by Victor, fixes memory leaks in the vectorizer, that were found with the help of valgrind and by examining the code. Bootstrapped with vectorization enabled and tested on vectorizer testsuite on ppc-linux. I still have to perform full regtesting. Is it O.K. for 4.3? Or will it wait for 4.4? Thanks,. Victor and Ira ChangeLog: * tree-vectorizer.c (free_stmt_vec_info): New function. (destroy_loop_vec_info): Move code to free_stmt_vec_info(). Call free_stmt_vec_info(). Free LOOP_VINFO_STRIDED_STORES.. * tree-vectorizer.h (free_stmt_vec_info): Declare. * tree-vect-transform.c (vectorizable_conversion): Free vec_oprnds0 if it was allocated. (vect_permute_store_chain): Remove unused VECs. (vectorizable_store): Free VECs that are allocated in the.. function. (vect_transform_strided_load, vectorizable_load): Likewise. (vect_remove_stores): Simplify the code. (vect_transform_loop): Move code to vect_remove_stores(). Call vect_remove_stores() and free_stmt_vec_info(). (See attached file: memleaks.txt) Index: tree-vectorizer.c === --- tree-vectorizer.c (revision 131899) +++ tree-vectorizer.c (working copy) @@ -1558,6 +1558,22 @@ new_stmt_vec_info (tree stmt, loop_vec_i } +/* Free stmt vectorization related info. */ + +void +free_stmt_vec_info (tree stmt) +{ + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); + + if (!stmt_info) +return; + + VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info)); + free (stmt_info); + set_stmt_info (stmt_ann (stmt), NULL); +} + + /* Function bb_in_loop_p Used as predicate for dfs order traversal of the loop bbs. */ @@ -1714,21 +1730,13 @@ destroy_loop_vec_info (loop_vec_info loo { basic_block bb = bbs[j]; tree phi; - stmt_vec_info stmt_info; for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi)) -{ - stmt_ann_t ann = stmt_ann (phi); - - stmt_info = vinfo_for_stmt (phi); - free (stmt_info); - set_stmt_info (ann, NULL); -} +free_stmt_vec_info (phi); for (si = bsi_start (bb); !bsi_end_p (si); ) { tree stmt = bsi_stmt (si); - stmt_ann_t ann = stmt_ann (stmt); stmt_vec_info stmt_info = vinfo_for_stmt (stmt); if (stmt_info) @@ -1746,9 +1754,7 @@ destroy_loop_vec_info (loop_vec_info loo } /* Free stmt_vec_info. */ - VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info)); - free (stmt_info); - set_stmt_info (ann, NULL); + free_stmt_vec_info (stmt); /* Remove dead "pattern stmts". */ if (remove_stmt_p) @@ -1767,6 +1773,7 @@ destroy_loop_vec_info (loop_vec_info loo for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++) vect_free_slp_tree (SLP_INSTANCE_TREE (instance)); VEC_free (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo)); + VEC_free (tree, heap, LOOP_VINFO_STRIDED_STORES (loop_vinfo)); free (loop_vinfo); loop->aux = NULL; Index: tree-vectorizer.h === --- tree-vectorizer.h (revision 131899) +++ tree-vectorizer.h (working copy) @@ -667,6 +667,7 @@ extern bool supportable_narrowing_operat extern loop_vec_info new_loop_vec_info (struct loop *loop); extern void destroy_loop_vec_info (loop_vec_info, bool); extern stmt_vec_info new_stmt_vec_info (tree stmt, loop_vec_info); +extern void free_stmt_vec_info (tree stmt); /** In tree-vect-analyze.c **/ Index: tree-vect-transform.c === --- tree-vect-transform.c (revision 131899) +++ tree-vect-transform.c (working copy) @@ -3638,6 +3638,9 @@ vectorizable_conversion (tree stmt, bloc *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info); } + if (vec_oprnds0) +VEC_free (tree, heap, vec_oprnds0); + return true; } @@ -4589,11 +4592,8 @@ vect_permute_store_chain (VEC(tree,heap) tree scalar_dest, tmp; int i; unsigned int j; - VEC(tree,heap) *first, *second; scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0); - first = VEC_alloc (tree, heap, length/2); - second = VEC_alloc (tree, heap, length/2); /* Check that the operation is supported. */ if (!vect_strided_store_supported (vectype
Re: Optimizations documentation
Hi, Dorit Nuzman/Haifa/IBM wrote on 14/02/2008 17:02:45: > This is an old debt: A while back Tim had sent me a detailed report > off line showing which C++ tests (originally from the Dongara loops > suite) were vectorized by current g++ or icpc, or both, as well as > when the vectorization by icpc required a pragma, or was partial. I > went over the loops that were reported to be vectorized by icc but > not by gcc, to see which features we are missing. There are 23 such > loops (out of a total of 77). They fall into the following 7 categories: > > (1) scalar evolution analysis fails with "evolution of base is not affine". > This happens in the 3 loops in lines 4267, 4204 and 511. > Here an example: > for (i__ = 1; i__ <= i__2; ++i__) > { > a[i__] = (b[i__] + b[im1] + b[im2]) * .333f; > im2 = im1; > im1 = i__; > } > Missed optimization PR to be opened. I opened PR35224. > > (2) Function calls inside a loop. These are calls to the math > functions sin/cos, which I expect would be vectorized if the proper > simd math lib was available. > This happens in the loop in line 6932. > I think there's an open PR for this one (at least for > powerpc/Altivec?) - need to look/open. There is PR6. > > (3) This one is the most dominant missed optimization: if-conversion > is failing to if-convert, most likely due to the very limited > handling of loads/stores (i.e. load/store hoisting/sinking is required). > This happens in the 13 loops in lines 4085, 4025, 3883, 3818, 3631, > 355, 3503, 2942, 877, 6740, 6873, 5191, 7943. > There is on going work towards addressing this issue - see http: > //gcc.gnu.org/ml/gcc/2007-07/msg00942.html, http://gcc.gnu. > org/ml/gcc/2007-09/msg00308.html. (I think Victor Kaplansky is > currently working on this). > > (4) A scalar variable, whose address is taken outside the loop (in > an enclosing outer-loop) is analyzed by the data-references > analysis, which fails because it is invariant. > Here's an example: > for (nl = 1; nl <= i__1; ++nl) > { > sum = 0.f; > for (i__ = 1; i__ <= i__2; ++i__) > { > a[i__] = c__[i__] + d__[i__]; > b[i__] = c__[i__] + e[i__];]; > sum += a[i__] + b[i__];];]; > } > dummy_ (ld, n, &a[1], &b[1], &c__[1], &d__[1], &e[1], &aa [aa_offset], > &bb[bb_offset], &cc[cc_offset], &sum); > } > (Analysis of 'sum' fails with "FAILED as dr address is invariant". > This happens in the 2 loops in lines 5053 and 332. > I think there is a missed optimization PR for this one already. need > to look/open. > The related PRs are PR33245 and PR33244. Also there is a FIXME comment in tree-data-ref.c before the failure with "FAILED as dr address is invariant" error: /* FIXME -- data dependence analysis does not work correctly for objects with invariant addresses. Let us fail here until the problem is fixed. */ > (5) Reduction and induction that involve multiplication (i.e. 'prod > *= CST' or 'prod *= a[i]') are currently not supported by the > vectorizer. It should be trivial to add support for this feature > (for reduction, it shouldn't be much more than adding a case for > MULT_EXPR in tree-vectorizer.c:reduction_code_for_scalar_code, I think). > This happens in the 2 loops in lines 4921 and 4632. > A missed-optimization PR to be opened. Opened PR35226. > > (6) loop distribution is required to break a dependence. This may > already be handled by Sebastian's loop-distribution pass that will > be incorporated in 4.4. > Here is an example: > for (i__ = 2; i__ <= i__2; ++i__) > { > a[i__] += c__[i__] * d__[i__]; > b[i__] = a[i__] + d__[i__] + b[i__ - 1]; > } > This happens in the loop in line 2136. > Need to check if we need to open a missed optimization PR for this. I don't think that this is a loop distribution issue. The dependence between the store to a[i] and the load from a[i] doesn't prevent vectorization. The problematic one is between the store to b[i] and the load from b[i-1] in the second statement. > > (7) A dependence, similar to such that would be created by > predictive commoning (or even PRE), is present in the loop: > for (i__ = 1; i__ <= i__2; ++i__) > { > a[i__] = (b[i__] + x) * .5f; > x = b[i__]; > } > This happens in the loop in line 3003. > The vectorizer needs to be extended to handle such cases. > A missed optimization PR to be opened (if doesn't exist already). I opened a new PR - 35229. (PR33244 is somewhat related). Ira
Re: Optimizations documentation
Dorit Nuzman/Haifa/IBM wrote on 18/02/2008 09:40:37: > Thanks a lot for tracking down / opening the relevant PRs. > > about: > > > > (6) loop distribution is required to break a dependence. This may > > > already be handled by Sebastian's loop-distribution pass that will > > > be incorporated in 4.4. > > > Here is an example: > > > for (i__ = 2; i__ <= i__2; ++i__) > > > { > > > a[i__] += c__[i__] * d__[i__]; > > > b[i__] = a[i__] + d__[i__] + b[i__ - 1]; > > > } > > > This happens in the loop in line 2136. > > > Need to check if we need to open a missed optimization PR for this. > > > > I don't think that this is a loop distribution issue. The dependence > > between the store to a[i] and the load from a[i] doesn't prevent > > vectorization. > > right, > > > The problematic one is between the store to b[i] and > > the load from b[i-1] in the second statement. > > ...which is exactly why loop distribution could make this loop > (partially) vectorizable - separating the first and second > statements into separate loops would allow vectorizing the first of > the two resulting loops (which is probably what icc does - icc > reports that this loop is partially vectrizable). Yes, I see now. I applied Sebastian's patch ( http://gcc.gnu.org/ml/gcc-patches/2007-12/msg00215.html) and got "FIXME: Loop 1 not distributed: failed to build the RDG." Ira > > dorit >
Re: vectorizer default in 4.3.0 changes document missing
Hi Andi, [EMAIL PROTECTED] wrote on 10/03/2008 18:32:35: > > I noticed the gcc 4.3.0 changes document on the website does not > mention that the vectorizer is now on by default in -O3. > Perhaps that should be added? It seems like an important noteworthy > change to me. Thanks for pointing this out. The vectorizer's website was not update for a while. I am going to do that. > > I'm not sure it applies to all architectures, but it applies to > x86 at least. Vectorization (-ftree-vectorize) is on by default in -O3 on all platforms, but many architectures require additional flags to actually apply it, like -maltivec on PowerPC. Thanks, Ira > > -Andi
Re: Auto-vectorization: need to know what to expect
[EMAIL PROTECTED] wrote on 17/03/2008 19:33:23: > I have looked more closely at the messages generated by the gcc 4.3 > vectorizer > and it seems that they fall into two categories: > > 1) complaining about aligmnent. > > For example: > > Unknown alignment for access: D.33485 > Unknown alignment for access: m These do not necessary mean that the loop can't be vectorized - we can handle unknown alignment with loop peeling and loop versioning. > > I don't understand, as all my data is statically allocated doubles > (no dynamic > memory allocation) and I am using -malign-double. What more can I do? > > 2) complaining about "possible dependence" between some data and itself > > Example: > > not vectorized, possible dependence between data-refs > m.m_storage.m_data[D.43225_112] and m.m_storage.m_data[D.43225_112] These two data-refs are probably a store and a load to the same place, not the same data-ref. As it has been already said, the best thing to do is to open a PR with a testcase, so we can fully analyze it and answer all the questions.. Ira > > > I am wondering what to do about all that? Surely there must be documentation > about the vectorizer and its messages somewhere but I can't find it? > > Cheers, > Benoit > > > On Monday 17 March 2008 15:59:21 Richard Guenther wrote: > > On Mon, Mar 17, 2008 at 3:45 PM, Benoît Jacob <[EMAIL PROTECTED]> wrote: > > > Dear All, > > > > > > I am currently (co-)developing a Free (GPL/LGPL) C++ library for > > > vector/matrix math. > > > > > > A major decision that we need to take is, what to do regarding > > > vectorization instructions (SSE). Either we rely on GCC to > > > auto-vectorize, or we control explicitly the vectorization using GCC's > > > special primitives. The latter solution is of course more difficult, and > > > would to some degree obfuscate our source code, so we wish to know > > > whether or not it's really necessary. > > > > > > GCC 4.3.0 does auto-vectorize our loops, but the resulting code has > > > worse performance than a version with unrolled loops and no > > > vectorization. By contrast, ICC auto-vectorizes the same loops in a way > > > that makes them significantly faster than the unrolled-loops > > > non-vectorized version. > > > > > > If you want to know, the loops in question typically look like: > > > for(int i = 0; i < COMPILE_TIME_CONSTANT; i++) > > > { > > > // some abstract c++ code with deep recursive templates and > > > // deep recursive inline functions, but resulting in only a > > > // few assembly instructions > > > a().b().c().d(i) = x().y().z(i); > > > } > > > > > > As said above, it's crucial for us to be able to get an idea of what to > > > expect, because design decisions depend on that. Should we expect large > > > improvements regarding autovectorization in 4.3.x, in 4.4 or 4.5 ? > > > > In general GCCs autovectorization capabilities are quite good, cases > > where we miss opportunities do of course exist. There were improvements > > regarding autovectorization capabilities in every GCC release and I expect > > that to continue for future releases (though I cannot promise anything > > as GCC is a volunteer driven project - but certainly testcases where we > > miss optimizations are welcome - often we don't know of all corner cases). > > > > If you require to get the absolute most out of your CPU I recommend to > > provide special routines tuned for the different CPU families and I > > recommend the use of the standard intrinsics headers (*mmintr.h) for > > this. Of course this comes at a high cost of maintainance (and initial > > work), so autovectorization might prove good enough. Often tuning the > > source for a given compiler has a similar effect than producing vectorized > > code manually. Looking at GCC tree dumps and knowing a bit about > > GCC internals helps you here ;) > > > > > A roadmap or a GCC developer sharing his thoughts would be very helpful. > > > > Thanks, > > Richard. > > > [attachment "signature.asc" deleted by Ira Rosen/Haifa/IBM]
Re: Auto-vectorization: need to know what to expect
[EMAIL PROTECTED] wrote on 17/03/2008 21:08:43: > It might be nice to think about an option that automatically aligns large > arrays without having to do the declaration (or even have the vectorizer > override the alignment for statics/auto). The vectorizer is already doing this. Ira > > -- > Michael Meissner, AMD > 90 Central Street, MS 83-29, Boxborough, MA, 01719, USA > [EMAIL PROTECTED] >
Re: 4.3.0 manual vs changes.html
[EMAIL PROTECTED] wrote on 19/03/2008 06:01:19: > The web page > > http://gcc.gnu.org/gcc-4.3/changes.html > > states that "The -ftree-vectorize option is now on by default under - > O3.", but on > > http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Optimize-Options.html > > -ftree-vectorize is not listed as one of the options enabled by -O3. > > Is the first statement correct? Yes, -ftree-vectorize is on by default under -O3. The later doc should be updated. I am preparing a patch. Thanks for pointing this out, Ira > > Brad
Re: auto vectorization - should this work ?
Yes, this should get vectorized. The problem is in data dependencies analysis. We fail to prove that s_5->a[i_16] and s_5->a[i_16] access the same memory location. I think, it happens since when we compare the bases of the data references (s_5->a and s_5->a) in base_object_differ_p(), we do that by comparing the trees (which are pointers) and not their content. I'll look into this and, I hope, I will submit a fix for that soon (I guess using operand_equal_p instead). Thanks, Ira
Re: auto vectorization - should this work ?
Toon Moene <[EMAIL PROTECTED]> wrote on 06/05/2007 15:33:38: > I'd be willing to test out your solution privately, if you prefer such a > round first ... > Thanks. I'll send you a patch when it's ready. Ira
Re: auto vectorization - should this work ?
"Richard Guenther" <[EMAIL PROTECTED]> wrote on 06/05/2007 16:17:05: > On 5/6/07, Ira Rosen <[EMAIL PROTECTED]> wrote: > > > > Yes, this should get vectorized. The problem is in data dependencies > > analysis. We fail to prove that s_5->a[i_16] and s_5->a[i_16] access the > > same memory location. I think, it happens since when we compare the bases > > of the data references (s_5->a and s_5->a) in base_object_differ_p(), we do > > that by comparing the trees (which are pointers) and not their content. > > > > I'll look into this and, I hope, I will submit a fix for that soon (I guess > > using operand_equal_p instead). > > Duh, that function looks interesting, indeed ;) > > It should probably use get_base_address () to get at the base object > and then operand_equal_p to compare them. Note that it strips outer > variable offset as well, like for a[i].b[j] you will get 'a' as the > base object. > If data-ref cannot handle this well, just copy get_base_address () and > stop at the first ARRAY_REF you come along. But maybe > base_object_differ_p is only called from contexts that are well-defined > in this regard. base_object_differ_p is called after the data-refs analysis. So we really compare base objects here, and no further peeling is needed at this stage. At least, that was our intention. Thanks, Ira > > Richard.
Re: Some thoughts about steerring commitee work
"Daniel Berlin" <[EMAIL PROTECTED]> wrote on 16/06/2007: > On 6/16/07, Dorit Nuzman <[EMAIL PROTECTED]> wrote: > > > Do you have specific examples where SLP helps performance out of loops? > > hash calculations. > > For md5, you can get a 2x performance improvement by straight-line > vectorizing it > sha1 is about 2-2.5x > > (This assumes you do good pack/unpack placement using something like > lazy code motion) > > See, for example, http://arctic.org/~dean/crypto/sha1.html > > (The page is out of date, the technique they explain where they are > doing straight line computation of the hash in parallel, is exactly > what SLP would provide out of loops) I looked at the above page (and also at MD5 and SHA1 implementations). I found only computations inside loops. Could you please explain what exactly you refer to as SLP out of loops in this benchmark? Thanks, Ira
Re: Optimizations documentation
Hi, [EMAIL PROTECTED] wrote on 01/01/2008 22:00:11: > some time ago I listened that GCC supports vectorization, > but still can't find anything about it, how can I use it in my programs. Here is the link to the vectorizer's documentation: http://gcc.gnu.org/projects/tree-ssa/vectorization.html Ira