Re: [mem-ssa] Updated documentation

2007-01-02 Thread Ira Rosen

Hi Diego,

In the example of dynamic partitioning below (Figure 6), I don't understand
why MEM7 is not killed in line 13 and is killed in line 20 later. As far as
I understand, in line 13 'c' is in the alias set, and it's currdef is MEM7,
so it must be killed by the store in line 14. What am I missing?

Thanks,
Ira


a, b, c} q6 points?to {b, c}
CD(v) means that the generated MEM i name is the “current definition” for
v.
LU(v) looks up the “current definition” for v.
The initial SSA name for MEM is MEM7.

1 . . .
2 # MEM8 = VDEF  ) CD(a)
3 a = 2
4
5 # MEM10 = VDEF  ) CD(b)
6 b = 5
7
8 # VUSE  ) LU(b)
9 b.311 = b
10
11 D.153612 = b.311 + 3
12
13 # MEM25 = VDEF  ) CD(a, b, c)
14 *p5 = D.153612
15
16 # VUSE  ) CD(b)
17 b.313 = b
18 D.153714 = 10 ? b.313
19
20 # MEM26 = VDEF  ) CD(b, c)
21 *q6 = D.153714
22
23 # VUSE  ) LU(a)
24 a.415 = a
25
26 # MEM17 = VDEF  ) CD(SFT.2)
27 X.x = a.415
28 return
}




Re: Scheduling an early complete loop unrolling pass?

2007-02-06 Thread Ira Rosen


Dorit Nuzman/Haifa/IBM wrote on 05/02/2007 21:13:40:

> Richard Guenther <[EMAIL PROTECTED]> wrote on 05/02/2007 17:59:00:
>
> > On Mon, 5 Feb 2007, Paolo Bonzini wrote:
> >
> > >
> > > > As we also only vectorize innermost loops I believe doing a
> > > > complete unrolling pass early will help in general (I pushed
> > > > for this some time ago).
> > > >
> > > > Thoughts?
> > >
> > > It might also hurt, though, since we don't have a basic block
vectorizer.
> > > IIUC the vectorizer is able to turn
> > >
> > >   for (i = 0; i < 4; i++)
> > > v[i] = 0.0;
> > >
> > > into
> > >
> > >   *(vector double *)v = (vector double){0.0, 0.0, 0.0, 0.0};
> >
> > That's true.
>
> That's going to change once this project goes in: "(3.2) Straight-
> line code vectorization" from http://gcc.gnu.
> org/wiki/AutovectBranchOptimizations. In fact, I think in autovect-
> branch, if you unroll the above loop it should get vectorized
> already. Ira - is that really the case?

The completely unrolled loop will not get vectorized because the code will
not be inside any loop (and our SLP implementation will focus, at least as
a first step, on loops).
The following will get vectorized (without permutation on autovect branch,
and with redundant permutation on mainline):

for (i = 0; i < n; i++)
  {
v[4*i] = 0.0;
v[4*i + 1] = 0.0;
v[4*i + 2] = 0.0;
v[4*i + 3] = 0.0;
  }

The original completely unrolled loop will get vectorized if it is
encapsulated in an outer-loop, like so:

for (j=0; j

vcond implementation in altivec

2007-02-27 Thread Ira Rosen

Hi,

We were looking at the implementation of vcond for altivec and we have a
couple of questions.

vcond has 6 operands, rs6000_emit_vector_cond_expr is called from
define_expand for "vcond". It gets those operands in their original
order, as in vcond, and emits  op0 = (op4 cond op5 ? op1 : op2), where cond
is op3.

Here is vcond for vector short (vconduv8hi, vcondv16qi, and vconduv16qi are
similar):
(define_expand "vcondv8hi"
 [(set (match_operand:V4SF 0 "register_operand" "=v")
   (unspec:V8HI [(match_operand:V4SI 1 "register_operand" "v")
(match_operand:V8HI 2 "register_operand" "v")
(match_operand:V8HI 3 "comparison_operator" "")
(match_operand:V8HI 4 "register_operand" "v")
(match_operand:V8HI 5 "register_operand" "v")
] UNSPEC_VCOND_V8HI))]
 "TARGET_ALTIVEC"
 "
 {
 if (rs6000_emit_vector_cond_expr (operands[0], operands[1],
operands[2],
   operands[3], operands[4],
operands[5]))
 DONE;
 else
 FAIL;
 }
 ")
Is there a reason why op0 is V4SF and op1 is V4SI (and not V8HI)?


In V4SF, op1 is V4SI:
(define_expand "vcondv4sf"
[(set (match_operand:V4SF 0 "register_operand" "=v")
  (unspec:V4SF [(match_operand:V4SI 1 "register_operand" "v")
   (match_operand:V4SF 2 "register_operand" "v")
   (match_operand:V4SF 3 "comparison_operator" "")
   (match_operand:V4SF 4 "register_operand" "v")
   (match_operand:V4SF 5 "register_operand" "v")
   ] UNSPEC_VCOND_V4SF))]
"TARGET_ALTIVEC"
"
{
if (rs6000_emit_vector_cond_expr (operands[0], operands[1],
operands[2],
  operands[3], operands[4],
operands[5]))
DONE;
else
FAIL;
}
")
Same question: is there a reason for op1 to be V4SI?

And also, why not use if_then_else instead of unspec (in all vcond's)?

Thanks,
Sa and Ira



Re: Vector permutation only deals with # of vector elements same as mask?

2011-02-10 Thread Ira Rosen

Hi,

"Bingfeng Mei"  wrote on 10/02/2011 05:35:45 PM:
>
> Hi,
> I noticed that vector permutation gets more use in GCC
> 4.6, which is great. It is used to handle negative step
> by reversing vector elements now.
>
> However, after reading the related code, I understood
> that it only works when the # of vector elements is
> the same as that of mask vector in the following code.
>
> perm_mask_for_reverse (tree-vect-stmts.c)
> ...
>   mask_type = get_vectype_for_scalar_type (mask_element_type);
>   nunits = TYPE_VECTOR_SUBPARTS (vectype);
>   if (!mask_type
>   || TYPE_VECTOR_SUBPARTS (vectype) != TYPE_VECTOR_SUBPARTS
(mask_type))
> return NULL;
> ...
>
> For PowerPC altivec, the mask_type is V16QI. It means that
> compiler can only permute V16QI type.  But given the capability of
> altivec vperm instruction, it can permute any 128-bit type
> (V8HI, V4SI, etc). We just need convert in/out V16QI from
> given types and a bit more extra work in producing mask.
>
> Do I understand correctly or miss something here?

Yes, you are right. The support of reverse access is somewhat limited.
Please see vect_transform_slp_perm_load() in tree-vect-slp.c for example of
all type permutation support.

But, anyway, reverse accesses are not supported for altivec's load
realignment scheme.

Ira

>
> Thanks,
> Bingfeng Mei
>
>
>
>



Re: Fw: RFC: Representing vector lane load/store operations

2011-03-23 Thread Ira Rosen
>> ...Ira would know best, but I don't think it would be used for this
>> kind of loop.  It would be more something like:
>>
>>   for (i=0; i>     X[i] = Y[i].red + Y[i].blue + Y[i].green;
>>
>> (not a realistic example).  You'd then have:
>>
>>    compoundY = __builtin_load_lanes (Y);
>>    red = ARRAY_REF 
>>    green = ARRAY_REF 
>>    blue = ARRAY_REF 
>>    D1 = red + green
>>    D2 = D1 + blue
>>    MEM_REF  = D2;
>>
>> My understanding is that'd we never do any operations besides ARRAY_REFs
>> on the compound value, and that the individual vectors would be treated
>> pretty much like any other.
>
> Ok, I thought it might be used to have a larger vectorization factor for
> loads and stores, basically make further unrolling cheaper because you
> don't have to duplicate the loads and stores.

Right, we can do that using vld1/vst1 instructions (full load/store
with N=1) and operate on up to 4 doubleword vectors in parallel. But
at the moment we are concentrating on efficient support of strided
memory accesses.

Ira


Re: Strange vect.exp test results

2011-05-30 Thread Ira Rosen


gcc-ow...@gcc.gnu.org wrote on 30/05/2011 06:36:36 PM:

>
> Hi,
>
> I've been playing with the vectorizer for my port, and of course I use
> the testsuite to check the generated code. I fail to understand some
> of the FAILs I get. For example, in slp-3.c, the test contains:
>
> /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" {
> xfail vect_no_align } } } */
>
> This test fails for me because I get 4 vectorized loops instead of 3.
> There are multiple other tests that generate more vectorization then
> expected. I'd like to understand the reason for these failures, but I
> can't see what motivates the choice of only 3 vectorized loops among
> the 4 vectorizable loops of the test. Can someone enlighten me?

The fourth loop (line 104) has only 3 scalar iterations, too few to
vectorize unless your target has vectors of 2 shorts.

Ira

>
> Many thanks,
> Fred



Re: Strange vect.exp test results

2011-05-31 Thread Ira Rosen


Frederic Riss  wrote on 31/05/2011 12:34:35 PM:

> Hi Ira,
>
> thanks for your answer, however:
>
> On 31 May 2011 08:06, Ira Rosen  wrote:
> >> This test fails for me because I get 4 vectorized loops instead of 3.
> >> There are multiple other tests that generate more vectorization then
> >> expected. I'd like to understand the reason for these failures, but I
> >> can't see what motivates the choice of only 3 vectorized loops among
> >> the 4 vectorizable loops of the test. Can someone enlighten me?
> >
> > The fourth loop (line 104) has only 3 scalar iterations, too few to
> > vectorize unless your target has vectors of 2 shorts.
>
> My port has vectors of 2 shorts, but I don't expose them directly to
> GCC. The V2HI type is defined, but UNITS_PER_SIMD_WORD always returns
> 8, which I believe should prompt GCC to use V4HI which is also
> defined.
>
> Regarding slp-3.c I don't get why the loop you point isn't
> vectorizable. I my version of the file (4.5 branch), I see 9 short
> copies in a loop iterating 4 times (a total of 36 short assignements).
> After the vectorization pass, I get 9 V4HI assignments which seem
> totally right. I don't see why this shouldn't be the case...

You are right. slp-3.c was fixed lately (revision 171569) on trunk for
targets with V4HI. I think there are other tests as well that fail because
of the vector size assumption. I'm planing to fix them.

Ira

>
> Many thanks,
> Fred



Re: SLP vectorizer on non-loop?

2011-11-01 Thread Ira Rosen


gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM:

> Hello,
> I have one example with two very similar loops. cunrolli pass
> unrolls one loop completely
> but not the other based on slightly different cost estimations. The
> not-unrolled loop
> get SLP-vectorized, then unrolled by "cunroll" pass, whereas the
> other unrolled loop cannot
> be vectorized since it is not a loop any more.  In the end, there is
> big difference of
> performance between two loops.
>

Here what I see with the current trunk on x86_64 with -O3 (with the two
loops split into different functions):

The first loop, the one that doesn't get unrolled by cunrolli, gets loop
vectorized with -fno-vect-cost-model. With the cost model the vectorization
fails because the number of iterations is not sufficient (the vectorizer
tries to apply loop peeling in order to align the accesses), the loop gets
later unrolled by cunroll and the basic block gets vectorized by SLP.

The second loop, unrolled by cunrolli, also gets vectorized by SLP.

The *.optimized dumps look similar:


:
  vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)];
  MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48;
  return;


:
  vect_var_.7_57 = MEM[(int *)p_input_10(D)];
  MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57;
  return;


> My question is why SLP vectorization has to be performed on loop (it
> is a sub-pass under
> pass_tree_loop). Conceptually, cannot it be done on any basic block?
> Our port are still
> stuck at 4.5. But I checked 4.7, it seems still the same. I also
> checked functions in
> tree-vect-slp.c. They use a lot of loop_vinfo structures. But in
> some places it checks
> whether loop_vinfo exists to use it or other alternative. I tried to
> add an extra SLP
> pass after pass_tree_loop, but it didn't work. I wonder how easy to
> make SLP works for
> non-loop.

SLP vectorization works both on loops (in vectorize pass) and on basic
blocks (in slp-vectorize pass).

Ira

>
> Thanks,
> Bingfeng Mei
>
> Broadcom UK
>
> void foo (int *__restrict__ temp_hist_buffer,
>   int * __restrict__ p_hist_buff,
>   int *__restrict__ p_input)
> {
>   int i;
>   for(i=0;i<4;i++)
>  temp_hist_buffer[i]=p_hist_buff[i];
>
>   for(i=0;i<4;i++)
>  temp_hist_buffer[i+4]=p_input[i];
>
> }
>
>



RE: SLP vectorizer on non-loop?

2011-11-01 Thread Ira Rosen


"Bingfeng Mei"  wrote on 01/11/2011 01:25:14 PM:

> Ira,
> Thank you very much for quick answer. I will check 4.7 x86-64
> to see difference from our port. Is there significant change
> between 4.5 & 4.7 regarding SLP?

Yes, I think so. 4.5 can't SLP data accesses with unknown alignment that
you have here.

Ira

>
> Cheers,
> Bingfeng
>
> > -Original Message-
> > From: Ira Rosen [mailto:i...@il.ibm.com]
> > Sent: 01 November 2011 11:13
> > To: Bingfeng Mei
> > Cc: gcc@gcc.gnu.org
> > Subject: Re: SLP vectorizer on non-loop?
> >
> >
> >
> > gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM:
> >
> > > Hello,
> > > I have one example with two very similar loops. cunrolli pass
> > > unrolls one loop completely
> > > but not the other based on slightly different cost estimations. The
> > > not-unrolled loop
> > > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the
> > > other unrolled loop cannot
> > > be vectorized since it is not a loop any more.  In the end, there is
> > > big difference of
> > > performance between two loops.
> > >
> >
> > Here what I see with the current trunk on x86_64 with -O3 (with the two
> > loops split into different functions):
> >
> > The first loop, the one that doesn't get unrolled by cunrolli, gets
> > loop
> > vectorized with -fno-vect-cost-model. With the cost model the
> > vectorization
> > fails because the number of iterations is not sufficient (the
> > vectorizer
> > tries to apply loop peeling in order to align the accesses), the loop
> > gets
> > later unrolled by cunroll and the basic block gets vectorized by SLP.
> >
> > The second loop, unrolled by cunrolli, also gets vectorized by SLP.
> >
> > The *.optimized dumps look similar:
> >
> >
> > :
> >   vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)];
> >   MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48;
> >   return;
> >
> >
> > :
> >   vect_var_.7_57 = MEM[(int *)p_input_10(D)];
> >   MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57;
> >   return;
> >
> >
> > > My question is why SLP vectorization has to be performed on loop (it
> > > is a sub-pass under
> > > pass_tree_loop). Conceptually, cannot it be done on any basic block?
> > > Our port are still
> > > stuck at 4.5. But I checked 4.7, it seems still the same. I also
> > > checked functions in
> > > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in
> > > some places it checks
> > > whether loop_vinfo exists to use it or other alternative. I tried to
> > > add an extra SLP
> > > pass after pass_tree_loop, but it didn't work. I wonder how easy to
> > > make SLP works for
> > > non-loop.
> >
> > SLP vectorization works both on loops (in vectorize pass) and on basic
> > blocks (in slp-vectorize pass).
> >
> > Ira
> >
> > >
> > > Thanks,
> > > Bingfeng Mei
> > >
> > > Broadcom UK
> > >
> > > void foo (int *__restrict__ temp_hist_buffer,
> > >   int * __restrict__ p_hist_buff,
> > >   int *__restrict__ p_input)
> > > {
> > >   int i;
> > >   for(i=0;i<4;i++)
> > >  temp_hist_buffer[i]=p_hist_buff[i];
> > >
> > >   for(i=0;i<4;i++)
> > >  temp_hist_buffer[i+4]=p_input[i];
> > >
> > > }
> > >
> > >
> >
>
>



Re: targetm.vectorize.builtin_vec_perm

2009-11-17 Thread Ira Rosen


Richard Henderson  wrote on 17/11/2009 03:39:42:

> Richard Henderson 
> 17/11/2009 03:39
>
> To
>
> Ira Rosen/Haifa/i...@ibmil
>
> cc
>
> gcc@gcc.gnu.org
>
> Subject
>
> targetm.vectorize.builtin_vec_perm
>
> What is this hook supposed to do?  There is no description of its
arguments.
>
> What is the theory of operation of permute within the vectorizer?  Do
> you actually need variable permute, or would constants be ok?

It is currently used for a specific load permutation of RGB to YUV
conversion (http://gcc.gnu.org/ml/gcc-patches/2008-07/msg00445.html). The
arguments are vector type and mask type (the last one is returned by the
hook).

The permute is constant, it depends on the number of loads (group size) and
their type. However, there are cases, that we may want to support in the
future, that require variable permute - indirect accesses, for example.

>
> I'm contemplating adding a tree- and gimple-level VEC_PERMUTE_EXPR of
> the form:
>
>VEC_PERMUTE_EXPR (vlow, vhigh, vperm)
>
> which would be exactly equal to
>
>(vec_select
>  (vec_concat vlow vhigh)
>  vperm)
>
> at the rtl level.  I.e. vperm is an integral vector of the same number
> of elements as vlow.
>
> Truly variable permutation is something that's only supported by ppc and
> spu.

Also Altivec and SPU support byte permutation (and not only element
permutation), however, the vectorizer does not make use of this at present.

> Intel AVX has a limited variable permutation -- 64-bit or 32-bit
> elements can be rearranged but only within a 128-bit subvector.
> So if you're working with 128-bit vectors, it's fully variable, but if
> you're working with 256-bit vectors, it's like doing 2 128-bit permute
> operations in parallel.  Intel before AVX has no variable permute.
>
> HOWEVER!  Most of the useful permutations that I can think of for the
> optimizers to generate are actually constant.  And these can be
> implemented everywhere (with varying degrees of efficiency).
>
> Anyway, I'm thinking that it might be better to add such a general
> operation instead of continuing to add things like
>
>VEC_EXTRACT_EVEN_EXPR,
>VEC_EXTRACT_ODD_EXPR,
>VEC_INTERLEAVE_HIGH_EXPR,
>VEC_INTERLEAVE_LOW_EXPR,
>
> and other obvious patterns like broadcast, duplicate even to odd,
> duplicate odd to even, etc.

If the back end will be able to identify specific masks, e.g., {0,2,4,6} as
extract even operation, then we can certainly remove those codes.

>
> I can imagine having some sort of target hook that computed a cost
> metric for a given constant permutation pattern.  For instance, I'd
> imagine that the interleave patterns are half as expensive as a full
> permute for altivec, due to not having to load a mask.  This hook would
> be fairly complicated for x86, given all of the permuting insns that
> were incrementally added in various ISA revisions, but such is life.
>
> In any case, would a VEC_PERMUTE_EXPR, as described above, work for the
> uses of builtin_vec_perm within the vectorizer at present?

Yes.

Ira

>
>
> r~



Re: targetm.vectorize.builtin_vec_perm

2009-11-17 Thread Ira Rosen

> > I can imagine having some sort of target hook that computed a cost
> > metric for a given constant permutation pattern.  For instance, I'd
> > imagine that the interleave patterns are half as expensive as a full
> > permute for altivec, due to not having to load a mask.  This hook would
> > be fairly complicated for x86, given all of the permuting insns that
> > were incrementally added in various ISA revisions, but such is life.
>
> There should be some way to account for the difference between the cost
> in straight-line code, where a mask load is a hard cost, a large loop,
> where the load can be hoisted at the cost of some target-dependent
> register pressure (e.g. being able to use inverted masks might save half
> of the cost), and a tight loop, where the constant load can be easily
> amortized over the entire loop.

Vectorizer cost model already does that. AFAIU, vectorizer cost model will
call the cost model hook to get a cost of a permute, and then incorporate
that cost into the general loop/basic block vectorization cost.

Ira




Re: Vectorizing 16bit signed integers

2009-12-14 Thread Ira Rosen


gcc-ow...@gcc.gnu.org wrote on 11/12/2009 20:25:33:

> Allan Sandfeld Jensen 
> Hi
>
> I hope someone can help me. I've been trying to write some tight
> integer loops
> in way that could be auto-vectorized, saving me to write assembler or
using
> specific vectorization extensions. Unfortunately I've not yet managed to
make
> gcc vectorize any of them.
>
> I've simplified the case to just perform the very first operation inthe
loop;
> converting from two's complement to sign-and-magnitude.
>
> I've then used -ftree-vectorizer-verbose to examine if and if not,
> why not the
> loops were not vectorized, but I am afraid I don't understand the output.
>
> The simplest version of the loop is here (it appears the branch is not a
> problem, but I have another version without).
>
> inline uint16_t transsign(int16_t v) {
> if (v<0) {
> return 0x8000U | (1-v);
> } else {
> return v;
> }
> }
>
> It very simply converts in a fashion that maintains the full effective
bit-
> width.
>
> The error from the vectorizer is:
> vectorizesign.cpp:42: note: not vectorized: relevant stmt not supported:
> v.1_16 = (uint16_t) D.2157_11;
>
> It appears the unsupported operation in vectorization is the typecast
from
> int16_t to uint16_t, can this really be the case, or is the output
misleading?

Yes, the problem is in signed->unsigned cast. I think it is related to PR
26128.

Ira

>
> If it is the case, then is there good reason for it, or can I fix
itmyself by
> adding additional vectorizable operations?
>
> I've attached both test case and full output of
ftree-vectorized-verbose=9
>
> Best regards
> `Allan
>
> [attachment "vectorizesign.cpp" deleted by Ira Rosen/Haifa/IBM]
> [attachment "vectorizesign-debug.txt" deleted by Ira Rosen/Haifa/IBM]



Re: Autovectorizing does not work with classes

2008-10-07 Thread Ira Rosen


[EMAIL PROTECTED] wrote on 07/10/2008 10:48:29:

> Dear gcc developers,
>
> I am new to this list.
> I tried to use the auto-vectorization (4.2.1 (SUSE Linux)) but
unfortunately
> with limited success.
> My code is bassically a matrix library in C++. The vectorizer does not
like
> the member variables. Consider this code compiled with
> gcc -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 -funsafe-
> math-optimizations
> that gives basically  "not vectorized: unhandled data-ref"

The unhandled data-ref here is sum. It is invariant in the loop, and
invariant data-refs are currently unsupported by the data dependence
analysis. If you can change your code to pass sum by value, it will get
vectorized (at least with gcc 4.3).
This is not C++ specific problem (for me your C version does not get
vectorized either because of the same reason).

HTH,
Ira,

> 
> class P{
> public:
>   P() : m(5),n(3) {
> double *d = data;
> for (int i=0; i   d[i] = i/10.2;
>   }
>   void test(const double& sum);
> private:
>   int m;
>   int n;
>   double data[15];
> };
>
> void P::test(const double& sum) {
>   double *d = this->data;
>   for(int i=0; i d[i]+=sum;
>   }
> }
> 
> whereas the more or less equivalent C version works just fine:
> 
> int m=5;
> int n=3;
> double data[15];
>
> void test(const double& sum) {
>   int mn = m*n;
>   for(int i=0; i data[i]+=sum;
>   }
> }
> 
>
> Is there a fundamental problem in using the vectorizer in C++?
>
> Regards!
>Georg
> [attachment "signature.asc" deleted by Ira Rosen/Haifa/IBM]



Re: Merging the alias-improvements branch

2009-03-29 Thread Ira Rosen

> I will announce the time I am doing the last trunk -> alias-improvements
> branch merge and freeze the trunk for that.
>
> Thus, this is a heads-up - if I collide with your planned merge schedule
> just tell me and we can sort it out.

I was planning to commit the vectorizer reorganization patch (
http://gcc.gnu.org/ml/gcc-patches/2009-02/msg00573.html). Do you prefer
that I wait, so it doesn't disturb the merge?

Thanks,
Ira



Re: Merging the alias-improvements branch

2009-03-29 Thread Ira Rosen


Richard Guenther  wrote on 29/03/2009 13:05:56:

> On Sun, 29 Mar 2009, Ira Rosen wrote:
>
> >
> > > I will announce the time I am doing the last trunk ->
alias-improvements
> > > branch merge and freeze the trunk for that.
> > >
> > > Thus, this is a heads-up - if I collide with your planned merge
schedule
> > > just tell me and we can sort it out.
> >
> > I was planning to commit the vectorizer reorganization patch (
> > http://gcc.gnu.org/ml/gcc-patches/2009-02/msg00573.html). Do you prefer
> > that I wait, so it doesn't disturb the merge?
>
> If you can commit the patch soon (like, before wednesday) you can go
> ahead.  The differences are not big (see attachment below for what
> is the difference between trunk and branch in tree-vect-*), so I think
> I can deal with the reorg just fine.

Great! I will commit it today or tomorrow then.

Thanks,
Ira

>
> Thanks,
> Richard.
>
>
> Index: gcc/tree-vectorizer.c
> ===
> --- gcc/tree-vectorizer.c   (.../trunk)   (revision 145210)
> +++ gcc/tree-vectorizer.c   (.../branches/alias-improvements)
> (revision 145211)
> @@ -973,7 +973,7 @@ slpeel_can_duplicate_loop_p (const struc
>gimple orig_cond = get_loop_exit_condition (loop);
>gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
>
> -  if (need_ssa_update_p ())
> +  if (need_ssa_update_p (cfun))
>  return false;
>
>if (loop->inner
> Index: gcc/tree-vect-analyze.c
> ===
> --- gcc/tree-vect-analyze.c   (.../trunk)   (revision 145210)
> +++ gcc/tree-vect-analyze.c   (.../branches/alias-improvements)
> (revision 145211)
> @@ -3563,16 +3563,6 @@ vect_analyze_data_refs (loop_vec_info lo
>return false;
>  }
>
> -  if (!DR_SYMBOL_TAG (dr))
> -{
> -  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
> -{
> -  fprintf (vect_dump, "not vectorized: no memory tag for ");
> -  print_generic_expr (vect_dump, DR_REF (dr), TDF_SLIM);
> -}
> -  return false;
> -}
> -
>base = unshare_expr (DR_BASE_ADDRESS (dr));
>offset = unshare_expr (DR_OFFSET (dr));
>init = unshare_expr (DR_INIT (dr));
> @@ -3804,7 +3794,7 @@ vect_stmt_relevant_p (gimple stmt, loop_
>
>/* changing memory.  */
>if (gimple_code (stmt) != GIMPLE_PHI)
> -if (!ZERO_SSA_OPERANDS (stmt, SSA_OP_VIRTUAL_DEFS))
> +if (gimple_vdef (stmt))
>{
> if (vect_print_dump_info (REPORT_DETAILS))
>   fprintf (vect_dump, "vec_stmt_relevant_p: stmt has vdefs.");
> Index: gcc/tree-vect-transform.c
> ===
> --- gcc/tree-vect-transform.c   (.../trunk)   (revision 145210)
> +++ gcc/tree-vect-transform.c   (.../branches/alias-improvements)
> (revision 145211)
> @@ -51,7 +51,7 @@ static bool vect_transform_stmt (gimple,
>   slp_tree, slp_instance);
>  static tree vect_create_destination_var (tree, tree);
>  static tree vect_create_data_ref_ptr
> -  (gimple, struct loop*, tree, tree *, gimple *, bool, bool *, tree);
> +  (gimple, struct loop*, tree, tree *, gimple *, bool, bool *);
>  static tree vect_create_addr_base_for_vector_ref
>(gimple, gimple_seq *, tree, struct loop *);
>  static tree vect_get_new_vect_var (tree, enum vect_var_kind, const char
*);
> @@ -1009,7 +1009,7 @@ vect_create_addr_base_for_vector_ref (gi
>  static tree
>  vect_create_data_ref_ptr (gimple stmt, struct loop *at_loop,
> tree offset, tree *initial_address, gimple *ptr_incr,
> -   bool only_init, bool *inv_p, tree type)
> +   bool only_init, bool *inv_p)
>  {
>tree base_name;
>stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> @@ -1020,7 +1020,6 @@ vect_create_data_ref_ptr (gimple stmt, s
>tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>tree vect_ptr_type;
>tree vect_ptr;
> -  tree tag;
>tree new_temp;
>gimple vec_stmt;
>gimple_seq new_stmt_list = NULL;
> @@ -1068,42 +1067,33 @@ vect_create_data_ref_ptr (gimple stmt, s
>  }
>
>/** (1) Create the new vector-pointer variable:  **/
> -  if (type)
> -vect_ptr_type = build_pointer_type (type);
> -  else
> -vect_ptr_type = build_pointer_type (vectype);
> -
> -  if (TREE_CODE (DR_BASE_ADDRESS (dr)) == SSA_NAME
> -  && TYPE_RESTRICT (TREE_TYPE (DR_BASE_ADDRESS (dr
> -vect_ptr_type = build_qualified_type (vect_ptr_type,
TYPE_QUAL_RESTRICT);
> +  vect_ptr_type = build_pointer_ty

Re: Inner loop unable to compute sufficient information during vectorization

2009-05-26 Thread Ira Rosen


gcc-ow...@gcc.gnu.org wrote on 25/05/2009 21:53:41:

> for a loop like
>
> 1 for(i=0;i 2   for(j=0;j 3   a[i][j] = a[i][j]+b[i][j];
>
> GCC 4.3.* is unable to get the information for the inner loop that
> array reference 'a'  is alias of each other and generates code for
> runtime aliasing check during vectorization.

Both current trunk and GCC4.4 vectorize the inner loop without any runtime
alias checks.

> Is it necessary to
> recompute all information in loop_vec_info in function
> vect_analyze_ref for analysis of inner loop also, as most of the
> information is similar for the outer loop for the program.

Maybe you are right, and it is possible to extract at least part of the
information for the inner loop from the outer loop information.

>
> Similarly, outer loop is able to compute correct chrec i.e. NULL , for
> array 'a' reference, while innerloop has chrec as chrec_dont_know, and
> therfore complaint about runtime alias check.

The chrecs are not the same for inner and outer loops, so it is reasonable
that the results of the data dependence tests will be different.
In this case, however, it seems to be a bug.

Ira





Re: Inner loop unable to compute sufficient information during vectorization

2009-06-02 Thread Ira Rosen


Abhishek Shrivastav  wrote on 31/05/2009
16:44:34:

> In this case, I think that Outer loop could be vectorized as there is
> no dependency in the loop,the access pattern is simple enough and
> there is unit stride in both the loops. Current version 4.4.* is not
> doing outer loop vectorization.

The memory accesses are consecutive in the inner loop and strided in the
outer loop. Therefore, inner loop vectorization is preferable in this case
(and also strided accesses are not yet supported in outer loop
vectorization).

Ira

>
> On Tue, May 26, 2009 at 5:57 PM, Ira Rosen  wrote:
> >
> >
> > gcc-ow...@gcc.gnu.org wrote on 25/05/2009 21:53:41:
> >
> >> for a loop like
> >>
> >> 1         for(i=0;i >> 2           for(j=0;j >> 3               a[i][j] = a[i][j]+b[i][j];
> >>
> >> GCC 4.3.* is unable to get the information for the inner loop that
> >> array reference 'a'  is alias of each other and generates code for
> >> runtime aliasing check during vectorization.
> >
> > Both current trunk and GCC4.4 vectorize the inner loop without any
runtime
> > alias checks.
> >
> >> Is it necessary to
> >> recompute all information in loop_vec_info in function
> >> vect_analyze_ref for analysis of inner loop also, as most of the
> >> information is similar for the outer loop for the program.
> >
> > Maybe you are right, and it is possible to extract at least part of the
> > information for the inner loop from the outer loop information.
> >
> >>
> >> Similarly, outer loop is able to compute correct chrec i.e. NULL , for
> >> array 'a' reference, while innerloop has chrec as chrec_dont_know, and
> >> therfore complaint about runtime alias check.
> >
> > The chrecs are not the same for inner and outer loops, so it is
reasonable
> > that the results of the data dependence tests will be different.
> > In this case, however, it seems to be a bug.
> >
> > Ira
> >
> >
> >
> >



Re: Loops no longer vectorized

2010-05-30 Thread Ira Rosen


gcc-ow...@gcc.gnu.org wrote on 28/05/2010 03:52:30 PM:

> Hi,
>
> I just noticed today that (implicit) loops of the kind
>
> xmin = minval(nodes(1,inductor_number(1:number_of_nodes)))
>
> (lines 5057 to 5062 of the polyhedron test induct.f90) are no longer
> vectorized (the change occurred between revisions 158215 and
> 158921). With -ftree-vectorizer-verbose=6, I got
>
> induct.f90:5057: note: not vectorized: data ref analysis failed D.
> 6088_872 = (*D.4001_143)[D.6087_871];
>
> induct.f90:5057: note: Alignment of access forced using peeling.
> induct.f90:5057: note: Vectorizing an unaligned access.
> induct.f90:5057: note: vect_model_load_cost: unaligned supported by
hardware.
> induct.f90:5057: note: vect_model_load_cost: inside_cost = 2,
> outside_cost = 0 .
> induct.f90:5057: note: vect_model_simple_cost: inside_cost = 2,
> outside_cost = 0 .
> induct.f90:5057: note: vect_model_store_cost: inside_cost = 2,
> outside_cost = 0 .
> induct.f90:5057: note: cost model: prologue peel iters set to vf/2.
> induct.f90:5057: note: cost model: epilogue peel iters set to vf/2
> because peeling for alignment is unknown .
> induct.f90:5057: note: Cost model analysis:
>   Vector inside of loop cost: 6
>   Vector outside of loop cost: 20
>   Scalar iteration cost: 3
>   Scalar outside cost: 7
>   prologue iterations: 2
>   epilogue iterations: 2
>   Calculated minimum iters for profitability: 5
>
> induct.f90:5057: note:   Profitability threshold = 4
>
> induct.f90:5057: note: Profitability threshold is 4 loop iterations.
> induct.f90:5057: note: LOOP VECTORIZED.
>
> and now:
>
> induct.f90:5057: note: not vectorized: data ref analysis failed D.
> 6017_848 = (*D.4001_131)[D.6016_847];
>
> Is this known/expected or should I open a new PR?

The loop that computes MIN_EXPR is not vectorizable because of indirect
access. You see for both versions:

induct.f90:5057: note: not vectorized: data ref analysis failed D.
6017_848 = (*D.4001_131)[D.6016_847];

The loop that got vectorized in the older revision is another loop
associated with the same source code line:

:
  # S.648_810 = PHI 
  S.648_856 = S.648_810 + 1;
  D.6082_858 = (*D.4108_840)[S.648_810];
  D.6083_859 = (integer(kind=8)) D.6082_858;
  (*pretmp.3557_2254)[S.648_810] = D.6083_859;
  if (D.4111_844 < S.648_856)
goto ;
  else
goto ;


And in the later revision this loop is replaced with:

:
  D.6008_833 = &(*D.5896_830)[0];
  pretmp.3873_1387 = (integer(kind=4)[0:] *) D.6008_833;


So, there is no loop now.

Ira

>
> Cheers
>
> Dominique



Re: Target macros vs. target hooks - policy/goal is hooks, isn't it?

2010-06-03 Thread Ira Rosen


Steven Bosscher  wrote on 02/06/2010 06:13:36 PM:

>
> On Wed, May 26, 2010 at 7:16 PM, Mark Mitchell 
wrote:
> > Ulrich Weigand wrote:
> >
> >>> So the question is: The goal is to have hooks, not macros, right? If
> >>> so, can reviewers please take care to reject patches that introduce
> >>> new macros?
> >>
> >> I don't know to which extent this is a formal goal these days, but I
> >> personally agree that it would be nice to eliminate macros.
> >
> > Yes, the (informally agreed) policy is to have hooks, not macros.
There
> > may be situations where that is technically impossible, but I'd expect
> > those to be very rare.
>
> Another batch of recently introduced target macros instead of target
hooks:

Not so recently - three years ago.

>
> tree-vectorizer.h:#ifndef TARG_COND_TAKEN_BRANCH_COST
> tree-vectorizer.h:#ifndef TARG_COND_NOT_TAKEN_BRANCH_COST
> tree-vectorizer.h:#ifndef TARG_SCALAR_STMT_COST
> tree-vectorizer.h:#ifndef TARG_SCALAR_LOAD_COST
> tree-vectorizer.h:#ifndef TARG_SCALAR_STORE_COST
> tree-vectorizer.h:#ifndef TARG_VEC_STMT_COST
> tree-vectorizer.h:#ifndef TARG_VEC_TO_SCALAR_COST
> tree-vectorizer.h:#ifndef TARG_SCALAR_TO_VEC_COST
> tree-vectorizer.h:#ifndef TARG_VEC_LOAD_COST
> tree-vectorizer.h:#ifndef TARG_VEC_UNALIGNED_LOAD_COST
> tree-vectorizer.h:#ifndef TARG_VEC_STORE_COST
> tree-vectorizer.h:#ifndef TARG_VEC_PERMUTE_COST
>
> Could the vectorizer folks please turn these into target hooks?

OK, I'll do that.

Ira

>
> Ciao!
> Steven



Re: Target macros vs. target hooks - policy/goal is hooks, isn't it?

2010-06-03 Thread Ira Rosen


Richard Guenther  wrote on 03/06/2010 02:00:00
PM:

> >> tree-vectorizer.h:#ifndef TARG_COND_TAKEN_BRANCH_COST
> >> tree-vectorizer.h:#ifndef TARG_COND_NOT_TAKEN_BRANCH_COST
> >> tree-vectorizer.h:#ifndef TARG_SCALAR_STMT_COST
> >> tree-vectorizer.h:#ifndef TARG_SCALAR_LOAD_COST
> >> tree-vectorizer.h:#ifndef TARG_SCALAR_STORE_COST
> >> tree-vectorizer.h:#ifndef TARG_VEC_STMT_COST
> >> tree-vectorizer.h:#ifndef TARG_VEC_TO_SCALAR_COST
> >> tree-vectorizer.h:#ifndef TARG_SCALAR_TO_VEC_COST
> >> tree-vectorizer.h:#ifndef TARG_VEC_LOAD_COST
> >> tree-vectorizer.h:#ifndef TARG_VEC_UNALIGNED_LOAD_COST
> >> tree-vectorizer.h:#ifndef TARG_VEC_STORE_COST
> >> tree-vectorizer.h:#ifndef TARG_VEC_PERMUTE_COST
> >>
> >> Could the vectorizer folks please turn these into target hooks?
>
> Btw, a single cost target hook with an enum argument would be
> preferred here.

Where is the best place to define such enum?

Thanks,
Ira

>
> Richard.
>
> > OK, I'll do that.
> >
> > Ira
> >
> >>
> >> Ciao!
> >> Steven
> >
> >



Re: Why doesn't vetorizer skips loop peeling/versioning for target supports hardware misaligned access?

2011-01-24 Thread Ira Rosen
Hi,

gcc-ow...@gcc.gnu.org wrote on 24/01/2011 03:21:51 PM:

> Hello,
> Some of our target processors support complete hardware misaligned
> memory access. I implemented movmisalignm patterns, and found
> TARGET_SUPPORT_VECTOR_MISALIGNMENT
> (TARGET_VECTORIZE_SUPPORT_VECTOR_MISALIGNMENT
> On 4.6) hook is based on checking these patterns. Somehow this
> hook doesn't seem to be used. vect_enhance_data_refs_alignment
> is called regardless whether the target has HW misaligned support
> or not.

targetm.vectorize.support_vector_misalignment is used in
vect_supportable_dr_alignment to decide whether a specific misaligned
access is supported.

>
> Shouldn't using HW misaligned memory access be better than
> generating extra code for loop peeling/versioning? Or at least
> if for some architectures it is not the case, we should have
> a compiler hook to choose between them. BTW, I mainly work
> on 4.5, maybe 4.6 has changed.

Right. And we have that implemented in 4.6 at least partially: for known
misalignment and for peeling for loads. Maybe this part needs to be
enhanced, concrete testcases could help.

Ira

>
> Thanks,
> Bingfeng Mei
>



Re: Documentation for loop infrastructure

2006-09-06 Thread Ira Rosen

> Here is the documentation for the data dependence analysis.

I can add a description of data-refs creation/analysis if it is useful.

Ira



Re: Type yielded by TARGET_VECTORIZE_BUILTIN_MASK_FOR_LOAD hook?

2006-09-19 Thread Ira Rosen
Hi,

Does this patch fix the problem?

Ira

Index: tree-vect-transform.c
===
--- tree-vect-transform.c   (revision 117002)
+++ tree-vect-transform.c   (working copy)
@@ -1916,10 +1916,10 @@ vectorizable_load (tree stmt, block_stmt
  /* Create permutation mask, if required, in loop preheader.  */
  tree builtin_decl;
  params = build_tree_list (NULL_TREE, init_addr);
- vec_dest = vect_create_destination_var (scalar_dest, vectype);
  builtin_decl = targetm.vectorize.builtin_mask_for_load ();
  new_stmt = build_function_call_expr (builtin_decl, params);
- new_stmt = build2 (MODIFY_EXPR, vectype, vec_dest, new_stmt);
+ vec_dest = vect_create_destination_var (scalar_dest, TREE_TYPE
(new_stmt));
+ new_stmt = build2 (MODIFY_EXPR, TREE_TYPE (vec_dest), vec_dest,
new_stmt);
  new_temp = make_ssa_name (vec_dest, new_stmt);
  TREE_OPERAND (new_stmt, 0) = new_temp;
  new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);


Dorit Nuzman/Haifa/IBM wrote on 16/09/2006 12:37:28:

> > I'm trying to add a hook for aligning vectors for loads.
> >
> > I'm using the altivec rs6000 code as a baseline.
> >
> > However, the instruction is like the iwmmxt_walign instruction in the
> > ARM port; it takes
> > a normalish register and uses the bottom bits... it doesn't use a
> > full-width vector.
> >
> > GCC complains when my builtin pointed to by
> > TARGET_VECTORIZE_BUILTIN_MASK_FOR_LOAD yields a QImode result, because
> > it has no way of converting that to the vector moe it is expecting.  I
>
> Looks like it's a bug in the vectorizer - we treat both the return
> value of the mask_for_load builtin, and the 3rd argument to the
> realign_load stmt (e.g. Altivec's vperm), as variables of type
> 'vectype', instead of obtaining the type from the target machine
> description. All we need to care about is that these two variables
> have the same type. I'll look into that
>
> dorit
>
> > think the altivec side would have a similar problem, as the expected
> > RTX output RTX is:
> >
> > (reg:V8HI 131 [ vect_var_.2540 ])
> >
> > but it changes that to:
> >
> > (reg:V16QI 160)
> >
> > for the VLSR instruction.  V16QImode is what VPERM expects, and I
> > think since V8HI and V16QI mode are the same size everyone is happy.
> >
> > Is there a way to tell GCC what the type of the
> > TARGET_VECTORIZE_BUILTIN_MASK_FOR_LOAD should be?  Looking at
> > http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gccint/Addressing-Modes.
> > html#Addressing-Modes
> > it reads like it must merely match the last operand of the
> > vec_realign_load_ pattern.
> >
> > --
> > Why are ``tolerant'' people so intolerant of intolerant people?



Re: Type yielded by TARGET_VECTORIZE_BUILTIN_MASK_FOR_LOAD hook?

2006-09-19 Thread Ira Rosen


"Erich Plondke" <[EMAIL PROTECTED]> wrote on 20/09/2006 04:09:14:

> On 9/19/06, Erich Plondke <[EMAIL PROTECTED]> wrote:
> > On 9/19/06, Ira Rosen <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > Does this patch fix the problem?
> >
> > Well... seems pretty good.  I get the instruction generated from the
> > builtin, and it lives outside the body of the loop.
> >
> > GCC then moves the value out of the special register, zero extends it,
> > and moves it back into the special register uselessly.  :-(  But I
> > have the feeling that I have something else in my backend to blame for
> > that.
>
> Yes, it's because the special register is a QI and the PROMOTE_MODE
> macro always
> says to promote the QI to an SI.
>
> So the patch looks great!  Thanks!

Great, I'll prepare a patch for the mainline then.
Ira

>
> --
> Why are ``tolerant'' people so intolerant of intolerant people?



Re: Documentation for loop infrastructure

2006-09-24 Thread Ira Rosen


Sebastian Pop <[EMAIL PROTECTED]> wrote on 08/09/2006 18:04:01:

> Ira Rosen wrote:
> >
> > > Here is the documentation for the data dependence analysis.
> >
> > I can add a description of data-refs creation/analysis if it is useful.
> >
>
> That's a good idea, thanks.
>
> Sebastian

Here it is.
Ira

> The data references are discovered in a particular order during the
> scanning of the loop body: the loop body is analyzed in execution
> order, and the data references of each statement are pushed at the end
> of the data reference array.  Two data references syntactically occur
> in the program in the same order as in the array of data references.
> This syntactic order is important in some classical data dependence
> tests, and mapping this order to the elements of this array avoids
> costly queries to the loop body representation.

Three types of data references are currently handled: ARRAY_REF,
INDIRECT_REF and COMPONENT_REF. The data structure for the data reference
is @code{data_reference}, where @code{data_reference_p} is a name of a
pointer to the data reference structure. The structure contains the
following elements:

@itemize
@item @code{base_object_info}: Provides information about the base object
of the data reference and its access functions. These access functions
represent the evolution of the data reference in the loop relative to
its base, in keeping with the classical meaning of the data reference
access function for the support of arrays. For example, for a reference
@code{a.b[i][j]}, the base object is @code{a.b} and the access functions,
one for each array subscript, are:
@[EMAIL PROTECTED], + [EMAIL PROTECTED], @{j_init, +, [EMAIL PROTECTED]

@item @code{first_location_in_loop}: Provides information about the first
location accessed by the data reference in the loop and about the access
function used to represent evolution relative to this location. This data
is used to support pointers, and is not used for arrays (for which we
have base objects). Pointer accesses are represented as a one-dimensional
access that starts from the first location accessed in the loop. For
example:

@smallexample
  for i
 for j
  *((int *)p + i + j) = a[i][j];
@end smallexample

The access function of the pointer access is @[EMAIL PROTECTED], + [EMAIL 
PROTECTED] relative
to @code{p + i}. The access functions of the array are
@[EMAIL PROTECTED], + [EMAIL PROTECTED] and @[EMAIL PROTECTED], +, [EMAIL 
PROTECTED] relative
to @code{a}.

Usually, the object the pointer refers to is either unknown, or we can’t
prove that the access is confined to the boundaries of a certain object.

Two data references can be compared only if at least one of these two
representations has all its fields filled for both data references.

The current strategy for data dependence tests is as follows:
If both @code{a} and @code{b} are represented as arrays, compare
@code{a.base_object} and @code{b.base_object};
if they are equal, apply dependence tests (use access functions based on
base_objects).
Else if both @code{a} and @code{b} are represented as pointers, compare
@code{a.first_location} and @code{b.first_location};
if they are equal, apply dependence tests (use access functions based on
first location).
However, if @code{a} and @code{b} are represented differently, only try
to prove that the bases are definitely different.

@item Aliasing information.
@item Alignment information.
@end itemize

> The structure describing the relation between two data references is
> @code{data_dependence_relation} and the shorter name for a pointer to
> such a structure is @code{ddr_p}.  This structure contains:

Re: Documentation for loop infrastructure

2006-09-28 Thread Ira Rosen


Sebastian Pop <[EMAIL PROTECTED]> wrote on 26/09/2006 21:24:18:

> It is probably better to include the loop indexes in the example, and
> modify the syntax of the scev for making it more explicit, like:
>
> @smallexample
>   for1 i
>  for2 j
>   *((int *)p + i + j) = a[i][j];
> @end smallexample
>
> and the access function becomes: @[EMAIL PROTECTED], + [EMAIL PROTECTED]
>

Done.

I guess, I'll commit my part as soon as loop.texi (and Dependency analysis
part)
is committed.

Ira


> The data references are discovered in a particular order during the
> scanning of the loop body: the loop body is analyzed in execution
> order, and the data references of each statement are pushed at the end
> of the data reference array.  Two data references syntactically occur
> in the program in the same order as in the array of data references.
> This syntactic order is important in some classical data dependence
> tests, and mapping this order to the elements of this array avoids
> costly queries to the loop body representation.

Three types of data references are currently handled: ARRAY_REF,
INDIRECT_REF and COMPONENT_REF. The data structure for the data reference
is @code{data_reference}, where @code{data_reference_p} is a name of a
pointer to the data reference structure. The structure contains the
following elements:

@itemize
@item @code{base_object_info}: Provides information about the base object
of the data reference and its access functions. These access functions
represent the evolution of the data reference in the loop relative to
its base, in keeping with the classical meaning of the data reference
access function for the support of arrays. For example, for a reference
@code{a.b[i][j]}, the base object is @code{a.b} and the access functions,
one for each array subscript, are:
@[EMAIL PROTECTED], + [EMAIL PROTECTED], @{j_init, +, [EMAIL PROTECTED]

@item @code{first_location_in_loop}: Provides information about the first
location accessed by the data reference in the loop and about the access
function used to represent evolution relative to this location. This data
is used to support pointers, and is not used for arrays (for which we
have base objects). Pointer accesses are represented as a one-dimensional
access that starts from the first location accessed in the loop. For
example:

@smallexample
  for1 i
 for2 j
  *((int *)p + i + j) = a[i][j];
@end smallexample

The access function of the pointer access is @[EMAIL PROTECTED], + [EMAIL 
PROTECTED]
relative to @code{p + i}. The access functions of the array are
@[EMAIL PROTECTED], + [EMAIL PROTECTED] and @[EMAIL PROTECTED], +, [EMAIL 
PROTECTED]
relative to @code{a}.

Usually, the object the pointer refers to is either unknown, or we can’t
prove that the access is confined to the boundaries of a certain object.

Two data references can be compared only if at least one of these two
representations has all its fields filled for both data references.

The current strategy for data dependence tests is as follows:
If both @code{a} and @code{b} are represented as arrays, compare
@code{a.base_object} and @code{b.base_object};
if they are equal, apply dependence tests (use access functions based on
base_objects).
Else if both @code{a} and @code{b} are represented as pointers, compare
@code{a.first_location} and @code{b.first_location};
if they are equal, apply dependence tests (use access functions based on
first location).
However, if @code{a} and @code{b} are represented differently, only try
to prove that the bases are definitely different.

@item Aliasing information.
@item Alignment information.
@end itemize

> The structure describing the relation between two data references is
> @code{data_dependence_relation} and the shorter name for a pointer to
> such a structure is @code{ddr_p}.  This structure contains:

Re: Documentation for loop infrastructure

2006-10-05 Thread Ira Rosen


Zdenek Dvorak <[EMAIL PROTECTED]> wrote on 28/09/2006
15:04:07:

>
> I have commited the documentation, including the parts from Daniel and
> Sebastian (but not yours) now.
>
> Zdenek

I've committed my part.

Ira



Added myself to MAINTAINERS (write after approval)

2005-02-17 Thread Ira Rosen




Index: MAINTAINERS
===
RCS file: /cvs/gcc/gcc/MAINTAINERS,v
retrieving revision 1.395
diff -c -3 -p -r1.395 MAINTAINERS
*** MAINTAINERS 14 Feb 2005 11:21:09 -  1.395
--- MAINTAINERS 17 Feb 2005 08:50:31 -
*** Volker Reichelt
[EMAIL PROTECTED]
*** 287,292 
--- 287,293 
  Tom Rix   [EMAIL PROTECTED]
  Craig Rodrigues   [EMAIL PROTECTED]
  Gavin Romig-Koch  [EMAIL PROTECTED]
+ Ira Rosen   [EMAIL PROTECTED]
  Ira Ruben [EMAIL PROTECTED]
  Douglas Rupp  [EMAIL PROTECTED]
  Matthew Sachs [EMAIL PROTECTED]



Re: Mainline is now regression and documentation fixes only

2008-01-24 Thread Ira Rosen


Dorit Nuzman/Haifa/IBM wrote on 23/01/2008 21:49:51:

> There are however a couple of small cost-model changes that were
> going to be submitted this week for the Cell SPU - it's unfortunate
> if these cannot get into 4.3.

It's indeed unfortunate. However, those changes are not crucial and there
is still some more work to be done (check on additional benchmarks, etc.).
So, I guess, it will have to wait for 4.4.

Ira

>
> dorit
>



Re: Memory leaks in compiler

2008-01-29 Thread Ira Rosen

(I am resending this, since some of the addresses got corrupted. My
apologies.)

Hi,

[EMAIL PROTECTED] wrote on 16/01/2008 15:20:00:

> > When a loop is vectorized, some statements are removed from the basic
> > blocks, but the vectorizer information attached to these BBs is never
> > freed.
>
> Sebastian, thanks for bringing this to our attention. I'll look into
this.
> I hope that removing stmts from a BB can be easily localized.
> -- Victor
>

The attached patch, mainly written by Victor, fixes memory leaks in the
vectorizer, that were found with the help of valgrind and by examining the
code.

Bootstrapped with vectorization enabled and tested on vectorizer testsuite
on ppc-linux. I still have to perform full regtesting.

Is it O.K. for 4.3? Or will it wait for 4.4?

Thanks,.
Victor and Ira

ChangeLog:

  * tree-vectorizer.c (free_stmt_vec_info): New function.
  (destroy_loop_vec_info): Move code to free_stmt_vec_info().
  Call free_stmt_vec_info(). Free LOOP_VINFO_STRIDED_STORES..
  * tree-vectorizer.h (free_stmt_vec_info): Declare.
  * tree-vect-transform.c (vectorizable_conversion): Free
  vec_oprnds0 if it was allocated.
  (vect_permute_store_chain): Remove unused VECs.
  (vectorizable_store): Free VECs that are allocated in the..
  function.
  (vect_transform_strided_load, vectorizable_load): Likewise.
  (vect_remove_stores): Simplify the code.
  (vect_transform_loop): Move code to vect_remove_stores().
  Call vect_remove_stores() and free_stmt_vec_info().


(See attached file: memleaks.txt)

Index: tree-vectorizer.c
===
--- tree-vectorizer.c   (revision 131899)
+++ tree-vectorizer.c   (working copy)
@@ -1558,6 +1558,22 @@ new_stmt_vec_info (tree stmt, loop_vec_i
 }
 
 
+/* Free stmt vectorization related info.  */
+
+void
+free_stmt_vec_info (tree stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+  if (!stmt_info)
+return;
+
+  VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info));
+  free (stmt_info);
+  set_stmt_info (stmt_ann (stmt), NULL);
+}
+
+
 /* Function bb_in_loop_p
 
Used as predicate for dfs order traversal of the loop bbs.  */
@@ -1714,21 +1730,13 @@ destroy_loop_vec_info (loop_vec_info loo
 {
   basic_block bb = bbs[j];
   tree phi;
-  stmt_vec_info stmt_info;
 
   for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi))
-{
-  stmt_ann_t ann = stmt_ann (phi);
-
-  stmt_info = vinfo_for_stmt (phi);
-  free (stmt_info);
-  set_stmt_info (ann, NULL);
-}
+free_stmt_vec_info (phi);
 
   for (si = bsi_start (bb); !bsi_end_p (si); )
{
  tree stmt = bsi_stmt (si);
- stmt_ann_t ann = stmt_ann (stmt);
  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
 
  if (stmt_info)
@@ -1746,9 +1754,7 @@ destroy_loop_vec_info (loop_vec_info loo
}

  /* Free stmt_vec_info.  */
- VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info));
- free (stmt_info);
- set_stmt_info (ann, NULL);
+ free_stmt_vec_info (stmt);
 
  /* Remove dead "pattern stmts".  */
  if (remove_stmt_p)
@@ -1767,6 +1773,7 @@ destroy_loop_vec_info (loop_vec_info loo
   for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++)
 vect_free_slp_tree (SLP_INSTANCE_TREE (instance));
   VEC_free (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo));
+  VEC_free (tree, heap, LOOP_VINFO_STRIDED_STORES (loop_vinfo));
 
   free (loop_vinfo);
   loop->aux = NULL;
Index: tree-vectorizer.h
===
--- tree-vectorizer.h   (revision 131899)
+++ tree-vectorizer.h   (working copy)
@@ -667,6 +667,7 @@ extern bool supportable_narrowing_operat
 extern loop_vec_info new_loop_vec_info (struct loop *loop);
 extern void destroy_loop_vec_info (loop_vec_info, bool);
 extern stmt_vec_info new_stmt_vec_info (tree stmt, loop_vec_info);
+extern void free_stmt_vec_info (tree stmt);
 
 
 /** In tree-vect-analyze.c  **/
Index: tree-vect-transform.c
===
--- tree-vect-transform.c   (revision 131899)
+++ tree-vect-transform.c   (working copy)
@@ -3638,6 +3638,9 @@ vectorizable_conversion (tree stmt, bloc
   *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
 }
 
+  if (vec_oprnds0)
+VEC_free (tree, heap, vec_oprnds0); 
+
   return true;
 }
 
@@ -4589,11 +4592,8 @@ vect_permute_store_chain (VEC(tree,heap)
   tree scalar_dest, tmp;
   int i;
   unsigned int j;
-  VEC(tree,heap) *first, *second;
   
   scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0);
-  first = VEC_alloc (tree, heap, length/2);
-  second = VEC_alloc (tree, heap, length/2);
 
   /* Check that the operation is supported.  */
   if (!vect_strided_store_supported (vectype

Re: Memory leaks in compiler

2008-01-29 Thread Ira Rosen

Hi,

[EMAIL PROTECTED] wrote on 16/01/2008 15:20:00:

> > When a loop is vectorized, some statements are removed from the basic
> > blocks, but the vectorizer information attached to these BBs is never
> > freed.
>
> Sebastian, thanks for bringing this to our attention. I'll look into
this.
> I hope that removing stmts from a BB can be easily localized.
> -- Victor
>

The attached patch, mainly written by Victor, fixes memory leaks in the
vectorizer, that were found with the help of valgrind and by examining the
code.

Bootstrapped with vectorization enabled and tested on vectorizer testsuite
on ppc-linux. I still have to perform full regtesting.

Is it O.K. for 4.3? Or will it wait for 4.4?

Thanks,.
Victor and Ira

ChangeLog:

  * tree-vectorizer.c (free_stmt_vec_info): New function.
  (destroy_loop_vec_info): Move code to free_stmt_vec_info().):
  Call free_stmt_vec_info(). Free LOOP_VINFO_STRIDED_STORES.
  * tree-vectorizer.h (free_stmt_vec_info): Declare.
  * tree-vect-transform.c (vectorizable_conversion): Free
  vec_oprnds0 if it was allocated.
  (vect_permute_store_chain): Remove unused VECs.
  (vectorizable_store): Free VECs that are allocated in the
  function.
  (vect_transform_strided_load, vectorizable_load): Likewise.
  (vect_remove_stores): Simplify the code.
  (vect_transform_loop): Move code to vect_remove_stores().
  Call vect_remove_stores() and free_stmt_vec_info().


(See attached file: memleaks.txt)





Index: tree-vectorizer.c
===
--- tree-vectorizer.c   (revision 131899)
+++ tree-vectorizer.c   (working copy)
@@ -1558,6 +1558,22 @@ new_stmt_vec_info (tree stmt, loop_vec_i
 }
 
 
+/* Free stmt vectorization related info.  */
+
+void
+free_stmt_vec_info (tree stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+  if (!stmt_info)
+return;
+
+  VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info));
+  free (stmt_info);
+  set_stmt_info (stmt_ann (stmt), NULL);
+}
+
+
 /* Function bb_in_loop_p
 
Used as predicate for dfs order traversal of the loop bbs.  */
@@ -1714,21 +1730,13 @@ destroy_loop_vec_info (loop_vec_info loo
 {
   basic_block bb = bbs[j];
   tree phi;
-  stmt_vec_info stmt_info;
 
   for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi))
-{
-  stmt_ann_t ann = stmt_ann (phi);
-
-  stmt_info = vinfo_for_stmt (phi);
-  free (stmt_info);
-  set_stmt_info (ann, NULL);
-}
+free_stmt_vec_info (phi);
 
   for (si = bsi_start (bb); !bsi_end_p (si); )
{
  tree stmt = bsi_stmt (si);
- stmt_ann_t ann = stmt_ann (stmt);
  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
 
  if (stmt_info)
@@ -1746,9 +1754,7 @@ destroy_loop_vec_info (loop_vec_info loo
}

  /* Free stmt_vec_info.  */
- VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info));
- free (stmt_info);
- set_stmt_info (ann, NULL);
+ free_stmt_vec_info (stmt);
 
  /* Remove dead "pattern stmts".  */
  if (remove_stmt_p)
@@ -1767,6 +1773,7 @@ destroy_loop_vec_info (loop_vec_info loo
   for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++)
 vect_free_slp_tree (SLP_INSTANCE_TREE (instance));
   VEC_free (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo));
+  VEC_free (tree, heap, LOOP_VINFO_STRIDED_STORES (loop_vinfo));
 
   free (loop_vinfo);
   loop->aux = NULL;
Index: tree-vectorizer.h
===
--- tree-vectorizer.h   (revision 131899)
+++ tree-vectorizer.h   (working copy)
@@ -667,6 +667,7 @@ extern bool supportable_narrowing_operat
 extern loop_vec_info new_loop_vec_info (struct loop *loop);
 extern void destroy_loop_vec_info (loop_vec_info, bool);
 extern stmt_vec_info new_stmt_vec_info (tree stmt, loop_vec_info);
+extern void free_stmt_vec_info (tree stmt);
 
 
 /** In tree-vect-analyze.c  **/
Index: tree-vect-transform.c
===
--- tree-vect-transform.c   (revision 131899)
+++ tree-vect-transform.c   (working copy)
@@ -3638,6 +3638,9 @@ vectorizable_conversion (tree stmt, bloc
   *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
 }
 
+  if (vec_oprnds0)
+VEC_free (tree, heap, vec_oprnds0); 
+
   return true;
 }
 
@@ -4589,11 +4592,8 @@ vect_permute_store_chain (VEC(tree,heap)
   tree scalar_dest, tmp;
   int i;
   unsigned int j;
-  VEC(tree,heap) *first, *second;
   
   scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0);
-  first = VEC_alloc (tree, heap, length/2);
-  second = VEC_alloc (tree, heap, length/2);
 
   /* Check that the operation is supported.  */
   if (!vect_strided_store_supported (vectype))
@@ -4976,6 +4976,11 @@ vectorizable_store (tree stmt, block_stm
}
 

Re: Memory leaks in compiler

2008-01-29 Thread Ira Rosen

(I am resending this, since some of the addresses got corrupted. My
apologies.)

Hi,

[EMAIL PROTECTED] wrote on 16/01/2008 15:20:00:

> > When a loop is vectorized, some statements are removed from the basic
> > blocks, but the vectorizer information attached to these BBs is never
> > freed.
>
> Sebastian, thanks for bringing this to our attention. I'll look into
this.
> I hope that removing stmts from a BB can be easily localized.
> -- Victor
>

The attached patch, mainly written by Victor, fixes memory leaks in the
vectorizer, that were found with the help of valgrind and by examining the
code.

Bootstrapped with vectorization enabled and tested on vectorizer testsuite
on ppc-linux. I still have to perform full regtesting.

Is it O.K. for 4.3? Or will it wait for 4.4?

Thanks,.
Victor and Ira

ChangeLog:

  * tree-vectorizer.c (free_stmt_vec_info): New function.
  (destroy_loop_vec_info): Move code to free_stmt_vec_info().
  Call free_stmt_vec_info(). Free LOOP_VINFO_STRIDED_STORES..
  * tree-vectorizer.h (free_stmt_vec_info): Declare.
  * tree-vect-transform.c (vectorizable_conversion): Free
  vec_oprnds0 if it was allocated.
  (vect_permute_store_chain): Remove unused VECs.
  (vectorizable_store): Free VECs that are allocated in the..
  function.
  (vect_transform_strided_load, vectorizable_load): Likewise.
  (vect_remove_stores): Simplify the code.
  (vect_transform_loop): Move code to vect_remove_stores().
  Call vect_remove_stores() and free_stmt_vec_info().


(See attached file: memleaks.txt)

Index: tree-vectorizer.c
===
--- tree-vectorizer.c   (revision 131899)
+++ tree-vectorizer.c   (working copy)
@@ -1558,6 +1558,22 @@ new_stmt_vec_info (tree stmt, loop_vec_i
 }
 
 
+/* Free stmt vectorization related info.  */
+
+void
+free_stmt_vec_info (tree stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+  if (!stmt_info)
+return;
+
+  VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info));
+  free (stmt_info);
+  set_stmt_info (stmt_ann (stmt), NULL);
+}
+
+
 /* Function bb_in_loop_p
 
Used as predicate for dfs order traversal of the loop bbs.  */
@@ -1714,21 +1730,13 @@ destroy_loop_vec_info (loop_vec_info loo
 {
   basic_block bb = bbs[j];
   tree phi;
-  stmt_vec_info stmt_info;
 
   for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi))
-{
-  stmt_ann_t ann = stmt_ann (phi);
-
-  stmt_info = vinfo_for_stmt (phi);
-  free (stmt_info);
-  set_stmt_info (ann, NULL);
-}
+free_stmt_vec_info (phi);
 
   for (si = bsi_start (bb); !bsi_end_p (si); )
{
  tree stmt = bsi_stmt (si);
- stmt_ann_t ann = stmt_ann (stmt);
  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
 
  if (stmt_info)
@@ -1746,9 +1754,7 @@ destroy_loop_vec_info (loop_vec_info loo
}

  /* Free stmt_vec_info.  */
- VEC_free (dr_p, heap, STMT_VINFO_SAME_ALIGN_REFS (stmt_info));
- free (stmt_info);
- set_stmt_info (ann, NULL);
+ free_stmt_vec_info (stmt);
 
  /* Remove dead "pattern stmts".  */
  if (remove_stmt_p)
@@ -1767,6 +1773,7 @@ destroy_loop_vec_info (loop_vec_info loo
   for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++)
 vect_free_slp_tree (SLP_INSTANCE_TREE (instance));
   VEC_free (slp_instance, heap, LOOP_VINFO_SLP_INSTANCES (loop_vinfo));
+  VEC_free (tree, heap, LOOP_VINFO_STRIDED_STORES (loop_vinfo));
 
   free (loop_vinfo);
   loop->aux = NULL;
Index: tree-vectorizer.h
===
--- tree-vectorizer.h   (revision 131899)
+++ tree-vectorizer.h   (working copy)
@@ -667,6 +667,7 @@ extern bool supportable_narrowing_operat
 extern loop_vec_info new_loop_vec_info (struct loop *loop);
 extern void destroy_loop_vec_info (loop_vec_info, bool);
 extern stmt_vec_info new_stmt_vec_info (tree stmt, loop_vec_info);
+extern void free_stmt_vec_info (tree stmt);
 
 
 /** In tree-vect-analyze.c  **/
Index: tree-vect-transform.c
===
--- tree-vect-transform.c   (revision 131899)
+++ tree-vect-transform.c   (working copy)
@@ -3638,6 +3638,9 @@ vectorizable_conversion (tree stmt, bloc
   *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
 }
 
+  if (vec_oprnds0)
+VEC_free (tree, heap, vec_oprnds0); 
+
   return true;
 }
 
@@ -4589,11 +4592,8 @@ vect_permute_store_chain (VEC(tree,heap)
   tree scalar_dest, tmp;
   int i;
   unsigned int j;
-  VEC(tree,heap) *first, *second;
   
   scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0);
-  first = VEC_alloc (tree, heap, length/2);
-  second = VEC_alloc (tree, heap, length/2);
 
   /* Check that the operation is supported.  */
   if (!vect_strided_store_supported (vectype

Re: Optimizations documentation

2008-02-17 Thread Ira Rosen
Hi,

Dorit Nuzman/Haifa/IBM wrote on 14/02/2008 17:02:45:

> This is an old debt: A while back Tim had sent me a detailed report
> off line showing which C++ tests (originally from the Dongara loops
> suite) were vectorized by current g++ or icpc, or both, as well as
> when the vectorization by icpc required a pragma, or was partial. I
> went over the loops that were reported to be vectorized by icc but
> not by gcc, to see which features we are missing. There are 23 such
> loops (out of a total of 77). They fall into the following 7 categories:
>
> (1) scalar evolution analysis fails with "evolution of base is not
affine".
> This happens in the 3 loops in lines 4267, 4204 and 511.
> Here an example:
>  for (i__ = 1; i__ <= i__2; ++i__)
> {
>   a[i__] = (b[i__] + b[im1] + b[im2]) * .333f;
>   im2 = im1;
>   im1 = i__;
> }
> Missed optimization PR to be opened.

I opened PR35224.

>
> (2) Function calls inside a loop. These are calls to the math
> functions sin/cos, which I expect would be vectorized if the proper
> simd math lib was available.
> This happens in the loop in line 6932.
> I think there's an open PR for this one (at least for
> powerpc/Altivec?) - need to look/open.

There is PR6.

>
> (3) This one is the most dominant missed optimization: if-conversion
> is failing to if-convert, most likely due to the very limited
> handling of loads/stores (i.e. load/store hoisting/sinking is required).
> This happens in the 13 loops in lines 4085, 4025, 3883, 3818, 3631,
> 355, 3503, 2942, 877, 6740, 6873, 5191, 7943.
> There is on going work towards addressing this issue - see http:
> //gcc.gnu.org/ml/gcc/2007-07/msg00942.html, http://gcc.gnu.
> org/ml/gcc/2007-09/msg00308.html. (I think Victor Kaplansky is
> currently working on this).
>
> (4) A scalar variable, whose address is taken outside the loop (in
> an enclosing outer-loop) is analyzed by the data-references
> analysis, which fails because it is invariant.
> Here's an example:
>   for (nl = 1; nl <= i__1; ++nl)
> {
>   sum = 0.f;
>   for (i__ = 1; i__ <= i__2; ++i__)
> {
>   a[i__] = c__[i__] + d__[i__];
>   b[i__] = c__[i__] + e[i__];];
> sum += a[i__] + b[i__];];];
> }
>   dummy_ (ld, n, &a[1], &b[1], &c__[1], &d__[1], &e[1], &aa
[aa_offset],
>   &bb[bb_offset], &cc[cc_offset], &sum);
> }
> (Analysis of 'sum' fails with "FAILED as dr address is invariant".
> This happens in the 2 loops in lines 5053 and 332.
> I think there is a missed optimization PR for this one already. need
> to look/open.
>

The related PRs are PR33245 and PR33244. Also there is a FIXME comment in
tree-data-ref.c before the failure with "FAILED as dr address is invariant"
error:

  /* FIXME -- data dependence analysis does not work correctly for
objects with
 invariant addresses.  Let us fail here until the problem is fixed.
*/


> (5) Reduction and induction that involve multiplication (i.e. 'prod
> *= CST' or 'prod *= a[i]') are currently not supported by the
> vectorizer. It should be trivial to add support for this feature
> (for reduction, it shouldn't be much more than adding a case for
> MULT_EXPR in tree-vectorizer.c:reduction_code_for_scalar_code, I think).
> This happens in the 2 loops in lines 4921 and 4632.
> A missed-optimization PR to be opened.

Opened PR35226.

>
> (6) loop distribution is required to break a dependence. This may
> already be handled by Sebastian's loop-distribution pass that will
> be incorporated in 4.4.
> Here is an example:
>  for (i__ = 2; i__ <= i__2; ++i__)
> {
>   a[i__] += c__[i__] * d__[i__];
>   b[i__] = a[i__] + d__[i__] + b[i__ - 1];
> }
> This happens in the loop in line 2136.
> Need to check if we need to open a missed optimization PR for this.

I don't think that this is a loop distribution issue. The dependence
between the store to a[i] and the load from a[i] doesn't prevent
vectorization. The problematic one is between the store to b[i] and the
load from b[i-1] in the second statement.

>
> (7) A dependence, similar to such that would be created by
> predictive commoning (or even PRE), is present in the loop:
>  for (i__ = 1; i__ <= i__2; ++i__)
> {
>   a[i__] = (b[i__] + x) * .5f;
>   x = b[i__];
> }
> This happens in the loop in line 3003.
> The vectorizer needs to be extended to handle such cases.
> A missed optimization PR to be opened (if doesn't exist already).

I opened a new PR - 35229. (PR33244 is somewhat related).

Ira



Re: Optimizations documentation

2008-02-17 Thread Ira Rosen


Dorit Nuzman/Haifa/IBM wrote on 18/02/2008 09:40:37:

> Thanks a lot for tracking down / opening the relevant PRs.
>
> about:
>
> > > (6) loop distribution is required to break a dependence. This may
> > > already be handled by Sebastian's loop-distribution pass that will
> > > be incorporated in 4.4.
> > > Here is an example:
> > >  for (i__ = 2; i__ <= i__2; ++i__)
> > > {
> > >   a[i__] += c__[i__] * d__[i__];
> > >   b[i__] = a[i__] + d__[i__] + b[i__ - 1];
> > > }
> > > This happens in the loop in line 2136.
> > > Need to check if we need to open a missed optimization PR for this.
> >
> > I don't think that this is a loop distribution issue. The dependence
> > between the store to a[i] and the load from a[i] doesn't prevent
> > vectorization.
>
> right,
>
> > The problematic one is between the store to b[i] and
> > the load from b[i-1] in the second statement.
>
> ...which is exactly why loop distribution could make this loop
> (partially) vectorizable - separating the first and second
> statements into separate loops would allow vectorizing the first of
> the two resulting loops (which is probably what icc does - icc
> reports that this loop is partially vectrizable).

Yes, I see now.
I applied Sebastian's patch (
http://gcc.gnu.org/ml/gcc-patches/2007-12/msg00215.html) and got
"FIXME: Loop 1 not distributed: failed to build the RDG."

Ira

>
> dorit
>



Re: vectorizer default in 4.3.0 changes document missing

2008-03-10 Thread Ira Rosen
Hi Andi,

[EMAIL PROTECTED] wrote on 10/03/2008 18:32:35:

>
> I noticed the gcc 4.3.0 changes document on the website does not
> mention that the vectorizer is now on by default in -O3.
> Perhaps that should be added? It seems like an important noteworthy
> change to me.

Thanks for pointing this out. The vectorizer's website was not update for a
while. I am going to do that.

>
> I'm not sure it applies to all architectures, but it applies to
> x86 at least.

Vectorization (-ftree-vectorize) is on by default in -O3 on all platforms,
but many architectures require additional flags to actually apply it, like
-maltivec on PowerPC.

Thanks,
Ira

>
> -Andi



Re: Auto-vectorization: need to know what to expect

2008-03-18 Thread Ira Rosen

[EMAIL PROTECTED] wrote on 17/03/2008 19:33:23:

> I have looked more closely at the messages generated by the gcc 4.3
> vectorizer
> and it seems that they fall into two categories:
>
> 1) complaining about aligmnent.
>
> For example:
>
> Unknown alignment for access: D.33485
> Unknown alignment for access: m

These do not necessary mean that the loop can't be vectorized - we can
handle unknown alignment with loop peeling and loop versioning.

>
> I don't understand, as all my data is statically allocated doubles
> (no dynamic
> memory allocation) and I am using -malign-double. What more can I do?
>
> 2) complaining about "possible dependence" between some data and itself
>
> Example:
>
> not vectorized, possible dependence between data-refs
> m.m_storage.m_data[D.43225_112] and m.m_storage.m_data[D.43225_112]

These two data-refs are probably a store and a load to the same place, not
the same data-ref.

As it has been already said, the best thing to do is to open a PR with a
testcase, so we can fully analyze it and answer all the questions..

Ira

>
>
> I am wondering what to do about all that? Surely there must be
documentation
> about the vectorizer and its messages somewhere but I can't find it?
>
> Cheers,
> Benoit
>
>
> On Monday 17 March 2008 15:59:21 Richard Guenther wrote:
> > On Mon, Mar 17, 2008 at 3:45 PM, Benoît Jacob <[EMAIL PROTECTED]>
wrote:
> > > Dear All,
> > >
> > >  I am currently (co-)developing a Free (GPL/LGPL) C++ library for
> > > vector/matrix math.
> > >
> > >  A major decision that we need to take is, what to do regarding
> > > vectorization instructions (SSE). Either we rely on GCC to
> > > auto-vectorize, or we control explicitly the vectorization using
GCC's
> > > special primitives. The latter solution is of course more difficult,
and
> > > would to some degree obfuscate our source code, so we wish to know
> > > whether or not it's really necessary.
> > >
> > >  GCC 4.3.0 does auto-vectorize our loops, but the resulting code has
> > > worse performance than a version with unrolled loops and no
> > > vectorization. By contrast, ICC auto-vectorizes the same loops in a
way
> > > that makes them significantly faster than the unrolled-loops
> > > non-vectorized version.
> > >
> > >  If you want to know, the loops in question typically look like:
> > >  for(int i = 0; i < COMPILE_TIME_CONSTANT; i++)
> > >  {
> > > // some abstract c++ code with deep recursive templates and
> > > // deep recursive inline functions, but resulting in only a
> > > // few assembly instructions
> > > a().b().c().d(i) = x().y().z(i);
> > >  }
> > >
> > >  As said above, it's crucial for us to be able to get an idea of what
to
> > >  expect, because design decisions depend on that. Should we expect
large
> > >  improvements regarding autovectorization in 4.3.x, in 4.4 or 4.5 ?
> >
> > In general GCCs autovectorization capabilities are quite good, cases
> > where we miss opportunities do of course exist.  There were
improvements
> > regarding autovectorization capabilities in every GCC release and I
expect
> > that to continue for future releases (though I cannot promise anything
> > as GCC is a volunteer driven project - but certainly testcases where we
> > miss optimizations are welcome - often we don't know of all corner
cases).
> >
> > If you require to get the absolute most out of your CPU I recommend to
> > provide special routines tuned for the different CPU families and I
> > recommend the use of the standard intrinsics headers (*mmintr.h) for
> > this.  Of course this comes at a high cost of maintainance (and initial
> > work), so autovectorization might prove good enough.  Often tuning the
> > source for a given compiler has a similar effect than producing
vectorized
> > code manually.  Looking at GCC tree dumps and knowing a bit about
> > GCC internals helps you here ;)
> >
> > >  A roadmap or a GCC developer sharing his thoughts would be very
helpful.
> >
> > Thanks,
> > Richard.
>
>
> [attachment "signature.asc" deleted by Ira Rosen/Haifa/IBM]



Re: Auto-vectorization: need to know what to expect

2008-03-18 Thread Ira Rosen


[EMAIL PROTECTED] wrote on 17/03/2008 21:08:43:

> It might be nice to think about an option that automatically aligns large
> arrays without having to do the declaration (or even have the vectorizer
> override the alignment for statics/auto).

The vectorizer is already doing this.

Ira

>
> --
> Michael Meissner, AMD
> 90 Central Street, MS 83-29, Boxborough, MA, 01719, USA
> [EMAIL PROTECTED]
>



Re: 4.3.0 manual vs changes.html

2008-03-18 Thread Ira Rosen


[EMAIL PROTECTED] wrote on 19/03/2008 06:01:19:

> The web page
>
> http://gcc.gnu.org/gcc-4.3/changes.html
>
> states that "The -ftree-vectorize option is now on by default under -
> O3.", but on
>
> http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Optimize-Options.html
>
> -ftree-vectorize is not listed as one of the options enabled by -O3.
>
> Is the first statement correct?

Yes, -ftree-vectorize is on by default under -O3.
The later doc should be updated. I am preparing a patch.

Thanks for pointing this out,
Ira

>
> Brad



Re: auto vectorization - should this work ?

2007-05-06 Thread Ira Rosen

Yes, this should get vectorized. The problem is in data dependencies
analysis. We fail to prove that s_5->a[i_16] and s_5->a[i_16] access the
same memory location. I think, it happens since when we compare the bases
of the data references (s_5->a and s_5->a) in base_object_differ_p(), we do
that by comparing the trees (which are pointers) and not their content.

I'll look into this and, I hope, I will submit a fix for that soon (I guess
using operand_equal_p instead).

Thanks,
Ira



Re: auto vectorization - should this work ?

2007-05-06 Thread Ira Rosen


Toon Moene <[EMAIL PROTECTED]> wrote on 06/05/2007 15:33:38:

> I'd be willing to test out your solution privately, if you prefer such a
> round first ...
>

Thanks. I'll send you a patch when it's ready.

Ira





Re: auto vectorization - should this work ?

2007-05-06 Thread Ira Rosen


"Richard Guenther" <[EMAIL PROTECTED]> wrote on 06/05/2007
16:17:05:

> On 5/6/07, Ira Rosen <[EMAIL PROTECTED]> wrote:
> >
> > Yes, this should get vectorized. The problem is in data dependencies
> > analysis. We fail to prove that s_5->a[i_16] and s_5->a[i_16] access
the
> > same memory location. I think, it happens since when we compare the
bases
> > of the data references (s_5->a and s_5->a) in base_object_differ_p(),
we do
> > that by comparing the trees (which are pointers) and not their content.
> >
> > I'll look into this and, I hope, I will submit a fix for that soon (I
guess
> > using operand_equal_p instead).
>
> Duh, that function looks interesting, indeed ;)
>
> It should probably use get_base_address () to get at the base object
> and then operand_equal_p to compare them.  Note that it strips outer
> variable offset as well, like for a[i].b[j] you will get 'a' as the
> base object.
> If data-ref cannot handle this well, just copy get_base_address () and
> stop at the first ARRAY_REF you come along.  But maybe
> base_object_differ_p is only called from contexts that are well-defined
> in this regard.

base_object_differ_p is called after the data-refs analysis. So we really
compare base objects here, and no further peeling is needed at this stage.
At least, that was our intention.

Thanks,
Ira

>
> Richard.



Re: Some thoughts about steerring commitee work

2007-06-17 Thread Ira Rosen

"Daniel Berlin" <[EMAIL PROTECTED]> wrote on 16/06/2007:

> On 6/16/07, Dorit Nuzman <[EMAIL PROTECTED]> wrote:
>
> > Do you have specific examples where SLP helps performance out of loops?
>
> hash calculations.
>
> For md5, you can get a 2x performance improvement by straight-line
> vectorizing it
> sha1 is about 2-2.5x
>
> (This assumes you do good pack/unpack placement using something like
> lazy code motion)
>
> See, for example, http://arctic.org/~dean/crypto/sha1.html
>
> (The page is out of date, the technique they explain where they are
> doing straight line computation of the hash in parallel, is exactly
> what SLP would provide out of loops)

I looked at the above page (and also at MD5 and SHA1 implementations). I
found only computations inside loops.
Could you please explain what exactly you refer to as SLP out of loops in
this benchmark?

Thanks,
Ira



Re: Optimizations documentation

2008-01-02 Thread Ira Rosen
Hi,

[EMAIL PROTECTED] wrote on 01/01/2008 22:00:11:

> some time ago I listened that GCC supports vectorization,
> but still can't find anything about it, how can I use it in my programs.

Here is the link to the vectorizer's documentation:
http://gcc.gnu.org/projects/tree-ssa/vectorization.html

Ira