Re: More of a Loop distribution.

2015-08-14 Thread Richard Biener
On Fri, Aug 14, 2015 at 8:13 AM, Ajit Kumar Agarwal
 wrote:
>
>
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Friday, August 14, 2015 11:30 AM
> To: Ajit Kumar Agarwal
> Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
> Hunsigida; Nagaraju Mekala
> Subject: RE: More of a Loop distribution.
>
> On August 14, 2015 4:59:07 AM GMT+02:00, Ajit Kumar Agarwal 
>  wrote:
>>
>>
>>-Original Message-
>>From: Richard Biener [mailto:richard.guent...@gmail.com]
>>Sent: Thursday, August 13, 2015 3:23 PM
>>To: Ajit Kumar Agarwal
>>Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta;
>>Vidhumouli Hunsigida; Nagaraju Mekala
>>Subject: Re: More of a Loop distribution.
>>
>>On Thu, Aug 13, 2015 at 9:37 AM, Ajit Kumar Agarwal
>> wrote:
>>> All:
>>>
>>> Loop distribution considers DDG to decide on distributing the Loops.
>>> The Loops with control statements like IF-THEN-ELSE can also be
>>> Distributed. Instead of Data Dependency Graph, the Control Dependence
>>Graph should be considered in order to distribute the loops In presence
>>of control Statements.
>>>
>>> Also the presence of multiple exits in the Loop can also be
>>considered for Loop distribution transformation.
>>>
>>> Thus the above transformation helps in the Libquantum benchmarks for
>>SPEC 2006.
>>>
>>> There are following articles that looks interesting to me.
>>>
>>> "Loop Distribution in presence of arbitrarily control flow Ken
>>Kennedy et.al."
>>> "Loop Distribution in presence of Multiple Exits Bor-Ming Hsieh
>>etal."
>>>
>>>
>>> I don't think the loop distribution in presence of control flow is
>>implemented in GCC/LLVM.
>>>
>>> I think it is feasible to consider the above for the implementation
>>in GCC.
>>
It's true that loop distribution does not try to distribute based on
>>any control structure heuristics but it only considers data locality.
>>It does however already >>compute the control dependence graph (and
>>uses it to add control edges to the DDG to properly add data dependence
>>edges to uses of control statements >>necessary in partitions).
>>
So it should be a matter of specifying the proper set of starting
>>statements it tries separating.
>>
>>Thanks.
>>
Not sure which kind of distribution you are after, can you give an
>>example?
>>
>>I would like to have a distribution of the loop having control flow.
>>For example
>>
>>For (I = 2 ; I < N; i++)
>>{
>>If  (condition)
>>   {
>>  S1: A[i] = ...
>>  S2:D[i] = A[i-1]...
>>   }
>>}
>>
>>The above loop can be distributed with two loops having one loop with
>>S1  inside IF and another loop with S2 with the IF.
>>The two scenario can be true.
>>
>>1. The condition inside IF have a check on A[i] and is dependent on S1.
>>In this case the distribution is difficult. And the above article From
>>Ken Kennedy et.al does store the partial results of comparison in an
>>temporary array and that array is compared inside the IF Condition.
>>This makes the loop distributed. This is what I was looking for which I
>>found in the above article.
>>
>>2. The condition inside the IF in the above loop is not dependent on
>>the S1 and S2 , and this case the loop can be distributed.
>>
>>In the above two scenario the GCC can't distribute the loop, as the
>>control dependency graph ( control structure ) is not used.
>
>>>The above loop can be distributed by gcc just fine.
>
> Existing  loop distribution implementation in GCC distributes the above loop 
> in both the above scenario's?

GCC cannot do the trick of computing the condition into a temporary
array.  What it does
(and may then reject for dependence reasons) is to have the original
condition (including an
eventual reference to A[i]) in both partitions containing S1 and S2.
So if the exact thing
loos like

  for (i = 0; i < N; i++)
{
   if (A[i])
 {
A[i] = ...;
D[i] = A[i-1] ...;
 }
   }

then I see no dependence that makes distributing into

  for (i = 0; i < N; i++)
{
   if (A[i])
 D[i] = A[i-1] ...;
}
 for (i =0; i < N; i++)
{
   if (A[i])
 A[i] = ...;
}

invalid.  But as the testcase isn't complete or compilable I can't
actually verify
what GCC does here.  I would think loop distributions cost metric would reject
this distribution because A is accessed in both loops, but as said, only real
code examples will show.

All I wanted to say is the underlying support is there in loop
distribution - where
current loop distribution is lacking is a) determining the desired partitioning
(by determining the stmts to distribute) and b) the cost model that fuses back
partitions that are related.

Oh, and of course CSE comes in the way of both cost modeling and actual
dependences.  Loop distribution doesn't understand to reload an expression
from memory if it sees

   tem = B[i] + C[i];
   A[i] = tem;   (1)
   D[i] = tem;   (2)

then the partition

Re: About loop unrolling and optimize for size

2015-08-14 Thread Richard Biener
On Thu, Aug 13, 2015 at 6:26 PM, sa...@hederstierna.com
 wrote:
> Hi
> I'm using an ARM thumb cross compiler for embedded systems and always do 
> optimize for small size with -Os.
>
> Though I've experimented with optimization flags, and loop unrolling.
>
> Normally loop unrolling is always bad for size, code is duplicated and size 
> increases.
>
> Though I discovered that in some special cases where the number of iteration 
> is very small, eg a loop of 2-3 times,
> in this case an unrolling could make code size smaller - eg. losen up 
> registers used for index in loops etc.
>
> Example when I use the flag "-fpeel-loops" together with -Os I will 99% of 
> the cases get smaller code size for ARM thumb target.
>
> Some my question is how unrolling works with -Os, is it always totally 
> disabled,
> or are there some cases when it could be tested, eg. with small number 
> iterations, so loop can be eliminated?
>
> Could eg. "-fpeel-loops" be enabled by default for -Os perhaps? Now its only 
> enabled for -O2 and above I think.

Complete peeling is already enabled with -Os, it is just restricted to
those cases where GCCs cost modeling of the
unrolling operation determines the code size shrinks.  If you enable
-fpeel-loops then the cost model allows the
code size to grow - sth not (always) intended with -Os.

The solution is of course to improve the cost modeling and GCCs idea
of followup optimization opportunities.
I do have some incomplete patches to improve that and hope to get back
to it for GCC 6.

If you have (small) testcases that show code size improvements with
-Os -fpeel-loops over -Os and you are
confident they are caused by unrolling please open a bugzilla containing them.

Thanks,
Richard.

> Thanks and Best Regards
> Fredrik


Re: Using libbacktrace in libgfortran: some questions

2015-08-14 Thread FX
> Patch tested and committed with this ChangeLog entry.
> 
> 2015-08-13  Ian Lance Taylor  
> 
> * dwarf.c (read_function_entry): Add vec_inlined parameter.
> Change all callers.


Thanks, this is great! I am going to submit the libgfortran patch to use 
libbacktrace today.

Cheers,
FX

Re: [powerpc64le] seq_cst memory order possibly not honored

2015-08-14 Thread Jonathan Wakely
On 14 August 2015 at 01:37, Andrey Semashev wrote:
> 1. Is my test valid or is there a flaw that I'm missing?

The cppmem tool at http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ shows
that there are consistent executions where (x==0 && y==0) is true. I
used this code:

int main() {
  atomic_int a = 0;
  atomic_int b = 0;
  int x, y;
  {{{
{
  a.store(1,mo_seq_cst);
  a.load(mo_relaxed);
  x = b.load(mo_relaxed);
}
  |||
{
  b.store(1, mo_seq_cst);
  b.load(mo_relaxed);
  y = a.load(mo_relaxed);
}
  }}}
  return 0;
}


Re: Finding insns to reorder using dataflow

2015-08-14 Thread Kyrill Tkachov

Hi Jeff,

On 13/08/15 17:20, Jeff Law wrote:

On 08/13/2015 05:06 AM, Kyrill Tkachov wrote:

Hi all,

I'm implementing a target-specific reorg pass, and one thing that I want
to do
is for a given insn in the stream to find an instruction
in the stream that I can swap it with, without violating any dataflow
dependencies.
The candidate instruction could be earlier or later in the stream.

I'm stuck on finding an approach to do this. It seems that using some of
the dataflow
infrastructure is the right way to go, but I can't figure out the details.
can_move_insns_across looks like relevant, but it looks too heavyweight
with quite a lot
of arguments.

I suppose somehow constructing regions of interchangeable instructions
would be the way
to go, but I'm not sure how clean/cheap that would be outside the scheduler

Any ideas would be appreciated.

I think you want all the dependency analysis done by the scheduler.

Which leads to the question, can you model what you're trying to do in
the various scheduler hooks -- in particular walking through the ready
list seems appropriate.


The problem I'm trying to solve can be expressed in this way:
"An insn that satisfies predicate pred_p (insn) cannot appear exactly N insns
apart from another insn 'insn2' that satisfies pred_p (insn2). N is a constant".
So, the problem here is that this restriction is not something expressed in 
terms
of cycles or DFA states, but rather distance in the instruction stream.

I don't think I can do this reliably during sched2 because there is still 
splitting
that can be done that will create more insns that will invalidate any book 
keeping
that I do there.

However, during TARGET_MACHINE_DEPENDENT_REORG I can first split all insns and 
then
call schedule_insns () to do another round of scheduling. However, I'm a bit 
confused
by all the different scheduler hooks and when each one is called in relation to
the other.

I'd need to keep some kind of bitfield recording for the previous N 
instructions in
the stream whether they satisfy pred_p. Where would I record that?
Can I just do everything in TARGET_SCHED_REORDER? i.e.
given a ready list check that no pred_p insns in it appear N insns apart from
another such insn (using my bitfield as a lookup helper), reorder insns as
appropriate and then record the order of the pred_p insns in the bitfield.
Would the scheduler respect the order of the insns that was set by 
TARGET_SCHED_REORDER
and not do any further reordering?

Thanks,
Kyrill



jeff





RE: About loop unrolling and optimize for size

2015-08-14 Thread sa...@hederstierna.com
I think I found explanation, the -fpeel-loops trigger some extra flags:

from "toplev.c":

  /* web and rename-registers help when run after loop unrolling.  */
  if (flag_web == AUTODETECT_VALUE)
flag_web = flag_unroll_loops || flag_peel_loops;

  if (flag_rename_registers == AUTODETECT_VALUE)
flag_rename_registers = flag_unroll_loops || flag_peel_loops;

actually its -frename-registers that causes the code size to decrease.
This flags seems to be set when enable -fpeel-loops.

Maybe this flag could be enabled in -Os, shouldn't have any downside besides 
makes possibly debugging harder?

Thanks/Fredrik



From: Richard Biener [richard.guent...@gmail.com]
Sent: Friday, August 14, 2015 09:28
To: sa...@hederstierna.com
Cc: gcc@gcc.gnu.org
Subject: Re: About loop unrolling and optimize for size

On Thu, Aug 13, 2015 at 6:26 PM, sa...@hederstierna.com
 wrote:
> Hi
> I'm using an ARM thumb cross compiler for embedded systems and always do 
> optimize for small size with -Os.
>
> Though I've experimented with optimization flags, and loop unrolling.
>
> Normally loop unrolling is always bad for size, code is duplicated and size 
> increases.
>
> Though I discovered that in some special cases where the number of iteration 
> is very small, eg a loop of 2-3 times,
> in this case an unrolling could make code size smaller - eg. losen up 
> registers used for index in loops etc.
>
> Example when I use the flag "-fpeel-loops" together with -Os I will 99% of 
> the cases get smaller code size for ARM thumb target.
>
> Some my question is how unrolling works with -Os, is it always totally 
> disabled,
> or are there some cases when it could be tested, eg. with small number 
> iterations, so loop can be eliminated?
>
> Could eg. "-fpeel-loops" be enabled by default for -Os perhaps? Now its only 
> enabled for -O2 and above I think.

Complete peeling is already enabled with -Os, it is just restricted to
those cases where GCCs cost modeling of the
unrolling operation determines the code size shrinks.  If you enable
-fpeel-loops then the cost model allows the
code size to grow - sth not (always) intended with -Os.

The solution is of course to improve the cost modeling and GCCs idea
of followup optimization opportunities.
I do have some incomplete patches to improve that and hope to get back
to it for GCC 6.

If you have (small) testcases that show code size improvements with
-Os -fpeel-loops over -Os and you are
confident they are caused by unrolling please open a bugzilla containing them.

Thanks,
Richard.

> Thanks and Best Regards
> Fredrik


Re: [powerpc64le] seq_cst memory order possibly not honored

2015-08-14 Thread Andrey Semashev

On 14.08.2015 11:51, Jonathan Wakely wrote:

On 14 August 2015 at 01:37, Andrey Semashev wrote:

1. Is my test valid or is there a flaw that I'm missing?


The cppmem tool at http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ shows
that there are consistent executions where (x==0 && y==0) is true. I
used this code:

int main() {
   atomic_int a = 0;
   atomic_int b = 0;
   int x, y;
   {{{
 {
   a.store(1,mo_seq_cst);
   a.load(mo_relaxed);
   x = b.load(mo_relaxed);
 }
   |||
 {
   b.store(1, mo_seq_cst);
   b.load(mo_relaxed);
   y = a.load(mo_relaxed);
 }
   }}}
   return 0;
}


I'm not familiar with that tool and not sure how to interpret its 
output, but I think it misses one point that is present in the original 
test and I failed to mention. The Boost.Atomic test uses Boost.Thread 
barriers to ensure that both threads have executed the store-load-load 
region before checking for x and y.


Otherwise I cannot see how (x==0 && y==0) could happen. The last load in 
each thread is sequenced after the first seq_cst store and both stores 
are ordered with respect to each other, so one of the threads should 
produce 1.




Re: [powerpc64le] seq_cst memory order possibly not honored

2015-08-14 Thread Jonathan Wakely
On 14 August 2015 at 10:54, Andrey Semashev  wrote:
> On 14.08.2015 11:51, Jonathan Wakely wrote:
>>
>> On 14 August 2015 at 01:37, Andrey Semashev wrote:
>>>
>>> 1. Is my test valid or is there a flaw that I'm missing?
>>
>>
>> The cppmem tool at http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ shows
>> that there are consistent executions where (x==0 && y==0) is true. I
>> used this code:
>>
>> int main() {
>>atomic_int a = 0;
>>atomic_int b = 0;
>>int x, y;
>>{{{
>>  {
>>a.store(1,mo_seq_cst);
>>a.load(mo_relaxed);
>>x = b.load(mo_relaxed);
>>  }
>>|||
>>  {
>>b.store(1, mo_seq_cst);
>>b.load(mo_relaxed);
>>y = a.load(mo_relaxed);
>>  }
>>}}}
>>return 0;
>> }
>
>
> I'm not familiar with that tool and not sure how to interpret its output,

Then you should study it :-)

> but I think it misses one point that is present in the original test and I
> failed to mention. The Boost.Atomic test uses Boost.Thread barriers to
> ensure that both threads have executed the store-load-load region before
> checking for x and y.


You can add

  int xx = x;
  int yy = y;

just before the return from main(), which executes after joining both
threads doing the store-load-load. You will still see that xx==0 &&
yy==0 is possible.

> Otherwise I cannot see how (x==0 && y==0) could happen. The last load in
> each thread is sequenced after the first seq_cst store and both stores are
> ordered with respect to each other, so one of the threads should produce 1.

The tool evaluates the possible executions according to a formal model
of the C++ memory model, so is invaluable for answering questions like
this.

It shows that there is no sychronizes-with (shown as "sw")
relationship between the seq_cst store and the relaxed load for each
atomic object. There is a total order of sequentially consistent
operations, but the loads are not sequentially consistent and do not
synchronize with the stores.


Re: [powerpc64le] seq_cst memory order possibly not honored

2015-08-14 Thread Andrey Semashev

On 14.08.2015 13:19, Jonathan Wakely wrote:

On 14 August 2015 at 10:54, Andrey Semashev  wrote:


Otherwise I cannot see how (x==0 && y==0) could happen. The last load in
each thread is sequenced after the first seq_cst store and both stores are
ordered with respect to each other, so one of the threads should produce 1.


The tool evaluates the possible executions according to a formal model
of the C++ memory model, so is invaluable for answering questions like
this.

It shows that there is no sychronizes-with (shown as "sw")
relationship between the seq_cst store and the relaxed load for each
atomic object. There is a total order of sequentially consistent
operations, but the loads are not sequentially consistent and do not
synchronize with the stores.


Thank you Jonathan, you are correct. I've changed the test to use 
seq_cst on loads as well and also removed the first load as it doesn't 
really matter for the test. I'll see if it helps the tester.


I'm still not entirely sure if the missing 'sync' instruction is a, 
let's say, desirable code, from practical point of view. I understand 
that the seq_cst load will generate an extra 'sync' which will ensure 
the stored 1 is visible to the other thread. However, if there is no 
second instruction, i.e. thread 1 performs a store(seq_cst) and thread 2 
performs a load(seq_cst) of the same atomic variable, the second thread 
may not observe the stored value until thread 1 performs another 
instruction involving 'sync' (or the CPU flushes the cache line for some 
reason). This increases latencies of inter-thread communication.




Re: [powerpc64le] seq_cst memory order possibly not honored

2015-08-14 Thread David Edelsohn
On Fri, Aug 14, 2015 at 8:20 AM, Andrey Semashev
 wrote:
> On 14.08.2015 13:19, Jonathan Wakely wrote:
>>
>> On 14 August 2015 at 10:54, Andrey Semashev 
>> wrote:
>>
>>> Otherwise I cannot see how (x==0 && y==0) could happen. The last load in
>>> each thread is sequenced after the first seq_cst store and both stores
>>> are
>>> ordered with respect to each other, so one of the threads should produce
>>> 1.
>>
>>
>> The tool evaluates the possible executions according to a formal model
>> of the C++ memory model, so is invaluable for answering questions like
>> this.
>>
>> It shows that there is no sychronizes-with (shown as "sw")
>> relationship between the seq_cst store and the relaxed load for each
>> atomic object. There is a total order of sequentially consistent
>> operations, but the loads are not sequentially consistent and do not
>> synchronize with the stores.
>
>
> Thank you Jonathan, you are correct. I've changed the test to use seq_cst on
> loads as well and also removed the first load as it doesn't really matter
> for the test. I'll see if it helps the tester.
>
> I'm still not entirely sure if the missing 'sync' instruction is a, let's
> say, desirable code, from practical point of view. I understand that the
> seq_cst load will generate an extra 'sync' which will ensure the stored 1 is
> visible to the other thread. However, if there is no second instruction,
> i.e. thread 1 performs a store(seq_cst) and thread 2 performs a
> load(seq_cst) of the same atomic variable, the second thread may not observe
> the stored value until thread 1 performs another instruction involving
> 'sync' (or the CPU flushes the cache line for some reason). This increases
> latencies of inter-thread communication.

I believe that you are misunderstanding the "sync" instruction.
"sync" does not mean "flush".

Thanks, David


Re: Results from SPEC2006 FP analysis done at Richard`s request {late July / early August}

2015-08-14 Thread Alan Lawrence

Abe wrote:

Dear all,

Overall, I think the WIP new if converter is holding up
relatively well, but there is obviously opportunity to do better,
at least if the numbers mean what they look like they mean,
i.e. the old converter`s code was fully OK and so is the new one`s.
By "fully OK" I mean e.g. no crashing bugs were introduced by the conversion.


In the following, all the integers over 1000 are loops-vectorized counts.



Interesting, thanks. For what kind of architecture are these - specifically: 
with/out masked/gathering loads/stores ??


Cheers, Alan



Re: Results from SPEC2006 FP analysis done at Richard`s request {late July / early August}

2015-08-14 Thread Abe Skolnik
[Alan wrote:]

> Interesting, thanks.  For what kind of architecture are these -

You are welcome.

You raised 2 or 3 good points, I think.

First: the numbers are all from builds on and for the AMD64 ISA, AKA "x86_64".  
My apologies for forgetting to mention that vital fact.

Second:  I did not tell the SPEC build system to use any particular "-march" or 
"-mtune" flags, and AFAIK it [SPEC] did not add any that I didn`t specify.  In 
other words, those compiler-tuning values were almost certainly at their GCC 
defaults.

[A question about the preceding: is "-march=native" the default nowadays, at 
least on GNU/Linux?  AFAIK the default GCC behavior is (still?) to generate 
code for the most generic form of the target ISA unless an explicit flag 
overrides this.]

Third: the server in question has Intel silicon for its CPUs.  If an implicit 
"-march=native" or similar is suspected of having been a factor, then please 
let me know and I`ll report back on the specifics.  [I am at home right now, so 
I have no easy way of getting that data right now.]


> specifically: with/out masked/gathering loads/stores ??


TTBOMK, generic AMD64/x86_64 does _not_ have the gathering stuff and the very 
latest from Intel _does_.

Sorry, but I don`t know about the masked form[s].  If that`s important to know, 
then please tell me and I will investigate.

Regards,

Abe


RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

2015-08-14 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Monday, August 03, 2015 2:59 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal 
 wrote:
> All:
>
> The definition of the following macro that determine the statement cost that 
> adds to vectorization cost.
>
> #define TARGET_VECTORIZE_ADD_STMT_COST.
>
> In the implementation of the above macro the following is done for many 
> vectorization supported architectures like i386, ARM.
>
> if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info))
> count *= 50;  /* FIXME.  */
>
> I have the  following questions.
>
> 1. Why the multiplication factor of 50 is choosen?

>>It's a wild guess.  See 
>>tree-vect-loop.c:vect_get_single_scalar_iteration_cost.

> 2. The comment mentions that the inner loop relative to the loop being 
> vectorized is added more weight. If more weight is added to the inner 
> loop for the loop being vectorized, the chances of vectorizing the inner loop 
> decreases. Why the inner loop cost is increased with relative to the loop 
> being vectorized?

>>In fact adding more weight to the inner loop increases the chance of 
>>vectorizing it (if vectorizing the inner loop is profitable).
>>Both scalar and vector cost get biased by a factor of 50 (we assume 50 
>>iterations of the inner loop for one iteration of the outer loop), so a 
>>non-profitable >>vectorization in the outer loop can be offsetted by 
>>profitable inner loop vectorization.

>>Yes, '50' can be improved if we actually know the iteration count of the 
>>inner loop or if we have profile-feedback.

Instead of vector and scalar cost get biased by a factor of 50, Can the benefit 
of vectorization calculated as follows

Benefit = scalar cost - vector cost/VF;
Cost = 0;
For ( I = 1; I < N; i++)
{
Cost = cost + (final_value - Initial value)/steps;
}

Benefit = Benefit * cost;

Where 
N = No. of levels of the loop;
Final_value = Final iteration count of the loop.
Initial_value = Initial Iteration count of the loop.
Steps = steps of the iteration for the loop.
VF = vectorization factor.

Thus increase in the Levels of the loops increases  the benefit of 
vectorization. Also if the scalar cost is more than the vectorization cost then 
the 
Scalar cost - vector cost /VF increases with the same vectorization Factor thus 
increasing the benefit of vectorization.

Thanks & Regards
Ajit

Richard.


> Thanks & Regards
> Ajit


Re: Finding insns to reorder using dataflow

2015-08-14 Thread Jeff Law

On 08/14/2015 03:05 AM, Kyrill Tkachov wrote:


The problem I'm trying to solve can be expressed in this way: "An
insn that satisfies predicate pred_p (insn) cannot appear exactly N
insns apart from another insn 'insn2' that satisfies pred_p (insn2).
N is a constant". So, the problem here is that this restriction is
not something expressed in terms of cycles or DFA states, but rather
distance in the instruction stream.

I wasn't really suggesting to model it in DFA states, but instead use
the dependency analysis + hooks.   The dependency analysis in particular
when it's safe to interchange two insns.

Given the additional information, I think you'd want to note when an
insn fires and satisfies pred_p, and associate a counter with each
firing.  THe active counters are bumped (decremented?) at each firing
(so you can track how many insns appear after the one that satisfied
pred_p).  Note that for insns which generate multiple assembly
instructions, you need to decrement the counter by the number of
assembly instructions they emit.

Then when sorting the ready list, if you have an insn that satisfies
pred_p and an active counter has just reached zero, make sure some other
insn fires (what if there aren't any other ready insns?  Is this a
correctness or performance issue?)






I don't think I can do this reliably during sched2 because there is
still splitting that can be done that will create more insns that
will invalidate any book keeping that I do there.

Right.  You need everything split and you need accurate insn length
information for every insn in the backend that isn't split.  If this is
a correctness issue, then you also have to deal with final deleting
insns behind your back as well.

Many years ago I did something which required 100% accurate length
information from the backend.  It was painful, very very painful.
Ultimately it didn't work out and the code was scrapped.





However, during TARGET_MACHINE_DEPENDENT_REORG I can first split all
 insns and then call schedule_insns () to do another round of
scheduling. However, I'm a bit confused by all the different
scheduler hooks and when each one is called in relation to the
other.
You'll have to work through them -- I haven't kept close tabs on the 
various hooks we have, I just know we have them.



I'd need to keep some kind of bitfield recording for the previous N
instructions in the stream whether they satisfy pred_p. Where would I
record that? Can I just do everything in TARGET_SCHED_REORDER? i.e.
given a ready list check that no pred_p insns in it appear N insns
apart from another such insn (using my bitfield as a lookup helper),
reorder insns as appropriate and then record the order of the pred_p
insns in the bitfield. Would the scheduler respect the order of the
insns that was set by TARGET_SCHED_REORDER and not do any further
reordering?
The problem I see is that once one of these insns fire, other new insns 
will be added to the ready list.  So you have to keep some kind of state 
about how many instructions back one of these insns fired and consult 
that data when making a decision about the next instruction to fire.


All this will fall apart if this is a correctness issue since you'd have 
to issue a nop or somesuch.  Though I guess you might be able to arrange 
to get a nop into the scheduled stream.  If this is a correctness issue, 
tackling it in the assembler may make more sense.


Jeff


Re: Finding insns to reorder using dataflow

2015-08-14 Thread Kyrill Tkachov


On 14/08/15 16:31, Jeff Law wrote:

On 08/14/2015 03:05 AM, Kyrill Tkachov wrote:

The problem I'm trying to solve can be expressed in this way: "An
insn that satisfies predicate pred_p (insn) cannot appear exactly N
insns apart from another insn 'insn2' that satisfies pred_p (insn2).
N is a constant". So, the problem here is that this restriction is
not something expressed in terms of cycles or DFA states, but rather
distance in the instruction stream.

I wasn't really suggesting to model it in DFA states, but instead use
the dependency analysis + hooks.   The dependency analysis in particular
when it's safe to interchange two insns.

Given the additional information, I think you'd want to note when an
insn fires and satisfies pred_p, and associate a counter with each
firing.  THe active counters are bumped (decremented?) at each firing
(so you can track how many insns appear after the one that satisfied
pred_p).  Note that for insns which generate multiple assembly
instructions, you need to decrement the counter by the number of
assembly instructions they emit.

Then when sorting the ready list, if you have an insn that satisfies
pred_p and an active counter has just reached zero, make sure some other
insn fires (what if there aren't any other ready insns?  Is this a
correctness or performance issue?)





I don't think I can do this reliably during sched2 because there is
still splitting that can be done that will create more insns that
will invalidate any book keeping that I do there.

Right.  You need everything split and you need accurate insn length
information for every insn in the backend that isn't split.  If this is
a correctness issue, then you also have to deal with final deleting
insns behind your back as well.

Many years ago I did something which required 100% accurate length
information from the backend.  It was painful, very very painful.
Ultimately it didn't work out and the code was scrapped.




However, during TARGET_MACHINE_DEPENDENT_REORG I can first split all
  insns and then call schedule_insns () to do another round of
scheduling. However, I'm a bit confused by all the different
scheduler hooks and when each one is called in relation to the
other.

You'll have to work through them -- I haven't kept close tabs on the
various hooks we have, I just know we have them.


I'd need to keep some kind of bitfield recording for the previous N
instructions in the stream whether they satisfy pred_p. Where would I
record that? Can I just do everything in TARGET_SCHED_REORDER? i.e.
given a ready list check that no pred_p insns in it appear N insns
apart from another such insn (using my bitfield as a lookup helper),
reorder insns as appropriate and then record the order of the pred_p
insns in the bitfield. Would the scheduler respect the order of the
insns that was set by TARGET_SCHED_REORDER and not do any further
reordering?

The problem I see is that once one of these insns fire, other new insns
will be added to the ready list.  So you have to keep some kind of state
about how many instructions back one of these insns fired and consult
that data when making a decision about the next instruction to fire.

All this will fall apart if this is a correctness issue since you'd have
to issue a nop or somesuch.  Though I guess you might be able to arrange
to get a nop into the scheduled stream.  If this is a correctness issue,
tackling it in the assembler may make more sense.


Thanks. This is not a correctness issue.
I got some code to add nops and it was easy enough,
but I'm using it only to investigate the effectiveness
of a proper scheduling approach (the more effective the
scheduling approach, the fewer nops we emit).

I'll try out the ideas you suggested.

Thanks,
Kyrill



Jeff





RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

2015-08-14 Thread Richard Biener
On August 14, 2015 5:03:58 PM GMT+02:00, Ajit Kumar Agarwal 
 wrote:
>
>
>-Original Message-
>From: Richard Biener [mailto:richard.guent...@gmail.com] 
>Sent: Monday, August 03, 2015 2:59 PM
>To: Ajit Kumar Agarwal
>Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta;
>Vidhumouli Hunsigida; Nagaraju Mekala
>Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
>
>On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal
> wrote:
>> All:
>>
>> The definition of the following macro that determine the statement
>cost that adds to vectorization cost.
>>
>> #define TARGET_VECTORIZE_ADD_STMT_COST.
>>
>> In the implementation of the above macro the following is done for
>many vectorization supported architectures like i386, ARM.
>>
>> if (where == vect_body && stmt_info && stmt_in_inner_loop_p
>(stmt_info))
>> count *= 50;  /* FIXME.  */
>>
>> I have the  following questions.
>>
>> 1. Why the multiplication factor of 50 is choosen?
>
>>>It's a wild guess.  See
>tree-vect-loop.c:vect_get_single_scalar_iteration_cost.
>
>> 2. The comment mentions that the inner loop relative to the loop
>being 
>> vectorized is added more weight. If more weight is added to the inner
>
>> loop for the loop being vectorized, the chances of vectorizing the
>inner loop decreases. Why the inner loop cost is increased with
>relative to the loop being vectorized?
>
>>>In fact adding more weight to the inner loop increases the chance of
>vectorizing it (if vectorizing the inner loop is profitable).
>>>Both scalar and vector cost get biased by a factor of 50 (we assume
>50 iterations of the inner loop for one iteration of the outer loop),
>so a non-profitable >>vectorization in the outer loop can be offsetted
>by profitable inner loop vectorization.
>
>>>Yes, '50' can be improved if we actually know the iteration count of
>the inner loop or if we have profile-feedback.
>
>Instead of vector and scalar cost get biased by a factor of 50, Can the
>benefit of vectorization calculated as follows
>
>Benefit = scalar cost - vector cost/VF;
>Cost = 0;
>For ( I = 1; I < N; i++)
>{
>Cost = cost + (final_value - Initial value)/steps;
>}
>
>Benefit = Benefit * cost;
>
>Where 
>N = No. of levels of the loop;
>Final_value = Final iteration count of the loop.
>Initial_value = Initial Iteration count of the loop.
>Steps = steps of the iteration for the loop.
>VF = vectorization factor.
>
>Thus increase in the Levels of the loops increases  the benefit of
>vectorization. Also if the scalar cost is more than the vectorization
>cost then the 
>Scalar cost - vector cost /VF increases with the same vectorization
>Factor thus increasing the benefit of vectorization.

Sure.  But the number of iterations may only be available symbolically, thus 
the cost be only useful for the dynamic check at runtime.  A better static 
estimate would also be useful.

Richard.

>Thanks & Regards
>Ajit
>
>Richard.
>
>
>> Thanks & Regards
>> Ajit