Re: More of a Loop distribution.
On Fri, Aug 14, 2015 at 8:13 AM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Friday, August 14, 2015 11:30 AM > To: Ajit Kumar Agarwal > Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli > Hunsigida; Nagaraju Mekala > Subject: RE: More of a Loop distribution. > > On August 14, 2015 4:59:07 AM GMT+02:00, Ajit Kumar Agarwal > wrote: >> >> >>-Original Message- >>From: Richard Biener [mailto:richard.guent...@gmail.com] >>Sent: Thursday, August 13, 2015 3:23 PM >>To: Ajit Kumar Agarwal >>Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; >>Vidhumouli Hunsigida; Nagaraju Mekala >>Subject: Re: More of a Loop distribution. >> >>On Thu, Aug 13, 2015 at 9:37 AM, Ajit Kumar Agarwal >> wrote: >>> All: >>> >>> Loop distribution considers DDG to decide on distributing the Loops. >>> The Loops with control statements like IF-THEN-ELSE can also be >>> Distributed. Instead of Data Dependency Graph, the Control Dependence >>Graph should be considered in order to distribute the loops In presence >>of control Statements. >>> >>> Also the presence of multiple exits in the Loop can also be >>considered for Loop distribution transformation. >>> >>> Thus the above transformation helps in the Libquantum benchmarks for >>SPEC 2006. >>> >>> There are following articles that looks interesting to me. >>> >>> "Loop Distribution in presence of arbitrarily control flow Ken >>Kennedy et.al." >>> "Loop Distribution in presence of Multiple Exits Bor-Ming Hsieh >>etal." >>> >>> >>> I don't think the loop distribution in presence of control flow is >>implemented in GCC/LLVM. >>> >>> I think it is feasible to consider the above for the implementation >>in GCC. >> It's true that loop distribution does not try to distribute based on >>any control structure heuristics but it only considers data locality. >>It does however already >>compute the control dependence graph (and >>uses it to add control edges to the DDG to properly add data dependence >>edges to uses of control statements >>necessary in partitions). >> So it should be a matter of specifying the proper set of starting >>statements it tries separating. >> >>Thanks. >> Not sure which kind of distribution you are after, can you give an >>example? >> >>I would like to have a distribution of the loop having control flow. >>For example >> >>For (I = 2 ; I < N; i++) >>{ >>If (condition) >> { >> S1: A[i] = ... >> S2:D[i] = A[i-1]... >> } >>} >> >>The above loop can be distributed with two loops having one loop with >>S1 inside IF and another loop with S2 with the IF. >>The two scenario can be true. >> >>1. The condition inside IF have a check on A[i] and is dependent on S1. >>In this case the distribution is difficult. And the above article From >>Ken Kennedy et.al does store the partial results of comparison in an >>temporary array and that array is compared inside the IF Condition. >>This makes the loop distributed. This is what I was looking for which I >>found in the above article. >> >>2. The condition inside the IF in the above loop is not dependent on >>the S1 and S2 , and this case the loop can be distributed. >> >>In the above two scenario the GCC can't distribute the loop, as the >>control dependency graph ( control structure ) is not used. > >>>The above loop can be distributed by gcc just fine. > > Existing loop distribution implementation in GCC distributes the above loop > in both the above scenario's? GCC cannot do the trick of computing the condition into a temporary array. What it does (and may then reject for dependence reasons) is to have the original condition (including an eventual reference to A[i]) in both partitions containing S1 and S2. So if the exact thing loos like for (i = 0; i < N; i++) { if (A[i]) { A[i] = ...; D[i] = A[i-1] ...; } } then I see no dependence that makes distributing into for (i = 0; i < N; i++) { if (A[i]) D[i] = A[i-1] ...; } for (i =0; i < N; i++) { if (A[i]) A[i] = ...; } invalid. But as the testcase isn't complete or compilable I can't actually verify what GCC does here. I would think loop distributions cost metric would reject this distribution because A is accessed in both loops, but as said, only real code examples will show. All I wanted to say is the underlying support is there in loop distribution - where current loop distribution is lacking is a) determining the desired partitioning (by determining the stmts to distribute) and b) the cost model that fuses back partitions that are related. Oh, and of course CSE comes in the way of both cost modeling and actual dependences. Loop distribution doesn't understand to reload an expression from memory if it sees tem = B[i] + C[i]; A[i] = tem; (1) D[i] = tem; (2) then the partition
Re: About loop unrolling and optimize for size
On Thu, Aug 13, 2015 at 6:26 PM, sa...@hederstierna.com wrote: > Hi > I'm using an ARM thumb cross compiler for embedded systems and always do > optimize for small size with -Os. > > Though I've experimented with optimization flags, and loop unrolling. > > Normally loop unrolling is always bad for size, code is duplicated and size > increases. > > Though I discovered that in some special cases where the number of iteration > is very small, eg a loop of 2-3 times, > in this case an unrolling could make code size smaller - eg. losen up > registers used for index in loops etc. > > Example when I use the flag "-fpeel-loops" together with -Os I will 99% of > the cases get smaller code size for ARM thumb target. > > Some my question is how unrolling works with -Os, is it always totally > disabled, > or are there some cases when it could be tested, eg. with small number > iterations, so loop can be eliminated? > > Could eg. "-fpeel-loops" be enabled by default for -Os perhaps? Now its only > enabled for -O2 and above I think. Complete peeling is already enabled with -Os, it is just restricted to those cases where GCCs cost modeling of the unrolling operation determines the code size shrinks. If you enable -fpeel-loops then the cost model allows the code size to grow - sth not (always) intended with -Os. The solution is of course to improve the cost modeling and GCCs idea of followup optimization opportunities. I do have some incomplete patches to improve that and hope to get back to it for GCC 6. If you have (small) testcases that show code size improvements with -Os -fpeel-loops over -Os and you are confident they are caused by unrolling please open a bugzilla containing them. Thanks, Richard. > Thanks and Best Regards > Fredrik
Re: Using libbacktrace in libgfortran: some questions
> Patch tested and committed with this ChangeLog entry. > > 2015-08-13 Ian Lance Taylor > > * dwarf.c (read_function_entry): Add vec_inlined parameter. > Change all callers. Thanks, this is great! I am going to submit the libgfortran patch to use libbacktrace today. Cheers, FX
Re: [powerpc64le] seq_cst memory order possibly not honored
On 14 August 2015 at 01:37, Andrey Semashev wrote: > 1. Is my test valid or is there a flaw that I'm missing? The cppmem tool at http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ shows that there are consistent executions where (x==0 && y==0) is true. I used this code: int main() { atomic_int a = 0; atomic_int b = 0; int x, y; {{{ { a.store(1,mo_seq_cst); a.load(mo_relaxed); x = b.load(mo_relaxed); } ||| { b.store(1, mo_seq_cst); b.load(mo_relaxed); y = a.load(mo_relaxed); } }}} return 0; }
Re: Finding insns to reorder using dataflow
Hi Jeff, On 13/08/15 17:20, Jeff Law wrote: On 08/13/2015 05:06 AM, Kyrill Tkachov wrote: Hi all, I'm implementing a target-specific reorg pass, and one thing that I want to do is for a given insn in the stream to find an instruction in the stream that I can swap it with, without violating any dataflow dependencies. The candidate instruction could be earlier or later in the stream. I'm stuck on finding an approach to do this. It seems that using some of the dataflow infrastructure is the right way to go, but I can't figure out the details. can_move_insns_across looks like relevant, but it looks too heavyweight with quite a lot of arguments. I suppose somehow constructing regions of interchangeable instructions would be the way to go, but I'm not sure how clean/cheap that would be outside the scheduler Any ideas would be appreciated. I think you want all the dependency analysis done by the scheduler. Which leads to the question, can you model what you're trying to do in the various scheduler hooks -- in particular walking through the ready list seems appropriate. The problem I'm trying to solve can be expressed in this way: "An insn that satisfies predicate pred_p (insn) cannot appear exactly N insns apart from another insn 'insn2' that satisfies pred_p (insn2). N is a constant". So, the problem here is that this restriction is not something expressed in terms of cycles or DFA states, but rather distance in the instruction stream. I don't think I can do this reliably during sched2 because there is still splitting that can be done that will create more insns that will invalidate any book keeping that I do there. However, during TARGET_MACHINE_DEPENDENT_REORG I can first split all insns and then call schedule_insns () to do another round of scheduling. However, I'm a bit confused by all the different scheduler hooks and when each one is called in relation to the other. I'd need to keep some kind of bitfield recording for the previous N instructions in the stream whether they satisfy pred_p. Where would I record that? Can I just do everything in TARGET_SCHED_REORDER? i.e. given a ready list check that no pred_p insns in it appear N insns apart from another such insn (using my bitfield as a lookup helper), reorder insns as appropriate and then record the order of the pred_p insns in the bitfield. Would the scheduler respect the order of the insns that was set by TARGET_SCHED_REORDER and not do any further reordering? Thanks, Kyrill jeff
RE: About loop unrolling and optimize for size
I think I found explanation, the -fpeel-loops trigger some extra flags: from "toplev.c": /* web and rename-registers help when run after loop unrolling. */ if (flag_web == AUTODETECT_VALUE) flag_web = flag_unroll_loops || flag_peel_loops; if (flag_rename_registers == AUTODETECT_VALUE) flag_rename_registers = flag_unroll_loops || flag_peel_loops; actually its -frename-registers that causes the code size to decrease. This flags seems to be set when enable -fpeel-loops. Maybe this flag could be enabled in -Os, shouldn't have any downside besides makes possibly debugging harder? Thanks/Fredrik From: Richard Biener [richard.guent...@gmail.com] Sent: Friday, August 14, 2015 09:28 To: sa...@hederstierna.com Cc: gcc@gcc.gnu.org Subject: Re: About loop unrolling and optimize for size On Thu, Aug 13, 2015 at 6:26 PM, sa...@hederstierna.com wrote: > Hi > I'm using an ARM thumb cross compiler for embedded systems and always do > optimize for small size with -Os. > > Though I've experimented with optimization flags, and loop unrolling. > > Normally loop unrolling is always bad for size, code is duplicated and size > increases. > > Though I discovered that in some special cases where the number of iteration > is very small, eg a loop of 2-3 times, > in this case an unrolling could make code size smaller - eg. losen up > registers used for index in loops etc. > > Example when I use the flag "-fpeel-loops" together with -Os I will 99% of > the cases get smaller code size for ARM thumb target. > > Some my question is how unrolling works with -Os, is it always totally > disabled, > or are there some cases when it could be tested, eg. with small number > iterations, so loop can be eliminated? > > Could eg. "-fpeel-loops" be enabled by default for -Os perhaps? Now its only > enabled for -O2 and above I think. Complete peeling is already enabled with -Os, it is just restricted to those cases where GCCs cost modeling of the unrolling operation determines the code size shrinks. If you enable -fpeel-loops then the cost model allows the code size to grow - sth not (always) intended with -Os. The solution is of course to improve the cost modeling and GCCs idea of followup optimization opportunities. I do have some incomplete patches to improve that and hope to get back to it for GCC 6. If you have (small) testcases that show code size improvements with -Os -fpeel-loops over -Os and you are confident they are caused by unrolling please open a bugzilla containing them. Thanks, Richard. > Thanks and Best Regards > Fredrik
Re: [powerpc64le] seq_cst memory order possibly not honored
On 14.08.2015 11:51, Jonathan Wakely wrote: On 14 August 2015 at 01:37, Andrey Semashev wrote: 1. Is my test valid or is there a flaw that I'm missing? The cppmem tool at http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ shows that there are consistent executions where (x==0 && y==0) is true. I used this code: int main() { atomic_int a = 0; atomic_int b = 0; int x, y; {{{ { a.store(1,mo_seq_cst); a.load(mo_relaxed); x = b.load(mo_relaxed); } ||| { b.store(1, mo_seq_cst); b.load(mo_relaxed); y = a.load(mo_relaxed); } }}} return 0; } I'm not familiar with that tool and not sure how to interpret its output, but I think it misses one point that is present in the original test and I failed to mention. The Boost.Atomic test uses Boost.Thread barriers to ensure that both threads have executed the store-load-load region before checking for x and y. Otherwise I cannot see how (x==0 && y==0) could happen. The last load in each thread is sequenced after the first seq_cst store and both stores are ordered with respect to each other, so one of the threads should produce 1.
Re: [powerpc64le] seq_cst memory order possibly not honored
On 14 August 2015 at 10:54, Andrey Semashev wrote: > On 14.08.2015 11:51, Jonathan Wakely wrote: >> >> On 14 August 2015 at 01:37, Andrey Semashev wrote: >>> >>> 1. Is my test valid or is there a flaw that I'm missing? >> >> >> The cppmem tool at http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ shows >> that there are consistent executions where (x==0 && y==0) is true. I >> used this code: >> >> int main() { >>atomic_int a = 0; >>atomic_int b = 0; >>int x, y; >>{{{ >> { >>a.store(1,mo_seq_cst); >>a.load(mo_relaxed); >>x = b.load(mo_relaxed); >> } >>||| >> { >>b.store(1, mo_seq_cst); >>b.load(mo_relaxed); >>y = a.load(mo_relaxed); >> } >>}}} >>return 0; >> } > > > I'm not familiar with that tool and not sure how to interpret its output, Then you should study it :-) > but I think it misses one point that is present in the original test and I > failed to mention. The Boost.Atomic test uses Boost.Thread barriers to > ensure that both threads have executed the store-load-load region before > checking for x and y. You can add int xx = x; int yy = y; just before the return from main(), which executes after joining both threads doing the store-load-load. You will still see that xx==0 && yy==0 is possible. > Otherwise I cannot see how (x==0 && y==0) could happen. The last load in > each thread is sequenced after the first seq_cst store and both stores are > ordered with respect to each other, so one of the threads should produce 1. The tool evaluates the possible executions according to a formal model of the C++ memory model, so is invaluable for answering questions like this. It shows that there is no sychronizes-with (shown as "sw") relationship between the seq_cst store and the relaxed load for each atomic object. There is a total order of sequentially consistent operations, but the loads are not sequentially consistent and do not synchronize with the stores.
Re: [powerpc64le] seq_cst memory order possibly not honored
On 14.08.2015 13:19, Jonathan Wakely wrote: On 14 August 2015 at 10:54, Andrey Semashev wrote: Otherwise I cannot see how (x==0 && y==0) could happen. The last load in each thread is sequenced after the first seq_cst store and both stores are ordered with respect to each other, so one of the threads should produce 1. The tool evaluates the possible executions according to a formal model of the C++ memory model, so is invaluable for answering questions like this. It shows that there is no sychronizes-with (shown as "sw") relationship between the seq_cst store and the relaxed load for each atomic object. There is a total order of sequentially consistent operations, but the loads are not sequentially consistent and do not synchronize with the stores. Thank you Jonathan, you are correct. I've changed the test to use seq_cst on loads as well and also removed the first load as it doesn't really matter for the test. I'll see if it helps the tester. I'm still not entirely sure if the missing 'sync' instruction is a, let's say, desirable code, from practical point of view. I understand that the seq_cst load will generate an extra 'sync' which will ensure the stored 1 is visible to the other thread. However, if there is no second instruction, i.e. thread 1 performs a store(seq_cst) and thread 2 performs a load(seq_cst) of the same atomic variable, the second thread may not observe the stored value until thread 1 performs another instruction involving 'sync' (or the CPU flushes the cache line for some reason). This increases latencies of inter-thread communication.
Re: [powerpc64le] seq_cst memory order possibly not honored
On Fri, Aug 14, 2015 at 8:20 AM, Andrey Semashev wrote: > On 14.08.2015 13:19, Jonathan Wakely wrote: >> >> On 14 August 2015 at 10:54, Andrey Semashev >> wrote: >> >>> Otherwise I cannot see how (x==0 && y==0) could happen. The last load in >>> each thread is sequenced after the first seq_cst store and both stores >>> are >>> ordered with respect to each other, so one of the threads should produce >>> 1. >> >> >> The tool evaluates the possible executions according to a formal model >> of the C++ memory model, so is invaluable for answering questions like >> this. >> >> It shows that there is no sychronizes-with (shown as "sw") >> relationship between the seq_cst store and the relaxed load for each >> atomic object. There is a total order of sequentially consistent >> operations, but the loads are not sequentially consistent and do not >> synchronize with the stores. > > > Thank you Jonathan, you are correct. I've changed the test to use seq_cst on > loads as well and also removed the first load as it doesn't really matter > for the test. I'll see if it helps the tester. > > I'm still not entirely sure if the missing 'sync' instruction is a, let's > say, desirable code, from practical point of view. I understand that the > seq_cst load will generate an extra 'sync' which will ensure the stored 1 is > visible to the other thread. However, if there is no second instruction, > i.e. thread 1 performs a store(seq_cst) and thread 2 performs a > load(seq_cst) of the same atomic variable, the second thread may not observe > the stored value until thread 1 performs another instruction involving > 'sync' (or the CPU flushes the cache line for some reason). This increases > latencies of inter-thread communication. I believe that you are misunderstanding the "sync" instruction. "sync" does not mean "flush". Thanks, David
Re: Results from SPEC2006 FP analysis done at Richard`s request {late July / early August}
Abe wrote: Dear all, Overall, I think the WIP new if converter is holding up relatively well, but there is obviously opportunity to do better, at least if the numbers mean what they look like they mean, i.e. the old converter`s code was fully OK and so is the new one`s. By "fully OK" I mean e.g. no crashing bugs were introduced by the conversion. In the following, all the integers over 1000 are loops-vectorized counts. Interesting, thanks. For what kind of architecture are these - specifically: with/out masked/gathering loads/stores ?? Cheers, Alan
Re: Results from SPEC2006 FP analysis done at Richard`s request {late July / early August}
[Alan wrote:] > Interesting, thanks. For what kind of architecture are these - You are welcome. You raised 2 or 3 good points, I think. First: the numbers are all from builds on and for the AMD64 ISA, AKA "x86_64". My apologies for forgetting to mention that vital fact. Second: I did not tell the SPEC build system to use any particular "-march" or "-mtune" flags, and AFAIK it [SPEC] did not add any that I didn`t specify. In other words, those compiler-tuning values were almost certainly at their GCC defaults. [A question about the preceding: is "-march=native" the default nowadays, at least on GNU/Linux? AFAIK the default GCC behavior is (still?) to generate code for the most generic form of the target ISA unless an explicit flag overrides this.] Third: the server in question has Intel silicon for its CPUs. If an implicit "-march=native" or similar is suspected of having been a factor, then please let me know and I`ll report back on the specifics. [I am at home right now, so I have no easy way of getting that data right now.] > specifically: with/out masked/gathering loads/stores ?? TTBOMK, generic AMD64/x86_64 does _not_ have the gathering stuff and the very latest from Intel _does_. Sorry, but I don`t know about the masked form[s]. If that`s important to know, then please tell me and I will investigate. Regards, Abe
RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Monday, August 03, 2015 2:59 PM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal wrote: > All: > > The definition of the following macro that determine the statement cost that > adds to vectorization cost. > > #define TARGET_VECTORIZE_ADD_STMT_COST. > > In the implementation of the above macro the following is done for many > vectorization supported architectures like i386, ARM. > > if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info)) > count *= 50; /* FIXME. */ > > I have the following questions. > > 1. Why the multiplication factor of 50 is choosen? >>It's a wild guess. See >>tree-vect-loop.c:vect_get_single_scalar_iteration_cost. > 2. The comment mentions that the inner loop relative to the loop being > vectorized is added more weight. If more weight is added to the inner > loop for the loop being vectorized, the chances of vectorizing the inner loop > decreases. Why the inner loop cost is increased with relative to the loop > being vectorized? >>In fact adding more weight to the inner loop increases the chance of >>vectorizing it (if vectorizing the inner loop is profitable). >>Both scalar and vector cost get biased by a factor of 50 (we assume 50 >>iterations of the inner loop for one iteration of the outer loop), so a >>non-profitable >>vectorization in the outer loop can be offsetted by >>profitable inner loop vectorization. >>Yes, '50' can be improved if we actually know the iteration count of the >>inner loop or if we have profile-feedback. Instead of vector and scalar cost get biased by a factor of 50, Can the benefit of vectorization calculated as follows Benefit = scalar cost - vector cost/VF; Cost = 0; For ( I = 1; I < N; i++) { Cost = cost + (final_value - Initial value)/steps; } Benefit = Benefit * cost; Where N = No. of levels of the loop; Final_value = Final iteration count of the loop. Initial_value = Initial Iteration count of the loop. Steps = steps of the iteration for the loop. VF = vectorization factor. Thus increase in the Levels of the loops increases the benefit of vectorization. Also if the scalar cost is more than the vectorization cost then the Scalar cost - vector cost /VF increases with the same vectorization Factor thus increasing the benefit of vectorization. Thanks & Regards Ajit Richard. > Thanks & Regards > Ajit
Re: Finding insns to reorder using dataflow
On 08/14/2015 03:05 AM, Kyrill Tkachov wrote: The problem I'm trying to solve can be expressed in this way: "An insn that satisfies predicate pred_p (insn) cannot appear exactly N insns apart from another insn 'insn2' that satisfies pred_p (insn2). N is a constant". So, the problem here is that this restriction is not something expressed in terms of cycles or DFA states, but rather distance in the instruction stream. I wasn't really suggesting to model it in DFA states, but instead use the dependency analysis + hooks. The dependency analysis in particular when it's safe to interchange two insns. Given the additional information, I think you'd want to note when an insn fires and satisfies pred_p, and associate a counter with each firing. THe active counters are bumped (decremented?) at each firing (so you can track how many insns appear after the one that satisfied pred_p). Note that for insns which generate multiple assembly instructions, you need to decrement the counter by the number of assembly instructions they emit. Then when sorting the ready list, if you have an insn that satisfies pred_p and an active counter has just reached zero, make sure some other insn fires (what if there aren't any other ready insns? Is this a correctness or performance issue?) I don't think I can do this reliably during sched2 because there is still splitting that can be done that will create more insns that will invalidate any book keeping that I do there. Right. You need everything split and you need accurate insn length information for every insn in the backend that isn't split. If this is a correctness issue, then you also have to deal with final deleting insns behind your back as well. Many years ago I did something which required 100% accurate length information from the backend. It was painful, very very painful. Ultimately it didn't work out and the code was scrapped. However, during TARGET_MACHINE_DEPENDENT_REORG I can first split all insns and then call schedule_insns () to do another round of scheduling. However, I'm a bit confused by all the different scheduler hooks and when each one is called in relation to the other. You'll have to work through them -- I haven't kept close tabs on the various hooks we have, I just know we have them. I'd need to keep some kind of bitfield recording for the previous N instructions in the stream whether they satisfy pred_p. Where would I record that? Can I just do everything in TARGET_SCHED_REORDER? i.e. given a ready list check that no pred_p insns in it appear N insns apart from another such insn (using my bitfield as a lookup helper), reorder insns as appropriate and then record the order of the pred_p insns in the bitfield. Would the scheduler respect the order of the insns that was set by TARGET_SCHED_REORDER and not do any further reordering? The problem I see is that once one of these insns fire, other new insns will be added to the ready list. So you have to keep some kind of state about how many instructions back one of these insns fired and consult that data when making a decision about the next instruction to fire. All this will fall apart if this is a correctness issue since you'd have to issue a nop or somesuch. Though I guess you might be able to arrange to get a nop into the scheduled stream. If this is a correctness issue, tackling it in the assembler may make more sense. Jeff
Re: Finding insns to reorder using dataflow
On 14/08/15 16:31, Jeff Law wrote: On 08/14/2015 03:05 AM, Kyrill Tkachov wrote: The problem I'm trying to solve can be expressed in this way: "An insn that satisfies predicate pred_p (insn) cannot appear exactly N insns apart from another insn 'insn2' that satisfies pred_p (insn2). N is a constant". So, the problem here is that this restriction is not something expressed in terms of cycles or DFA states, but rather distance in the instruction stream. I wasn't really suggesting to model it in DFA states, but instead use the dependency analysis + hooks. The dependency analysis in particular when it's safe to interchange two insns. Given the additional information, I think you'd want to note when an insn fires and satisfies pred_p, and associate a counter with each firing. THe active counters are bumped (decremented?) at each firing (so you can track how many insns appear after the one that satisfied pred_p). Note that for insns which generate multiple assembly instructions, you need to decrement the counter by the number of assembly instructions they emit. Then when sorting the ready list, if you have an insn that satisfies pred_p and an active counter has just reached zero, make sure some other insn fires (what if there aren't any other ready insns? Is this a correctness or performance issue?) I don't think I can do this reliably during sched2 because there is still splitting that can be done that will create more insns that will invalidate any book keeping that I do there. Right. You need everything split and you need accurate insn length information for every insn in the backend that isn't split. If this is a correctness issue, then you also have to deal with final deleting insns behind your back as well. Many years ago I did something which required 100% accurate length information from the backend. It was painful, very very painful. Ultimately it didn't work out and the code was scrapped. However, during TARGET_MACHINE_DEPENDENT_REORG I can first split all insns and then call schedule_insns () to do another round of scheduling. However, I'm a bit confused by all the different scheduler hooks and when each one is called in relation to the other. You'll have to work through them -- I haven't kept close tabs on the various hooks we have, I just know we have them. I'd need to keep some kind of bitfield recording for the previous N instructions in the stream whether they satisfy pred_p. Where would I record that? Can I just do everything in TARGET_SCHED_REORDER? i.e. given a ready list check that no pred_p insns in it appear N insns apart from another such insn (using my bitfield as a lookup helper), reorder insns as appropriate and then record the order of the pred_p insns in the bitfield. Would the scheduler respect the order of the insns that was set by TARGET_SCHED_REORDER and not do any further reordering? The problem I see is that once one of these insns fire, other new insns will be added to the ready list. So you have to keep some kind of state about how many instructions back one of these insns fired and consult that data when making a decision about the next instruction to fire. All this will fall apart if this is a correctness issue since you'd have to issue a nop or somesuch. Though I guess you might be able to arrange to get a nop into the scheduled stream. If this is a correctness issue, tackling it in the assembler may make more sense. Thanks. This is not a correctness issue. I got some code to add nops and it was easy enough, but I'm using it only to investigate the effectiveness of a proper scheduling approach (the more effective the scheduling approach, the fewer nops we emit). I'll try out the ideas you suggested. Thanks, Kyrill Jeff
RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
On August 14, 2015 5:03:58 PM GMT+02:00, Ajit Kumar Agarwal wrote: > > >-Original Message- >From: Richard Biener [mailto:richard.guent...@gmail.com] >Sent: Monday, August 03, 2015 2:59 PM >To: Ajit Kumar Agarwal >Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; >Vidhumouli Hunsigida; Nagaraju Mekala >Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST > >On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal > wrote: >> All: >> >> The definition of the following macro that determine the statement >cost that adds to vectorization cost. >> >> #define TARGET_VECTORIZE_ADD_STMT_COST. >> >> In the implementation of the above macro the following is done for >many vectorization supported architectures like i386, ARM. >> >> if (where == vect_body && stmt_info && stmt_in_inner_loop_p >(stmt_info)) >> count *= 50; /* FIXME. */ >> >> I have the following questions. >> >> 1. Why the multiplication factor of 50 is choosen? > >>>It's a wild guess. See >tree-vect-loop.c:vect_get_single_scalar_iteration_cost. > >> 2. The comment mentions that the inner loop relative to the loop >being >> vectorized is added more weight. If more weight is added to the inner > >> loop for the loop being vectorized, the chances of vectorizing the >inner loop decreases. Why the inner loop cost is increased with >relative to the loop being vectorized? > >>>In fact adding more weight to the inner loop increases the chance of >vectorizing it (if vectorizing the inner loop is profitable). >>>Both scalar and vector cost get biased by a factor of 50 (we assume >50 iterations of the inner loop for one iteration of the outer loop), >so a non-profitable >>vectorization in the outer loop can be offsetted >by profitable inner loop vectorization. > >>>Yes, '50' can be improved if we actually know the iteration count of >the inner loop or if we have profile-feedback. > >Instead of vector and scalar cost get biased by a factor of 50, Can the >benefit of vectorization calculated as follows > >Benefit = scalar cost - vector cost/VF; >Cost = 0; >For ( I = 1; I < N; i++) >{ >Cost = cost + (final_value - Initial value)/steps; >} > >Benefit = Benefit * cost; > >Where >N = No. of levels of the loop; >Final_value = Final iteration count of the loop. >Initial_value = Initial Iteration count of the loop. >Steps = steps of the iteration for the loop. >VF = vectorization factor. > >Thus increase in the Levels of the loops increases the benefit of >vectorization. Also if the scalar cost is more than the vectorization >cost then the >Scalar cost - vector cost /VF increases with the same vectorization >Factor thus increasing the benefit of vectorization. Sure. But the number of iterations may only be available symbolically, thus the cost be only useful for the dynamic check at runtime. A better static estimate would also be useful. Richard. >Thanks & Regards >Ajit > >Richard. > > >> Thanks & Regards >> Ajit