Re: On-Demand range technology [1/5] - Executive Summary

2019-06-19 Thread Kugan Vivekanandarajah
Hi Andrew,

Thanks for working on this.

Enable elimination of zext/sext with VRP patch had to be reverted in
(https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the
need for value ranges in PROMOTED_MODE precision for at least 1 test
case for alpha.

Playing with ranger suggest that it is not possible to get value
ranges in PROMOTED_MODE precision on demand. Or is there any way we
can use on-demand ranger here?

Thanks,
Kugan

On Thu, 23 May 2019 at 11:28, Andrew MacLeod  wrote:
>
> Now that stage 1 has reopened, I’d like to reopen a discussion about the
> technology and experiences we have from the Ranger project I brought up
> last year. https://gcc.gnu.org/ml/gcc/2018-05/msg00288.html .  (The
> original wiki pages are now out of date, and I will work on updating
> them soon.)
>
> The Ranger is designed to evaluate ranges on-demand rather than through
> a top-down approach. This means you can ask for a range from anywhere,
> and it walks back thru the IL satisfying any preconditions and doing the
> required calculations.  It utilizes a cache to avoid re-doing work. If
> ranges are processed in a forward dominator order, it’s not much
> different than what we do today. Due to its nature, the order you
> process things in has minimal impact on the overall time… You can do it
> in reverse dominator order and get similar times.
>
> It requires no outside preconditions (such as dominators) to work, and
> has a very simple API… Simply query the range of an ssa_name at any
> point in the IL and all the details are taken care of.
>
> We have spent much of the past 6 months refining the prototype (branch
> “ssa-range”) and adjusting it to share as much code with VRP as
> possible. They are currently using a common code base for extracting
> ranges from statements, as well as simplifying statements.
>
> The Ranger deals with just  ranges. The other aspects of  VRP are
> intended to be follow on work that integrates tightly with it, but are
> also independent and would be available for other passes to use.  These
> include:
> - Equivalency tracking
> - Relational processing
> - Bitmask tracking
>
> We have implemented a VRP pass that duplicates the functionality of EVRP
> (other than the bits mentioned above), as well as converted a few other
> passes to use the interface.. I do not anticipate those missing bits
> having a significant impact on the results.
>
> The prototype branch it quite stable and can successfully build and test
> an entire Fedora distribution (9174 packages). There is an issue with
> switches I will discuss later whereby the constant range of a switch
> edge is not readily available and is exponentially expensive to
> calculate. We have a design to address that problem, and in the common
> case we are about 20% faster than EVRP is.
>
> When utilized in passes which only require ranges for a small number of
> ssa-names we see significant improvements.  The sprintf warning pass for
> instance allows us to remove the calculations of dominators and the
> resulting forced walk order. We see a 95% speedup (yes, 1/20th of the
> overall time!).  This is primarily due to no additional overhead and
> only calculating the few things that are actually needed.  The walloca
> and wrestrict passes are a similar model, but as they have not been
> converted to use EVRP ranges yet, we don’t see similar speedups there.
>
> That is the executive summary.  I will go into more details of each
> major thing mentioned in follow on notes so that comments and
> discussions can focus on one thing at a time.
>
> We think this approach is very solid and has many significant benefits
> to GCC. We’d like to address any concerns you may have, and work towards
> finding a way to integrate this model with the code base during this
> stage 1.
>
> Comments and feedback always welcome!
> Thanks
> Andrew


Re: On-Demand range technology [1/5] - Executive Summary

2019-06-21 Thread Kugan Vivekanandarajah
Hi Andrew,

Thanks for looking into it and my apologies for not being clear.

My proposal was to use value ranges when expanding gimple to RTL and
eliminate redundant zero/sign extensions. I.e., if we know the value
generated by some gimple operation is already in the (zero/sign)
extended from based on our VR analysis, we could skip the SUBREG and
ZERO/SIGN_EXTEND (or set SRP_SIGNED_AND_UNSIGNED and likes).

However, the problem is, RTL operations are done in PROMOTE_MODE
precision while gimple value range is in natural types. This can be a
problem when type wraps (and shows up mainly for targets where
PROMOTE_MODE is DImode like alpha).

For example, as Uros pointed out with the reverted patch, for alpha-linux we had

FAIL: libgomp.fortran/simd7.f90   -O2  execution test
FAIL: libgomp.fortran/simd7.f90   -Os  execution test

The reason being that values wrap and in VR calculation we only
records the type precision (which is what matters for gimple) but in
order to eliminate the zero/sign extension we need the full precision
in the PROMOTE_MODE.

Extract from the testcase failing:
_343 = ivtmp.179_52 + 2147483645; [0x8004, 0x80043]
_344 = _343 * 2; [0x8, 0x86]
_345 = (integer(kind=4)) _344; [0x8, 0x86]

With the above VR of [0x8, 0x86] (in promoted precision which is
[0x10008, 0x10086]), my patch was  setting
SRP_SIGNED_AND_UNSIGNED which was wrong and causing the error
(eliminating and extension which was not redundant). If we had the VR
in promoted precision, the patch would be correct and used to
eliminate redundant zero/sign extensions.

Please let me know if my explanation is not clear and I will show it
with more examples.

Thanks,
Kugan


On Fri, 21 Jun 2019 at 23:27, Andrew MacLeod  wrote:
>
> On 6/19/19 11:04 PM, Kugan Vivekanandarajah wrote:
>
> Hi Andrew,
>
> Thanks for working on this.
>
> Enable elimination of zext/sext with VRP patch had to be reverted in
> (https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00672.html) due to the
> need for value ranges in PROMOTED_MODE precision for at least 1 test
> case for alpha.
>
> Playing with ranger suggest that it is not possible to get value
> ranges in PROMOTED_MODE precision on demand. Or is there any way we
> can use on-demand ranger here?
>
> Thanks,
> Kugan
>
>
>
> I took a look at the thread, but I think I'm still missing some context.
>
> Lets go back to the beginning.  What is an example of the case we arent 
> getting that you want to get?
>
> I'm going to guess to start :-)
>
> short foo(unsigned char c)
>  {
>c = c & (unsigned char)0x0F;
>if( c > 7 )
>  return((short)(c - 5));
>else
>  return(( short )c);
>  }
>
>
>
> A run of this thru the ranger shows me code that looks like (on x86 anyway):
>
>  === BB 2 
> c_4(D)  [0, 255] unsigned char
>  :
> c_5 = c_4(D) & 15;
> _9 = c_4(D) & 8;
> if (_9 != 0)
>   goto ; [INV]
> else
>   goto ; [INV]
>
> c_5 : [0, 15] unsigned char
> _9 : [0, 0][8, 8] unsigned char
> 2->3  (T)*c_5 : [0, 15] unsigned char
> 2->3  (T) _9 :  [8, 8] unsigned char
> 2->4  (F)*c_5 : [0, 15] unsigned char
> 2->4  (F) _9 :  [0, 0] unsigned char
>
> === BB 3 
> c_5 [0, 15] unsigned char
>  :
> _1 = (unsigned short) c_5;
> _2 = _1 + 65531;
> _7 = (short int) _2;
> // predicted unlikely by early return (on trees) predictor.
> goto ; [INV]
>
> _1 : [0, 15] unsigned short
> _2 : [0, 10][65531, 65535] unsigned short
> _7 : [-5, 10] short int
>
> === BB 4 
> c_5 [0, 15] unsigned char
>  :
> _6 = (short int) c_5;
> // predicted unlikely by early return (on trees) predictor.
>
>
> I think I see.  we aren't adjusting the range of c_5 going into blocks 3 and 
> 4.  its obvious from the original source where the code says < 7, but once 
> its been "bitmasked" that info becomes obfuscated.
> .
> If you were to see a range in bb3 of c_5 = [8,15], and a range in bb4 of c_4 
> = [0,7],  would that solve your problem?
>
> so in bb3 , we'd then see ranges that look like:
>
> _1 : [8, 15] unsigned short
> _2 : [3, 10] unsigned short
> _7 : [3, 10] short int
>
> and then later on we'd see there is no negative/wrap value, and you could 
> could just drop the extension then?
>
> SO.
>
> yes. this is fixable,  but is it easy? :-)
>
> We're in the process of replacing the range extraction code with the 
> range-ops/gori-computes  components from the ranger.  This is the part which 
> figures ranges out from individual statements on exit to a block..
>
> We have implemented mostly the same func

Re: Duplicating loops and virtual phis

2017-05-21 Thread Kugan Vivekanandarajah
Hi Bin and Steve,

On 17 May 2017 at 19:41, Bin.Cheng  wrote:
> On Mon, May 15, 2017 at 7:32 PM, Richard Biener
>  wrote:
>> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey  
>> wrote:
>>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
 On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey >>> om> wrote:
 >
 > (Short version of this email, is there a way to recalculate/rebuild
 > virtual
 > phi nodes after modifying the CFG.)
 >
 > I have a question about duplicating loops and virtual phi nodes.
 > I am trying to implement the following optimization as a pass:
 >
 > Transform:
 >
 >   for (i = 0; i < n; i++) {
 >A[i] = A[i] + B[i];
 >C[i] = C[i-1] + D[i];
 >   }
 >
 > Into:
 >
 >   if (noalias between A&B, A&C, A&D)
 >for (i = 0; i < 100; i++)
 >A[i] = A[i] + B[i];
 >for (i = 0; i < 100; i++)
 >C[i] = C[i-1] + D[i];
 >   else
 >for (i = 0; i < 100; i++) {
 >A[i] = A[i] + B[i];
 >C[i] = C[i-1] + D[i];
 >}
 >
 > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>>be
 > vectorized so it gives up and does not vectorize the loop.  If we
>>>split
 > up the loop into two loops then the vector add with A[i] could be
 > vectorized
 > even if the one with C[i] could not.
 Loop distribution does this transform but it doesn't know about
 versioning for unknown dependences.

>>>
>>>Yes, I looked at loop distribution.  But it only works with global
>>>arrays and not with pointer arguments where it doesn't know the size of
>>>the array being pointed at.  I would like to be able to have it work
>>>with pointer arguments.  If I call a function with 2 or
>>>more integer pointers, and I have a loop that accesses them with
>>>offsets between 0 and N where N is loop invariant then I should have
>>>enough information (at runtime) to determine if there are overlapping
>>>memory accesses through the pointers and determine whether or not I can
>>>distribute the loop.
>>
>> Not sure where you got that from. Loop distribution works with our data 
>> reference / dependence analysis.  The cost model might be more restricted 
>> but that can be fixed.
>>
>>>The loop splitting code seemed like a better template since it already
>>>knows how to split a loop based on a runtime determined condition. That
>>>part seems to be working for me, it is when I try to
>>>distribute/duplicate one of those loops (under the unaliased condition)
>>>that I am running into the problem with virtual PHIs.
>>
>> There's mark_virtual*for_renaming (sp?).
>>
>> But as said you are performing loop distribution so please enhance the 
>> existing pass rather than writing a new one.
> I happen to be working on loop distribution now (If guess correctly,
> to get hmmer fixed).  So far my idea is to fuse the finest distributed
> loop in two passes, in the first pass, we merge all SCCs due to "true"
> data dependence; in the second one we identify all SCCs and breaks
> them on dependent edges due to possible alias.  Breaking SCCs with
> minimal edge set can be modeled as Feedback arc set problem which is
> NP-hard. Fortunately the problem is small in our case and there are
> approximation algorithms.  OTOH, we should also improve loop
> distribution/fusion to maximize parallelism / minimize
> synchronization, as well as maximize data locality, but I think this
> is not needed to get hmmer vectorized.

I am also looking into vectoring homer loop. Glad to know you are also
looking at this problem and looking forward to seeing the patches.

I have some experimental patches where I added the data reference that
needs runtime checking to a list

static int
pg_add_dependence_edges (struct graph *rdg, vec loops, int dir,
  vec drs1,
- vec drs2)
+ vec drs2,
+ vec &ddrs,
+ bool runtime_alias_check)

Then I am vesioning the main loop based on the condition generated
from the runtime check. I have borrowed the logic from vectorizer
(like pruning the list and generating the condition). I have neither
verified nor benchmarked it enough yet.

As I understand, we also should  have some form of cost model where we
should be able too see the data access patterns and decide if the
distributed loops can be vectorized?

Cost model in  similar_memory_accesses also need to be relaxd based on
the ability to vectorize distributed loops.

Thanks,
Kugan


> Thanks,
> bin
>>
>> Richard.
>>
>>>Steve Ellcey
>>>sell...@cavium.com
>>


Loop reversal

2017-07-12 Thread Kugan Vivekanandarajah
I am looking into reversing loop to increased efficiency. There is
already a PR22041 for this and an old patch
https://gcc.gnu.org/ml/gcc-patches/2006-01/msg01851.html by Zdenek
which never made it to mainline.

For constant loop count, ivcanon pass is adding reverse iv but this
not selected by ivopt.

For example:

void copy (unsigned int N, double *a, double *c)
{
  for (int i = 0; i < 800; ++i)
  c[i] = a[i];
}

ivcanon pass Added canonical iv to loop 1, 799 iterations.
ivtmp_14 = ivtmp_15 – 1;

in ivopt, it selects candidates 10

Candidate 10:
Var befor: ivtmp.11
Var after: ivtmp.11
Incr POS: before exit test
IV struct:
Type: sizetype
Base: 0
Step: 8
Biv: N

If we look at the group :

Group 0:
Type: ADDRESS
Use 0.0:
At stmt: _5 = *_3;
At pos: *_3
IV struct:
Type: double *
Base: a_9(D)
Step: 8
Object: (void *) a_9(D)
Biv: N
Overflowness wrto loop niter: Overflow

Group 1:
Type: ADDRESS
Use 1.0:
At stmt: *_4 = _5;
At pos: *_4
IV struct:
Type: double *
Base: c_10(D)
Step: 8
Object: (void *) c_10(D)
Biv: N
Overflowness wrto loop niter: Overflow

Group 2:
Type: COMPARE
Use 2.0:
At stmt: if (ivtmp_14 != 0)
At pos: ivtmp_14
IV struct:
Type: unsigned int
Base: 799
Step: 4294967295
Biv: Y
Overflowness wrto loop niter: Overflow

ivopt cost model assumes that group0 and 1 will have infinite cost for
the iv added by ivcanon pass because of the lower precision with the
IV added by ivcanon pass.

If I change the example to:

void copy (unsigned int N, double *a, double *c)
{
 for (long i = 0; i < 800; ++i)
 c[i] = a[i];
}

It still has higher cost for group0 and 1 due to the negative step. I
think this can be improved. My question is:

1. For the case where the loop count is not constant, can we make
ivcanon to add reverse IV with the current implementation. Can ivopt
be taught to select the reverse iv ?

2. Or is the patch by Zdenek a better option. I am re-basing it for the trunk.

Thanks,
Kugan


Re: [RFC] type promotion pass

2017-09-17 Thread Kugan Vivekanandarajah
Hi Richard,

On 16 September 2017 at 06:12, Richard Biener
 wrote:
> On September 15, 2017 6:56:04 PM GMT+02:00, Jeff Law  wrote:
>>On 09/15/2017 10:19 AM, Segher Boessenkool wrote:
>>> On Fri, Sep 15, 2017 at 09:18:23AM -0600, Jeff Law wrote:
 WORD_REGISTER_OPERATIONS works with PROMOTE_MODE.  The reason you
>>can't
 define WORD_REGISTER_OPERATIONS on aarch64 is because that the
>>implicit
 promotion is sometimes to 32 bits and sometimes to 64 bits.
 WORD_REGISTER_OPERATIONS can't really describe that.
>>>
>>> WORD_REGISTER_OPERATIONS isn't well-defined.
>>>
>>> """
>>> @defmac WORD_REGISTER_OPERATIONS
>>> Define this macro to 1 if operations between registers with integral
>>mode
>>> smaller than a word are always performed on the entire register.
>>> Most RISC machines have this property and most CISC machines do not.
>>> @end defmac
>>> """
>>>
>>> Exactly what operations?  For almost all targets it isn't true for
>>*all*
>>> operations.  Or no targets even, if you include rotate, etc.
>>>
>>> For targets that have both 32-bit and 64-bit operations it is never
>>true
>>> either.
>>>
 And I'm also keen on doing something with type promotion -- Kai did
>>some
 work in this space years ago which I found interesting, even if the
>>work
 didn't go forward.  It showed a real weakness.  So I'm certainly
 interested in looking at Prathamesh's work -- with the caveat that
>>if it
 stumbles across the same issues as Kai's work that it likely
>>wouldn't be
 acceptable in its current form.
>>>
>>> Doing type promotion too aggressively reduces code quality.  "Just"
>>find
>>> a sweet spot :-)
>>>
>>> Example: on Power, an AND of QImode with 0xc3 is just one insn, which
>>> actually does a SImode AND with 0xffc3.  This is what we do
>>currently.
>>> A SImode AND with 0x00c3 is two insns, or one if we allow it to
>>write
>>> to CR0 as well ("andi."); same for DImode, except there isn't a way
>>to do
>>> an AND with 0xffc3 in one insn at all.
>>>
>>> unsigned char a;
>>> void f(void) { a &= 0xc3; };
>>Yes, these are some of the things we kicked around.  One of the most
>>interesting conclusions was that for these target issues we'd really
>>like a target.pd file to handle this class of transformations just
>>prior
>>to rtl expansion.
>>
>>Essentially early type promotion/demotion would be concerned with cases
>>where we can eliminate operations in a target independent manner and
>>narrow operands as much as possible.  Late promotion/demotion would
>>deal
>>with stuff like the target's desire to work on specific sized hunks in
>>specific contexts.
>>
>>I'm greatly oversimplifying here.  Type promotion/demotion is fairly
>>complex to get right.
>
> I always thought we should start with those promotions that are done by RTL 
> expansion according to PROMOTE_MODE and friends. The complication is that 
> those promotions also apply to function calls and arguments and those are 
> difficult to break apart from other ABI specific details.
>
> IIRC the last time we went over this patch I concluded a better first step 
> would be to expose call ABI details on GIMPLE much earlier. But I may 
> misremember here.

I think this would be very useful. Some of the regressions in type
promotion comes from parameters/return values. ABI in some cases
guarantees that they are properly extended but during type promotion
we promote (or extend) leading to additional extensions.

We might also need some way of having gimple statements that can
convert (or promote to the type without extensions) just to keep the
gimple type system happy.

Thanks,
Kugan

>
> Basically we couldn't really apply all promotions RTL expansion applies. One 
> of my ideas with doing them early also was to simplify RTL expansion and 
> especially promotion issues during SSA coalescing.
>
> Richard.
>
>>jeff
>


Re: [RFC] type promotion pass

2017-09-18 Thread Kugan Vivekanandarajah
Hi Steve,

On 19 September 2017 at 05:45, Steve Ellcey  wrote:
> On Mon, 2017-09-18 at 23:29 +0530, Prathamesh Kulkarni wrote:
>>
>> Hi Steve,
>> The patch is currently based on r249469. I will rebase it on ToT and
>> look into the build failure.
>> Thanks for pointing it out.
>>
>> Regards,
>> Prathamesh
>
> OK, I applied it to that version successfully.  The thing I wanted to
> check was to see if this helped with PR target/77729.  It does not,
> so I think even with this patch we would need my patch to address the
> issue of having GCC recognize that ldrb/ldhb zero out the top of a
> register and thus we do not need to mask it out later.
>
> https://gcc.gnu.org/ml/gcc-patches/2017-09/msg00929.html

I tried the testases you have in the patch with type promotion. Looks
like forwprop is reversing the promotion there. I haven't looked in
detail yet but -fno-tree-forwprop seems to remove 6 "and" from the
test case. I have a slightly different version to what Prathamseh has
posted and hope that there isn't any difference here.

Thanks,
Kugan


Handling prefetcher tag collisions while allocating registers

2017-10-23 Thread Kugan Vivekanandarajah
Hi All,

I am wondering if there is anyway we can prefer certain registers in
register allocations. That is, I want to have some way of recording
register allocation decisions (for loads in loop that are accessed in
steps) and use this to influence register allocation of other loads
(again that are accessed in steps).

This is for architectures (like falkor AArch64) that use hardware
perefetchers that use signatures of the loads to lock into and tune
prefetching parameters. Ideally, If the loads are from the same
stream, they should have same signature and if they are from different
stream, they should have different signature. Destination, base
register and offset are used in the signature. Therefore, selecting
different register can influence this.

In LLVM, this is implemented as a machine specific pass that runs
after register allocation. It then inserts mov instruction with
scratch registers to manage this. We can do a machine reorg pass in
gcc but detecting strided loads at that stage is not easy.

I am trying to implement this in gcc and wondering what is the
preferred and acceptable way to implement this. Any thoughts ?

Thanks,
Kugan


Re: Handling prefetcher tag collisions while allocating registers

2017-10-24 Thread Kugan Vivekanandarajah
Hi Bin,

On 24 October 2017 at 18:29, Bin.Cheng  wrote:
> On Tue, Oct 24, 2017 at 12:44 AM, Kugan Vivekanandarajah
>  wrote:
>> Hi All,
>>
>> I am wondering if there is anyway we can prefer certain registers in
>> register allocations. That is, I want to have some way of recording
>> register allocation decisions (for loads in loop that are accessed in
>> steps) and use this to influence register allocation of other loads
>> (again that are accessed in steps).
>>
>> This is for architectures (like falkor AArch64) that use hardware
>> perefetchers that use signatures of the loads to lock into and tune
>> prefetching parameters. Ideally, If the loads are from the same
>> stream, they should have same signature and if they are from different
>> stream, they should have different signature. Destination, base
>> register and offset are used in the signature. Therefore, selecting
>> different register can influence this.
> I wonder why the destination register is used in signature.  In an extreme 
> case,
> load in loop can be unrolled then allocated to different dest registers. 
> Forcing
> the same dest register could be too restricted.

My description is very simplified. Signature is based on part of the
register number. Thus, two registers can have same signature. What we
don't want is to have collisions when they are from two different
memory stream. So this is not an issue.

Thanks,
Kugan

>
> Thanks,
> bin
>
>>
>> In LLVM, this is implemented as a machine specific pass that runs
>> after register allocation. It then inserts mov instruction with
>> scratch registers to manage this. We can do a machine reorg pass in
>> gcc but detecting strided loads at that stage is not easy.
>>
>> I am trying to implement this in gcc and wondering what is the
>> preferred and acceptable way to implement this. Any thoughts ?
>>
>> Thanks,
>> Kugan


Re: Global analysis of RTL

2017-10-25 Thread Kugan Vivekanandarajah
Hi,

On 26 October 2017 at 14:13, R0b0t1  wrote:
> On Thu, Oct 19, 2017 at 8:46 AM, Geoff Wozniak  wrote:
>> R0b0t1  writes:
>>>
>>> When I first looked at the GCC codebase, it seemed to me that most
>>> operations should be done on the GIMPLE representation as it contains the
>>> most information. Is there any reason you gravitated towards RTL?
>>
>>
>> Naiveté, really.
>>
>> My team and I didn’t know much about the code base when we started looking
>> at the problem, although we knew a little about the intermediate formats.
>> GIMPLE makes the analysis more complicated, although not impossible, and it
>> can make the cost model difficult to pin down. Raw assembly/machine code is
>> ideal, but then we have to deal with different platforms and would likely
>> have to do all the work in the linker. RTL is sufficiently low-level enough
>> (as far as we know) to start counting instructions, and platform independent
>> enough that we don’t have to parse machine code.
>>
>> Essentially, working with RTL makes the implementation a little easier but
>> we didn’t know that the pass infrastructure wasn’t in our favour.
>>
>> It’s likely we’ll turn our attention to GIMPLE and assembler/machine code,
>> unless we can come up with something (or anyone has a suggestion).
>>
>
> Admittedly I do not know much about compiler design, but your response
> has put some of what I read about analysis of RTL into context. It it
> is hard to be sure, but I think analysis of RTL has fallen out of
> favor and has been replaced with the analysis of intermediate
> languages. For example, compare clang and llvm's operation.

It is not really being replaced (at least I am not aware of it). It is
true that more and more of the high level optimisations are moved to
gimple. When we move high level intermediate format to lower level
intermediate format, we tend to lose some information and gets more
closer to machine representation. An obvious example is, in RTL sign
is not represented. Even in RTL, after reload, we will have one to one
mapping fro RTL to actual machine instruction (i.e. more closer to
asm). In short, gcc goes from generic to gimple to RTL as stamens are
lowered from high level languages to asm.

Thanks,
Kugan

>
> The missing link is that you seem to be right about cost calculation.
> Cost calculation is difficult for high level operations. Would online
> analysis of the produced machine code be sufficient? That seems to be
> a popular solution from what I have read.
>
> Thanks for the response, and best of luck to you.
>
> Cheers,
>  R0b0t1.


Re: Problems in IPA passes

2017-10-29 Thread Kugan Vivekanandarajah
Hi Jeff,

On 28 October 2017 at 18:28, Jeff Law  wrote:
>
> Jan,
>
> What's the purpose behind calling vrp_meet and
> extract_range_from_unary_expr from within the IPA passes?

This is used such that when we have an argument to a function and this
for which we know the VR and this intern is passed as a parameter to
another. For example:

void foo (int i)
{
...
  bar (unary_op (i))
...
}

This is mainly to share what is done in tree-vrp.
>
> AFAICT that is not safe to do.  Various paths through those routines
> will access static objects within tree-vrp.c which may not be
> initialized when IPA runs (vrp_equiv_obstack, vr_value).

IPA-VRP does not track equivalence and vr_value is not used.

Thanks,
Kugan

>
> While this seems to be working today, it's a failure waiting to happen.
>
> Is there any way you can avoid using those routines?  I can't believe
> you really need all the complexity of those routines, particularly
> extract_range_from_unary_expr.  Plus it's just downright fugly from a
> modularity standpoint.
>
>
> ?
>
> Jeff


Re: Problems in IPA passes

2017-10-31 Thread Kugan Vivekanandarajah
Hi Jeff,

On 31 October 2017 at 14:47, Jeff Law  wrote:
> On 10/29/2017 03:54 PM, Kugan Vivekanandarajah wrote:
>> Hi Jeff,
>>
>> On 28 October 2017 at 18:28, Jeff Law  wrote:
>>>
>>> Jan,
>>>
>>> What's the purpose behind calling vrp_meet and
>>> extract_range_from_unary_expr from within the IPA passes?
>>
>> This is used such that when we have an argument to a function and this
>> for which we know the VR and this intern is passed as a parameter to
>> another. For example:
>>
>> void foo (int i)
>> {
>> ...
>>   bar (unary_op (i))
>> ...
>> }
>>
>> This is mainly to share what is done in tree-vrp.
> Presumably you never have equivalences or anything like that, which
> probably helps with not touching vrp_bitmap_obstack which isn't
> initialized when you run the IPA bits.
>
>>>
>>> AFAICT that is not safe to do.  Various paths through those routines
>>> will access static objects within tree-vrp.c which may not be
>>> initialized when IPA runs (vrp_equiv_obstack, vr_value).
>>
>> IPA-VRP does not track equivalence and vr_value is not used.
> But there's no enforcement and I'd be hard pressed to believe that all
> the paths through the routines you use in tree-vrp aren't going to touch
> vr_value, or vrp_bitmap_obstack.  vrp_bitmap_obstack turns out to be
> incredibly tangled into the implementations within tree-vrp.c :(
>

I looked into the usage and it does seem to be not using vr_value
unless I am missing something. There are two overloaded functions
here:

extract_range_from_unary_expr (value_range *vr,
   enum tree_code code, tree type,
   value_range *vr0_, tree op0_type)
is safe as this works with value_range and does not use
get_value_range to access vr_value.

extract_range_from_unary_expr (value_range *vr, enum tree_code code,
   tree type, tree op0)
This is not safe as this takes tree as an argument and gets
value_range by calling get_value_range. May be we should change the
names to reflect this.

Thanks,
Kugan


Re: poly_uint64 / TYPE_VECTOR_SUBPARTS question

2018-02-08 Thread Kugan Vivekanandarajah
Hi,

On 9 February 2018 at 09:08, Steve Ellcey  wrote:
> I have a question about the poly_uint64 type and the TYPE_VECTOR_SUBPARTS
> macro.  I am trying to copy some code from i386.c into my aarch64 build
> that is basically:
>
> int n;
> n = TYPE_VECTOR_SUBPARTS (type_out);
>
> And it is not compiling for me, I get:
>
> /home/sellcey/gcc-vectmath/src/gcc/gcc/config/aarch64/aarch64-builtins.c:1504:37:
>  error: cannot convert ‘poly_uint64’ {aka ‘poly_int<2, long unsigned int>’} 
> to ‘int’ in assignment
>n = TYPE_VECTOR_SUBPARTS (type_out);

AFIK, you could use to_constant () if known to be a compile time constant.

Thanks,
Kugan
>
> My first thought was that I was missing a header file but I put
> all the header includes that are in i386.c into aarch64-builtins.c
> and it still does not compile.  It works on the i386 side.  It looks
> like poly-int.h and poly-int-types.h are included by coretypes.h
> and I include that header file so I don't understand why this isn't
> compiling and what I am missing.  Any help?
>
> Steve Ellcey
> sell...@cavium.com


Generating gimple assign stmt that changes sign

2018-05-21 Thread Kugan Vivekanandarajah
Hi,

I am looking to introduce ABSU_EXPR and that would create:

unsigned short res = ABSU_EXPR (short);

Note that the argument is signed and result is unsigned. As per the
review, I have a match.pd entry to generate this as:
(simplify (abs (convert @0))
 (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0)))
  (convert (absu @0


Now when gimplifying the converted tree, how do we tell that ABSU_EXPR
will take a signed arg and return unsigned. I will have other match.pd
entries so this will be generated while in gimple.passes too. Should I
add new functions in gimple.[h|c] for this.

Is there any examples I can refer to. Conversion expressions seems to
be the only place where sign can change in gimple assignment but they
are very specific.

Thanks,
Kugan


Re: Generating gimple assign stmt that changes sign

2018-05-21 Thread Kugan Vivekanandarajah
Hi Jeff,

Thanks for the prompt reply.

On 22 May 2018 at 09:10, Jeff Law  wrote:
> On 05/21/2018 04:50 PM, Kugan Vivekanandarajah wrote:
>> Hi,
>>
>> I am looking to introduce ABSU_EXPR and that would create:
>>
>> unsigned short res = ABSU_EXPR (short);
>>
>> Note that the argument is signed and result is unsigned. As per the
>> review, I have a match.pd entry to generate this as:
>> (simplify (abs (convert @0))
>>  (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0)))
>>   (convert (absu @0
>>
>>
>> Now when gimplifying the converted tree, how do we tell that ABSU_EXPR
>> will take a signed arg and return unsigned. I will have other match.pd
>> entries so this will be generated while in gimple.passes too. Should I
>> add new functions in gimple.[h|c] for this.
>>
>> Is there any examples I can refer to. Conversion expressions seems to
>> be the only place where sign can change in gimple assignment but they
>> are very specific.
> What's the value in representing ABSU vs a standard ABS followed by a
> conversion?

It is based on PR https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64946.
Specifically, comment 13.

>
> You'll certainly want to do verification of the type signedness in the
> gimple verifier.
I am doing it and it is failing now.

>
> In general the source and destination types have to be the same.
> Conversions are the obvious exception.  There's a few other nodes that
> have more complex type rules (MEM_REF, COND_EXPR and a few others).  But
> I don't think they're necessarily going to be helpful.


Thanks,
Kugan
>
> jeff


Sched1 stability issue

2018-07-03 Thread Kugan Vivekanandarajah
Hi,

We noticed a difference in the code generated for aarch64 gcc 7.2
hosted in Linux vs mingw. AFIK, we are supposed to produce the same
output.

For the testacse we have (quite large and I am trying to reduce), the
difference comes from sched1 pass. If I disable sched1 the difference
is going away.

Is this a known issue? Attached is the sched1 dump snippet where there
is the difference.

Thanks,
Kugan


 verify found no changes in insn with uid = 41.
 starting the processing of deferred insns
 ending the processing of deferred insns
 df_analyze called

 Pass 0 for finding pseudo/allocno costs


   r84 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:2
FP_REGS:2 ALL_REGS:2 MEM:8000
   r83 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:2
FP_REGS:2 ALL_REGS:2 MEM:8000
   r80 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:1
FP_REGS:1 ALL_REGS:1 MEM:8000
   r79 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:4000
FP_REGS:4000 ALL_REGS:1 MEM:8000
   r78 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:4000
FP_REGS:4000 ALL_REGS:1 MEM:8000
   r77 costs: CALLER_SAVE_REGS:0 GENERAL_REGS:0 FP_LO_REGS:9000
FP_REGS:9000 ALL_REGS:1 MEM:8000


 Pass 1 for finding pseudo/allocno costs

 r86: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r85: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r84: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r83: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r82: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r81: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r80: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r79: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r78: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r77: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r76: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r75: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r74: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r73: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r72: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r71: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r70: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r69: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r68: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
 r67: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS

   r84 costs: GENERAL_REGS:0 FP_LO_REGS:2 FP_REGS:2
ALL_REGS:2 MEM:8000
   r83 costs: GENERAL_REGS:0 FP_LO_REGS:2 FP_REGS:2
ALL_REGS:2 MEM:8000
   r80 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1
ALL_REGS:1 MEM:8000
   r79 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1
ALL_REGS:1 MEM:8000
   r78 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1
ALL_REGS:1 MEM:8000
   r77 costs: GENERAL_REGS:0 FP_LO_REGS:1 FP_REGS:1
ALL_REGS:1 MEM:8000

 ;;   ==
 ;;   -- basic block 2 from 3 to 48 -- before reload
 ;;   ==

 ;;  0--> b  0: i  24 r77=ap-0x40
:cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0)
 ;;  0--> b  0: i  26 r78=0xffc8
:cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0)
 ;;  1--> b  0: i  25 [sfp-0x10]=r77
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(-1)@FP_REGS+0(0)
--
-;;  1--> b  0: i   9 [ap-0x8]=x7
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(-1)@FP_REGS+0(0)
--
-;;  2--> b  0: i  22 [sfp-0x20]=ap
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(0)FP_REGS+0(0)
+;;  1--> b  0: i  22 [sfp-0x20]=ap
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(0)
 ;;  2--> b  0: i  23 [sfp-0x18]=ap
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(0)FP_REGS+0(0)
-;;  3--> b  0: i  27 [sfp-0x8]=r78
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(-1)FP_REGS+0(0)
+;;  2--> b  0: i  27 [sfp-0x8]=r78
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:GENERAL_REGS+0(-1)FP_REGS+0(0)
 ;;  3--> b  0: i  28 r79=0xff80
:cortex_a53_slot_any:GENERAL_REGS+1(1)FP_REGS+0(0)
-;;  4--> b  0: i  10 [ap-0xc0]=v0
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(-1)
+;;  3--> b  0: i  10 [ap-0xc0]=v0
:(cortex_a53_slot_any+cortex_a53_ls_agen),cortex_a53_store:@GENERAL_REGS+0(0)@FP_REGS+0(-1)
 ;;  4--> b  0: i  29 [sfp-0x4]=r79
:(cortex_a53_slot_