Re: Modifying ARM code generator for elimination of 8bit writes - need help

2006-06-02 Thread Rask Ingemann Lambertsen
On Fri, Jun 02, 2006 at 08:23:49AM +0200, Wolfgang Mües wrote:
> Rask,
> 
> (_only_ adding the clobber statement),
> I get

> >0/newlib/li bc/argz/argz_create_sep.c:60: error: unrecognizable insn:
> (insn 192 21 24 0 (set (reg:QI 1 r1) (reg:QI 4 r4)) -1
>  (nil) (nil))
> 
> What do you mean with
> 
> > You will also have to modify any code which 
> > expands this pattern accordingly.

The rest of the ARM backend presently assumes that the pattern has the form

(set (operand:QI 0) (operand:QI 1))

but now we've changed it to

(parallel [(set (operand:QI 0) (operand:QI 1))
   (clobber (operand:QI 2))
])

so that's why you get "unrecognizable insn" errors now. Any place which
intended to generate an *arm_movqi_insn has to add a clobber also. For a
start, this means the "movqi" pattern.

There may be a faster way of seeing if the modification is going to work for
the DS at all. I noticed from the output template "swp%?b\\t%1, %1, [%M0]"
that "swp" takes three operands. I don't know ARM assembler, but you may be
able to choose to always clobber a specific register. Make it a fixed register
(see FIXED_REGISTERS), refer to this register directly in the output template
and don't add a clobber to the movqi patterns. IMHO, that's an acceptable hack
at an experimental stage. If the resulting code runs correctly on the DS, you
can then undo the FIXED_REGISTERS change and add the clobber statements.

-- 
Rask Ingemann Lambertsen


[patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Canqun Yang
Hi, all

This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS 
benchmark suite on
Itanium-2 system, respectively. More performance increase is hopeful by further 
tuning the
parameters and improving the prefetch algorithm at tree level. 

Details of NAS benchmarks are listed below.


GCC options: -O3 -fprefetch-loop-arrays
Target: Itanium-2 1.6GHz; L2 Cache 256K, L3 Cache 6M
Execution times in seconds

   -this patch +this patch
bt.W   14.4314.17
cg.A   13.766.86
ep.W   7.83 7.79
ft.A   18.7320.15
is.B   11.8510.94
lu.W   20.5520.27
mg.A   15.0911.86
sp.W   37.1135.49
geomean15.8413.94
speedup 13.68%


2006-06-02  Canqun Yang  <[EMAIL PROTECTED]>

 * config/ia64/ia64.h (SIMULTANEOUS_PREFETCHES): Define to 18.
 (PREFETCH_BLOCK): Define to 128.
 (PREFETCH_LATENCY): Define to 400.

Index: ia64.h
===
--- ia64.h (revision 114307)
+++ ia64.h (working copy)
@@ -1985,13 +1985,18 @@
??? This number is bogus and needs to be replaced before the value is
actually used in optimizations.  */
 
-#define SIMULTANEOUS_PREFETCHES 6
+#define SIMULTANEOUS_PREFETCHES 18
 
 /* If this architecture supports prefetch, define this to be the size of
the cache line that is prefetched.  */
 
-#define PREFETCH_BLOCK 32
+#define PREFETCH_BLOCK 128
 
+/* A number that should roughly corresponding to the nunmber of instructions
+   executed before the prefetch is completed.  */
+
+#define PREFETCH_LATENCY 400
+
 #define HANDLE_SYSV_PRAGMA 1
 
 /* A C expression for the maximum number of instructions to execute via


Canqun Yang


__
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com


Re: [wwwdocs] releases.html v/s develop.html

2006-06-02 Thread Gerald Pfeifer
On Thu, 1 Jun 2006, Joe Buck wrote:
> Let's just add the info to the table.  Here is a proposed patch.
> Note that I resorted by date and added an explanation.  I think
> that the attempt to sort by release number became increasingly
> untenable after 3.4, because we now have heavy overlapping.  Better
> to just state it and explain it.

Sounds good.  I applied your patch right away and will take care of 
updating the release manager documentation over the weekend.

Thanks!

Gerald


Re: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Andrey Belevantsev

Canqun Yang wrote:

Hi, all

This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS 
benchmark suite on
Itanium-2 system, respectively. More performance increase is hopeful by further 
tuning the
parameters and improving the prefetch algorithm at tree level. 


Hi Canqun,

It's great news that you continued to work on prefetching tuning for 
ia64!  Do you plan to port your other changes for the old RTL 
prefetching to the tree level?



@@ -1985,13 +1985,18 @@
??? This number is bogus and needs to be replaced before the value is
actually used in optimizations.  */


I suggest to remove this comment as it has become outdated with your 
patch.  Instead you might say how did you choose this particular value 
(and PREFETCH_BLOCK too).  Just my 2c.


Andrey



Re: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Steven Bosscher

On 6/2/06, Canqun Yang <[EMAIL PROTECTED]> wrote:

This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS 
benchmark suite on
Itanium-2 system, respectively. More performance increase is hopeful by further 
tuning the
parameters and improving the prefetch algorithm at tree level.


Bravo.


--- ia64.h (revision 114307)
+++ ia64.h (working copy)
@@ -1985,13 +1985,18 @@
   ??? This number is bogus and needs to be replaced before the value is
   actually used in optimizations.  */

-#define SIMULTANEOUS_PREFETCHES 6
+#define SIMULTANEOUS_PREFETCHES 18


Is the number still bogus as the comment suggests, or is there a
rationale for 18?  It looks quite high.


+/* A number that should roughly corresponding to the nunmber of instructions
+   executed before the prefetch is completed.  */
+
+#define PREFETCH_LATENCY 400


Likewise.  Is 400 cycles the memory latency on itanium-2?

Gr.
Steven


Re: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Canqun Yang
--- Andrey Belevantsev <[EMAIL PROTECTED]>:

> Canqun Yang wrote:
> > Hi, all
> > 
> > This patch results a performance increase of 4% for SPECfp2000 and 13% for 
> > NAS benchmark suite
> on
> > Itanium-2 system, respectively. More performance increase is hopeful by 
> > further tuning the
> > parameters and improving the prefetch algorithm at tree level. 
> 
> Hi Canqun,
> 
> It's great news that you continued to work on prefetching tuning for 
> ia64!  Do you plan to port your other changes for the old RTL 
> prefetching to the tree level?
> 

Yes. But I have no much time to do it now. I am busy for other things.

> > @@ -1985,13 +1985,18 @@
> > ??? This number is bogus and needs to be replaced before the value is
> > actually used in optimizations.  */
> 
> I suggest to remove this comment as it has become outdated with your 
> patch.  Instead you might say how did you choose this particular value 
> (and PREFETCH_BLOCK too).  Just my 2c.
> 
> Andrey
> 
> 

Please refer to my previous mail and attatched paper.

Canqun Yang

__
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com


addressability checks in the gimplifier

2006-06-02 Thread Olivier Hainque
Hello,

First a short description of a problem we are seeing, then a couple
of related questions on addressability checks in the gimplifier.

>From a simple Ada testcase which I can provide if need be, the front-end
is producing a MODIFY_EXPR with a lhs of the following shape when we get
to gimplify_modify_expr: 

 

arg 1 
bit offset 

arg 1 

in short, a variable size array_range_ref within a bitfield record component.

The lhs remains of similar shape after gimplification, the rhs is of
variable size as well, and we end up at this point in gimplify_modify_expr:

  /* If we've got a variable sized assignment between two lvalues (i.e. does
 not involve a call), then we can make things a bit more straightforward
 by converting the assignment to memcpy or memset.  */
  if (TREE_CODE (*from_p) == WITH_SIZE_EXPR)
{
  tree from = TREE_OPERAND (*from_p, 0);
  tree size = TREE_OPERAND (*from_p, 1);

  if (TREE_CODE (from) == CONSTRUCTOR)
return gimplify_modify_expr_to_memset (expr_p, size, want_value);
  if (is_gimple_addressable (from))
{
  *from_p = from;
  return gimplify_modify_expr_to_memcpy (expr_p, size, want_value);
}
}

We get down into gimplify_modify_expr_to_memcpy, which builds ADDR_EXPRs
for both operands, which ICEs later on from expand_expr_addr_expr_1 because
the operand sketched above is not byte-aligned.

The first puzzle to me is that there is no check made that the target
is a valid argument for an ADDR_EXPR.  AFAICS, it has been gimplified
with is_gimple_lvalue/fb_lvalue as the predicate/fallback pair, but
this currently doesn't imply the required properties.

I first thought that a is_gimple_addressable (*to_p) addition to the
outer condition would help, but it actually does not because the
predicate is shallow and only checks a very restricted set of
conditions (e.g.  any ARRAY_RANGE_REF or COMPONENT_REF is considered
"addressable"). This is actually the reason why the gimplified lhs
tree is considered is_gimple_lvalue, from:

   bool
   is_gimple_lvalue (tree t)
   {
 return (is_gimple_addressable (t)
 || TREE_CODE (t) == WITH_SIZE_EXPR
 /* These are complex lvalues, but don't have addresses, so they
go here.  */
 || TREE_CODE (t) == BIT_FIELD_REF);

Assuming that the initial tree is valid GENERIC, it would seem that a more
sophisticated addressability checker (recursing down some inner refs and
checking DECL_BIT_FIELD on field decls in COMPONENT_REFs) might be required.

I'm unclear whether this could/should be is_gimple_addressable, as
comments from http://gcc.gnu.org/ml/gcc/2004-07/msg01255.html indicate
that it not designed for this sort of operation.

I'm pretty sure I'm missing implicit assumptions and/or bits of design
intents in various places, so would appreciate input on the case and
puzzles described above.

Thanks very much in advance for your help,

With Kind Regards,

Olivier





Re: Expansion of __builtin_frame_address

2006-06-02 Thread Richard Earnshaw
On Thu, 2006-06-01 at 16:05, Mark Shinwell wrote:
> Mark Mitchell wrote:
> > Mark Shinwell wrote:
> >> As for the remaining problem, I suggest that we could:
> >>
> >> (i) always return the hard frame pointer, and disable FP elimination in
> >> the current function; or
> >>
> >> (iii) ...the same as option (i), but allow targets to define another macro
> >> that will cause the default code to use the soft frame pointer rather than
> >> the hard frame pointer, and hence allow FP elimination.  (If such a macro
> >> were set by a particular target, am I right in thinking that it would be
> >> safe to use the soft frame pointer even in the count >= 1 cases?)
> > 
> >> I tend to think that option (iii) might be best, although perhaps it
> >> is overkill and option (i) would do.  But I'm not entirely sure;
> >> still being a gcc novice I have to admit to not being quite thoroughly
> >> clear on this myself at this stage.  So any advice or comments would be
> >> appreciated!
> > 
> > I agree that option (iii) is best, as it provides the ability to
> > optimize on platforms where that is feasible, and still provides a
> > working default elsewhere.  I will review and approve a suitable patch
> > to implement (iii), assuming that there are no objections from Jim or
> > others.
> 
> This having been discussed some more, and my understanding improved,
> I now believe that option (i) is in fact the correct thing to do.  The
> attach patch implements this, which basically amounts to the same logic
> that is currently in the compiler save for the removal of the special
> case when count == 0.
> 
> OK for mainline?
> 

I'm not keen on this.  On some machines a frame pointer is *very*
expensive, both in terms of the code required to set it up, and the
resulting loss of a register which affects code quality (in addition, on
Thumb, the frame pointer can access much less data on the stack than the
stack pointer can, so code quality is affected even more).

I can see no argument for a frame pointer being *required* for getting
the return address.  We didn't have to do this in the past, so I think
it is wrong to require that we do it now.

R.
> Mark
> 
> --
> 
> 2006-06-01  Mark Shinwell  <[EMAIL PROTECTED]>
> 
>   * gcc/builtins.c (expand_builtin_return_addr): Always use
>   hard_frame_pointer_rtx and prevent frame pointer elimination
>   if INITIAL_FRAME_ADDRESS_RTX isn't set.




Re: Expansion of __builtin_frame_address

2006-06-02 Thread Mark Shinwell

Richard Earnshaw wrote:

I'm not keen on this.  On some machines a frame pointer is *very*
expensive, both in terms of the code required to set it up, and the
resulting loss of a register which affects code quality (in addition, on
Thumb, the frame pointer can access much less data on the stack than the
stack pointer can, so code quality is affected even more).


Do you have anything in mind that would be a better default?  Is there
something that could be used instead of hard_frame_pointer_rtx that
will later expand to the correct frame address, but not necessarily force
use of a frame pointer, for example?  (As far as I can tell,
frame_pointer_rtx will not do at least in the ARM case, because it doesn't
yield the same value.)

If the hard frame pointer is forced by default, then targets which are
particularly badly affected can simply define INITIAL_FRAME_ADDRESS_RTX.
Since such targets would presumably not have to force reload to keep
the frame pointer, then definitions of such macros would not need to
be side-effecting (in the way described earlier in this thread) and thus
be satisfactory.


I can see no argument for a frame pointer being *required* for getting
the return address.  We didn't have to do this in the past, so I think
it is wrong to require that we do it now.


Currently, the code does require a frame pointer in all except the
count == 0 case, and as far as that particular case goes I get the
impression that it would have been treated in the same way had this
glibc backtrace function been noticed last year.  This may be a mistaken
impression though.

Mark


Re: Expansion of __builtin_frame_address

2006-06-02 Thread David Edelsohn
> Mark Shinwell writes:

Mark> If the hard frame pointer is forced by default, then targets which are
Mark> particularly badly affected can simply define INITIAL_FRAME_ADDRESS_RTX.
Mark> Since such targets would presumably not have to force reload to keep
Mark> the frame pointer, then definitions of such macros would not need to
Mark> be side-effecting (in the way described earlier in this thread) and thus
Mark> be satisfactory.

PowerPC also does not need hard_frame_pointer_rtx for all cases.
It seems like a bad idea to force every port to define
INITIAL_FRAME_ADDRESS_RTX to avoid a penalty.  Why can't whatever port
needs this change define INITIAL_FRAME_ADDRESS_RTX to
hard_frame_pointer_rtx?   One could add "count" to the macro so that the
port can optimize further and avoid hard_frame_pointer_rtx, if possible.

David



Re: [wwwdocs] releases.html v/s develop.html

2006-06-02 Thread Ranjit Mathew

On 6/2/06, Gerald Pfeifer <[EMAIL PROTECTED]> wrote:


Mind to send/commit a patch to complete releases.html with 4.x
releases and add a step to releasing.html?  (Basically you just
need to revert revision 1.26 of that file.)


Joe Buck beat me to it and you applied it for him. Thanks to
both of you.

Thanks,
Ranjit.

--
Ranjit Mathew  Email: rmathew AT gmail DOT com

Bangalore, INDIA.Web: http://rmathew.com/


Re: Expansion of __builtin_frame_address

2006-06-02 Thread Mark Shinwell

David Edelsohn wrote:

Mark Shinwell writes:


Mark> If the hard frame pointer is forced by default, then targets which are
Mark> particularly badly affected can simply define INITIAL_FRAME_ADDRESS_RTX.
Mark> Since such targets would presumably not have to force reload to keep
Mark> the frame pointer, then definitions of such macros would not need to
Mark> be side-effecting (in the way described earlier in this thread) and thus
Mark> be satisfactory.

PowerPC also does not need hard_frame_pointer_rtx for all cases.
It seems like a bad idea to force every port to define
INITIAL_FRAME_ADDRESS_RTX to avoid a penalty.  Why can't whatever port
needs this change define INITIAL_FRAME_ADDRESS_RTX to
hard_frame_pointer_rtx?   One could add "count" to the macro so that the
port can optimize further and avoid hard_frame_pointer_rtx, if possible.


OK, here is what I think is a better suggestion.  First note that
expand_builtin_return_addr is used for both __builtin_frame_address and
__builtin_return_address.  The behaviour for the return address
case seems to be for target-specific code to override the result of this
function in the case when count == 0; thus, it does indeed not matter what
we return from expand_builtin_return_addr in that case.  (I hadn't
realised this before.)  The new patch, below, thus has the same behaviour
for __builtin_return_address.

However when dealing with __builtin_frame_address, we must return the
correct value from this function no matter what the value of count.  This
patch therefore forces use of a hard FP in such situations.

Is that more satisfactory?

Mark
Index: builtins.c
===
--- builtins.c  (revision 114325)
+++ builtins.c  (working copy)
@@ -496,12 +496,16 @@ expand_builtin_return_addr (enum built_i
 #else
   rtx tem;
 
-  /* For a zero count, we don't care what frame address we return, so frame
- pointer elimination is OK, and using the soft frame pointer is OK.
- For a non-zero count, we require a stable offset from the current frame
- pointer to the previous one, so we must use the hard frame pointer, and
+  /* For a zero count with __builtin_return_address, we don't care what
+ frame address we return, because target-specific definitions will
+ override us.  Therefore frame pointer elimination is OK, and using
+ the soft frame pointer is OK.
+
+ For a non-zero count, or a zero count with __builtin_frame_address,
+ we require a stable offset from the current frame pointer to the
+ previous one, so we must use the hard frame pointer, and
  we must disable frame pointer elimination.  */
-  if (count == 0)
+  if (count == 0 && fndecl_code == BUILT_IN_RETURN_ADDRESS)
 tem = frame_pointer_rtx;
   else 
 {


Re: Expansion of __builtin_frame_address

2006-06-02 Thread Richard Earnshaw
On Fri, 2006-06-02 at 14:57, Mark Shinwell wrote:
> Richard Earnshaw wrote:
> > I'm not keen on this.  On some machines a frame pointer is *very*
> > expensive, both in terms of the code required to set it up, and the
> > resulting loss of a register which affects code quality (in addition, on
> > Thumb, the frame pointer can access much less data on the stack than the
> > stack pointer can, so code quality is affected even more).
> 
> Do you have anything in mind that would be a better default?  Is there
> something that could be used instead of hard_frame_pointer_rtx that
> will later expand to the correct frame address, but not necessarily force
> use of a frame pointer, for example?  (As far as I can tell,
> frame_pointer_rtx will not do at least in the ARM case, because it doesn't
> yield the same value.)
> 

Well in the past the ARM prologue code would copy the return address
into a pseudo if the body of the function invoked
__builtin_return_address, then the body of the code just used the
psuedo.  Somebody found a reason to change this, but I can't remember
why.

> If the hard frame pointer is forced by default, then targets which are
> particularly badly affected can simply define INITIAL_FRAME_ADDRESS_RTX.
> Since such targets would presumably not have to force reload to keep
> the frame pointer, then definitions of such macros would not need to
> be side-effecting (in the way described earlier in this thread) and thus
> be satisfactory.
> 
> > I can see no argument for a frame pointer being *required* for getting
> > the return address.  We didn't have to do this in the past, so I think
> > it is wrong to require that we do it now.
> 
> Currently, the code does require a frame pointer in all except the
> count == 0 case, and as far as that particular case goes I get the
> impression that it would have been treated in the same way had this
> glibc backtrace function been noticed last year.  This may be a mistaken
> impression though.

__builtin_frame_address(n) is not required to work for any value other
than n=0.  It's not clear what it means anyway on a function that
eliminates the frame pointer.

On ARM you *cannot* walk the stack frames without additional
information.  Frames are private to each function and there is no
defined format for the layout.  In particular the layout (and the frame
pointer register) is different for ARM and Thumb code, but the two can
be freely intermixed.

The only chance you have for producing a backtrace() is to have
unwinding information similar to that provided for exception unwinding. 
This would describe to the unwinder how that frames code is laid out so
that it can unpick it.

R.



Re: Expansion of __builtin_frame_address

2006-06-02 Thread Richard Earnshaw
On Fri, 2006-06-02 at 15:30, Mark Shinwell wrote:

> However when dealing with __builtin_frame_address, we must return the
> correct value from this function no matter what the value of count.  This
> patch therefore forces use of a hard FP in such situations.

Eh?  The manual explicitly says that __builtin_frame_address is only
required to work if count=0.  You simply cannot up walk arbitrary
numbers of frames on some CPUs since code isn't compiled to support it.

R.



RE: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Davis, Mark
Canqun,

Nice job getting this ready for the current version of gcc!

Question: does gcc now know the difference between prefetching to cache L1 via 
"lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"?  For 
floating point data, the latter is the only interesting case because float 
loads only access the L2.  Thus using "lfetch" for floating point arrays will 
unnecessarily wipe out the contents of L1.  (gcc 3.2.3 only seems to generate 
"lfetch", which is why I ask...)

Thanks,
Mark 

-Original Message-
From: Canqun Yang [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 02, 2006 5:14 AM
To: gcc@gcc.gnu.org; [EMAIL PROTECTED]
Subject: [patch] Improve loop array prefetch for IA-64

Hi, all

This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS 
benchmark suite on
Itanium-2 system, respectively. More performance increase is hopeful by further 
tuning the
parameters and improving the prefetch algorithm at tree level. 


Canqun Yang


__
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com


Re: Expansion of __builtin_frame_address

2006-06-02 Thread Daniel Jacobowitz
On Fri, Jun 02, 2006 at 04:20:21PM +0100, Richard Earnshaw wrote:
> On Fri, 2006-06-02 at 15:30, Mark Shinwell wrote:
> 
> > However when dealing with __builtin_frame_address, we must return the
> > correct value from this function no matter what the value of count.  This
> > patch therefore forces use of a hard FP in such situations.
> 
> Eh?  The manual explicitly says that __builtin_frame_address is only
> required to work if count=0.  You simply cannot up walk arbitrary
> numbers of frames on some CPUs since code isn't compiled to support it.

Right - it's the result of __builtin_frame_address (0) we're looking at
here.

Mark's latest change seems logical to me: the user has asked for the
frame address, so hadn't we better arrange that there's a frame?

-- 
Daniel Jacobowitz
CodeSourcery


Re: Expansion of __builtin_frame_address

2006-06-02 Thread Paul Brook
> __builtin_frame_address(n) is not required to work for any value other
> than n=0.  It's not clear what it means anyway on a function that
> eliminates the frame pointer.
>
> On ARM you *cannot* walk the stack frames without additional
> information.  Frames are private to each function and there is no
> defined format for the layout.  In particular the layout (and the frame
> pointer register) is different for ARM and Thumb code, but the two can
> be freely intermixed.
>
> The only chance you have for producing a backtrace() is to have
> unwinding information similar to that provided for exception unwinding.
> This would describe to the unwinder how that frames code is laid out so
> that it can unpick it.

I agree that in general you need ancillary information way to get a backtrace 
on Arm. However if you assume only Arm code code and -fno-omit-frame-pointer 
then you can walk the frames. Given this assumption I think it make sense to 
have _b_f_a(0) force the use of a frame pointer.

If you're implementing "proper" unwinding then I'd say you want to use an 
assembly function stub to determine the initial frame, rather than relying on 
a rather ill-defined gcc builtin function. In other words 
__builtin_frame_address is effectively useless, so we may as well make it 
consistent with historical use than try to optimize it.

Th background to this problem is we have a client who was upset when 
backtrace() "broke" with gcc4. For this particular client 
-marm -fno-omit-frame-pointer -mapcs-frame is an acceptable price to play for 
making backtrace() work.

Paul


Re: Expansion of __builtin_frame_address

2006-06-02 Thread Richard Earnshaw
On Fri, 2006-06-02 at 16:28, Paul Brook wrote:
> I agree that in general you need ancillary information way to get a backtrace 
> on Arm. However if you assume only Arm code code and -fno-omit-frame-pointer 
> then you can walk the frames. Given this assumption I think it make sense to 
> have _b_f_a(0) force the use of a frame pointer.
> 

No, in the general case you can't.  Because ARM and Thumb frames are
laid out differently.  In ARM code the frame pointer is in r11 (when not
eliminated); in thumb code it is in r7 (because r11 can't be used in
memory insns).

R.



Re: Expansion of __builtin_frame_address

2006-06-02 Thread Daniel Jacobowitz
On Fri, Jun 02, 2006 at 04:32:07PM +0100, Richard Earnshaw wrote:
> On Fri, 2006-06-02 at 16:28, Paul Brook wrote:
> > I agree that in general you need ancillary information way to get a 
> > backtrace 
> > on Arm. However if you assume only Arm code code and 
> > -fno-omit-frame-pointer 
> > then you can walk the frames. Given this assumption I think it make sense 
> > to 
> > have _b_f_a(0) force the use of a frame pointer.
> > 
> 
> No, in the general case you can't.  Because ARM and Thumb frames are
> laid out differently.  In ARM code the frame pointer is in r11 (when not
> eliminated); in thumb code it is in r7 (because r11 can't be used in
> memory insns).

I'm reading these two paragraphs and the two of you seem to be in
violent agreement.  Paul assumed ARM code only.

-- 
Daniel Jacobowitz
CodeSourcery


Bug in gnu assembler?

2006-06-02 Thread jacob navia

How to reproduce this problem
-

1) Take some C file. I used for instance dwarf.c from
  the new binutils distribution.
2) Generate an assembler listing of this file
3) Using objdump -s dwarf.o I dump all the
  sections of the executable in hexadecimal format.
  Put the result of this dump in some file, I used
  "vv" as name.
4) Dump the contents of the eh_frame section in
  symbolic form. You should use readelf -W. Put
  the result in some file, say, "dwarf.framedump"

---

OK Now let's start. I go to the assembly listing
(dwarf.s) and search for "eh_frame" in the editor.
I arrive at:
   .section.debug_frame,"",@progbits

This section consists of a CIE (Common Information Entry
in GNU terminology) that is generated as follows
in the assembly listing

.Lframe0:
   .long   .LECIE0-.LSCIE0
.LSCIE0:
   .long   0x
   .byte   0x1
   .string ""
   .uleb128 0x1
   .sleb128 -8
   .byte   0x10
   .byte   0xc
   .uleb128 0x7
   .uleb128 0x8
   .byte   0x90
   .uleb128 0x1
   .align 8
.LECIE0:
---
This corresponds to a symbolic listing like this:
(file dwarf.framedump)

The section .debug_frame contains:

 0014  CIE
 Version:   1
 Augmentation:  ""
 Code alignment factor: 1
 Data alignment factor: -8
 Return address column: 16

 DW_CFA_def_cfa: r7 ofs 8
 DW_CFA_offset: r16 at cfa-8
 DW_CFA_nop
 DW_CFA_nop
 DW_CFA_nop
 DW_CFA_nop
 DW_CFA_nop
 DW_CFA_nop

This means that this entry starts at offset 0 and goes
for 20+4 bytes (the length field is 4 bytes).
Our binary dump of the contents of the first 96 bytes
(0x60) looks like this:
Contents of section .eh_frame:
 1400  01000178 100c0708  ...x
0010 9001  1c00 1c00  
0020   5900   Y...
0030 410e1083 0200 1c00 3c00  A...<...
0040   6800   h...
0050 410e1083 0200 1400 5c00  A...\...
0060   4e00   N...
We eliminate the first 24 (0x18) bytes and we obtain:
0018 1c00 1c00  
0020   5900   Y...
0030 410e1083 0200 1c00 3c00  A...<...
The is a FDE or Frame description entry in GNU terminology.
We have first a 32 bit length field represented
by the difference LEFDZ0 - LASDFE0. This is 1c00 above

Then we have another .long instruction, (32 bits)
that corresponds to the second 1c00 above.

Then we have two .quad instructions that correspond to
the line
0020   5900   Y...
above

AND NOW IT BECOMES VERY INTERESTING:
We have the instructions
   .byte0x4
   .long.LCFI0 - .LFB50
   .byte   0xe
   .uleb128 0x10
   .byte   0x83
   .uleb128 0x2
   .align 8

And we find in the hexademical dump the line
0030 410e1083 0200 1c00 3c00  A...<...

The 4 and the 1 are in the same byte, followed by the correct
0xe byte the correct 0x10 byte (uleb128 is 0x10) followed
by the correct 0x83 and followed by the correctd 0x02 byte.

WHERE AM I WRONG ?

I am getting CRAZY with this

Here is the full assembly listing of the FDE:

.LSFDE0:
   .long   .LEFDE0-.LASFDE0 /* first field 1c00 */
.LASFDE0:
   .long   .Lframe0 
   .quad   .LFB50

   .quad   .LFE50-.LFB50
   .byte   0x4
   .long   .LCFI0-.LFB50
   .byte   0xe
   .uleb128 0x10
   .byte   0x83
   .uleb128 0x2
   .align 8



Re: Expansion of __builtin_frame_address

2006-06-02 Thread Mark Mitchell
Richard Earnshaw wrote:

> The only chance you have for producing a backtrace() is to have
> unwinding information similar to that provided for exception unwinding. 
> This would describe to the unwinder how that frames code is laid out so
> that it can unpick it.

I'd suggest we leave backtrace() aside, and just talk about
__builtin_frame_address(0), which does have well-defined semantics.
_b_f_a(0) is currently broken on ARM, and we all agree we should fix it.

I mildly disagree with David's comment that:

> It seems like a bad idea to force every port to define
> INITIAL_FRAME_ADDRESS_RTX to avoid a penalty.

in that I think the default should be working code, and Mark's change
accomplishes that.

Of course, if _b_f_a(0) can be implemented more efficiently on some
target, there should be a hook to do that.  And, I think it's reasonable
to ask Mark to go through and add that optimization to ports that
already work that way, so that his patch doesn't regress any target.
(I'm not actually sure how _b_f_a(0) works on other targets, but not on
ARM.)

But, scrapping about the default probably isn't very productive.  The
important thing is to work out how _b_f_a(0) can be made to work on ARM.

Richard, I can't tell from your comments how you think _b_f_a(0) should
be implemented on ARM.  We could use Mark's logic (forcing a hard frame
pointer), but stuff it into INITIAL_FRAME_ADDRESS_RTX.  We could also
try to reimplement the thing you mentioned about using a pseudo, though
I guess we'd need to work out why that was thought a bad idea before.
What option do you suggest?

-- 
Mark Mitchell
CodeSourcery
[EMAIL PROTECTED]
(650) 331-3385 x713


Re: Expansion of __builtin_frame_address

2006-06-02 Thread Richard Earnshaw
On Fri, 2006-06-02 at 16:35, Daniel Jacobowitz wrote:
> On Fri, Jun 02, 2006 at 04:32:07PM +0100, Richard Earnshaw wrote:
> > On Fri, 2006-06-02 at 16:28, Paul Brook wrote:
> > > I agree that in general you need ancillary information way to get a 
> > > backtrace 
> > > on Arm. However if you assume only Arm code code and 
> > > -fno-omit-frame-pointer 
> > > then you can walk the frames. Given this assumption I think it make sense 
> > > to 
> > > have _b_f_a(0) force the use of a frame pointer.
> > > 
> > 
> > No, in the general case you can't.  Because ARM and Thumb frames are
> > laid out differently.  In ARM code the frame pointer is in r11 (when not
> > eliminated); in thumb code it is in r7 (because r11 can't be used in
> > memory insns).
> 
> I'm reading these two paragraphs and the two of you seem to be in
> violent agreement.  Paul assumed ARM code only.

Well, that's a pretty limiting assumption given that ARM and thumb code
can be freely intermixed.  Indeed, I've often wondered if -Os should
default to Thumb code on CPUs that can support it (and thumb code can
corrupt the ARM frame register since it doesn't consider it to be
special in any way -- it's just a call-saved register).  I've also
pondered making the compiler ignore -f[no-]omit-frame-pointer and to
only use one in cases where the stack is dynamically adjustable.

R.



Re: Expansion of __builtin_frame_address

2006-06-02 Thread Paul Brook
On Friday 02 June 2006 16:44, Richard Earnshaw wrote:
> On Fri, 2006-06-02 at 16:35, Daniel Jacobowitz wrote:
> > On Fri, Jun 02, 2006 at 04:32:07PM +0100, Richard Earnshaw wrote:
> > > On Fri, 2006-06-02 at 16:28, Paul Brook wrote:
> > > > I agree that in general you need ancillary information way to get a
> > > > backtrace on Arm. However if you assume only Arm code code and
> > > > -fno-omit-frame-pointer then you can walk the frames. Given this
> > > > assumption I think it make sense to have _b_f_a(0) force the use of a
> > > > frame pointer.
> > >
> > > No, in the general case you can't.  Because ARM and Thumb frames are
> > > laid out differently.  In ARM code the frame pointer is in r11 (when
> > > not eliminated); in thumb code it is in r7 (because r11 can't be used
> > > in memory insns).
> >
> > I'm reading these two paragraphs and the two of you seem to be in
> > violent agreement.  Paul assumed ARM code only.
>
> Well, that's a pretty limiting assumption given that ARM and thumb code
> can be freely intermixed.  Indeed, I've often wondered if -Os should
> default to Thumb code on CPUs that can support it (and thumb code can
> corrupt the ARM frame register since it doesn't consider it to be
> special in any way -- it's just a call-saved register).  I've also
> pondered making the compiler ignore -f[no-]omit-frame-pointer and to
> only use one in cases where the stack is dynamically adjustable.

Ok, let me put it a different way.

How is __builtin_frame_address(0) useful if you don't make these assumptions, 
and what would it be used for?

For the record I agree that __builtin_return_address(0) has use and should not 
force a frame pointer.

Paul


Re: Expansion of __builtin_frame_address

2006-06-02 Thread Richard Earnshaw
On Fri, 2006-06-02 at 16:46, Mark Mitchell wrote:
> Richard, I can't tell from your comments how you think _b_f_a(0) should
> be implemented on ARM.  We could use Mark's logic (forcing a hard frame
> pointer), but stuff it into INITIAL_FRAME_ADDRESS_RTX.  We could also
> try to reimplement the thing you mentioned about using a pseudo, though
> I guess we'd need to work out why that was thought a bad idea before.
> What option do you suggest?

I think I need to understand first what _b_f_a(0) would be used for. 
Until I understand that I can't really say how best it should be
implemented.  One _possible_ implementation that would be reasonable
would be the dwarf CFA value for the function: but that's very different
from both the current ARM r11 value or the Thumb r7 value in functions
that use a frame register.  However, it is well defined in both ARM and
Thumb code.

Note that in ARM code r11 points near to the top of the frame, but in
Thumb code r7 points to the bottom of the frame (in gcc-4.2 or later,
since you can't use negative offsets in memory addresses).

R.





Re: Bug in gnu assembler?

2006-06-02 Thread Andrew Pinski


On Jun 2, 2006, at 8:46 AM, jacob navia wrote:


How to reproduce this problem
-

WHERE AM I WRONG ?


You should write to [EMAIL PROTECTED] if you want
a high probility of your question about the assembler
being answered.

-- Pinski


Re: GCC SC request about ecj

2006-06-02 Thread Per Bothner

Richard stallman write last night:

 I agree to the use of the Eclipse front end to generate
 Java byte codes.

Note this does not mean importing Eclispe code into the gcc source or
release tree.  We need to decide on a practical way to have people
grab a compatible version of ecj.
--
--Per Bothner
[EMAIL PROTECTED]   http://per.bothner.com/


Re: comparing DejaGNU results

2006-06-02 Thread James Lemke
I took a quick pass at implementing the comparisons in a more suitable
lanugage.  Run time is now a few seconds on both platforms.  About the
same as compare_tests on my old ibook/OSX and much faster on FC3.
Trials show the same results as before.

For anyone interested, the new version is attached.
Jim.

-- 
Jim Lemke   [EMAIL PROTECTED]   Orillia, Ontario


dg-cmp-results.sh
Description: application/shellscript


Re: GCC SC request about ecj

2006-06-02 Thread Joe Buck
On Fri, Jun 02, 2006 at 10:59:58AM -0700, Per Bothner wrote:
> Richard stallman write last night:
> 
>  I agree to the use of the Eclipse front end to generate
>  Java byte codes.
> 
> Note this does not mean importing Eclispe code into the gcc source or
> release tree.  We need to decide on a practical way to have people
> grab a compatible version of ecj.

Treat it like GMP, I guess; it's an external dependency.  Tell people
where to get it; have configure test for its presence and refuse to build
any dependencies if it isn't found.


gcc 4.1.1 build reports: 3 gnu/linux flavors, HP/UX, Solaris 2.8

2006-06-02 Thread Joe Buck
Hi,

Here are some gcc 4.1.1 build reports.

#1: i686-pc-linux-gnu, Red Hat EL3: C, C++, ObjC, and Java.
"Native" tools were used.
Test results: http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg00019.html

#2: ia64-unknown-linux-gnu, Red Hat Advanced Workstation 2.1AW. C, C++, ObjC.
binutils 2.16.1 were used.
Test results: http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg00065.html

#3: x86_64-unknown-linux-gnu, Red Hat EL3: C, C++, ObjC, and Java.
"Native" tools were used.
Test results: http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg00061.html

#4: hppa2.0w-hp-hpux11.00, using GNU as version 2.15, native C compiler
for bootstrap, native linker.  C, C++, ObjC.
Results: http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg00122.html

I had to install a new makeinfo to keep the build from bombing, even
though the failure seems to be in fastjar, and Java won't build on
this platform.

#5: sparc-sun-solaris2.8; as and ld from binutils 2.16.1
Build failure while building the Java library.  I'll send a separate
message on this.






Solaris 2.8 build failure for 4.1.1 (libtool/libjava)

2006-06-02 Thread Joe Buck
I haven't tried to build Java on Solaris in quite a while because it
takes so long.

My attempt to build on Solaris 2.8 with binutils 2.16.1 died with

/bin/ksh ./libtool --tag=GCJ --mode=link 
/remote/atg2/jbuck/solaris.tmp/411/gcc/gcj 
-B/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava/ 
-B/remote/atg2/jbuck/solaris.tmp/411/gcc/ 
-L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava -g -O2  -o 
jv-convert --main=gnu.gcj.convert.Convert -rpath /u/jbuck/cvs.sol2/4.1.1/lib 
-shared-libgcc   
-L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava/.libs 
libgcj.la 
/remote/atg2/jbuck/solaris.tmp/411/gcc/gcj 
-B/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava/ 
-B/remote/atg2/jbuck/solaris.tmp/411/gcc/ -g -O2 -o .libs/jv-convert 
--main=gnu.gcj.convert.Convert -shared-libgcc  
-L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava 
-L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava/.libs 
./.libs/libgcj.so 
-L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libstdc++-v3/src 
-L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libstdc++-v3/src/.libs
 -lpthread -lrt -ldl -L/remote/atg2/jbuck/solaris.tmp/411/./gcc -L/usr/ccs/lib 
-lgcc_s -lgcc_s -Wl,--rpath -Wl,/u/jbuck/cvs.sol2/4.1.1/lib
/u/jbuck/gnu.sol2/bin/ld: unrecognized option '-Wl,-rpath'
/u/jbuck/gnu.sol2/bin/ld: use the --help option for usage information
collect2: ld returned 1 exit status

It's GNU ld version 2.16.1.  This is strange; I would have expected the
linker to get just -rpath: -Wl should tell gcj to pass the following
option to the linker.

Any clues?


Re: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Steven Bosscher

On 6/2/06, Davis, Mark <[EMAIL PROTECTED]> wrote:

Question: does gcc now know the difference between prefetching to cache L1 via
"lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"?


The ia64 backend knows the difference, see the prefetch pattern in ia64.md.

But ia64 is the only backend that supports this kind of explicit
locality parameter. And since no-one from the ia64 community cared
much about gcc until recently, gcc's prefetching pass (which is
limited anyway) does not generate lfetch.nt1 or other prefetches with
explicit locality parameters.



For floating point data, the latter is the only interesting case because float 
loads only
access the L2.  Thus using "lfetch" for floating point arrays will unnecessarily wipe out 
> the contents of L1.  (gcc 3.2.3 only seems to generate "lfetch", which is why I ask...)


You could experiment with this for ia64 by hacking issue_prefetch_ref
in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating
point types.

Gr.
Steven


Re: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Steven Bosscher

On 6/3/06, Steven Bosscher <[EMAIL PROTECTED]> wrote:

> For floating point data, the latter is the only interesting case because 
float loads only
> access the L2.  Thus using "lfetch" for floating point arrays will 
unnecessarily wipe out
> the contents of L1.  (gcc 3.2.3 only seems to generate "lfetch", which is why 
I ask...)

You could experiment with this for ia64 by hacking issue_prefetch_ref
in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating
point types.


E.g. something like this, which is (needless to say) untested but
something you could play with.

Gr.
Steven
Index: tree-ssa-loop-prefetch.c
===
--- tree-ssa-loop-prefetch.c	(revision 114315)
+++ tree-ssa-loop-prefetch.c	(working copy)
@@ -816,7 +816,7 @@ static void
 issue_prefetch_ref (struct mem_ref *ref, unsigned unroll_factor, unsigned ahead)
 {
   HOST_WIDE_INT delta;
-  tree addr, addr_base, prefetch, params, write_p;
+  tree addr, addr_base, prefetch, params, write_p, locality;
   block_stmt_iterator bsi;
   unsigned n_prefetches, ap;
 
@@ -838,11 +838,21 @@ issue_prefetch_ref (struct mem_ref *ref,
 			  addr_base, build_int_cst (ptr_type_node, delta));
   addr = force_gimple_operand_bsi (&bsi, unshare_expr (addr), true, NULL);
 
-  /* Create the prefetch instruction.  */
+  /* Create the prefetch instruction.  Do this by building a call to
+ `void __builtin_prefetch (const void *ADDR, int RW, int LOCALITY)'.
+
+	 ??? The `locality' parameter is a shameless, untested hack to
+	 force lfetch.nt1 -- hopefully.  */
   write_p = ref->write_p ? integer_one_node : integer_zero_node;
-  params = tree_cons (NULL_TREE, addr,
-			  tree_cons (NULL_TREE, write_p, NULL_TREE));
- 
+  locality = FLOAT_TYPE_P (mem_ref->base)
+		 ? integer_one_node : integer_zero_node;
+  params = tree_cons (NULL_TREE,
+			  addr,
+			  tree_cons (NULL_TREE,
+ write_p,
+ tree_cons (NULL_TREE,
+		locality,
+		NULL_TREE)));
   prefetch = build_function_call_expr (built_in_decls[BUILT_IN_PREFETCH],
 	   params);
   bsi_insert_before (&bsi, prefetch, BSI_SAME_STMT);


gcc-4.1-20060602 is now available

2006-06-02 Thread gccadmin
Snapshot gcc-4.1-20060602 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.1-20060602/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.1 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_1-branch 
revision 114329

You'll find:

gcc-4.1-20060602.tar.bz2  Complete GCC (includes all of below)

gcc-core-4.1-20060602.tar.bz2 C front end and core compiler

gcc-ada-4.1-20060602.tar.bz2  Ada front end and runtime

gcc-fortran-4.1-20060602.tar.bz2  Fortran front end and runtime

gcc-g++-4.1-20060602.tar.bz2  C++ front end and runtime

gcc-java-4.1-20060602.tar.bz2 Java front end and runtime

gcc-objc-4.1-20060602.tar.bz2 Objective-C front end and runtime

gcc-testsuite-4.1-20060602.tar.bz2The GCC testsuite

Diffs from 4.1-20060526 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.1
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: comparing DejaGNU results

2006-06-02 Thread Mike Stump

On Jun 2, 2006, at 11:08 AM, James Lemke wrote:

I took a quick pass at implementing the comparisons in a more suitable
lanugage.  Run time is now a few seconds on both platforms.  About the
same as compare_tests on my old ibook/OSX and much faster on FC3.


Since Ben and I seem interested in this, I think we should check in  
this version.  It seems portable enough and useful enough.  Any  
objections from the crowd?


RE: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Canqun Yang

--- "Davis, Mark" <[EMAIL PROTECTED]>:

> Canqun,
> 
> Nice job getting this ready for the current version of gcc!
> 
> Question: does gcc now know the difference between prefetching to cache L1 
> via "lfetch", as
> opposed to prefetching only to level L2 via "lfetch.nt1"?  For floating point 
> data, the latter
> is the only interesting case because float loads only access the L2.  Thus 
> using "lfetch" for
> floating point arrays will unnecessarily wipe out the contents of L1.  (gcc 
> 3.2.3 only seems to
> generate "lfetch", which is why I ask...)
> 

Yes, GCC does. I have tried this on the old prefetch implementation at RTL 
level and the new one
at TREE level, but no significant performance difference for SPECfp2000 and NAS 
benchmarks.
Nevertheless, it worth taking more time to inspect it.

Canqun Yang


> Thanks,
> Mark 
> 
> -Original Message-
> From: Canqun Yang [mailto:[EMAIL PROTECTED] 
> Sent: Friday, June 02, 2006 5:14 AM
> To: gcc@gcc.gnu.org; [EMAIL PROTECTED]
> Subject: [patch] Improve loop array prefetch for IA-64
> 
> Hi, all
> 
> This patch results a performance increase of 4% for SPECfp2000 and 13% for 
> NAS benchmark suite
> on
> Itanium-2 system, respectively. More performance increase is hopeful by 
> further tuning the
> parameters and improving the prefetch algorithm at tree level. 
> 
> 
> Canqun Yang
> 
> 

__
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com


gen_lowpart vs big endian insv

2006-06-02 Thread DJ Delorie

h8300 has an HImode insv pattern.  If you try to use it with an SImode
argument, expmed.c uses gen_lowpart to force it into the desired mode.
However, gen_lowpart eventually fails for pseudos on big endian:

rtx
gen_rtx_SUBREG (enum machine_mode mode, rtx reg, int offset)
{
  gcc_assert (validate_subreg (mode, GET_MODE (reg), reg, offset));
  return gen_rtx_raw_SUBREG (mode, reg, offset);
}

validate_subreg refuses to use a subreg to change the address of a
pseudo that could be in memory (i.e. SI->HI on big endian).

So... where is the bug or false assumption here?  The test case is
h8300-elf vs gcc.dg/20040310-1.c with "-O1 -msx"

Thanks,
DJ


Re: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Andi Kleen
"Steven Bosscher" <[EMAIL PROTECTED]> writes:

> On 6/2/06, Davis, Mark <[EMAIL PROTECTED]> wrote:
> > Question: does gcc now know the difference between prefetching to cache L1 
> > via
> > "lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"?
> 
> The ia64 backend knows the difference, see the prefetch pattern in ia64.md.
> 
> But ia64 is the only backend that supports this kind of explicit
> locality parameter. And since no-one from the ia64 community cared
> much about gcc until recently, gcc's prefetching pass (which is
> limited anyway) does not generate lfetch.nt1 or other prefetches with
> explicit locality parameters.

Actually SSE X86 has prefetches with different locality hints (T0, T1, T2, NTA)

However x86 always needs to have the items in L1 cache to do anything
with them even for FP data so it might not be very useful to do this
particular optimization for it.

T0 vs NTA is useful though and at least AMD K8 can make use of them - when
data is streamed and not reused and there is a lot of it then NTA is a good 
idea.

> > For floating point data, the latter is the only interesting case because 
> > float loads only
> > access the L2.  Thus using "lfetch" for floating point arrays will 
> > unnecessarily wipe out > the contents of L1.  (gcc 3.2.3 only seems to 
> > generate "lfetch", which is why I ask...)
> 
> You could experiment with this for ia64 by hacking issue_prefetch_ref
> in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating
> point types.

Perhaps it could generate different prefetches based on the array size being
worked on?

I guess e.g. for an 1MB array walk NTA is probably a good idea (with the 1MB 
being
a tunable) 

-Andi