Re: SSA alias representation

2008-02-26 Thread Fran Baena
>
> When there are no symbols in the pointer's points-to set.
>

I am beginning to realize. But there is something it remains:

1) Once the ssa rename has been done, the alias analysis begins
(points-to sets). Each version of a base pointer has associated its
own points-to set? Like

if ()
  t_1 = & a_1;t_1 points-to { a_1 }
else
  if ()
t2_ = & b_1;  t_2 points-to { b_1 }
  else
t_3 = & c_1;  t_3 points-to { c_1 }

#  t_4 = PHI 
p_1 = t_4; p_1 and t_4 points-to { a_1, b_1, c_1 }

2) If a virtual operator is inserted for each symbol found in a
points-to set and since the virtual operands are inserted after the
ssa renaming, how virtual operands are versioned to keep a correct
version state between real and virtual operators?  For instance, given
the next code

if ()
  t_1 = & a_1;t_1 points-to { a_1 }
else
  if ()
t2_ = & b_1;  t_2 points-to { b_1 }
  else
t_3 = & c_1;  t_3 points-to { c_1 }


p_1 = t_4; p_1 and t_4 points-to { a_1, b_1, c_1 }

*p_2 = 5;  *p_2 -> NMT that is associated to points-to {
a_1, b_1, c_1 }

the virtual operators will be placed and versioned like this:

if ()
  t_1 = & a_1;t_1 points-to { a_1 }
else
  if ()
t2_ = & b_1;  t_2 points-to { b_1 }
  else
t_3 = & c_1;  t_3 points-to { c_1 }

#  t_4 = PHI 
# VUSE 
# VUSE 
# VUSE 
p_1 = t_4; p_1 and t_4 points-to { a_1, b_1, c_1 }


# a_2 = VDEF 
# b_2 = VDEF 
# c_2 = VDEF 
*p_2 = 5;  *p_2 -> NMT that is associated to points-to {
a_1, b_1, c_1 }


Thank you very much for your help.

Fran


Re: optimizing predictable branches on x86

2008-02-26 Thread Jan Hubicka
Hi,
> Core2 follows a similar pattern, although it's not seeing any
> slowdown in the "no deps, predictable, jmp" case like K8 does.
> 
> Any comments? (please cc me) Should gcc be using conditional jumps
> more often eg. in the case of __builtin_expect())?

The problem is that in general GCC's branch prediction algorithms are
very poor on predicting predictability of branch: they are pretty good
on guessing outcome but that is.

Only cases we do so quite reliably IMO are:
  1) loop branches that are not interesting for cmov conversion
  2) branches leading to noreturn calls, also not interesting
  3) builtin_expect mentioned.
  4) when profile feedback is around to some degree (ie we know when the
  branch is very likely or very unlikely. We don't simulate what
  hardware will do on it).

I guess we can implement the machinery for 3 and 4 (in fact once
I played adding EDGE_PREDICTABLE_P predicate that basically tested if
the esimated probability of branch is <5% or >95%) but never got really
noticeable improvements out of it and gave up.

It was before Core2 times, so it might be helping now.  But it needs
updating for backend cost interface as ifcvt is bit inflexible in this.
I had BRANCH_COST and PREDICTABLE_BRANCH_COST macros.

Honza


Re: optimizing predictable branches on x86

2008-02-26 Thread Jan Hubicka
> Hi,
> > Core2 follows a similar pattern, although it's not seeing any
> > slowdown in the "no deps, predictable, jmp" case like K8 does.
> > 
> > Any comments? (please cc me) Should gcc be using conditional jumps
> > more often eg. in the case of __builtin_expect())?
> 
> The problem is that in general GCC's branch prediction algorithms are
> very poor on predicting predictability of branch: they are pretty good
> on guessing outcome but that is.
> 
> Only cases we do so quite reliably IMO are:
>   1) loop branches that are not interesting for cmov conversion
>   2) branches leading to noreturn calls, also not interesting
>   3) builtin_expect mentioned.
>   4) when profile feedback is around to some degree (ie we know when the
>   branch is very likely or very unlikely. We don't simulate what
>   hardware will do on it).
> 
> I guess we can implement the machinery for 3 and 4 (in fact once
> I played adding EDGE_PREDICTABLE_P predicate that basically tested if
> the esimated probability of branch is <5% or >95%) but never got really
> noticeable improvements out of it and gave up.

Just for those who might be interested, I found the old patch.
I will try to find time to update it to mainline, but if someone beats
me, I definitly won't complain.

Index: expr.h
===
RCS file: /cvs/gcc/gcc/gcc/expr.h,v
retrieving revision 1.171
diff -c -3 -p -r1.171 expr.h
*** expr.h  8 Sep 2004 18:44:56 -   1.171
--- expr.h  25 Sep 2004 13:22:22 -
*** Software Foundation, 59 Temple Place - S
*** 38,43 
--- 38,49 
  #ifndef BRANCH_COST
  #define BRANCH_COST 1
  #endif
+ #ifndef PREDICTABLE_BRANCH_COST
+ #define PREDICTABLE_BRANCH_COST BRANCH_COST
+ #endif
+ #ifndef COLD_BRANCH_COST
+ #define COLD_BRANCH_COST BRANCH_COST
+ #endif
  
  /* This is the 4th arg to `expand_expr'.
 EXPAND_STACK_PARM means we are possibly expanding a call param onto
Index: ifcvt.c
===
RCS file: /cvs/gcc/gcc/gcc/ifcvt.c,v
retrieving revision 1.165
diff -c -3 -p -r1.165 ifcvt.c
*** ifcvt.c 17 Sep 2004 05:32:36 -  1.165
--- ifcvt.c 25 Sep 2004 13:22:22 -
*** cond_exec_process_if_block (ce_if_block_
*** 608,613 
--- 608,614 
  struct noce_if_info
  {
basic_block test_bb;
+   int branch_cost;
rtx insn_a, insn_b;
rtx x, a, b;
rtx jump, cond, cond_earliest;
*** noce_try_store_flag_constants (struct no
*** 869,888 
normalize = 0;
else if (ifalse == 0 && exact_log2 (itrue) >= 0
   && (STORE_FLAG_VALUE == 1
!  || BRANCH_COST >= 2))
normalize = 1;
else if (itrue == 0 && exact_log2 (ifalse) >= 0 && can_reverse
!  && (STORE_FLAG_VALUE == 1 || BRANCH_COST >= 2))
normalize = 1, reversep = 1;
else if (itrue == -1
   && (STORE_FLAG_VALUE == -1
!  || BRANCH_COST >= 2))
normalize = -1;
else if (ifalse == -1 && can_reverse
!  && (STORE_FLAG_VALUE == -1 || BRANCH_COST >= 2))
normalize = -1, reversep = 1;
!   else if ((BRANCH_COST >= 2 && STORE_FLAG_VALUE == -1)
!  || BRANCH_COST >= 3)
normalize = -1;
else
return FALSE;
--- 870,889 
normalize = 0;
else if (ifalse == 0 && exact_log2 (itrue) >= 0
   && (STORE_FLAG_VALUE == 1
!  || if_info->branch_cost >= 2))
normalize = 1;
else if (itrue == 0 && exact_log2 (ifalse) >= 0 && can_reverse
!  && (STORE_FLAG_VALUE == 1 || if_info->branch_cost >= 2))
normalize = 1, reversep = 1;
else if (itrue == -1
   && (STORE_FLAG_VALUE == -1
!  || if_info->branch_cost >= 2))
normalize = -1;
else if (ifalse == -1 && can_reverse
!  && (STORE_FLAG_VALUE == -1 || if_info->branch_cost >= 2))
normalize = -1, reversep = 1;
!   else if ((if_info->branch_cost >= 2 && STORE_FLAG_VALUE == -1)
!  || if_info->branch_cost >= 3)
normalize = -1;
else
return FALSE;
*** noce_try_addcc (struct noce_if_info *if_
*** 1014,1020 
  
/* If that fails, construct conditional increment or decrement using
 setcc.  */
!   if (BRANCH_COST >= 2
  && (XEXP (if_info->a, 1) == const1_rtx
  || XEXP (if_info->a, 1) == constm1_rtx))
  {
--- 1015,1021 
  
/* If that fails, construct conditional increment or decrement using
 setcc.  */
!   if (if_info->branch_cost >= 2
  && (XEXP (if_info->a, 1) == const1_rtx
  || XEXP (if_info->a, 1) == constm1_rtx))
  {
*** noce_try_store_flag_mask (struct noce_if
*** 1066,1072 
  
reversep = 0;
if (! no_new_pseudos
!   && (BRANCH_COST >= 2
  || STORE_FLAG_VALUE == -1)
  

Re: -mfmovd enabled by default for SH2A but not for SH4

2008-02-26 Thread Kaz Kojima
> --- ORIG/trunk/gcc/config/sh/sh.h 2007-12-07 09:11:38.0 +0900
> +++ LOCAL/trunk/gcc/config/sh/sh.h2008-02-25 19:09:48.0 +0900
> @@ -553,7 +553,7 @@ do {  
> \
>  {
> \
>sh_cpu = CPU_SH2A; \
>if (TARGET_SH2A_DOUBLE)
> \
> -target_flags |= MASK_FMOVD;  \
> +target_flags |= MASK_FMOVD | MASK_ALIGN_DOUBLE;  
> \

I've played with this patch and recognize that it's not that
simple, unfortunately.  -mdalign changes not only alignment
of doubles but also calling conventions.  If you compile

void foo () { bar (1, 0x12345678abcdLL); }

with/without -mdalign, for example, you can see that effect.
This behavior with -mdalign is consistent from 3.4.
It seems that such change is too big for SH2A users.  Perhaps
this was one reason of not defaulting -mdalign for SH2A.

Regards,
kaz


A question regarding -fwrapv flag

2008-02-26 Thread Revital1 Eres

Hello,

I am running the attached testcase (inspired from vect/vect-reduc-3.c
testcase) with -O3 -fwrapv on powerpc64-linux with trunk 4.4.

Here is a snippet from the testcase:

...

  unsigned short ub[N] = {0,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45};
  unsigned short uc[N] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
  unsigned short udiff;


  udiff = 0;
  for (i = 0; i < n; i++) {,
udiff += (ub[i] - uc[i]);,
  }

 ...

It seems that the use of -fwrapv flag causes part of the variables to};
be converted to short and I do not understand why.

Snippet from the .gimple dump file:

  i.2 = i;
  D.1654 = ub[i.2];
  i.2 = i;
  D.1655 = uc[i.2];
  D.1656 = D.1654 - D.1655;
  D.1657 = (short int) D.1656;
  udiff.3 = (short int) udiff;
  D.1659 = D.1657 + udiff.3;
  udiff = (short unsigned int) D.1659;

Thanks,
Revital

(See attached file: test-2.c)

test-2.c
Description: Binary data


Re: Bootstrap failure on x86_64

2008-02-26 Thread Richard Guenther
On Mon, 25 Feb 2008, Richard Guenther wrote:

> On Mon, 25 Feb 2008, H.J. Lu wrote:
> 
> > Uros failed with --enable-checking=release and I failed with
> > --enable-checking=assert.
> > You can try either one.
> 
> That reproduces it.  I have reverted the patch for now.

It turns out that PCH and ggc_free don't play along well appearantly.
Both making ggc_free a NOP or disabling the ggc_free call in 
cp/decl.c:2157 fixes the problem for me (that is in duplicate_decls).

So we need to make sure to remove the decl from the global hashtable
before ggc_freeing it, otherwise another object may become life in
the freed memory and the global map will still point to it.

Richard.


Re: A question regarding -fwrapv flag

2008-02-26 Thread Richard Guenther
On Tue, Feb 26, 2008 at 12:12 PM, Revital1 Eres <[EMAIL PROTECTED]> wrote:
>
>  Hello,
>
>  I am running the attached testcase (inspired from vect/vect-reduc-3.c
>  testcase) with -O3 -fwrapv on powerpc64-linux with trunk 4.4.
>
>  Here is a snippet from the testcase:
>
>  ...
>
>   unsigned short ub[N] = {0,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45};
>   unsigned short uc[N] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
>   unsigned short udiff;
>
>
>   udiff = 0;
>   for (i = 0; i < n; i++) {,
> udiff += (ub[i] - uc[i]);,
>   }
>

This is done by fold (see the .original dump), both operands decay to int
according to C rules, so fold sees (unsigned short)((int)ub[i]  -
(int)uc[i] + (int)udiff)
and thinks that (short unsigned int) ((short int) (ub[i] - uc[i]) +
(short int) udiff)
is a better way to compute it.

Richard.


Re: optimizing predictable branches on x86

2008-02-26 Thread Nick Piggin
On Tuesday 26 February 2008 21:14, Jan Hubicka wrote:
> Hi,
>
> > Core2 follows a similar pattern, although it's not seeing any
> > slowdown in the "no deps, predictable, jmp" case like K8 does.
> >
> > Any comments? (please cc me) Should gcc be using conditional jumps
> > more often eg. in the case of __builtin_expect())?
>
> The problem is that in general GCC's branch prediction algorithms are
> very poor on predicting predictability of branch: they are pretty good
> on guessing outcome but that is.

Yes, I guess this would be tricky. I wonder if there would be use
in having a __builtin_predictable() type of thing. I know there
are cases where we could use it in Linux (eg. we have a lot of
tunable things, but usually they aren't changed often). Then again,
maybe the benefit of doing these annotations would be too small
to bother about.

Linux generally has reasonable likely/unlikely annotations, so I
guess we can first wait to see if predictable branch optimization
gives any benefit there. (or for those doing benchmarks with
profile feedback)

> Only cases we do so quite reliably IMO are:
>   1) loop branches that are not interesting for cmov conversion
>   2) branches leading to noreturn calls, also not interesting
>   3) builtin_expect mentioned.
>   4) when profile feedback is around to some degree (ie we know when the
>   branch is very likely or very unlikely. We don't simulate what
>   hardware will do on it).

At least on x86 it should also be a good idea to know which way
the branch is going to go, because it doesn't have explicit branch
hints, you really want to be able to optimize the cold branch
predictor case if converting from cmov to conditional branches.


> I guess we can implement the machinery for 3 and 4 (in fact once
> I played adding EDGE_PREDICTABLE_P predicate that basically tested if
> the esimated probability of branch is <5% or >95%) but never got really
> noticeable improvements out of it and gave up.
>
> It was before Core2 times, so it might be helping now.  But it needs
> updating for backend cost interface as ifcvt is bit inflexible in this.
> I had BRANCH_COST and PREDICTABLE_BRANCH_COST macros.

cmov performance seems to be pretty robust (I was surprised it is so
good), so it definitely seems like the right thing to do by default.
It will be hard to beat, but I hope there is room for some improvement.

Thanks,
Nick



Re: SSA alias representation

2008-02-26 Thread Diego Novillo
On Tue, Feb 26, 2008 at 04:44, Fran Baena <[EMAIL PROTECTED]> wrote:

>  if ()
>   t_1 = & a_1;t_1 points-to { a_1 }
>  else
>   if ()
> t2_ = & b_1;  t_2 points-to { b_1 }
>   else
> t_3 = & c_1;  t_3 points-to { c_1 }
>
>  #  t_4 = PHI 
>  p_1 = t_4; p_1 and t_4 points-to { a_1, b_1, c_1 }

Not quite, points-to sets always have symbols, not SSA names.  Use
-fdump-tree-salias-all-vops to see how things are renamed.

Symbols with their address taken are only renamed when they appear as
virtual operands.  So, if you have:

p_3 = (i_5 > 10) ? &a : &b
a = 4

notice that 'a' is never renamed in the LHS of the assignment.  It's
renamed as a virtual operand:

p_3 = (i_5 > 10) ? &a : &b

# a_9 = VDEF 
a = 4


Diego.


Re: optimizing predictable branches on x86

2008-02-26 Thread J.C. Pizarro
Compiling and executing the code of Nick Piggin at
http://gcc.gnu.org/ml/gcc/2008-02/msg00601.html

in my old Athlon64 Venice 3200+ 2.0 GHz,
3 GiB DDR400, 32-bit kernel, gcc 3.4.6, i got

$ gcc -O3 -falign-functions=64 -falign-loops=64 -falign-jumps=64
-falign-labels=64 -march=i686 foo.c -o foo
$ ./foo
 no deps,   predictable -- Ccode took  10.08ns per iteration
 no deps,   predictable -- cmov code took  11.07ns per iteration
 no deps,   predictable -- jmp  code took  11.25ns per iteration
has deps,   predictable -- Ccode took  26.66ns per iteration
has deps,   predictable -- cmov code took  35.44ns per iteration
has deps,   predictable -- jmp  code took  18.89ns per iteration
 no deps, unpredictable -- Ccode took  10.17ns per iteration
 no deps, unpredictable -- cmov code took  11.07ns per iteration
 no deps, unpredictable -- jmp  code took  22.51ns per iteration
has deps, unpredictable -- Ccode took  104.02ns per iteration
has deps, unpredictable -- cmov code took  107.19ns per iteration
has deps, unpredictable -- jmp  code took  176.18ns per iteration
$

This machine concludes that ( > means slightly better than, >> better )
1. jmp >> C >> cmov when it's predictable and has data dependencies.
2. C > cmov > jmp when it's predictable and has not data dependencies.
3. C > cmov >> jmp when it's unpredictable and has not data dependencies.
4. C > cmov >> jmp when it's unpredictable and has not data dependencies.

* Be careful, jmp is the worst when it's unpredictable
 (with or without data dependencies).
* But conditional jmp is the best when it's
 predictable AND has data dependencies.

   ;)


[tuples] Updated wiki with build instructions

2008-02-26 Thread Diego Novillo
Since the branch still does not bootstrap cleanly, we have to jump
through some hoops to do testing.  I've added configuration and build
instructions to the tuples wiki (http://gcc.gnu.org/wiki/tuples).


Diego.


Re: optimizing predictable branches on x86

2008-02-26 Thread J.C. Pizarro
On 2008/2/26, J.C. Pizarro <[EMAIL PROTECTED]>, i wrote:
>  4. C > cmov >> jmp when it's unpredictable and has not data dependencies.

I'm sorry of my error typo, the correct is (without the "not")
4. C > cmov >> jmp when it's unpredictable and has data dependencies.

and my forgotten 3rd annotation:
* cmov is the worst when it's
predictable AND has data dependencies.


Re: optimizing predictable branches on x86

2008-02-26 Thread J.C. Pizarro
It's a final summary for good performance of the tested machines:

  + unpredictable: * don't use conditional jmp (the worst).
 / * use cmov or C version.
/
\ + no deps: * use cmov or C version.
 \   /
  + predictable: \
  + has deps: * don't use cmov (the worst).
  * use conditional jmp (the best).


Re: optimizing predictable branches on x86

2008-02-26 Thread J.C. Pizarro
On Tuesday 26 February 2008 21:14, Jan Hubicka wrote:
> Only cases we do so quite reliably IMO are:
>   1) loop branches that are not interesting for cmov conversion
>   2) branches leading to noreturn calls, also not interesting
>   3) builtin_expect mentioned.
>   4) when profile feedback is around to some degree (ie we know when the
>   branch is very likely or very unlikely. We don't simulate what
>   hardware will do on it).

Without profiler, we can estimate blindly that the simply loop branches
can be predictable with the assumption of that we know that the loops
(of many iterations) are potential enemies (they consume many cycles)
of the CPU.

For example, for this simple loop without profiler, human prediction is easy:

for (;;) { /* it's predictable not-branch to the end of loop for */
start:
  ... // hundreds of iterations (e.g. 99% branching inside, <1%
branching outside
} /* or predictable branch to the start of loop for depending in code gen. */
end:

But for this complex loop (mutually nested), the predictability without
profiler is very hard!

loop1:
   ...
   if (cond1) then goto loop2 else loop3 endif  // to loop2 or loop3?
prediction is very hard
   ...
loop2:
   ...
   if (cond2) then goto loop1 else loop2 endif  // to loop1 or loop2?
prediction is very hard
   ...
loop3:
   ...
   if (cond3) then goto loop1 else loop3 endif  // to loop1 or loop2?
prediction is very hard
   ...
   goto loop1

There are things that can be human predictable but human unpredictable too!

   Sincerely yours ;)


optimizing predictable branches (Was: ... on x86)

2008-02-26 Thread Joern Rennecke
This is also interesting for the ARC700 processor.

There is also an issue if the flag for the conditionalized instruction is
set in the immediately preceding instruction, and the result of the
conditionalized instruction is required in the immediately following
instruction, and if using a conditional branch with a short offset,
there is also the opportunity to combine a comparison or bit test
with the branch.

MOreover, since the ARCompact architecture a lot more registers than x86,
if you don't use a frame pointer, there are also realistic
opportunities to use conditional function returns.

Already back when I was an SH maintainer, I was annoyed that there is
only one BRANCH_COST.  We should really have different ones for
predictable and unpredictable/mispredicted branches.

Also, it would make sense if the cost could be modified according to if
the compiler thinks it will be able to schedule a delay slot instruction.

Ideally alignment could also be taken into account, but that would
require to do register allocation first, so there appears to be no viable
pass ordering withing the gcc infrastructure to make this work.

For an exact modeling, we should actually have three branch costs,
distinguishing the cost from having no prediction to having a wrong
prediction.
However, in 'hot' code we can assume we have some prediction - either
right or wrong, and 'cold' code would typically not matter, unles you
have a humongous program with very poor locality.

Howevr, for these reasons I think that COLD_BRANCH_COST is a misnomer,
and could also promt port writers to put the wrong value there,
since it's the mispredicted branches we are interested in.
MISPREDICTED_BRANCH_COST would be more descriptive.


decl_constant_value_for_broken_optimization

2008-02-26 Thread Ian Lance Taylor
I just spent a couple of hours looking into what turned out to be an
issue with decl_constant_value_for_broken_optimization, and I wanted
to record the notes.


The function was introduced here:
http://gcc.gnu.org/ml/gcc-patches/2000-10/msg00649.html

At the time Joseph added this comment:

/* Return either DECL or its known constant value (if it has one), but
   return DECL if pedantic or DECL has mode BLKmode.  This is for
   bug-compatibility with the old behavior of decl_constant_value
   (before GCC 3.0); every use of this function is a bug and it should
   be removed before GCC 3.1.  It is not appropriate to use pedantic
   in a way that affects optimization, and BLKmode is probably not the
   right test for avoiding misoptimizations either.  */


Needless to say, the function was not removed before gcc 3.1, and
indeed has not yet been removed seven years later.  As the comment
says, the existence of this function means that in some cases we
generate different code merely because -pedantic is specified.  This
led to a recent question on gcc-help.


Joseph's patch removed the checking of pedantic from
decl_constant_value.  I tracked back to see where pedantic was added
to that function, and found it here:

Wed Feb 15 01:59:15 1989  Richard Stallman  (rms at sugar-bombs.ai.mit.edu)

* c-typeck.c (decl_constant_value): Disable opt. if pedantic or
outside functions, so that validity of program is never affected.

So it appears that the test for pedantic was added in the first place
purely to disable the optimization, perhaps because RMS perceived it
to be unsafe.


In the current compiler, it seems very likely that every call to
decl_constant_value_for_broken_optimization can simply be removed.
The constant propagation passes should implement the optimization.


I don't plan to work on this in the immediate future.

Ian


Draft SH uClinux FDPIC ABI

2008-02-26 Thread Joseph S. Myers
Here is a draft FDPIC ABI for SH uClinux, based on the FR-V FDPIC ABI.  
Please send any comments; CodeSourcery will be implementing the final ABI 
version in GCC and Binutils.


   The SH FDPIC ABI

 Joseph Myers

  CodeSourcery, Inc.
  February 25, 2008
 Version 0.1

Based on FR-V FDPIC ABI Version 1.0a by Kevin Buettner, Alexandre
Oliva and Richard Henderson, adapted for SH by Joseph Myers.
   
Introduction


This document describes extensions (and some minor changes) to the
existing SH ELF ABI (as used on GNU/Linux) required to support the
implementation of shared libraries on a system whose OS (and hardware)
require that processes share a common address space.  This document
will also attempt to explore the motivations behind and the
implications of these extensions.

One of the primary goals in using shared libraries is to reduce the
memory requirements of the overall system.  Thus, if two processes use
the same library, the hope is that at least some of the memory pages
will be shared between the two processes resulting in an overall
savings.  To realize these savings, tools used to build a program and
library must identify which sections may be shared and which must not
be shared.  The shared sections, when grouped together, are commonly
referred to as the "text segment" whereas the non-shared (grouped)
sections are commonly referred to as the "data segment".  The text
segment is read-only and is usually comprised of executable code and
read-only data.  The data segment must be writable and it is this fact
which makes it non-sharable.

Systems which utilize disjoint address spaces for its processes are
free to group the text and data segments in such a way that they
may always be loaded with fixed relative positions of the text
and data segments.  I.e, for a given load object, the offset from
the start of the text segment to the start of the data segment is
constant.  This property greatly simplifies the design of the
shared library machinery.

The design of the shared library mechanism described in this document
does not (and cannot) have this property.  Due to the fact that all
processes share a common address space, the text and data segments
will be placed at arbitrary locations relative to each other and will
therefore need a mechanism whereby executable code will always be able
to find its corresponding data.  One of the CPU's registers is
typically dedicated to hold the base address of the data segment. 
This register will be called the "FDPIC register" in this document. 
Such a register is sometimes used in systems with disjoint address
spaces too, but this is for efficiency rather than necessity.

The fact that the locations of the text and data segments are at
non-constant offsets with respect to each other also complicates
function pointer representation.  As noted above, executable code
must be able to find its corresponding data segment.  When making an
indirect function call, it is therefore important that both the
address of the function and the base address of the data segment are
available.  This means that a function pointer needs to represented as
the address of a "function descriptor" which contains the address of
the actual code to execute as well as the corresponding data (FDPIC
register) address.


FDPIC Register
--

The FDPIC register is used as a base register for accessing the global
offset table (GOT) and function descriptors.  Since both code and data
are relocatable, executable code may not contain any instruction
sequences which directly encode a pointer's value.  Instead, pointers
to global data are indirectly referenced via the global offset table. 
At load time, pointers contained in the global offset table are
relocated by the dynamic linker to point at the correct locations.

Register R12 is used as the FDPIC register; in this specification it
is caller-save, not callee-save, to avoid problems with PLT entries
needing to save the register.

Upon entry to a function, the caller saved register R12 is the FDPIC
register.  As described above, it contains the GOT address for that
function.  R12 obtains its value in one of three ways:

1) By being inherited from the calling function in the case
   of a direct call to a function within the same load module.

2) By being set either in a PLT entry or in inlined PLT code.

3) By being set from a function descriptor as part of an
   indirect call.

The specifics associated with each of these cases are covered in
greater detail in "Procedure Linkage Table (PLT)" and "Function
Calls", below.

The prologue code of a non-leaf function should save R12 either on
the stack or in one of the callee-saved registers.  After each
function call, R12 must be restored if it is needed later on in the
function.  Direct calls to functions in 

Re: [PATCH] linux/fs.h - Convert debug functions declared inline __attribute__((format (printf,x,y) to statement expression macros

2008-02-26 Thread Matthew Wilcox
On Tue, Feb 26, 2008 at 08:02:27PM -0800, Joe Perches wrote:
> Converting inline __attribute__((format (printf,x,y) functions
> to macros or statement expressions produces smaller objects
> 
> before:
> $ size vmlinux
>textdata bss dec hex filename
> 4716770  474560  618496 5809826  58a6a2 vmlinux
> after:
> $ size vmlinux
>textdata bss dec hex filename
> 4716706  474560  618496 5809762  58a662 vmlinux

> -static inline void __attribute__((format(printf, 1, 2)))
> -__simple_attr_check_format(const char *fmt, ...)
> -{
> - /* don't do anything, just let the compiler check the arguments; */
> -}
> +/* don't do anything, just let the compiler check the arguments; */
> +
> +#define __simple_attr_check_format(fmt, args...) \
> + do { if (0) printk(fmt, ##args); } while (0)

That's very interesting.  It's only 64 bytes, but still, it's not
supposed to have any different effect.  Could you distill a test case
for the GCC folks and file it in their bugzilla?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."


Re: [PATCH] linux/fs.h - Convert debug functions declared inline __attribute__((format (printf,x,y) to statement expression macros

2008-02-26 Thread Joe Perches
On Tue, 2008-02-26 at 21:13 -0700, Matthew Wilcox wrote:
> That's very interesting.  It's only 64 bytes, but still, it's not
> supposed to have any different effect.

Especially because __simple_attr_check_format is not even used
or called in an x86 defconfig.  It's powerpc/cell specific.

> Could you distill a test case for the GCC folks

I'll play around with it.




Re: [PATCH] linux/fs.h - Convert debug functions declared inline __attribute__((format (printf,x,y) to statement expression macros

2008-02-26 Thread David Rientjes
On Tue, 26 Feb 2008, Matthew Wilcox wrote:

> On Tue, Feb 26, 2008 at 08:02:27PM -0800, Joe Perches wrote:
> > Converting inline __attribute__((format (printf,x,y) functions
> > to macros or statement expressions produces smaller objects
> > 
> > before:
> > $ size vmlinux
> >textdata bss dec hex filename
> > 4716770  474560  618496 5809826  58a6a2 vmlinux
> > after:
> > $ size vmlinux
> >textdata bss dec hex filename
> > 4716706  474560  618496 5809762  58a662 vmlinux
> 
> > -static inline void __attribute__((format(printf, 1, 2)))
> > -__simple_attr_check_format(const char *fmt, ...)
> > -{
> > -   /* don't do anything, just let the compiler check the arguments; */
> > -}
> > +/* don't do anything, just let the compiler check the arguments; */
> > +
> > +#define __simple_attr_check_format(fmt, args...) \
> > +   do { if (0) printk(fmt, ##args); } while (0)
> 
> That's very interesting.  It's only 64 bytes, but still, it's not
> supposed to have any different effect.  Could you distill a test case
> for the GCC folks and file it in their bugzilla?
> 

I'm not seeing any change in text size with allyesconfig after applying 
this patch with latest git:

   textdata bss dec hex filename
326962105021759 6735572 444535412a64ea5 vmlinux.before
326962105021759 6735572 444535412a64ea5 vmlinux.after

Joe, what version of gcc are you using?

David


Re: [PATCH] linux/fs.h - Convert debug functions declared inline __attribute__((format (printf,x,y) to statement expression macros

2008-02-26 Thread Joe Perches
On Tue, 2008-02-26 at 21:44 -0800, David Rientjes wrote:
> I'm not seeing any change in text size with allyesconfig after applying 
> this patch with latest git:

This is just x86 defconfig

> Joe, what version of gcc are you using?

$ gcc --version
gcc (GCC) 4.2.2 20071128 (prerelease) (4.2.2-3.1mdv2008.0)

It's definitely odd.
The .o size changes are inconsistent.
Some get bigger, some get smaller.

The versioning ones I understand but I have no idea why
changes in drivers/ or mm/ or net/ exist.

I think it's gcc optimization changes, but dunno...
Any good ideas?

$ git reset --hard
HEAD is now at 7704a8b... Merge branch 'for-linus' of 
git://oss.sgi.com:8090/xfs/xfs-2.6
$ make mrproper ; make defconfig ; make > /dev/null
$ size vmlinux
   textdata bss dec hex filename
4716770  474560  618496 5809826  58a6a2 vmlinux
$ size $(find -type f -print | grep "\.o$" | grep -vP 
"(vmlinux|built-in|piggy|allsyms.)\.o$") > size.default
$ patch -p1 < inline/fs.h.d
$ make > /dev/null
$ size vmlinux
   textdata bss dec hex filename
4716706  474560  618496 5809762  58a662 vmlinux
$ size $(find -type f -print | grep "\.o$" | grep -vP 
"(vmlinux|built-in|piggy|allsyms.)\.o$") > size.inline_fs
$ diff --unified=0 size.default size.inline_fs
--- size.default2008-02-26 22:18:33.0 -0800
+++ size.inline_fs  2008-02-26 22:33:27.0 -0800
@@ -21 +21 @@
- 79  0   0  79  4f ./arch/x86/boot/version.o
+ 85  0   0  85  55 ./arch/x86/boot/version.o
@@ -335 +335 @@
-   5206 72  12529014aa ./drivers/base/core.o
+   5201 72  12528514a5 ./drivers/base/core.o
@@ -374 +374 @@
-  181921041648   199444de8 ./drivers/char/tty_io.o
+  181841041648   199364de0 ./drivers/char/tty_io.o
@@ -390 +390 @@
-   4293560  244877130d ./drivers/char/hpet.o
+   4287560  2448711307 ./drivers/char/hpet.o
@@ -473 +473 @@
-  38914 32 341   392879977 
./drivers/message/fusion/mptbase.o
+  38922 32 341   39295997f 
./drivers/message/fusion/mptbase.o
@@ -492 +492 @@
-  81665   2613   4   84282   1493a ./drivers/net/tg3.o
+  81659   2613   4   84276   14934 ./drivers/net/tg3.o
@@ -544 +544 @@
-  17508845 552   1890549d9 
./drivers/scsi/aic7xxx/aic79xx_osm.o
+  17510845 552   1890749db 
./drivers/scsi/aic7xxx/aic79xx_osm.o
@@ -581 +581 @@
- 74   4480   0455411ca 
./drivers/scsi/scsi_wait_scan.mod.o
+ 80   4480   0456011d0 
./drivers/scsi/scsi_wait_scan.mod.o
@@ -774 +774 @@
-   1924  4   41932 78c ./fs/proc/kcore.o
+   1922  4   41930 78a ./fs/proc/kcore.o
@@ -776 +776 @@
-  41462652  80   42194a4d2 ./fs/proc/proc.o
+  41458652  80   42190a4ce ./fs/proc/proc.o
@@ -828 +828 @@
-   9583 80   0966325bf ./fs/locks.o
+   9571 80   0965125b3 ./fs/locks.o
@@ -870 +870 @@
-277396   4 677 2a5 ./init/version.o
+281396   4 681 2a9 ./init/version.o
@@ -926 +926 @@
-   8379460   88847228f ./kernel/sys.o
+   8381460   888492291 ./kernel/sys.o
@@ -954 +954 @@
-  13337188  73   13598351e ./kernel/module.o
+  13341188  73   136023522 ./kernel/module.o
@@ -1044 +1044 @@
-   1845  0   01845 735 ./mm/mremap.o
+   1841  0   01841 731 ./mm/mremap.o
@@ -1052 +1052 @@
-   8781 442196   110212b0d ./mm/swapfile.o
+   8777 442196   110172b09 ./mm/swapfile.o
@@ -1065 +1065 @@
-   2630  0   02630 a46 ./net/core/datagram.o
+   2631  0   02631 a47 ./net/core/datagram.o
@@ -1101 +1101 @@
-  13190 24   0   13214339e ./net/ipv4/tcp_output.o
+  13192 24   0   1321633a0 ./net/ipv4/tcp_output.o
@@ -1109 +1109 @@
-   6244468   067121a38 ./net/ipv4/arp.o
+   6239468   067071a33 ./net/ipv4/arp.o
@@ -1138 +1138 @@
-   4660132  44483612e4 ./net/ipv6/ip6_fib.o
+   4644132  44482012d4 ./net/ipv6/ip6_fib.o
@@ -1146 +1146 @@
-  16397 24   4   164254029 ./net/ipv6/mcast.o
+  16399 24   4   16427402b ./net/ipv6/mcast.o
@@ -1159 +1159 @@
- 143799   74243036  154259   25a93 ./net/ipv6/ipv6.o
+ 143787   74243036  154247   25a87 ./net/ipv6/ipv6.o
@@ -1202 +1202 @@
-   2109600   02709 a95 ./net/xfrm/xfrm_algo.o
+   2111600   02711 a97 ./net/xfrm/xfrm_algo.o




Re: [PATCH] linux/fs.h - Convert debug functions declared inline __attribute__((format (printf,x,y) to statement expression macros

2008-02-26 Thread David Rientjes
On Tue, 26 Feb 2008, Joe Perches wrote:

> > I'm not seeing any change in text size with allyesconfig after applying 
> > this patch with latest git:
> 
> This is just x86 defconfig
> 

allyesconfig should be able to capture any text savings that this patch 
offers.

> > Joe, what version of gcc are you using?
> 
> $ gcc --version
> gcc (GCC) 4.2.2 20071128 (prerelease) (4.2.2-3.1mdv2008.0)
> 

My x86_64 defconfig with gcc 4.0.3 had no difference in text size after 
applying your patch, yet the same config on gcc 4.1.2 did:

   textdata bss dec hex filename
5386112  846328  719560 6952000  6a1440 vmlinux.before
5386048  846328  719560 6951936  6a1400 vmlinux.after