Unroller gone wild

2010-03-08 Thread Piotr Wyderski
I have the following code:

struct bounding_box {

pack4sf m_Mins;
pack4sf m_Maxs;

void set(__v4sf v_mins, __v4sf v_maxs) {

m_Mins = v_mins;
m_Maxs = v_maxs;
}
};

struct bin {

bounding_box m_Box[3];
pack4si  m_NL;
pack4sf  m_AL;
};

static const std::size_t bin_count = 16;
bin aBins[bin_count];

for(std::size_t i = 0; i != bin_count; ++i) {

bin& b = aBins[i];

b.m_Box[0].set(g_VecInf, g_VecMinusInf);
b.m_Box[1].set(g_VecInf, g_VecMinusInf);
b.m_Box[2].set(g_VecInf, g_VecMinusInf);
b.m_NL = __v4si{ 0, 0, 0, 0 };
}

where pack4sf/si are union-based wrappers for __v4sf/si.
GCC 4.5 on Core i7/Cygwin with

-O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
-fomit-frame-pointer

completely unrolled the loop into 112 movdqa instructions,
which is "a bit" too agressive. Should I file a bug report?
The processor has an 18 instructions long prefetch queue
and the loop is perfectly predictable by the built-in branch
prediction circuitry, so translating it as is would result in huge
fetch/decode bandwidth reduction. Is there something like
"#pragma nounroll" to selectively disable this optimization?

Best regards
Piotr Wyderski


Re: Unroller gone wild

2010-03-08 Thread Richard Guenther
On Mon, Mar 8, 2010 at 9:49 AM, Piotr Wyderski  wrote:
> I have the following code:
>
>    struct bounding_box {
>
>        pack4sf m_Mins;
>        pack4sf m_Maxs;
>
>        void set(__v4sf v_mins, __v4sf v_maxs) {
>
>            m_Mins = v_mins;
>            m_Maxs = v_maxs;
>        }
>    };
>
>    struct bin {
>
>        bounding_box m_Box[3];
>        pack4si      m_NL;
>        pack4sf      m_AL;
>    };
>
>    static const std::size_t bin_count = 16;
>    bin aBins[bin_count];
>
>    for(std::size_t i = 0; i != bin_count; ++i) {
>
>        bin& b = aBins[i];
>
>        b.m_Box[0].set(g_VecInf, g_VecMinusInf);
>        b.m_Box[1].set(g_VecInf, g_VecMinusInf);
>        b.m_Box[2].set(g_VecInf, g_VecMinusInf);
>        b.m_NL = __v4si{ 0, 0, 0, 0 };
>    }
>
> where pack4sf/si are union-based wrappers for __v4sf/si.
> GCC 4.5 on Core i7/Cygwin with
>
> -O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
> -fomit-frame-pointer
>
> completely unrolled the loop into 112 movdqa instructions,
> which is "a bit" too agressive. Should I file a bug report?
> The processor has an 18 instructions long prefetch queue
> and the loop is perfectly predictable by the built-in branch
> prediction circuitry, so translating it as is would result in huge
> fetch/decode bandwidth reduction. Is there something like
> "#pragma nounroll" to selectively disable this optimization?

No, only --param max-completely-peel-times (which is 16)
or --param max-completely-peeled-insns (which probably should
then be way lower than the current 400).

Richard.

> Best regards
> Piotr Wyderski
>


Re: How to make 'long int' type be a PDImode?

2010-03-08 Thread Frank Isamov
On Mon, Mar 8, 2010 at 8:29 AM, Joern Rennecke
 wrote:
> Quoting Frank Isamov :
>
>> Hi,
>>
>> I'd like to make a backend which would have 48 bits for 'long' type.
>> (32 for int and 64 for long long).
>>
>> I have tried to define:
>> #define LONG_TYPE_SIZE  48
>
> That's not a partial integer mode; PDImode would have the same size as
> DImode,
> just not all bits would be significant.
>
>> and one of:
>> INT_MODE (PDI, 6);
>
> And that wouldn't be PDImode, more like THImode (three-halves integer mode,
> going by the precedent of TQFmode - three-quarter float mode - of the 1750a
> port in GCC prior to version 3.1)
>


I am sorry, I still can conclude how to make the backend use a certain
mode for 'long' type.
My architecture implements 32- and 48- bit registers and instructions
operating on these registers. That it why my intention is to use 'int'
for 32 and 'long' for 48 bit operations. My attempts lead to the
backend ignore 48 bits path inspite of the LONG_TYPE_SIZE definition
and 'long' is still processed as 32 bit value. I think I am missing
something, so I am applying for help. Any piece of information would
be useful.

Thank you,
Frank


Re: How to make 'long int' type be a PDImode?

2010-03-08 Thread Frank Isamov
On Mon, Mar 8, 2010 at 4:27 PM, Frank Isamov  wrote:
> On Mon, Mar 8, 2010 at 8:29 AM, Joern Rennecke
>  wrote:
>> Quoting Frank Isamov :
>>
>>> Hi,
>>>
>>> I'd like to make a backend which would have 48 bits for 'long' type.
>>> (32 for int and 64 for long long).
>>>
>>> I have tried to define:
>>> #define LONG_TYPE_SIZE  48
>>
>> That's not a partial integer mode; PDImode would have the same size as
>> DImode,
>> just not all bits would be significant.
>>
>>> and one of:
>>> INT_MODE (PDI, 6);
>>
>> And that wouldn't be PDImode, more like THImode (three-halves integer mode,
>> going by the precedent of TQFmode - three-quarter float mode - of the 1750a
>> port in GCC prior to version 3.1)
>>
>
>
> I am sorry, I still can conclude how to make the backend use a certain
> mode for 'long' type.
> My architecture implements 32- and 48- bit registers and instructions
> operating on these registers. That it why my intention is to use 'int'
> for 32 and 'long' for 48 bit operations. My attempts lead to the
> backend ignore 48 bits path inspite of the LONG_TYPE_SIZE definition
> and 'long' is still processed as 32 bit value. I think I am missing
> something, so I am applying for help. Any piece of information would
> be useful.
>
> Thank you,
> Frank
>

Correction: "I still can conclude " should be read as "I still cannot conclude"


(un)aligned accesses on x86 platform.

2010-03-08 Thread Paweł Sikora

hi,

during development a cross platform appliacation on x86 workstation
i've enabled an alignemnt checking [1] to catch possible erroneous
code before it appears on client's sparc/arm cpu with sigbus ;)

it works pretty fine and catches alignment violations but Jakub Jelinek
had told me (on glibc bugzilla) that gcc on x86 can still dereference
an unaligned pointer (except for vector insns).
i suppose it means that gcc can emit e.g. movl for access a short int
(or maybe others scenarios) in some cases and violates cpu alignment rules.

so, is it possible to instruct gcc-x86 to always use suitable loads/stores
like on sparc/arm?

[1] "AC" bit - http://en.wikipedia.org/wiki/FLAGS_register_(computing)

BR,
Pawel.


IRA conflict graph, multiple alternatives and commutative operands

2010-03-08 Thread Jie Zhang
I'm looking at PR 42258. I have a question on IRA conflict graph and 
multiple alternatives.


Below is an RTL insn just before register allocation pass:

(insn 7 6 12 2 pr42258.c:2 (set (reg:SI 136)
(mult:SI (reg:SI 137)
(reg/v:SI 135 [ x ]))) 33 {*thumb_mulsi3})

IRA generates the following conflict graph for r135, r136 and r137:

;; a0(r136,l0) conflicts: a2(r137,l0) a1(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a1(r135,l0) conflicts: a0(r136,l0) a2(r137,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a2(r137,l0) conflicts: a0(r136,l0) a1(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:

  regions=1, blocks=3, points=5
allocnos=3, copies=0, conflicts=0, ranges=3

Apparently this conflict graph is not an optimized one for any of the 
three alternatives in the following instruction pattern:


(define_insn "*thumb_mulsi3"
  [(set (match_operand:SI  0 "register_operand" "=&l,&l,&l")
(mult:SI (match_operand:SI 1 "register_operand" "%l,*h,0")
 (match_operand:SI 2 "register_operand" "l,l,l")))]
  ...)

This conflict graph seems like a merge of conflict graphs of the three 
alternatives. Ideally for the first and second alternatives, we should have


;; a0(r136,l0) conflicts: a2(r137,l0) a1(r135,l0)
;; a1(r135,l0) conflicts: a0(r136,l0)
;; a2(r137,l0) conflicts: a0(r136,l0)

For the third alternative, we'd better have

;; a0(r136,l0) conflicts: a1(r135,l0)
;; a1(r135,l0) conflicts: a0(r136,l0)
  cp0:a0(r136)<->a2(r137)@1000:constraint

And register allocator would use one of these more specific conflict 
graphs for coloring. If we take the commutative operands into count, we 
have to add the following conflict graph for choosing.


;; a0(r136,l0) conflicts: a2(r137,l0)
;; a2(r137,l0) conflicts: a0(r136,l0)
  cp0:a0(r136)<->a1(r135)@1000:constraint

(Actually, this conflict graph will result in an optimal result for the 
test case in PR 42258.)


Now the problem is when and how to choose the alternative for register 
allocator to calculate the conflict graph?


Yes, I have read the thread:

http://gcc.gnu.org/ml/gcc/2009-02/msg00215.html

This question seems not easy. So is there any practical method to make 
register allocator pick up the third alternative and do commutation 
before or during register allocation?


Thanks,
--
Jie Zhang
CodeSourcery
(650) 331-3385 x735


Re: (un)aligned accesses on x86 platform.

2010-03-08 Thread Andrew Pinski
You define STRICT_ALIGNED to be 1 in i386.h or provide an option to  
turn that on/off like the rs6000 target does.


Thanks,
Andrew Pinski

Sent from my iPhone

On Mar 8, 2010, at 7:37 AM, Paweł Sikora  wrote:


hi,

during development a cross platform appliacation on x86 workstation
i've enabled an alignemnt checking [1] to catch possible erroneous
code before it appears on client's sparc/arm cpu with sigbus ;)

it works pretty fine and catches alignment violations but Jakub  
Jelinek

had told me (on glibc bugzilla) that gcc on x86 can still dereference
an unaligned pointer (except for vector insns).
i suppose it means that gcc can emit e.g. movl for access a short int
(or maybe others scenarios) in some cases and violates cpu alignment  
rules.


so, is it possible to instruct gcc-x86 to always use suitable loads/ 
stores

like on sparc/arm?

[1] "AC" bit - http://en.wikipedia.org/wiki/FLAGS_register_(computing)

BR,
Pawel.


Re: (un)aligned accesses on x86 platform.

2010-03-08 Thread Richard Guenther
2010/3/8 Paweł Sikora :
> hi,
>
> during development a cross platform appliacation on x86 workstation
> i've enabled an alignemnt checking [1] to catch possible erroneous
> code before it appears on client's sparc/arm cpu with sigbus ;)
>
> it works pretty fine and catches alignment violations but Jakub Jelinek
> had told me (on glibc bugzilla) that gcc on x86 can still dereference
> an unaligned pointer (except for vector insns).
> i suppose it means that gcc can emit e.g. movl for access a short int
> (or maybe others scenarios) in some cases and violates cpu alignment rules.
>
> so, is it possible to instruct gcc-x86 to always use suitable loads/stores
> like on sparc/arm?

Only by re-building it and setting STRICT_ALIGNMENT.

Richard.

> [1] "AC" bit - http://en.wikipedia.org/wiki/FLAGS_register_(computing)
>
> BR,
> Pawel.
>


Re: IRA conflict graph, multiple alternatives and commutative operands

2010-03-08 Thread Vladimir Makarov

Jie Zhang wrote:
I'm looking at PR 42258. I have a question on IRA conflict graph and 
multiple alternatives.


Below is an RTL insn just before register allocation pass:

(insn 7 6 12 2 pr42258.c:2 (set (reg:SI 136)
(mult:SI (reg:SI 137)
(reg/v:SI 135 [ x ]))) 33 {*thumb_mulsi3})

IRA generates the following conflict graph for r135, r136 and r137:

;; a0(r136,l0) conflicts: a2(r137,l0) a1(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a1(r135,l0) conflicts: a0(r136,l0) a2(r137,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a2(r137,l0) conflicts: a0(r136,l0) a1(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:

  regions=1, blocks=3, points=5
allocnos=3, copies=0, conflicts=0, ranges=3

Apparently this conflict graph is not an optimized one for any of the 
three alternatives in the following instruction pattern:


(define_insn "*thumb_mulsi3"
  [(set (match_operand:SI  0 "register_operand" "=&l,&l,&l")
(mult:SI (match_operand:SI 1 "register_operand" "%l,*h,0")
 (match_operand:SI 2 "register_operand" "l,l,l")))]
  ...)

This conflict graph seems like a merge of conflict graphs of the three 
alternatives. Ideally for the first and second alternatives, we should 
have


;; a0(r136,l0) conflicts: a2(r137,l0) a1(r135,l0)
;; a1(r135,l0) conflicts: a0(r136,l0)
;; a2(r137,l0) conflicts: a0(r136,l0)

For the third alternative, we'd better have

;; a0(r136,l0) conflicts: a1(r135,l0)
;; a1(r135,l0) conflicts: a0(r136,l0)
  cp0:a0(r136)<->a2(r137)@1000:constraint

And register allocator would use one of these more specific conflict 
graphs for coloring. If we take the commutative operands into count, 
we have to add the following conflict graph for choosing.


;; a0(r136,l0) conflicts: a2(r137,l0)
;; a2(r137,l0) conflicts: a0(r136,l0)
  cp0:a0(r136)<->a1(r135)@1000:constraint

(Actually, this conflict graph will result in an optimal result for 
the test case in PR 42258.)


Now the problem is when and how to choose the alternative for register 
allocator to calculate the conflict graph?


Yes, I have read the thread:

http://gcc.gnu.org/ml/gcc/2009-02/msg00215.html

This question seems not easy. So is there any practical method to make 
register allocator pick up the third alternative and do commutation 
before or during register allocation?
I do not know such practical method.  IRA should create correct RA 
whatever alternative reload chooses for insns.  Otherwise reload might 
generate a wrong code.  As for code telling IRA (and reload) what 
alternatives only to consider, it is doable to implement but the 
question again is how to choose what alternatives to consider.


In you particular example.  It might seem that 1st alternative is always 
worse than third one if the first operand dies.  But even such check is 
not easy to implement.  Also that might be not true for some global 
register allocation results.




Re: Phil's regression hunter: How does it work?

2010-03-08 Thread Janis Johnson
On Sat, 2010-03-06 at 11:44 +, Peter Maier wrote:
> The gcc developers seem to have nice tool referred to as "Phil's regression 
> hunter". Where can I find documentation on it? I'm interested to know how it 
> works and its abilities. Is it maybe even available for download?
> 
> - Peter

Long ago Phil kept install trees from daily GCC builds and had tools
to identify on which day a bug was introduced.

I wrote a set of regression hunt tools that, given separate scripts to
check out sources for a particular revision or time, build GCC (or any
other product), and run a test that reports whether it passed, failed,
or something unexpected happened, will report the revision that
introduced the failure.  See contrib/reghunt in the GCC source tree.
I think there's a web page about it but can't find it at the moment.

Janis 



Re: (un)aligned accesses on x86 platform.

2010-03-08 Thread Paweł Sikora
On Monday 08 March 2010 16:46:10 Richard Guenther wrote:
> 2010/3/8 Paweł Sikora :
> > hi,
> > 
> > during development a cross platform appliacation on x86 workstation
> > i've enabled an alignemnt checking [1] to catch possible erroneous
> > code before it appears on client's sparc/arm cpu with sigbus ;)
> > 
> > it works pretty fine and catches alignment violations but Jakub Jelinek
> > had told me (on glibc bugzilla) that gcc on x86 can still dereference
> > an unaligned pointer (except for vector insns).
> > i suppose it means that gcc can emit e.g. movl for access a short int
> > (or maybe others scenarios) in some cases and violates cpu alignment
> > rules.
> > 
> > so, is it possible to instruct gcc-x86 to always use suitable
> > loads/stores like on sparc/arm?
> 
> Only by re-building it and setting STRICT_ALIGNMENT.

oops, we have a problem with 4.4.x bootstrap ;)
http://imgbin.org/index.php?page=image&id=1356


Puzzle about mips pipeline description

2010-03-08 Thread Amker.Cheng
Hi All:
  In gcc internal, section 16.19.8, there is a rule about
"define_insn_reservation" like:
"`condition` defines what RTL insns are described by this
construction. You should re-
member that you will be in trouble if `condition` for two or more
different `define_insn_
reservation` constructors if TRUE for an insn".

  While in mips.md, pipeline description for each processor are
included along with
generic.md, which providing a fallback for processor without specific
pipeline description.

  Here is the PUZZLE: Won't `define_insn_reservation` constructors
from both specific
processor's and the generic md file break the rule mentioned before?
For example, It seems
conditions for the r3k_load(from 3000.md) and generic_load(from
generic.md) are both TRUE
for lw insn.

Further more,
In those md files for specific processors, It is said that these
description are supposed to
override parts of generic md file, but i don't know how it works
without reading codes in
genautomata.c.

Please help me out, Thanks very much.

-- 
Best Regards.