Unroller gone wild
I have the following code: struct bounding_box { pack4sf m_Mins; pack4sf m_Maxs; void set(__v4sf v_mins, __v4sf v_maxs) { m_Mins = v_mins; m_Maxs = v_maxs; } }; struct bin { bounding_box m_Box[3]; pack4si m_NL; pack4sf m_AL; }; static const std::size_t bin_count = 16; bin aBins[bin_count]; for(std::size_t i = 0; i != bin_count; ++i) { bin& b = aBins[i]; b.m_Box[0].set(g_VecInf, g_VecMinusInf); b.m_Box[1].set(g_VecInf, g_VecMinusInf); b.m_Box[2].set(g_VecInf, g_VecMinusInf); b.m_NL = __v4si{ 0, 0, 0, 0 }; } where pack4sf/si are union-based wrappers for __v4sf/si. GCC 4.5 on Core i7/Cygwin with -O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native -fomit-frame-pointer completely unrolled the loop into 112 movdqa instructions, which is "a bit" too agressive. Should I file a bug report? The processor has an 18 instructions long prefetch queue and the loop is perfectly predictable by the built-in branch prediction circuitry, so translating it as is would result in huge fetch/decode bandwidth reduction. Is there something like "#pragma nounroll" to selectively disable this optimization? Best regards Piotr Wyderski
Re: Unroller gone wild
On Mon, Mar 8, 2010 at 9:49 AM, Piotr Wyderski wrote: > I have the following code: > > struct bounding_box { > > pack4sf m_Mins; > pack4sf m_Maxs; > > void set(__v4sf v_mins, __v4sf v_maxs) { > > m_Mins = v_mins; > m_Maxs = v_maxs; > } > }; > > struct bin { > > bounding_box m_Box[3]; > pack4si m_NL; > pack4sf m_AL; > }; > > static const std::size_t bin_count = 16; > bin aBins[bin_count]; > > for(std::size_t i = 0; i != bin_count; ++i) { > > bin& b = aBins[i]; > > b.m_Box[0].set(g_VecInf, g_VecMinusInf); > b.m_Box[1].set(g_VecInf, g_VecMinusInf); > b.m_Box[2].set(g_VecInf, g_VecMinusInf); > b.m_NL = __v4si{ 0, 0, 0, 0 }; > } > > where pack4sf/si are union-based wrappers for __v4sf/si. > GCC 4.5 on Core i7/Cygwin with > > -O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native > -fomit-frame-pointer > > completely unrolled the loop into 112 movdqa instructions, > which is "a bit" too agressive. Should I file a bug report? > The processor has an 18 instructions long prefetch queue > and the loop is perfectly predictable by the built-in branch > prediction circuitry, so translating it as is would result in huge > fetch/decode bandwidth reduction. Is there something like > "#pragma nounroll" to selectively disable this optimization? No, only --param max-completely-peel-times (which is 16) or --param max-completely-peeled-insns (which probably should then be way lower than the current 400). Richard. > Best regards > Piotr Wyderski >
Re: How to make 'long int' type be a PDImode?
On Mon, Mar 8, 2010 at 8:29 AM, Joern Rennecke wrote: > Quoting Frank Isamov : > >> Hi, >> >> I'd like to make a backend which would have 48 bits for 'long' type. >> (32 for int and 64 for long long). >> >> I have tried to define: >> #define LONG_TYPE_SIZE 48 > > That's not a partial integer mode; PDImode would have the same size as > DImode, > just not all bits would be significant. > >> and one of: >> INT_MODE (PDI, 6); > > And that wouldn't be PDImode, more like THImode (three-halves integer mode, > going by the precedent of TQFmode - three-quarter float mode - of the 1750a > port in GCC prior to version 3.1) > I am sorry, I still can conclude how to make the backend use a certain mode for 'long' type. My architecture implements 32- and 48- bit registers and instructions operating on these registers. That it why my intention is to use 'int' for 32 and 'long' for 48 bit operations. My attempts lead to the backend ignore 48 bits path inspite of the LONG_TYPE_SIZE definition and 'long' is still processed as 32 bit value. I think I am missing something, so I am applying for help. Any piece of information would be useful. Thank you, Frank
Re: How to make 'long int' type be a PDImode?
On Mon, Mar 8, 2010 at 4:27 PM, Frank Isamov wrote: > On Mon, Mar 8, 2010 at 8:29 AM, Joern Rennecke > wrote: >> Quoting Frank Isamov : >> >>> Hi, >>> >>> I'd like to make a backend which would have 48 bits for 'long' type. >>> (32 for int and 64 for long long). >>> >>> I have tried to define: >>> #define LONG_TYPE_SIZE 48 >> >> That's not a partial integer mode; PDImode would have the same size as >> DImode, >> just not all bits would be significant. >> >>> and one of: >>> INT_MODE (PDI, 6); >> >> And that wouldn't be PDImode, more like THImode (three-halves integer mode, >> going by the precedent of TQFmode - three-quarter float mode - of the 1750a >> port in GCC prior to version 3.1) >> > > > I am sorry, I still can conclude how to make the backend use a certain > mode for 'long' type. > My architecture implements 32- and 48- bit registers and instructions > operating on these registers. That it why my intention is to use 'int' > for 32 and 'long' for 48 bit operations. My attempts lead to the > backend ignore 48 bits path inspite of the LONG_TYPE_SIZE definition > and 'long' is still processed as 32 bit value. I think I am missing > something, so I am applying for help. Any piece of information would > be useful. > > Thank you, > Frank > Correction: "I still can conclude " should be read as "I still cannot conclude"
(un)aligned accesses on x86 platform.
hi, during development a cross platform appliacation on x86 workstation i've enabled an alignemnt checking [1] to catch possible erroneous code before it appears on client's sparc/arm cpu with sigbus ;) it works pretty fine and catches alignment violations but Jakub Jelinek had told me (on glibc bugzilla) that gcc on x86 can still dereference an unaligned pointer (except for vector insns). i suppose it means that gcc can emit e.g. movl for access a short int (or maybe others scenarios) in some cases and violates cpu alignment rules. so, is it possible to instruct gcc-x86 to always use suitable loads/stores like on sparc/arm? [1] "AC" bit - http://en.wikipedia.org/wiki/FLAGS_register_(computing) BR, Pawel.
IRA conflict graph, multiple alternatives and commutative operands
I'm looking at PR 42258. I have a question on IRA conflict graph and multiple alternatives. Below is an RTL insn just before register allocation pass: (insn 7 6 12 2 pr42258.c:2 (set (reg:SI 136) (mult:SI (reg:SI 137) (reg/v:SI 135 [ x ]))) 33 {*thumb_mulsi3}) IRA generates the following conflict graph for r135, r136 and r137: ;; a0(r136,l0) conflicts: a2(r137,l0) a1(r135,l0) ;; total conflict hard regs: ;; conflict hard regs: ;; a1(r135,l0) conflicts: a0(r136,l0) a2(r137,l0) ;; total conflict hard regs: ;; conflict hard regs: ;; a2(r137,l0) conflicts: a0(r136,l0) a1(r135,l0) ;; total conflict hard regs: ;; conflict hard regs: regions=1, blocks=3, points=5 allocnos=3, copies=0, conflicts=0, ranges=3 Apparently this conflict graph is not an optimized one for any of the three alternatives in the following instruction pattern: (define_insn "*thumb_mulsi3" [(set (match_operand:SI 0 "register_operand" "=&l,&l,&l") (mult:SI (match_operand:SI 1 "register_operand" "%l,*h,0") (match_operand:SI 2 "register_operand" "l,l,l")))] ...) This conflict graph seems like a merge of conflict graphs of the three alternatives. Ideally for the first and second alternatives, we should have ;; a0(r136,l0) conflicts: a2(r137,l0) a1(r135,l0) ;; a1(r135,l0) conflicts: a0(r136,l0) ;; a2(r137,l0) conflicts: a0(r136,l0) For the third alternative, we'd better have ;; a0(r136,l0) conflicts: a1(r135,l0) ;; a1(r135,l0) conflicts: a0(r136,l0) cp0:a0(r136)<->a2(r137)@1000:constraint And register allocator would use one of these more specific conflict graphs for coloring. If we take the commutative operands into count, we have to add the following conflict graph for choosing. ;; a0(r136,l0) conflicts: a2(r137,l0) ;; a2(r137,l0) conflicts: a0(r136,l0) cp0:a0(r136)<->a1(r135)@1000:constraint (Actually, this conflict graph will result in an optimal result for the test case in PR 42258.) Now the problem is when and how to choose the alternative for register allocator to calculate the conflict graph? Yes, I have read the thread: http://gcc.gnu.org/ml/gcc/2009-02/msg00215.html This question seems not easy. So is there any practical method to make register allocator pick up the third alternative and do commutation before or during register allocation? Thanks, -- Jie Zhang CodeSourcery (650) 331-3385 x735
Re: (un)aligned accesses on x86 platform.
You define STRICT_ALIGNED to be 1 in i386.h or provide an option to turn that on/off like the rs6000 target does. Thanks, Andrew Pinski Sent from my iPhone On Mar 8, 2010, at 7:37 AM, Paweł Sikora wrote: hi, during development a cross platform appliacation on x86 workstation i've enabled an alignemnt checking [1] to catch possible erroneous code before it appears on client's sparc/arm cpu with sigbus ;) it works pretty fine and catches alignment violations but Jakub Jelinek had told me (on glibc bugzilla) that gcc on x86 can still dereference an unaligned pointer (except for vector insns). i suppose it means that gcc can emit e.g. movl for access a short int (or maybe others scenarios) in some cases and violates cpu alignment rules. so, is it possible to instruct gcc-x86 to always use suitable loads/ stores like on sparc/arm? [1] "AC" bit - http://en.wikipedia.org/wiki/FLAGS_register_(computing) BR, Pawel.
Re: (un)aligned accesses on x86 platform.
2010/3/8 Paweł Sikora : > hi, > > during development a cross platform appliacation on x86 workstation > i've enabled an alignemnt checking [1] to catch possible erroneous > code before it appears on client's sparc/arm cpu with sigbus ;) > > it works pretty fine and catches alignment violations but Jakub Jelinek > had told me (on glibc bugzilla) that gcc on x86 can still dereference > an unaligned pointer (except for vector insns). > i suppose it means that gcc can emit e.g. movl for access a short int > (or maybe others scenarios) in some cases and violates cpu alignment rules. > > so, is it possible to instruct gcc-x86 to always use suitable loads/stores > like on sparc/arm? Only by re-building it and setting STRICT_ALIGNMENT. Richard. > [1] "AC" bit - http://en.wikipedia.org/wiki/FLAGS_register_(computing) > > BR, > Pawel. >
Re: IRA conflict graph, multiple alternatives and commutative operands
Jie Zhang wrote: I'm looking at PR 42258. I have a question on IRA conflict graph and multiple alternatives. Below is an RTL insn just before register allocation pass: (insn 7 6 12 2 pr42258.c:2 (set (reg:SI 136) (mult:SI (reg:SI 137) (reg/v:SI 135 [ x ]))) 33 {*thumb_mulsi3}) IRA generates the following conflict graph for r135, r136 and r137: ;; a0(r136,l0) conflicts: a2(r137,l0) a1(r135,l0) ;; total conflict hard regs: ;; conflict hard regs: ;; a1(r135,l0) conflicts: a0(r136,l0) a2(r137,l0) ;; total conflict hard regs: ;; conflict hard regs: ;; a2(r137,l0) conflicts: a0(r136,l0) a1(r135,l0) ;; total conflict hard regs: ;; conflict hard regs: regions=1, blocks=3, points=5 allocnos=3, copies=0, conflicts=0, ranges=3 Apparently this conflict graph is not an optimized one for any of the three alternatives in the following instruction pattern: (define_insn "*thumb_mulsi3" [(set (match_operand:SI 0 "register_operand" "=&l,&l,&l") (mult:SI (match_operand:SI 1 "register_operand" "%l,*h,0") (match_operand:SI 2 "register_operand" "l,l,l")))] ...) This conflict graph seems like a merge of conflict graphs of the three alternatives. Ideally for the first and second alternatives, we should have ;; a0(r136,l0) conflicts: a2(r137,l0) a1(r135,l0) ;; a1(r135,l0) conflicts: a0(r136,l0) ;; a2(r137,l0) conflicts: a0(r136,l0) For the third alternative, we'd better have ;; a0(r136,l0) conflicts: a1(r135,l0) ;; a1(r135,l0) conflicts: a0(r136,l0) cp0:a0(r136)<->a2(r137)@1000:constraint And register allocator would use one of these more specific conflict graphs for coloring. If we take the commutative operands into count, we have to add the following conflict graph for choosing. ;; a0(r136,l0) conflicts: a2(r137,l0) ;; a2(r137,l0) conflicts: a0(r136,l0) cp0:a0(r136)<->a1(r135)@1000:constraint (Actually, this conflict graph will result in an optimal result for the test case in PR 42258.) Now the problem is when and how to choose the alternative for register allocator to calculate the conflict graph? Yes, I have read the thread: http://gcc.gnu.org/ml/gcc/2009-02/msg00215.html This question seems not easy. So is there any practical method to make register allocator pick up the third alternative and do commutation before or during register allocation? I do not know such practical method. IRA should create correct RA whatever alternative reload chooses for insns. Otherwise reload might generate a wrong code. As for code telling IRA (and reload) what alternatives only to consider, it is doable to implement but the question again is how to choose what alternatives to consider. In you particular example. It might seem that 1st alternative is always worse than third one if the first operand dies. But even such check is not easy to implement. Also that might be not true for some global register allocation results.
Re: Phil's regression hunter: How does it work?
On Sat, 2010-03-06 at 11:44 +, Peter Maier wrote: > The gcc developers seem to have nice tool referred to as "Phil's regression > hunter". Where can I find documentation on it? I'm interested to know how it > works and its abilities. Is it maybe even available for download? > > - Peter Long ago Phil kept install trees from daily GCC builds and had tools to identify on which day a bug was introduced. I wrote a set of regression hunt tools that, given separate scripts to check out sources for a particular revision or time, build GCC (or any other product), and run a test that reports whether it passed, failed, or something unexpected happened, will report the revision that introduced the failure. See contrib/reghunt in the GCC source tree. I think there's a web page about it but can't find it at the moment. Janis
Re: (un)aligned accesses on x86 platform.
On Monday 08 March 2010 16:46:10 Richard Guenther wrote: > 2010/3/8 Paweł Sikora : > > hi, > > > > during development a cross platform appliacation on x86 workstation > > i've enabled an alignemnt checking [1] to catch possible erroneous > > code before it appears on client's sparc/arm cpu with sigbus ;) > > > > it works pretty fine and catches alignment violations but Jakub Jelinek > > had told me (on glibc bugzilla) that gcc on x86 can still dereference > > an unaligned pointer (except for vector insns). > > i suppose it means that gcc can emit e.g. movl for access a short int > > (or maybe others scenarios) in some cases and violates cpu alignment > > rules. > > > > so, is it possible to instruct gcc-x86 to always use suitable > > loads/stores like on sparc/arm? > > Only by re-building it and setting STRICT_ALIGNMENT. oops, we have a problem with 4.4.x bootstrap ;) http://imgbin.org/index.php?page=image&id=1356
Puzzle about mips pipeline description
Hi All: In gcc internal, section 16.19.8, there is a rule about "define_insn_reservation" like: "`condition` defines what RTL insns are described by this construction. You should re- member that you will be in trouble if `condition` for two or more different `define_insn_ reservation` constructors if TRUE for an insn". While in mips.md, pipeline description for each processor are included along with generic.md, which providing a fallback for processor without specific pipeline description. Here is the PUZZLE: Won't `define_insn_reservation` constructors from both specific processor's and the generic md file break the rule mentioned before? For example, It seems conditions for the r3k_load(from 3000.md) and generic_load(from generic.md) are both TRUE for lw insn. Further more, In those md files for specific processors, It is said that these description are supposed to override parts of generic md file, but i don't know how it works without reading codes in genautomata.c. Please help me out, Thanks very much. -- Best Regards.