Re: inefficient comparison

2006-10-18 Thread Uros Bizjak

Krishna Myneni <[EMAIL PROTECTED]> wrote:


Under "gcc (GCC) 4.1.0 (SUSE Linux)", with CFLAGS = -m32 -S, and under all of
the optimization levels (O0 through O3), the following is produced :
flds .LC2
fldz
fxch %st(1)
fucompp
fnstsw %ax
:



The FXCH instruction is unnecessary if the FLDS and FLDZ instructions were
ordered in reverse. Why does this type of optimization not take place?


This is actually PR target/15492.

And the optimization doesn't take place for the simple fact, that
nobody implemented the optimization yet. I have a prototype patch that
catches these simple opportunities.

Uros.


Povray benchmark results

2006-10-29 Thread Uros Bizjak

Hello!

Following are the results of povray 3.6.1 official benchmark 
(benchmark.ini) on:


vendor_id   : AuthenticAMD
cpu family  : 15
model   : 47
model name  : AMD Athlon(tm) 64 Processor 3000+
stepping: 2
cpu MHz : 1809.278
cache size  : 512 KB

Base compile flags of "gcc version 4.3.0 20061029" were set to:
-O3 -msse3 -ffast-math -march=k8 -mtune=k8 -minline-all-stringops

Different -mpfmath selections were benchmarked:

a) -mfpmath=sse
user27m33.848s

b) -mfpmath=387
user27m42.136s

c) -mfpmath=sse,387
user26m0.312s

These results were obtained with Richard's SSE rounding functions, top 
of ChangeLog was:

2006-10-29  Richard Guenther  <[EMAIL PROTECTED]>
   * config/i386/i386-protos.h (ix86_expand_trunc): Declare.

Nice to see that there is life in -mfpmath=sse,387 ;) It was faster than 
a) and b) by some 6%.


Uros.



bootstrap broken in libgfortran

2006-11-02 Thread Uros Bizjak

Hello!

Does anybody else get these errors in libgfortran during clean bootstrap:

...

...
.libs/environ.o(.text+0x10d0):/usr/include/stdlib.h:401: first defined here
.libs/in_unpack_generic.o(.text+0x730): In function `atol':
/usr/include/stdlib.h:336: multiple definition of `atol'
.libs/environ.o(.text+0x12a0):/usr/include/stdlib.h:336: first defined here
.libs/in_unpack_generic.o(.text+0x740): In function `atoll':
/usr/include/stdlib.h:382: multiple definition of `atoll'
.libs/environ.o(.text+0x1290):/usr/include/stdlib.h:382: first defined here
.libs/in_unpack_generic.o(.text+0x750): In function `atof':
/usr/include/stdlib.h:330: multiple definition of `atof'
.libs/environ.o(.text+0x12b0):/usr/include/stdlib.h:330: first defined here
collect2: ld returned 1 exit status
gmake[3]: *** [libgfortran.la] Error 1
gmake[3]: Leaving directory 
`/home/uros/gcc-build/x86_64-unknown-linux-gnu/libgfortran'

gmake[2]: *** [all] Error 2
gmake[2]: Leaving directory 
`/home/uros/gcc-build/x86_64-unknown-linux-gnu/libgfortran'

gmake[1]: *** [all-target-libgfortran] Error 2
gmake[1]: Leaving directory `/home/uros/gcc-build'
gmake: *** [bootstrap] Error 2

This happens on x86_64-pc-linux-gnu and i686-pc-linux-gnu, FC4.

Uros.


Re: Mapping NAN to ZERO / When does gcc generate MOVcc and FCMOVcc instructions?

2006-11-02 Thread Uros Bizjak

Michael James wrote:


Conceptually, the code is:



double sum = 0;



for(i=0; i



I have tried a half dozen variants at the source level in attempt to
get gcc to do this without branching (and without calling a helper
function isnan). I was not really able to succeed at either of these.


You need to specify an architecture that has cmov instruction; at
least -march=i686.


Concerning the inline evaluation of isnan, I tried using
__builtin_unordered(x,x) which either gets optimized out of existence
when I specificy -funsafe-math-optimizations, or causes other gcc math
inlines (specifically log) to not use their inline definitions when I
do not specificy -funsafe-math-optimizations. For my particular
problem I have a work around for this which none-the-less causes the
result of isnan to end up as a condition flag in the EFLAGS register.
(Instead of a test for nan, I use a test for 0 in the domain of the
log.)


This testcase (similar to yours, but it actually compiles):

double test(int n, double a)
{
 double sum = 0.0;
 int i;

 for(i=0; idouble extensions for x87 math. As you probably don't need math
errno from log(), -fno-math-errno should be added.

Those two flags produce IMO optimal loop:

.L5:
   pushl   %eax
   fildl   (%esp)
   addl$4, %esp
   fldln2
   fxch%st(1)
   fyl2x
   fucomi  %st(0), %st
   fldz
   fcmovnu %st(1), %st
   fstp%st(1)
   addl$1, %eax
   cmpl%edx, %eax
   faddp   %st, %st(1)
   jne .L5

Uros.


Concerning the use of an unconditional add, followed by a FCMOVcc
instead of a Jcc, I have had no success: I have tried code such as:


Re: Mapping NAN to ZERO / When does gcc generate MOVcc and FCMOVcc instructions?

2006-11-03 Thread Uros Bizjak

Michael James wrote:


And recompiled with the same flags. The assembly code for the loop
portion is identical to the one I posted above. Now though the code is
actually capable of producing NANs.

Just to be sure, I also tested this on my modified loop:

int main(void) {
   printf("test(4, 6, 0) = %f\n", test(4,6,0));
   printf("test(0, 2, 0) = %f\n", test(0,2,0));
   printf("test(-2, 3, 0) = %f\n", test(-2,3,0));
   return 0;
}

[EMAIL PROTECTED]:~/project/cf/util$ /home/james/local/gcc/bin/gcc -O2
-march=i686 -funsafe-math-optimizations -fno-math-errno uros-test.c -o
test

[EMAIL PROTECTED]:~/project/cf/util$ ./test
test(4, 6, 0) = 2.995732
test(0, 2, 0) = -inf
test(-2, 3, 0) = nan

[EMAIL PROTECTED]:~/project/cf/util$ /home/james/local/gcc/bin/gcc -O2
-march=i686 uros-test.c -o test -lm

[EMAIL PROTECTED]:~/project/cf/util$ ./uros
test(4, 6, 0) = 2.995732
test(0, 2, 0) = -inf
test(-2, 3, 0) = -inf

[EMAIL PROTECTED]:~/project/cf/util$ /home/james/local/gcc/bin/gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4-2/configure --prefix=/home/james/local/gcc
Thread model: posix
gcc version 4.2.0 20061103 (prerelease)

Perhaps I have not replicated your working environment closely enough,
or you have a different macro in place of the isnan call. I compiled
all code above both with and without include headers ,
. I get the same results either way.

No, I'm working with gcc version 4.3.0 20061103 (experimental), and the 
results are as expected:
gcc -O2 -funsafe-math-optimizations -fno-math-errno -march=i686 
-mfpmath=387 -m32 mj.c

[-m32 is not needed on 32bit compiler]

test(4, 6, 0) = 2.995732
test(0, 2, 0) = -inf
test(-2, 3, 0) = -inf

This is the exact testcase I used:
--cut here--
double test(int i0, int n, double a)
{
double sum = 0.0;
int i;

for(i=i0; iInspecting the code, there is fucomi insn present just after "log" insn 
sequence.


Unfortunatelly, I have no 4.2 installed here, but it looks like a bug to 
me. Perhaps could you open a bugreport for this issue?


Thanks,
Uros.


Re: AVR byte swap optimization

2006-11-26 Thread Uros Bizjak

Denis Vlasenko wrote:


The following macro expands to some rather frightful code on the AVR:

#define BSWAP_16(x) \
 x) >> 8) & 0xff) | (((x) & 0xff) << 8))


Sometimes gcc is generating better code if you cast
values instead of masking. Try:

  ( (uint8_t)((x) >> 8) | ((uint8_t)(x)) << 8 )


gcc > 3.4.x always generates better code for casted values. This is PR
middle-end/29749, and it affects i386 linux too.

Uros.


SPEC CFP2000 and polyhedron runtime scores dropped from 13. november onwards

2006-12-01 Thread Uros Bizjak

Hello!

At least on x86_64 and i686 SPEC score [1] and polyhedron [2] scores
dropped noticeably. For SPEC benchmarks, mgrid, galgel, ammp and
sixtrack tests are affected and for polygedron, ac (second regression
in the peak) and protein (?) regressed in that time frame.

[1] http://www.suse.de/~aj/SPEC/amd64/CFP/summary-britten/recent.html
[2] 
http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-2-0.html

Does anybody have any idea what is going on there?

Uros.


Re: Serious SPEC CPU 2006 FP performance regressions on IA32

2006-12-08 Thread Uros Bizjak

H. J. Lu wrote:


Gcc 4.3 revision 119497 has very poor SPEC CPU 2006 FP performance
regressions on P4, Pentium M and Core Duo, comparing aganst
gcc 4.2 20060910. With -O2, the typical regressions look like

 

I think that you are looking at the same problem as 
http://gcc.gnu.org/ml/gcc/2006-12/msg00017.html.



Is that related to recent i386 backend FP changes?

 


I _think_ that they should not have such impact.

Uros.




Re: What's the status of autovectorization for MMX and 3DNow!?

2006-12-11 Thread Uros Bizjak

Hello!


I'm particularly interested in this patch
(http://gcc.gnu.org/ml/gcc-patches/2005-07/msg01128.html); pretty 

nice for

users of Pentium 3 and Athlon. Has it been or will it be integrated into
mainline?


This patch heavily depends on the functionality of optimize mode
switching pass. Unfortunatelly, there is currently no way to tell
optimize_mode_switching() which modes are exclusive. Due to the way how
the emms switching patch was designed, it expects that either MMX or X87
mode can be active at once, to properly switch between x87 and MMX
registers.

PR target/19161 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19161)
comment #17 has an example of the control flow that can block both
register sets  at once. Otherwise, the patch works as expected.

Uros.



Fixme in driver-i386.c

2006-12-22 Thread Uros Bizjak

Hello!

There is a fixme in config/i386/driver-i386.c:

--cut here--
 if (arch)
   {
 /* FIXME: i386 is wrong for 64bit compiler.  How can we tell if
we are generating 64bit or 32bit code?  */
 cpu = "i386";
   }
 else
--cut here--

Couldn't simple "sizeof(long)" do the trick here, i.e.:

--cut here--
int main()
{
 int i = sizeof (long);

 switch (i)
   {
   default:
 abort();
   case 4:
   case 8:
 printf ("%i\n", i);
   }
 return 0;
}
--cut here--

gcc -m32
./a.out
4
gcc -m64
./a.out
8

Uros.


Re: Jan Hubicka and Uros Bizjak appointed i386 maintainers

2007-01-09 Thread Uros Bizjak

David Edelsohn wrote:


I am pleased to announce that the GCC Steering Committee has
appointed Jan Hubicka and Uros Bizjak as co-maintainers of the i386 port.
 


Thank you!

Please find attached the patch that updates my MAINTAINERS entry.

2007-01-09  Uros Bizjak  <[EMAIL PROTECTED]>

   * MAINTAINERS: Add myself as i386 maintainer.

Uros.

Index: MAINTAINERS
===
--- MAINTAINERS	(revision 120611)
+++ MAINTAINERS	(working copy)
@@ -54,6 +54,7 @@
 hppa port		Dave Anglin		[EMAIL PROTECTED]
 i386 port		Richard Henderson	[EMAIL PROTECTED]
 i386 port		Jan Hubicka		[EMAIL PROTECTED]
+i386 port		Uros Bizjak		[EMAIL PROTECTED]
 ia64 port		Jim Wilson		[EMAIL PROTECTED]
 iq2000 port		Nick Clifton		[EMAIL PROTECTED]
 m32c port		DJ Delorie		[EMAIL PROTECTED]
@@ -243,7 +244,6 @@
 Jan Beulich	[EMAIL PROTECTED]
 David Billinghurst[EMAIL PROTECTED]
 Laurynas Biveinis			[EMAIL PROTECTED]
-Uros Bizjak	[EMAIL PROTECTED]
 Eric Blake	[EMAIL PROTECTED]
 Jim Blandy	[EMAIL PROTECTED]
 Phil Blundell	[EMAIL PROTECTED]


No ifcvt during ce1 pass (fails i386/ssefp-2.c)

2007-03-14 Thread Uros Bizjak

Hello!

Recent committed patch breaks i386ssefp-2.c testcase, where maxsd is
not generated anymore.

I have looked a bit into this failure and noticed that for some reason
we don't perform ifcvt transformations during ce1 RTL pass. The second
transformation is still performed during ce2 pass, but this is late as
combine already combined some patterns into patterns that can't be
split into maxsd pattern.

Previously, ce1 pass generated:

IF-THEN block found, pass 1, start block 2 [insn 5], then 3 [15], join 4 [17]
Replacing insn 10 by jump 35
Conversion succeeded on pass 1.

1 possible IF blocks searched.
1 IF blocks converted.
2 true changes made.

(insn 34 9 19 2 (set (reg:DF 58 [ iftmp.0 ])
   (unspec:DF [
   (reg:DF 60)
   (reg:DF 58 [ iftmp.0 ])
   ] 52)) 564 {*ieee_smaxdf3} (nil)
   (nil))

but now all relevant insns remain unaffected at the end of ce1 pass:

(insn 9 8 10 2 (set (reg:CCFPU 17 flags)
   (compare:CCFPU (reg:DF 60)
   (reg:DF 58 [ iftmp.0 ]))) 26 {*cmpfp_iu_sse} (nil)
   (nil))

(jump_insn 10 9 14 2 (set (pc)
   (if_then_else (gt (reg:CCFPU 17 flags)
   (const_int 0 [0x0]))
   (label_ref:SI 14)
   (pc))) 365 {*jcc_1} (nil)
   (expr_list:REG_BR_PROB (const_int 4600 [0x11f8])
   (nil)))

Uros.


Re: No ifcvt during ce1 pass (fails i386/ssefp-2.c)

2007-03-15 Thread Uros Bizjak

On 3/15/07, Alexandre Oliva <[EMAIL PROTECTED]> wrote:


> Recent committed patch breaks i386ssefp-2.c testcase, where maxsd is
> not generated anymore.

FWIW, I saw it both before and after the patch for PR 31127.  I've
just tried reverting PR 30643 as well, but the problem remains.  So
it's unrelated.

Have you been able to narrow it down to any other patch?


Yes, bisection found that steven's patch is guilty  as charged ;>

--- gcc/ChangeLog   (revision 122856)
+++ gcc/ChangeLog   (revision 122858)
@@ -1,3 +1,12 @@
+2007-03-12  Steven Bosscher  <[EMAIL PROTECTED]>
+
+   * tree-pass.h (pass_into_cfg_layout_mode,
+   pass_outof_cfg_layout_mode): Declare.
+   * cfglayout.c (into_cfg_layout_mode, outof_cfg_layout_mode,
+   pass_into_cfg_layout_mode, pass_outof_cfg_layout_mode): New.
+   * passes.c (pass_into_cfg_layout_mode): Schedule before jump2.
+   (pass_outof_cfg_layout_mode): Schedule after pass_rtl_ifcvt.
+

Steven, I have been looking into this failure a bit, and found that
ifcvt during first pass is rejected in find_if_block(), at:

 /* The THEN block of an IF-THEN combo must have exactly one predecessor,
other than any || blocks which jump to the THEN block.  */
 if ((EDGE_COUNT (then_bb->preds) - ce_info->num_or_or_blocks) != 1)
   return FALSE;

For some reason, EDGE_COUNT (then_bb->preds) returns 2 (where
ce_info->n_o_o_blocks = 0). Strangely, bb3 ("else" block that fails)
has:

Basic block 3 , prev 2, next 4, loop_depth 0, count 0, freq 4600, maybe hot.
Predecessors:  2 [46.0%]
Successors:  4 [100.0%]  (fallthru)

The testcase is:

double x;
q()
{
 x=x<5?5:x;
}

compile this with -O2 -msse2 -mfpmath=sse, and this testcase should
compile to maxsd.

Uros.


Re: No ifcvt during ce1 pass (fails i386/ssefp-2.c)

2007-03-15 Thread Uros Bizjak

On 3/15/07, Steven Bosscher <[EMAIL PROTECTED]> wrote:


On 3/15/07, Uros Bizjak <[EMAIL PROTECTED]> wrote:
> compile this with -O2 -msse2 -mfpmath=sse, and this testcase should
> compile to maxsd.

I'll look into it this weekend.


Thanks!

BTW: Your patch also causes

FAIL: gcc.dg/torture/pr25183.c  -O0  (internal compiler error)
FAIL: gcc.dg/torture/pr25183.c  -O0  (test for excess errors)

on i386 [1]:

gcc/cc1 -O0 pr25183.c
error
pr25183.c: In function âerrorâ:
pr25183.c:22: internal compiler error: in cfg_layout_merge_blocks, at
cfgrtl.c:2552
Please submit a full bug report,


[1] http://gcc.gnu.org/ml/gcc-testresults/2007-03/msg00672.html

Uros.


Effects of newly introduced -mpcX 80387 precision flag

2007-04-03 Thread Uros Bizjak

Hello!

Here are some effects of newly introduced -mpcX flag on a povray-3.6.1
(as kind of representative FP-heavy application to measure FP
performance).

The code was compiled using:

-O3 -march=native -funroll-all-loops -ffast-math -D__NO_MATH_INLINES
-malign-double -minline-all-stringops

(32bit code) and final binary was linked using -mpc32, -mpc64 or -mpc80.

The testfile was abyss.pov in default resolution (320 x 240) and the
timings were performed on 'Intel(R) Xeon(TM) CPU 3.60GHz'. Following
are averages of 5 runs:

-mpc80: average runtime 13.273s  (baseline)
-mpc64: average runtime 12.977s  (2.2 % faster than baseline)
-mpc32: average runtime 12.160s  (9.2 % faster than baseline)

It should be noted, that rendered picture in -mpc32 case developed a
strange halo around rendered object (nice special effect ;), and this
is clearly the effect of too low FP precision. -mpc64 and -mpc80
rendered pictures were equal, as far as my vision in concerned (well,
not quite a scientific approach).

It should be noted, that ifort defaults to -pc64 and this brings some
2% better runtimes on FP intensive code.

Uros.


Re: Effects of newly introduced -mpcX 80387 precision flag

2007-04-03 Thread Uros Bizjak

On 4/3/07, Richard Guenther <[EMAIL PROTECTED]> wrote:


> It should be noted, that rendered picture in -mpc32 case developed a
> strange halo around rendered object (nice special effect ;), and this
> is clearly the effect of too low FP precision. -mpc64 and -mpc80
> rendered pictures were equal, as far as my vision in concerned (well,
> not quite a scientific approach).
>
> It should be noted, that ifort defaults to -pc64 and this brings some
> 2% better runtimes on FP intensive code.

Is ifort able to provide extended precision arithmetic in that case?  I think
the documentation should mention that 'long double' will no longer work
as expected if you pass -mpc64, likewise for 'double' and -mpc32.


I have added:

...  Note that a change of default precision control may
affect the results returned by some of the mathematical functions.

to the documentation to warn users about this fact.

Uros.


Re: x86 inc/dec on core2

2007-04-07 Thread Uros Bizjak

Hello!


> I was wondering, if:
> 
>   /* X86_TUNE_USE_INCDEC */

>   ~(m_PENT4 | m_NOCONA | m_CORE2 | m_GENERIC),
> 
> is correct.  Should it be:
> 
>   /* X86_TUNE_USE_INCDEC */

>   ~(m_PENT4 | m_NOCONA | m_GENERIC),
> 
> ?


inc/dec has the same performance as add/sub on Core 2 Duo. But
inc/dec is shorter.
  


What about partial flag register dependency of inc/dec?

Uros.



Re: x86 inc/dec on core2

2007-04-08 Thread Uros Bizjak

H. J. Lu wrote:


inc/dec has the same performance as add/sub on Core 2 Duo. But
inc/dec is shorter.
 
  

What about partial flag register dependency of inc/dec?



There is no partial flag register dependency on inc/dec.
  


My docs say that "INC/DEC does not change the carry flag". But you have 
better resources that I, so if you think that C2D should be left out of 
X86_TUNE_USE_INCDEC, then the patch is pre-approved for mainline.


Thanks,
Uros.





Re: x86 inc/dec on core2

2007-04-08 Thread Uros Bizjak

Mike Stump wrote:

But you have better resources that I, so if you think that C2D should 
be left out of X86_TUNE_USE_INCDEC, then the patch is pre-approved 
for mainline.


I'm confused again, it isn't that it should be left out, it is that it 
should be included.  My patch adds inc/dec selection for C2D.  I'd 
also like it for generic on darwin, as that makes more sense for us.  
How does the rest of the community feel about inc/dec selection for 
generic?
Just to clear the mess - Yes, C2D should be a part of 
X86_TUNE_USE_INCDEC. Sorry for the confusion.


Uros.



Re: GCC no longer synthesizing v2sf operations from v4sf operations?

2005-03-21 Thread Uros Bizjak
Hello!
typedef float v4sf __attribute__((vector_size(16)));
void foo(v4sf *a, v4sf *b, v4sf *c)
{
   *a = *b + *c;
}
we no longer (since 4.0) synthesize v2sf (aka sse) operations
for f.i. -march=athlon (not that we were too successful at this
in 3.4 - we generated horrible code instead).  Instead for !sse2
architectures we generate standard i387 FP code (with some
unnecessary temporaries, but reasonably well).
 

SSE _is_ v4sf. 'gcc -O2 -msse -S -fomit-frame-pointer' produces:
foo:
   movl12(%esp), %eax
   movaps  (%eax), %xmm0
   movl8(%esp), %eax
   addps   (%eax), %xmm0
   movl4(%esp), %eax
   movaps  %xmm0, (%eax)
   ret
SSE2 is v2df.
Athlon does not handle SSE insns.
Uros.


Re: GCC no longer synthesizing v2sf operations from v4sf operations?

2005-03-21 Thread Uros Bizjak
Richard Guenther wrote:
Oh, so we used to expand to 3dnow?  I see with gcc 3.4 produced:
foo:
   pushl   %ebp
   movl%esp, %ebp
   pushl   %ebx
   subl$84, %esp
   movl12(%ebp), %eax
   movl16(%ebp), %edx
[...]
   movq-64(%ebp), %mm0
   movl%ebx, -72(%ebp)
   movl-36(%ebp), %ebx
   movl%ebx, -68(%ebp)
   pfadd   -72(%ebp), %mm0
   movq%mm0, -56(%ebp)
   movl12(%eax), %eax
etc.
This doesn't happen anymore with 4.0/4.1.
 

IIRC, any generic code that produces MMX or 3DNow! instructions is 
disabled ATM, because gcc doesn't know how/when to insert emms/femms 
instruction. You don't want to mix 3dNow insns with x87 insn and use 
shared 3DNow/x87 registers without this insn...

Uros.


Re: GCC 4.0.0 fsincos?

2005-04-23 Thread Uros Bizjak
Hello!
If I compile with
$ ~/usr/bin/gcc-4.0.0 -S Com_Code.cc -ffast-math -O2
the relevant generated code section is
#APP
fldln2; fxch; fyl2x
#NO_APP
fmulp   %st, %st(2)
fxch%st(1)
#APP
fsqrt
#NO_APP
fld %st(1)
#APP
fsin
#NO_APP
fxch%st(2)
#APP
fcos
#NO_APP
So after generating R, a separate fsin and fcos seem to be generated. Am I
missing an option or something?
You should use -ffast-math together with -D__NO_MATH_INLINES in your 
compile flags, or use a newer glibc. -D__NO_MATH_INLINES should also be 
used for -mfpmath=sse to prevent generation of x87 instructions from 
mathinline.h header when SSE code is used for FP math operators. 
Otherwise  xmm reg->mem->x87 reg moves will kill your performance.

Uros.
Uros.


Re: GCC 4.0, Fast Math, and Acovea

2005-04-29 Thread Uros Bizjak
Hello Scott!
Specifically, the -funsafe-math-optimizations flag doesn't work 
correctly on AMD64 because the default on that platform is 
-mfpmath=sse. Without specifying -mfpmath=387, 
-funsafe-math-optimizations does not generate inline processor 
instructions for most floating-point functions.

Let's put it another way: Manually selecting -mfpmath=387 cuts 
run-times by 50% for programs dependent on functions like sin() and 
sqrt(), as compared to -funsafe-math-optimizations by itself.

It was found that moving data from SSE registers to X87 registers (and 
back) only to call an x87 builtin degrades performance. Because of this, 
x87 builtins are disabled for -mfpmath=sse and a normal libcall is 
issued for sin(), etc functions. If someone wants to use x87 builtins, 
then _all_ math operations should be done in x87 registers to avoid 
costly SSE->x87 moves.

BTW: Does adding -D__NO_MATH_INLINES improve performance for 
-mfpmath=sse? That would be PR19602.

Uros.


i387 control word register definition is missing

2005-05-23 Thread Uros Bizjak
Hello!

It looks that i387 control word register definition is missing from register
definitions for i386 processor. Inside i386.h, we have:

#define FIXED_REGISTERS \
/*ax,dx,cx,bx,si,di,bp,sp,st,st1,st2,st3,st4,st5,st6,st7*/  \
{  0, 0, 0, 0, 0, 0, 0, 1, 0,  0,  0,  0,  0,  0,  0,  0,   \
/*arg,flags,fpsr,dir,frame*/\
1,1,   1,  1,1, \
/*xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7*/ \
 0,   0,   0,   0,   0,   0,   0,   0,  \
/*mmx0,mmx1,mmx2,mmx3,mmx4,mmx5,mmx6,mmx7*/ \
 0,   0,   0,   0,   0,   0,   0,   0,  \
/*  r8,  r9, r10, r11, r12, r13, r14, r15*/ \
 2,   2,   2,   2,   2,   2,   2,   2,  \
/*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/   \
 2,   2,2,2,2,2,2,2}

However, there should be another register defined, i387 control word register,
'fpcr' (Please look at chapter 11.2.1.2 and 11.2.1.3 in
http://webster.cs.ucr.edu/AoA/Windows/HTML/RealArithmetic.html). There are two
instructions in i386.md that actually use fpcr:

(define_insn "x86_fnstcw_1"
  [(set (match_operand:HI 0 "memory_operand" "=m")
(unspec:HI [(reg:HI FPSR_REG)] UNSPEC_FSTCW))]
  "TARGET_80387"
  "fnstcw\t%0"
  [(set_attr "length" "2")
   (set_attr "mode" "HI")
   (set_attr "unit" "i387")])

(define_insn "x86_fldcw_1"
  [(set (reg:HI FPSR_REG)
(unspec:HI [(match_operand:HI 0 "memory_operand" "m")] UNSPEC_FLDCW))]
  "TARGET_80387"
  "fldcw\t%0"
  [(set_attr "length" "2")
   (set_attr "mode" "HI")
   (set_attr "unit" "i387")
   (set_attr "athlon_decode" "vector")])

However, RTL template for these two instructions state that they use i387 STATUS
register, but they should use i387 CONTROL register. To be correct, a new fixed
register should be introduced:

#define FIXED_REGISTERS \
/*ax,dx,cx,bx,si,di,bp,sp,st,st1,st2,st3,st4,st5,st6,st7*/  \
{  0, 0, 0, 0, 0, 0, 0, 1, 0,  0,  0,  0,  0,  0,  0,  0,   \
/*arg,flags,fpsr,fpcr,dir,frame*/   \
1,1,   1,   1,  1,1,\
...

and above two insn definitions should be changed to use FPCR_REG. Unfortunately,
some changes would be needed through the code (mainly to various register masks
and definitions) to fix this issue, so I would like to ask for opinions on this
change before proceeding.

This change would be needed to get i387 status word switching instructions out
of the loops, i.e.:

for ...

Thanks,
Uros.


i387 control word reg definition is missing

2005-05-23 Thread Uros Bizjak
Hello!

It looks that i387 control word register definition is missing from register
definitions for i386 processor. Inside i386.h, we have:

#define FIXED_REGISTERS \
/*ax,dx,cx,bx,si,di,bp,sp,st,st1,st2,st3,st4,st5,st6,st7*/  \
{  0, 0, 0, 0, 0, 0, 0, 1, 0,  0,  0,  0,  0,  0,  0,  0,   \
/*arg,flags,fpsr,dir,frame*/\
1,1,   1,  1,1, \
/*xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7*/ \
 0,   0,   0,   0,   0,   0,   0,   0,  \
/*mmx0,mmx1,mmx2,mmx3,mmx4,mmx5,mmx6,mmx7*/ \
 0,   0,   0,   0,   0,   0,   0,   0,  \
/*  r8,  r9, r10, r11, r12, r13, r14, r15*/ \
 2,   2,   2,   2,   2,   2,   2,   2,  \
/*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/   \
 2,   2,2,2,2,2,2,2}

However, there should be another register defined, i387 control word register,
'fpcr' (Please look at chapter 11.2.1.2 and 11.2.1.3 in
http://webster.cs.ucr.edu/AoA/Windows/HTML/RealArithmetic.html). There are two
instructions in i386.md that actually use fpcr:

(define_insn "x86_fnstcw_1"
  [(set (match_operand:HI 0 "memory_operand" "=m")
(unspec:HI [(reg:HI FPSR_REG)] UNSPEC_FSTCW))]
  "TARGET_80387"
  "fnstcw\t%0"
  [(set_attr "length" "2")
   (set_attr "mode" "HI")
   (set_attr "unit" "i387")])

(define_insn "x86_fldcw_1"
  [(set (reg:HI FPSR_REG)
(unspec:HI [(match_operand:HI 0 "memory_operand" "m")] UNSPEC_FLDCW))]
  "TARGET_80387"
  "fldcw\t%0"
  [(set_attr "length" "2")
   (set_attr "mode" "HI")
   (set_attr "unit" "i387")
   (set_attr "athlon_decode" "vector")])

However, RTL template for these two instructions state that they use i387 STATUS
register, but they should use i387 CONTROL register. To be correct, a new fixed
register should be introduced:

#define FIXED_REGISTERS \
/*ax,dx,cx,bx,si,di,bp,sp,st,st1,st2,st3,st4,st5,st6,st7*/  \
{  0, 0, 0, 0, 0, 0, 0, 1, 0,  0,  0,  0,  0,  0,  0,  0,   \
/*arg,flags,fpsr,fpcr,dir,frame*/   \
1,1,   1,   1,  1,1,\
...

and above two insn definitions should be changed to use FPCR_REG. Unfortunately,
some changes would be needed through the code (mainly to various register masks
and definitions) to fix this issue, so I would like to ask for opinions on this
matter before proceeding.

This change is needed to get i387 status word switching instructions in
(int)->(float) conversions out of the loops, i.e.:

int i;
double d;

for (x = 0... {
   i[x] = d[x];
}


Thanks,
Uros.


Re: GCC and Floating-Point

2005-05-24 Thread Uros Bizjak
Hello!

> I'm writing an extensive article about floating-point programming on
> Linux, including a focus on GCC's compilers. This is an outgrowth of
> many debates about topics like -ffast-math and -mfpmath=sse|387, and I
> hope it will prove enlightening for myself and others.

  I would like to point out that for applications that crunch data from real
world (no infinites or nans, where precision is not critical) such as various
simulations, -ffast-math is something that can speed up application a lot.

Regarding i387, -ffast-math does following:

- because NaNs and infinets are not present, fp compares are simplified, as
there is no need for bypases and secondary compares (due to C0 C2 and C3 bits of
FP status word) and cmove instruction can be used in all cases.

- simple builtin functions (sqrt, sin and cos, etc) can be used with hardware
implemented fsqrt, fsin and fcos insn instead of calling library functions.
These can handle arguments up to (something)*2^63, so it is quite enough for
normal apps. This way, a call overhead (saving all FP registers, overhead of
call insn itself) is eliminated, and as an added bonus, sin and cos of the same
argument are combined as fsincos insn.

- not-so-simple (I won't use the word complex here :)) builtin functions (exp,
asin, etc) are expanded on RTL level and CSE is used to eliminate duplicate
calculations.

- floor and ceil functions are implemented as builtin functions and further 
simplified to direct conversion, for example:
(int)floor(double) -> __builtin_lfloor(double).
__builtin_lfloor (and similar builtins) can be implemented directly in i387
using fist(p) insn with appropriate rounding control bits set in control word.

- in addition to this target specific effects, various (otherwise unsafe)
transformations are enabled on middle-level when -ffast-math is used.

  Due to outdated i386 ABI, where all FP parameters are passed on stack, SSE
code does not show all its power when used. When math library function is
called, SSE regs are pushed on stack and called math library function (that is
currently implemented again with i387 insns) pulls these values from stack to
x87 registers. In contrast, x86_64 ABI specifies that FP values are passed in
SSE registers, so they avoid costly SSE reg->stack moves. Until i386 ABI
(together with supporting math functions) is changed to something similar to
x86_64, use of -mfpmath=sse won't show all its power. Another fact is, that x87
intrinsics are currently disabled for -mfpmath=sse, because it was shown that
SSE math libraries (with SSE ABI) are faster for x86_64. Somehow annoying fact
is, that intrinsics are disabled also for i386, where we are still waiting for
ABI to change ;) [Please note that use of SSE intrinsic functions does not rely
on -mfpmath=... settings].

  So, for real-world applications, using i387 with -ffast-math could be
substantially faster than using SSE code. However, the problem lies in math
library headers. These define a lot of inlined asm functions in mathinline.h
header (included when math.h is used). These functions interfear with gcc's
builtins, so -D__NO_MATH_INLINES is needed to fix this problem. The situation is
even worse when SSE code is used. Asm inlines from mathinline.h are implemented
using i387 instructions, so these instructions force parameters to move from SSE
registers to x87 regs (via stack) and the result to move back to SSE reg the
same way. This can be seen when sin(a + 1.0) is compiled with math.h header
included. With -mfpmath=sse, SSE->mem->FP reg moves are needed to satisfy
constraints of inlined sin() code.

  Because SSE->x87 moves are costly, -mfpmath=sse,387 produces unoptimal code.
This option in fact confuses register allocator, and wrong register set is
choosen sometimes. As there is no separate resources for SSE and x87
instructions, the use of -mfpmath=sse,387 is a bit questionable. However, with
-mfpmath=sse, x87 registers could be used for temporary storage in MEM->MEM
moves, they can even do conversions from (double)<->(float) on the fly...

  As an example for this writing, try to benchmark povray with the combination
of following parameters:

-ffast-math
-mpfmath=sse, -mfpmath=387 (and perhaps -mfpmath=387,sse)
-D__NO_MATH_INLINES (this depends on the version of your libc)

Uros.


Re: i387 control word register definition is missing

2005-05-25 Thread Uros Bizjak
Hello!

> Well you really want both the fpcr and the mxcsr registers, since the fpcr
> only controls the x87 and the mxcsr controls the xmm registers.  Note, in
> adding these registers, you are going to have to go through all of the 
> floating
> point patterns to add (use:HI FPCR_REG) and (use:SI MXCSR_REG) to each and
> every pattern so that the optimizer can be told not to move a floating point
> operation past the setting of the control word.

  I think that (use:...) clauses are needed only for (float)->(int) patterns
(fix_trunc.. & co.). For i386, we could calculate new mode word in advance (this
calculation is inserted by LCM), and fldcw insn is inserted just before
fist/frndint.

(define_insn_and_split "fix_trunc_i387_2"
  [(set (match_operand:X87MODEI12 0 "memory_operand" "=m")
(fix:X87MODEI12 (match_operand 1 "register_operand" "f")))
   (use (match_operand:HI 2 "memory_operand" "m"))
   (use (match_operand:HI 3 "memory_operand" "m"))]
  "TARGET_80387 && !TARGET_FISTTP
   && FLOAT_MODE_P (GET_MODE (operands[1]))
   && !SSE_FLOAT_MODE_P (GET_MODE (operands[1]))"
  "#"
  "reload_completed"
  [(set (reg:HI FPCR_REG)
(unspec:HI [(match_dup 3)] UNSPEC_FLDCW))
   (parallel [(set (match_dup 0) (fix:X87MODEI12 (match_dup 1)))
  (use (reg:HI FPCR_REG))])]
  ""
  [(set_attr "type" "fistp")
   (set_attr "i387_cw" "trunc")
   (set_attr "mode" "")])


(define_insn "*fix_trunc_i387"
  [(set (match_operand:X87MODEI12 0 "memory_operand" "=m")
(fix:X87MODEI12 (match_operand 1 "register_operand" "f")))
   (use (reg:HI FPCR_REG))]
  "TARGET_80387 && !TARGET_FISTTP
   && FLOAT_MODE_P (GET_MODE (operands[1]))
   && !SSE_FLOAT_MODE_P (GET_MODE (operands[1]))"
  "* return output_fix_trunc (insn, operands, 0);"
  [(set_attr "type" "fistp")
   (set_attr "i387_cw" "trunc")
   (set_attr "mode" "")])

I'm trying to use MODE_ENTRY and MODE_EXIT macros to insert mode calculations in
proper places. Currently, I have a somehow working prototype that switches
between 2 modes: MODE_UNINITIALIZED, MODE_TRUNC (and MODE_ANY). The trick here
is, that MODE_ENTRY and MODE_EXIT are defined to MODE_UNINITIALIZED. Secondly,
every asm statement and call insn switches to MODE_UNINITIALIZED, and when mode
is switched _from_ MODE_TRUNC _to_ MODE_UNINITIALIZED before these two
statements (or in exit BBs), an UNSPEC_VOLATILE type fldcw is emitted (again via
LCM) that switches fpu to saved mode. [UNSPEC_VOLATILE is needed to prevent
optimizers to remove this pattern]. So, 2 fldcw patterns are defined:

(define_insn "x86_fldcw_1"
  [(set (reg:HI FPCR_REG)
(unspec:HI [(match_operand:HI 0 "memory_operand" "m")]
 UNSPEC_FLDCW))]
  "TARGET_80387"
  "fldcw\t%0"
  [(set_attr "length" "2")
   (set_attr "mode" "HI")
   (set_attr "unit" "i387")
   (set_attr "athlon_decode" "vector")])

(define_insn "x86_fldcw_2"
  [(set (reg:HI FPCR_REG)
(unspec_volatile:HI [(match_operand:HI 0 "memory_operand" "m")]
  UNSPECV_FLDCW))]
  "TARGET_80387"
  "fldcw\t%0"
  [(set_attr "length" "2")
   (set_attr "mode" "HI")
   (set_attr "unit" "i387")
   (set_attr "athlon_decode" "vector")])

By using this approach, testcase:

int test (int *a, double *x) {
int i;

for (i = 10; i; i--) {
 a[i] = x[i];
}

return 0;
}

is compiled (with -O2 -fomit-frame-pointer -fgcse-after-reload) into:

test:
pushl  %ebx
xorl %edx, %edx
subl $4, %esp
fnstcw 2(%esp) <- store current cw
movl 12(%esp), %ebx
movl 16(%esp), %ecx
movzwl 2(%esp), %eax
orw  $3072, %ax
movw %ax, (%esp)   <- store new cw
.p2align 4,,15
.L2:
fldcw  (%esp)  <- hello? gcse-after-reload?
fldl 80(%ecx,%edx,8)
fistpl 40(%ebx,%edx,4)
decl %edx
cmpl $-10, %edx
jne  .L2
fldcw  2(%esp) <- volatile fldcw in exit block (load stored cw)
xorl %eax, %eax
popl %edx
popl %ebx
ret

Another testcase, involving call:

extern double (int a);

int test (double a) {
return  (a);
}

is compiled into:

test:
subl $12, %esp
fnstcw 10(%esp)<- store current control word
fldl 16(%esp)
movzwl 10(%esp), %eax
orw  $3072, %ax
movw %ax, 8(%esp)
fldcw  8(%esp) <- switch fpu to new mode
fistpl (%esp)  <- make conversion
fldcw  10(%esp)<- volatile fldcw before call (load stored cw)
call 
fnstcw 10(%esp)<- rewrite stored control word after call
movzwl 10(%esp), %eax
orw  $3072, %ax
movw %ax, 8(%esp)
fldcw  8(%esp) <- load new
fistpl 4(%esp) <- make conversion
movl 4(%esp), %eax
fldcw  10(%esp)<- volatile fldcw in exit block (load stored cw)
addl $12, %esp
ret

Because ABI specifies th

RE: GCC and Floating-Point

2005-05-25 Thread Uros Bizjak

Hello Evandro!

x87 registers. In contrast, x86_64 ABI specifies that FP 
values are passed in SSE registers, so they avoid costly SSE 
reg->stack moves. Until i386 ABI (together with supporting 
math functions) is changed to something similar to x86_64, 
use of -mfpmath=sse won't show all its power.



Actually, in many cases, SSE did help x86 performance as well.  That
happens in FP-intensive applications which spend a lot of time in loops
when the XMM register set can be used more efficiently than the x87 stack.


 There is an annoying piece of code attached to PR19780
(http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19780), a loop that shuffles
registers around a lot:

 int i;

 real v1x, v1y, v1z;
 real v2x, v2y, v2z;
 real v3x, v3y, v3z;

 for (i = 0; i < 1; i++)
   {
 v3x = v1y * v2z - v1z * v2y;
 v3y = v1z * v2x - v1x * v2z;
 v3z = v1x * v2y - v1y * v2x;

 v1x = v2x;
 v1y = v2y;
 v1z = v2z;

 v2x = v3x;
 v2y = v3y;
 v2z = v3z;
   }

 This code could be a perfect example how XMM register file beats x87 reg stack.
However, contrary to all expectations, x87 code is 20% faster(!!) /on p4, but it
would be interesting to see this comparison on x86_64, or perhaps on 32bit AMD/.
The code structure, produced with -mfpmath=sse, is the same as the code 
structure produced
with -mfpmath=x87, so IMO there is no register allocator effects in play.

 I was trying to look into this problem, but on first sight, code seems optimal 
to me...

Uros.



Re: i387 control word register definition is missing

2005-05-25 Thread Uros Bizjak
Quoting Jan Hubicka <[EMAIL PROTECTED]>:

> If you make FPCTR/MXCSR real registers, you will need to add use to all
> the arithmetic and move pattern that would consume quite some memory and
> confuse optimizers.  I think you can get better around simply using volatile
> unspecs inserted by LCM pass  (this would limit scheduling, but I don't
> think it is that big deal)

  Ouch... I wrongly assumed that rouding bits affect only (int)->(float)
patterns - thanks for clearing this to me! (Perhaps adding a "nearest" i387_cw
attribute to arithmetic/move patterns could be used to switch back to default
rounding?)

> > Unfortunatelly, in first testcase, fldcw is not moved out of the loop,
> > because
> > fix_trunc_i387_2 is splitted after gcse-after-reload pass (Is this
> > intentional for gcse-after-reload pass?)
> 
> It is intentional for reload pass.  I guess gcse might be run after
> splitting, but not sure what the interferences are.

  I have added split_all_insns call before gcse_after_reload_main in passes.c.
To my suprise, it didn't break anything, but it also didn't get fldcw out of the
loop.

Uros.


Re: Sine and Cosine Accuracy

2005-05-26 Thread Uros Bizjak

Hello!


Fair enough, so with 64 bit floats you have no right to expect an
accurate answer for sin(2^90).  However, you DO have a right to expect
an answer in the range [-1,+1] rather than the 1.2e+27 that Richard
quoted.  I see no words in the description of
-funsafe-math-optimizations to lead me to expect such a result.

 The source operand to fsin, fcos and fsincos x87 insns must be within 
the range of +-2^63, otherwise a C2 flag is set in FP status word that 
marks insufficient operand reduction. Limited operand range is the 
reason, why fsin & friends are enabled only with 
-funsafe-math-optimizations.


 However, the argument to fsin can be reduced to an acceptable range by 
using fmod builtin. Internally, this builtin is implemented as a very 
tight loop that check for insufficient reduction, and could reduce 
whatever finite value one wishes.


 Out of curiosity, where could sin(2^90) be needed? It looks rather big 
angle to me.


Uros.


What is wrong with Bugzilla? [Was: Re: GCC and Floating-Point]

2005-05-28 Thread Uros Bizjak

Hello Scott!


I do know this: Many, many scientific and mathematical programmers find
GCC frustrating and annoying, and most of those folk know far more about
numbers than I do. I wish more of these people would feel comfortable
posting to the GCC list, rather than sending private e-mails to my
inbox.

At this point, I wonder what is wrong with Bugzilla, that those 
programmers don't fill a proper bug report. If there is a problem with 
GCC, that is so annoying to somebody, I think that at least developers 
could be informed about it via their standard channels of communication. 
If there is a specific problem, at least it can be analysed properly and 
perhaps some actions could be taken to fix it. If the problem is indeed 
_that_ big (usually, it is not!), then a workaround could be suggested - 
and this bug, together with a workaround is documented in bugzilla for 
others, until the problem is properly solved (usually with a testcase). 
I guess that these persons don't know that bugreports are extremmely 
important for the development of gcc. The users themself are actaully a 
QA department of open source development;)


There is no problem that Bugzilla is un-intuitive, it is far from that. 
The users don't fill bugreports because they are afraid of filling an 
invalid report or a duplicate. I can't speak for gcc bugmasters, but it 
looks to me that dupes and invalid reports are not that big problem.


Is perhaps some kind of anonymous account needed (as in Slashdot's case) 
to encourage these users to fill bugreports? Their knowledge of specific 
problems could help gcc to became better, so it indeed is a win-win 
situation.



However, the atmosphere of GCC development is... well, let's just
say that my investment in asbestos underware has not been wasted. ;)

I would call it an atmosphere of brainstorming. Different opinions and 
different point of views. The only problem is, that words can be 
different if people sit 3000 km/miles/whatever apart ;)


Uros.



How to check for MMX registers in function call?

2005-06-10 Thread Uros Bizjak
Hello!

A sse function that gets its parameters via xmm regs and returns its result in
xmm reg is defined as:

__m128
func_sse (__m128 x, __m128 y)
{
  __m128 xmm;

  xmm = _mm_add_ss (x, y);
  return xmm;
}

The RTL code that is used to call this function is produced as:

(call_insn/u:HI 30 29 31 0 (set (reg:V4SF 21 xmm0)
(call (mem:QI (symbol_ref:SI ("func_sse") [flags 0x3] ) [0 S1 A8])
(const_int 0 [0x0]))) 529 {*call_value_0} (insn_list:REG_DEP_TRUE 28
(insn_list:REG_DEP_TRUE 29 (nil)))
(insn_list:REG_RETVAL 28 (expr_list:REG_DEAD (reg:V4SF 22 xmm1 [ xmm1 ])
(expr_list:REG_EH_REGION (const_int -1 [0x])
(nil
(expr_list:REG_DEP_TRUE (use (reg:V4SF 21 xmm0 [ xmm0 ]))
(expr_list:REG_DEP_TRUE (use (reg:V4SF 22 xmm1 [ xmm1 ]))
(nil

To implement a LCM pass to switch FPU between MMX and x87 mode (the example
above is a call to SSE function, currently the call to MMX function is wrong,
see PR21981), the type of registers, used to pass parameters is needed for
MODE_NEEDED macro to insert correct mode for the call - that is FPU_MODE_MMX if
there are MMX registers used and FPU_MODE_X87 otherwise:

  if (entity == I387_FPU_MODE)
{
  if (CALL_P (insn))
{
  if ("mmx registers are used")<<< here we should check for MMX regs
return FPU_MODE_MMX;
  else
return FPU_MODE_X87;
}

  mode = get_attr_unit (insn);

  return (mode == UNIT_I387)
? FPU_MODE_X87 : (mode == UNIT_MMX)
? FPU_MODE_MMX : FPU_MODE_ANY;
}


Secondly, if return value is placed in MMX regsister, MODE_AFTER after call insn
should set mode state to FPU_MODE_MMX, otherwise to FPU_MODE_X87.

To properly implement this switching scheme, I would like to ask, what is the
proper way to check if MMX register is used as a parameter passing register in
the call, and how to check if MMX register is used to hold return value. This
information is needed to properly calculate MODE_NEEDED and MODE_AFTER values
for function call in LCM pass.

In the function itself, we can handle entry and exit mode using
MODE_ENTRY and MODE_EXIT macros. Entry mode would be set to FPU_MODE_MMX if MMX
registers are used for parameter passing and exit mode would be set to
FPU_MODE_MMX if MMX reg is used to hold return value. Otherwise, they would both
be set to FPU_MODE_X87.

I would like to obtain the same information in function itself to
properly set MODE_ENTRY and MODE_EXIT in LCM pass. Is there an recommended
approach on how to get this information?

Thanks in advance,

Uros.


Big differences on SpecFP results for gcc and icc

2005-06-12 Thread Uros Bizjak

Hello!

There is an interesting comparison of SPEC scores between gcc and icc: 
http://people.redhat.com/dnovillo/spec2000.i686/gcc/individual-run-ratio.html 
. A quick look at the graphs shows a big differences in achieved scores 
between gcc and icc, mostly in SpecFP tests. I was trying to find some 
information on this matter, but none can be found in the archives on 
gcc's site.


An interesting examples are:
-177.mesa (this is a c test), where icc is almost 40% faster
-178.galgel, where icc is again 40% faster
-179.art, where llvm is more than 1.5x faster than both gcc and icc
-187.facere, where icc is 100% faster than gcc
-189.lucas, where icc is 60% faster

I know that these graphs don't show the results of most aggresive 
optimization options for gcc, but that is also the case with icc (only 
-O2). However, it looks that gcc and icc are not even in the same class 
regarding FP performance. Perhaps there is some critical optimizations, 
that are not present in gcc?


I think I'm not the only person, that finds these results rather 
"dissapointing". As Scott is currently writing a paper on gcc's FP 
performance, perhaps someone has an explanation, why gcc's results are 
so low on Pentium4 for these tests?


Uros.


Re: req. help on merging instructions

2005-06-14 Thread Uros Bizjak
Hello!

> Iam trying to merge the following two instructions

> 1. addu r2, r3,r4
> 2. ld   r5 ,mem(r2) # load from address calculated
> in the prev. instruction


> in to one single isntruction.

> 3. ldx  r5 , mem(r3(r4)) # indexed load.


> I managed to do it with a define_peephole pattern
> in the md file. But I want this to happen only in the case
> when the last use of register r2 is in statement2 (i.e it isn't
> live after stmt 2.)otherwise the r2 value would go incorrect

You could use peep2_reg_dead_p() or peep2_regno_dead_p() functions in
define_peephole constraints. There are many examples of their use in i386.md.

Uros.


Re: How can I create a const rtx other than 0, 1, 2

2005-07-22 Thread Uros Bizjak
Hello!

> There's const0_rtx, const1_rtx and const2_rtx. How can I create a
> const rtx other than 0, 1, 2? I want to use it in md file, like

> operand[1] = 111.

> I know I must use const rtx here. How can I do it? A simple question,
> but just no idea where to find the answer.

operand[1] = GEN_INT (111);

Uros.


Re: rfa (x86): 387<=>sse moves

2005-07-31 Thread Uros Bizjak

Hello!


With -march=pentium4 -mfpmath=sse -O2, we get an extra move for code like

   double d = atof(foo);
   int i = d;


   callatof
   fstpl   -8(%ebp)
   movsd   -8(%ebp), %xmm0
   cvttsd2si   %xmm0, %eax


(This is Linux, Darwin is similar.) I think the difficulty is that for


This problem is similar to the problem, described in PR target/19398. 
There is another testcase and a small analysis in the PR that might help 
with this problem.


Uros.


Re: gcc 4.0.x: MMX built-ins regression

2005-08-29 Thread Uros Bizjak
Hello!

> I am using MMX built-ins and gcc-4.0-20050825 and I am experiencing generation
> of uneeded movq (at least I guess so, I am no assembler pro). I don't know
> which gcc snapshot introduced this, but a I know that some pre-release gcc 4.0
> didn't show this bad behaviour. (It's been some time I played with this...)

> Just shout, if you need anything else.

Yes, a bugreport would be nice. Please look at http://gcc.gnu.org/bugs.html .

I have extracted a testcase from your source/assembly mix and with
'gcc version 4.1.0 20050716 (experimental)' the code looks OK to me:

gcc -O3 -march=athlon-xp:

MixAudio16_MMX_T:
pushl   %ebp
movl%esp, %ebp
movl8(%ebp), %eax
movq(%eax), %mm3
movlm.1485, %eax
movq%mm3, %mm4
movq(%eax), %mm1
movq%mm1, %mm0
movq(%eax), %mm2
movl12(%ebp), %eax
pand%mm3, %mm0
pcmpeqw %mm2, %mm0
punpcklwd   %mm0, %mm4
punpckhwd   %mm0, %mm3
movq(%eax), %mm0
pand%mm0, %mm1
movl16(%ebp), %eax
pcmpeqw %mm1, %mm2
movq%mm0, %mm1
punpckhwd   %mm2, %mm0
punpcklwd   %mm2, %mm1
paddd   %mm3, %mm0
paddd   %mm4, %mm1
packssdw%mm0, %mm1
movq%mm1, (%eax)
femms
leave
ret

(Sorry, I have no gcc 4.0.x here.)

Uros.


Re: Regarding bug 22480, vectorisation of shifts

2005-09-02 Thread Uros Bizjak
Hello Paolo!

> Heh, I'm quite at a loss regarding PR22480.  I don't know exactly what
> to do because i386 does not support, e.g. { 2, 4 } << { 1, 2 } (which
> would give {4, 16} as a result).  There is indeed a back-end problem,
> because ashl3 is supposed to have two operands of the same mode,
> not one vector and one SI!
> 
I have changed a bit your proposed patterns:

--cut here--
(define_predicate "vec_shift_operand"
 (and (ior (match_code "reg")
   (match_code "const_vector"))
  (match_test "GET_MODE_CLASS (mode) == MODE_VECTOR_INT"))
{
 unsigned elt = GET_MODE_NUNITS (mode) - 1;
 HOST_WIDE_INT ref;

 if (GET_CODE (op) == CONST_VECTOR)
   {
 ref = INTVAL (CONST_VECTOR_ELT (op, elt));

 while (--elt)
   if (INTVAL (CONST_VECTOR_ELT (op, elt)) != ref)
 return 0;
   }
 return 1;
})

(define_expand "ashl3"
 [(set (match_operand:SSEMODE248 0 "register_operand" "")
   (ashift:SSEMODE248
 (match_operand:SSEMODE248 1 "register_operand" "")
 (match_operand:SSEMODE248 2 "vec_shift_operand" "")))]
 "TARGET_SSE2"
{
  if (GET_CODE (operands[2]) == CONST_VECTOR)
operands[2] = CONST_VECTOR_ELT (operands[2], 0);
  else
operands[2] = gen_lowpart (SImode, operands[2]);
})

(define_insn "sse_psll3"
 [(set (match_operand:SSEMODE248 0 "register_operand" "=x")
   (ashift:SSEMODE248
 (match_operand:SSEMODE248 1 "register_operand" "0")
 (match_operand:SI 2 "nonmemory_operand" "xi")))]
 "TARGET_SSE2"
 "psll\t{%2, %0|%0, %2}"
 [(set_attr "type" "sseishft")
  (set_attr "mode" "TI")])

--cut here--

> 
> This however will not fix "a << b" shifts, which right now should (my
> guess) ICE with something similar to PR22480.

With above patterns, I tried to solve the a << b shifts. The proposed
solution is to generate a SImode lowpart out of SImode vector, to get
only element[0] value, in the hope that all elements are equal (this
is true for pr22480.c testcases).

For following testcase:

void
test_1 (void)
{
  static unsigned bm[16];
  int j;
  for (j = 0; j < 16; j++)
bm[j] <<= 8;
}

void
test_2 (int a)
{
  static unsigned bm[16];
  int j;
  for (j = 0; j < 16; j++)
bm[j] <<= a;
}

I was able to generate following code (gcc -O2 -msse2 -ftree-vectorize
-fomit-frame-pointer):

test_1:
movl$bm.1591, %eax
movl$bm.1591, %edx
.p2align 4,,15
.L2:
movdqa  (%eax), %xmm0
addl$4, %edx
pslld   $8, %xmm0
movdqa  %xmm0, (%eax)
addl$16, %eax
cmpl$bm.1591+16, %edx
jne .L2
ret

test_2:
subl$28, %esp
movl$bm.1602, %eax
movl$bm.1602, %edx
movd32(%esp), %xmm0
pshufd  $0, %xmm0, %xmm0
movdqa  %xmm0, (%esp)
.p2align 4,,15
.L9:
movdqa  (%eax), %xmm0
movd(%esp), %xmm1
addl$4, %edx
pslld   %xmm1, %xmm0
movdqa  %xmm0, (%eax)
addl$16, %eax
cmpl$bm.1602+16, %edx
jne .L9
addl$28, %esp
ret

A couple of (unrelated) problems can be also seen with above code.
First one is PR target/22479
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22497, where register is
wasted in a vectorised loop, the other problem is that movd is not
pushed out of the loop.

In this case, gcc should be able to optimize:

movd32(%esp), %xmm0
pshufd  $0, %xmm0, %xmm0
movdqa  %xmm0, (%esp)
...
movd(%esp), %xmm1

into
movd32(%esp), %xmm1

BTW: This change breaks gcc.dg/i386-sse-6.c, because a
__builtin_ia32_psllwi128 still calls ashl3 with integer
parameter. Otherwise there are no other regressions.

> By the way, it is time to remove the mmx_ prefix from the MMX insns!

Not yet... emms patch depends heavily on optimize_mode_switching
functionality, however o_m_s should be enhanced not to insert
switching insns into the loops.

Uros.


Re: Should -msse3 enable fisttp?

2005-10-04 Thread Uros Bizjak
Quoting Evan Cheng <[EMAIL PROTECTED]>:

> Let me know what you think. I kind of agree with your argument. But  
> for practical reasons I thinkg -msse3 should enable fisttp. Certainly  
> here in Apple, a few folks have been surprised by this.

  Following simple patch should implement your suggested approach:

-march=prescott   enables fisttp
-msse3enables fisttp
-march=prescott -mno-sse3 enables fisttp

Otherwise fisttp is disabled.

2005-10-05  Uros Bizjak  <[EMAIL PROTECTED]>

* config/i386/i386.h (TARGET_FISTTP): Enable for TARGET_SSE3.

Uros.


Re: Should -msse3 enable fisttp?

2005-10-04 Thread Uros Bizjak
Quoting Uros Bizjak <[EMAIL PROTECTED]>:

>   Following simple patch should implement your suggested approach:
> 
> -march=prescott   enables fisttp
> -msse3enables fisttp
> -march=prescott -mno-sse3 enables fisttp
> 
> Otherwise fisttp is disabled.

This one also works for -mno-80387 and simplifies insn pattern constraints a 
bit:

2005-10-05  Uros Bizjak  <[EMAIL PROTECTED]>

* config/i386/i386.h (TARGET_FISTTP): Enable also for
TARGET_SSE3 and only for TARGET_80387.
* config/i386/i386.md (fix_trunc_fisttp_i387_1,
(fix_trunc_i387_fisttp, fix_trunc_i387_fisttp_with_temp):
Do not depend on TARGET_80387.

Uros.

fisttp_2.diff
Description: Binary data


[BENCHMARK] runtime impact of fix for target/17390 on i386 targets

2005-10-20 Thread Uros Bizjak
Quoting steven at gcc dot gnu dot org <[EMAIL PROTECTED]>:

> --- Comment #5 from steven at gcc dot gnu dot org  2005-10-19 13:13
> ---
> That patch is yet another example of why we constantly keep having compile
> time
> problems.  Just add more, and more, and more, and more.  And act surprised
> when
> someone notices that gcc 4.1 is four times as slow as 2.95.3 for very little
> benefit on x86 at least.
> 
> Why a new pass over all insns, instead of e.g. teaching postreload cse or
> postreload gcse about this kind of thing?

  The problem is in funny i387 compare FP sequences, that are implemented by a
sequence of AX setting compare and CC setting test before jump insn.

I would like to show the compile time and runtime impact of this patch:

compiling povray-3.6.1 source directory without patch where
CXXFLAGS =  -Wno-multichar -march=i586 -mtune=i686 -O3 -D__NO_MATH_INLINES
-ffast-math -malign-double -minline-all-stringops 

I got following times:

1m37.986
1m37.139
1m38.410

And _with_ patch:

1m37.264
1m37.352
1m37.383

I would say that the difference is burried in noise.

The runtime impact (./povray -display=NONE abyss.pov)

without patch (compile flags as above):
0m16.035
0m16.006
0m15.998

0m16.013

And with patch (compile flags as above):
0m14.551
0m14.543
0m14.566

0m14.553

Yes, when povray-3.6.1 is compiled with patched gcc-4.1, it is slightly more
than 10% (!!!) faster, which is by far the best result I've ever seen here.

In both cases, gcc was patched with a patch for recip problems
(http://gcc.gnu.org/ml/gcc-patches/2005-10/msg00687.html), that fix PR/23948 and
PR/24123 and is a _MUST_ for any -ffast-math compilation.

Any comment on these numbers?

Uros.


Re: [BENCHMARK] runtime impact of fix for target/17390 on i386 targets

2005-10-20 Thread Uros Bizjak
Hello Steven!

> And FWIW, it is IMHO bad practice in general to just add new passes,
> instead of investigating why existing passes don't do the job and how
> they can be enhanced to do the job better.

There is no post-reload cse_condition_code_reg () pass, so perhaps we
have to add one. A cse_condition_code_reg () walks all instructions by
itself, so I'm not sure if some existing post-reload CSE pass could be
enhanced.

> It's also not a particularly great idea to duplicate a lot of code (like
> you did, from the CC-cse pass), and I thought machine-specific
> optimization passes are a no-no unless there really, _really_ is no way
> to do the optimization elsewhere in the shared code.

I don't think a lot of code is duplicated, perhaps only checks for
clobbered registers between two insn (that are necessary). The insn
walking code is quite generic.

Regarding specific i386 implementation: fp_jcc_* patterns are splitted
to their final sequences somewhere in flow2 pass, so I think the best
place to put a generic CC-cse pass is somewhere between _.flow2 and
_.csa pass.

The benefit of this approach would be correct calculation of register
lives by following passes, and this way we could get rid of extra fstp
instructions that are necessary in post-regstack optimization pass, 
if popping compare insn was removed.

However, proposed code is so much i386 target specific, that I was
under impression that it could be put into machine specific pass. I
think that this code is not of any use to other targets, so perhaps
this code should be implemented as a target hook that is called
somewhere after post-reload split in _.flow2 pass?

(BTW: Is there already appropriate target hook availabe that is called
after post-reload split?)

Thanks for your suggestions,
Uros.


Re: [BENCHMARK] runtime impact of fix for target/17390 on i386 targets

2005-10-20 Thread Uros Bizjak

Ian Lance Taylor wrote:


Uros Bizjak <[EMAIL PROTECTED]> writes:

 


There is no post-reload cse_condition_code_reg () pass, so perhaps we
have to add one. A cse_condition_code_reg () walks all instructions by
itself, so I'm not sure if some existing post-reload CSE pass could be
enhanced.
   



The cse_condition_code_reg pass doesn't walk all the instructions.  It
walks the basic blocks, and looks at the last instruction in each
basic block.  When it finds an optimization opportunity, it looks at
more instructions, but usually only a few more.


 

Uh, I was trying to say that cse_condition_code_reg pass is not similar 
to any other CSE pass. Following this, existing post-reload CSE passes 
can not be simply enhanced to handle interdependent AX reg and condition 
code reg eliminations.


The proposed patch works like cse_condition_code_reg. It looks at the 
last instruction, and walks up in instruction chain until it hits CC 
setting instruction. It continues up until it hits the AX setting 
compare. Usually, CC setting insn is just a couple of insns above, and 
the walk stops relatively quickly.


Perhpas the problem could be in the successor blocks. We have to walk 
down the insn chain in hope to find matching CC setting or AX setting 
compare instruction, checking if compare args, AX reg or CC reg are 
clobbered by any instruction on the way. If AX reg and CC reg gets 
clobbered, the walk stops as there is no further optimization possible, 
and - if there are no further FP compares - this usually happens quite 
soon down the  insn chain.


Since this patch only deletes instructions, it can't make run-time 
performance worse. Also, there is no impact on register lives as only 
truly redundant instructions are deleted.


I agree with Steven, that mach-reorg is not a good home for this code 
(after reg-stack pass, we have to emit fstps for eliminated register 
popping compares). Is there a recommended approach (target hooks?)on  
how to call this kind of target-dependent function from RTL passes 
(after first post-reload split)?


Thanks,
Uros.


Re: svn feature request: print URL in diff output

2005-11-08 Thread Uros Bizjak
Paolo Bonzini wrote:

> I would like that svn print the URL of each file in the diff output, like 
> CVS's
> `RCS file'. One of the scripts I use to test GCC (which I have not contributed
> yet because of the svn transition) used it to detect the directory in which
the > patch should apply.

Is it possible also to include a "diff" line (as it was the case with CVS diff)
that shows the diff command used to produce the patch?

BTW: Is there a way to include a C function heading in diff output? I have tried

'svn diff -x -p' to get:
svn: '-p' is not supported

Thanks,
Uros.


'gcc -whatever' unrecognized option returns 0

2005-11-15 Thread Uros Bizjak
Hello!

libgomp's configure checks -pthread option using following commands:

--cut here--
# Check to see if -pthread or -lpthread is needed.  Prefer the former.
XPCFLAGS=""
CFLAGS="$CFLAGS -pthread"
AC_LINK_IFELSE(
 [AC_LANG_PROGRAM(
  [#include 
   void *g(void *d) { return NULL; }],
  [pthread_t t; pthread_create(&t,NULL,g,NULL);])],
 [XPCFLAGS=" -Wc,-pthread"],
 [CFLAGS="$save_CFLAGS" LIBS="-lpthread $LIBS"
  AC_LINK_IFELSE(
   [AC_LANG_PROGRAM(
[#include 
 void *g(void *d) { return NULL; }],
[pthread_t t; pthread_create(&t,NULL,g,NULL);])],
   [],
   [AC_MSG_ERROR([Pthreads are required to build libgomp])])])
--cut here--

On solaris-2.8, this results in:

configure:5636: /home/uros/gcc-build-gomp/./gcc/xgcc
-B/home/uros/gcc-build-gomp/./gcc/ -B/usr/local/sparc-sun-solaris2.8/bin/
-B/usr/local/sparc-sun-solaris2.8/lib/ -isystem
/usr/local/sparc-sun-solaris2.8/include -isystem
/usr/local/sparc-sun-solaris2.8/sys-include -o conftest -O2 -g -O2   -pthread  
conftest.c  >&5
xgcc: unrecognized option '-pthread'
configure:5642: $? = 0
configure:5646: test -z 
 || test ! -s conftest.err
configure:5649: $? = 0
configure:5652: test -s conftest
configure:5655: $? = 0

Please note, that -pthread is not a valid option for gcc on solaris-2.8 target.
However, the check above does not finish in AC_MSG_ERROR. [It looks that
conftest executable remains undeleted from previous checks.] This results in
wrong option '-pthread' passed during libgomp compilation.

The fact that gcc returns 0 for unrecognized option is a bit strange:

bash-2.03$ gcc -whatever p.c
gcc: unrecognized option `-whatever'
bash-2.03$ echo $?
0

Is this intentional?

Thanks,
Uros.



Re: overcoming info build failures

2005-11-24 Thread Uros Bizjak
Hello!

> Mark Mitchell's @file documentation change adds a @set directive to
> gcc-vers.texi in the build directory, but that file only depends on
> DEV-PHASE and BASE-VER, so it will never be correctly rebuilt using
> the new make rule.  Just deleting it will remedy the problem.

Another problem is in the fact that value references in @include commands are
expanded only from texinfo version 4.4 and newer. This means, that version 4.4
or higher of texinfo is _required_ to build the documentation.

Interested people could check texinfo's ChangeLog.46 for the changeset:

2003-01-12<[EMAIL PROTECTED]>

...

* makeinfo/cmds.c (handle_include): call text_expansion on the
filename, so @value constructs are expanded.

* doc/texinfo.txi (verbatiminclude, Using Include Files): mention
@value expansion.

...

Using texinfo 4.2, bootstrap fails with:

if [ xinfo = xinfo ]; then \
makeinfo --split-size=500 --split-size=500 --no-split -I . -I
../../gcc-svn/trunk/gcc/doc \
-I ../../gcc-svn/trunk/gcc/doc/include -o doc/gcc.info
../../gcc-svn/trunk/gcc/doc/gcc.texi; \
fi
../../gcc-svn/trunk/gcc/doc/invoke.texi:1057: @include
[EMAIL PROTECTED]/../libiberty/at-file.texi': No such file or directory.
makeinfo: Removing output file `doc/gcc.info' due to errors; use --force to
preserve.
gmake[2]: *** [doc/gcc.info] Error 2

Attached (untested) diff should update the required version of texinfo.

Uros. 


build.diff
Description: Binary data


contrib/gcc_update does not work

2020-01-14 Thread Uros Bizjak
gcc_update, when called from newly initialized and pulled tree does not work:

--cut here--
$ contrib/gcc_update
Updating GIT tree
There is no tracking information for the current branch.
Please specify which branch you want to rebase against.
See git-pull(1) for details.

git pull  

If you wish to set tracking information for this branch you can do so with:

git branch --set-upstream-to=origin/ master

Adjusting file timestamps
git pull of full tree failed.
--cut here--

I would also appreciate a simple step-by step instructions on how to
set-up the local repo and basic workflow with git for "non-power"
users, as was the case with now obsolete instructions for anonymous
SVN r/o access [1] and r/w access [2]. Basically, push-my-first-patch
example.

[1] https://gcc.gnu.org/svn.html
[2] https://gcc.gnu.org/svnwrite.html

Thanks,
Uros.


Re: contrib/gcc_update does not work

2020-01-14 Thread Uros Bizjak
On Tue, Jan 14, 2020 at 11:34 AM Jonathan Wakely  wrote:
>
> On Tue, 14 Jan 2020 at 09:22, Uros Bizjak  wrote:
> >
> > gcc_update, when called from newly initialized and pulled tree does not 
> > work:
>
> Initialized how?

 1035  mkdir gcc
 1036  cd gcc
 1037  git init
 1038  git pull https://gcc.gnu.org/git/gcc.git

> If you do a 'git clone' then it correctly checks out master and sets
> it to track origin/master.

I see, I'll try this now.

>
> >
> > --cut here--
> > $ contrib/gcc_update
> > Updating GIT tree
> > There is no tracking information for the current branch.
> > Please specify which branch you want to rebase against.
> > See git-pull(1) for details.
> >
> > git pull  
> >
> > If you wish to set tracking information for this branch you can do so with:
> >
> > git branch --set-upstream-to=origin/ master
> >
> > Adjusting file timestamps
> > git pull of full tree failed.
> > --cut here--
> >
> > I would also appreciate a simple step-by step instructions on how to
> > set-up the local repo and basic workflow with git for "non-power"
> > users, as was the case with now obsolete instructions for anonymous
> > SVN r/o access [1] and r/w access [2]. Basically, push-my-first-patch
> > example.
> >
> > [1] https://gcc.gnu.org/svn.html
> > [2] https://gcc.gnu.org/svnwrite.html
>
> They're still a work in progress:
> https://gcc.gnu.org/git.html
> https://gcc.gnu.org/gitwrite.html

Yes, this is the information i was looking for. Sorry for being impatient ;)

Thanks,
Uros.


New build warning: implicit declaration of function ‘asprintf’ when building lto-plugin.o

2015-06-29 Thread Uros Bizjak
Hello!

Recent commit introduced following build warning:

/home/uros/gcc-svn/trunk/lto-plugin/lto-plugin.c: In function
‘claim_file_handler’:
/home/uros/gcc-svn/trunk/lto-plugin/lto-plugin.c:930:16: warning:
implicit declaration of function ‘asprintf’
[-Wimplicit-function-declaration]
  t = hi ? asprintf (&objname, "%s@0x%x%08x", file->name, lo, hi)

I'll speculate that this is due to recent header reorganisation. Author CC'd.

Uros.


Re: Making __builtin_signbit type-generic

2015-07-06 Thread Uros Bizjak
On Mon, Jul 6, 2015 at 11:49 AM, FX  wrote:

> Many of the floating point-related builtins are type-generic, including 
> __builtin_{isfinite,isinf_sign,isinf,isnan,isnormal,isgreater,islesser,isunordered}.
>  However, __builtin_signbit is not. It would make life easier for the 
> implementation of IEEE in Fortran if it were, and probably for some other 
> stuff too (PR 36757).
>
> I don’t know where to start and how to achieve that, though. Could someone 
> who knows this middle-end FP stuff help?

Please look at builtins.def, grep for TYPEGENERIC.

Uros.


Proposal to postpone release of 5.2 for a week [Was: Re: patch to fix PR66782]

2015-07-09 Thread Uros Bizjak
Hello!

> The patch was bootstrapped and tested on x86/x86-64.
>
> Committed as rev. 225618.
>
> 2015-07-09  Vladimir Makarov  
>
> PR rtl-optimization/66782
> * lra-int.h (struct lra_insn_recog_data): Add comment about
> clobbered hard regs for arg_hard_regs.
> * lra.c (lra_set_insn_recog_data): Add clobbered hard regs.
> * lra-lives.c (process_bb_lives): Process clobbered hard regs.
> Add condition for processing used hard regs.
> * lra-constraints.c (update_ebb_live_info, inherit_in_ebb):
> Process clobbered hard regs.

I would like to nominate this patch for gcc-5.2 release. According to
downstream bugreport [1], gcc-5.1 is unusable for 64-bit wine:

"Breaks all of wine, no easy workaround -> blocker."

Due to severity of this bug, and importance of Wine, I'd like to
postpone the 5.2 release for a week, so this bug gets some testing in
the mainline, before it is backported to gcc-5 branch

[1] https://bugs.winehq.org/show_bug.cgi?id=38653

Uros.


C++11 support still experimental?

2015-11-21 Thread Uros Bizjak
[1] still says in its third paragraph:

--q--
Important: GCC's support for C++11 is still experimental. Some
features were implemented based on early proposals, and no attempt
will be made to maintain backward compatibility when they are updated
to match the final C++11 standard.
--/q--

[1] https://gcc.gnu.org/projects/cxx0x.html

Uros.


Re: Question on TARGET_MMX and X86_TUNE_GENERAL_REGS_SSE_SPILL

2016-04-27 Thread Uros Bizjak
On Wed, Apr 27, 2016 at 4:26 PM, Ilya Enkovich  wrote:

>>> >> > X86_TUNE_GENERAL_REGS_SSE_SPILL: Try to spill general regs to SSE
>>> >> > regs
>>> >> instead of memory.
>>> >> >
>>> >> > I tried enabling the above tuning with -march=bdver4 -Ofast -mtune-
>>> >> ctrl=general_regs_sse_spill.
>>> >> > I did not find any code differences.
>>> >> >
>>> >> > Looking at the below code to enable this tune,  mmx ISA needs to be
>>> >> > turned
>>> >> off.
>>> >> >
>>> >> > static reg_class_t
>>> >> > ix86_spill_class (reg_class_t rclass, machine_mode mode) {
>>> >> >   if (TARGET_SSE && TARGET_GENERAL_REGS_SSE_SPILL && !
>>> >> TARGET_MMX
>>> >> >   && (mode == SImode || (TARGET_64BIT && mode == DImode))
>>> >> >   && rclass != NO_REGS && INTEGER_CLASS_P (rclass))
>>> >> > return ALL_SSE_REGS;
>>> >> >   return NO_REGS;
>>> >> > }
>>> >> >
>>> >> > All processor variants enable MMX by default  and why we need to
>>> >> > switch
>>> >> off mmx?
>>> >>
>>> >> That really looks weird to me.  I ran SPEC2006 on Ofast + LTO with
>>> >> and without -mno-mmx and -mno-mmx gives (Haswell machine):
>>> >>
>>> >> SPEC2006INT :+0.30%
>>> >> SPEC2006FP  :+0.60%
>>> >> SPEC2006ALL :+0.48%
>>> >>
>>> >> Which is quite surprising for disabling a hardware feature hardly
>>> >> used anywhere now.
>>> >
>>> > As I said without mmx (-mno-mmx), the tune
>>> X86_TUNE_GENERAL_REGS_SSE_SPILL may be active now.
>>> > Not sure if there are any other reason.
>>>
>>> Surely that should be the main reason I see performance gain.
>>> So I want to ask the same question as you did: why does this important
>>> performance feature requires disabled MMX.  This restriction exists from the
>>> very start of X86_TUNE_GENERAL_REGS_SSE_SPILL existence (at least in
>>> trunk) and no comments on why we have this restriction.
>>
>> I was told by Uros,  that using TARGET_MMX is to prevent intreg <-> MMX 
>> moves that clobber stack registers.
>
> ix86_spill_class is supposed to return a register class to be used
> to store general purpose registers.  It returns ALL_SSE_REGS which
> doesn't intersect with MMX_REGS class.  So I don't see why
> intreg <-> MMX moves may appear.  And if those moves appear we should
> fix it, not disable the whole feature.
>
> @Uros, do you have a comment here?

Looking at the implementation of ix86_spill_class, TARGET_MMX check
really looks too restrictive. However, we need to check TARGET_SSE2
and TARGET_INTERUNIT_MOVES instead, otherwise movq xmm <-> intreg
pattern gets disabled

This change should be OK then, but just in case, SSE2 enabled
-mfpmath=i387 32bit SPEC run should uncover unwanted MMX instructions.

Uros.


Re: Question on TARGET_MMX and X86_TUNE_GENERAL_REGS_SSE_SPILL

2016-04-27 Thread Uros Bizjak
On Wed, Apr 27, 2016 at 4:39 PM, Uros Bizjak  wrote:
> On Wed, Apr 27, 2016 at 4:26 PM, Ilya Enkovich  wrote:
>
>>>> >> > X86_TUNE_GENERAL_REGS_SSE_SPILL: Try to spill general regs to SSE
>>>> >> > regs
>>>> >> instead of memory.
>>>> >> >
>>>> >> > I tried enabling the above tuning with -march=bdver4 -Ofast -mtune-
>>>> >> ctrl=general_regs_sse_spill.
>>>> >> > I did not find any code differences.
>>>> >> >
>>>> >> > Looking at the below code to enable this tune,  mmx ISA needs to be
>>>> >> > turned
>>>> >> off.
>>>> >> >
>>>> >> > static reg_class_t
>>>> >> > ix86_spill_class (reg_class_t rclass, machine_mode mode) {
>>>> >> >   if (TARGET_SSE && TARGET_GENERAL_REGS_SSE_SPILL && !
>>>> >> TARGET_MMX
>>>> >> >   && (mode == SImode || (TARGET_64BIT && mode == DImode))
>>>> >> >   && rclass != NO_REGS && INTEGER_CLASS_P (rclass))
>>>> >> > return ALL_SSE_REGS;
>>>> >> >   return NO_REGS;
>>>> >> > }
>>>> >> >
>>>> >> > All processor variants enable MMX by default  and why we need to
>>>> >> > switch
>>>> >> off mmx?
>>>> >>
>>>> >> That really looks weird to me.  I ran SPEC2006 on Ofast + LTO with
>>>> >> and without -mno-mmx and -mno-mmx gives (Haswell machine):
>>>> >>
>>>> >> SPEC2006INT :+0.30%
>>>> >> SPEC2006FP  :+0.60%
>>>> >> SPEC2006ALL :+0.48%
>>>> >>
>>>> >> Which is quite surprising for disabling a hardware feature hardly
>>>> >> used anywhere now.
>>>> >
>>>> > As I said without mmx (-mno-mmx), the tune
>>>> X86_TUNE_GENERAL_REGS_SSE_SPILL may be active now.
>>>> > Not sure if there are any other reason.
>>>>
>>>> Surely that should be the main reason I see performance gain.
>>>> So I want to ask the same question as you did: why does this important
>>>> performance feature requires disabled MMX.  This restriction exists from 
>>>> the
>>>> very start of X86_TUNE_GENERAL_REGS_SSE_SPILL existence (at least in
>>>> trunk) and no comments on why we have this restriction.
>>>
>>> I was told by Uros,  that using TARGET_MMX is to prevent intreg <-> MMX 
>>> moves that clobber stack registers.
>>
>> ix86_spill_class is supposed to return a register class to be used
>> to store general purpose registers.  It returns ALL_SSE_REGS which
>> doesn't intersect with MMX_REGS class.  So I don't see why
>> intreg <-> MMX moves may appear.  And if those moves appear we should
>> fix it, not disable the whole feature.
>>
>> @Uros, do you have a comment here?
>
> Looking at the implementation of ix86_spill_class, TARGET_MMX check
> really looks too restrictive. However, we need to check TARGET_SSE2
> and TARGET_INTERUNIT_MOVES instead, otherwise movq xmm <-> intreg
> pattern gets disabled

I'm testing following patch:

--cut here--
Index: i386.c
===
--- i386.c  (revision 235516)
+++ i386.c  (working copy)
@@ -53560,9 +53560,12 @@
 static reg_class_t
 ix86_spill_class (reg_class_t rclass, machine_mode mode)
 {
-  if (TARGET_SSE && TARGET_GENERAL_REGS_SSE_SPILL && ! TARGET_MMX
+  if (TARGET_GENERAL_REGS_SSE_SPILL
+  && TARGET_SSE2
+  && TARGET_INTER_UNIT_MOVES_TO_VEC
+  && TARGET_INTER_UNIT_MOVES_FROM_VEC
   && (mode == SImode || (TARGET_64BIT && mode == DImode))
-  && rclass != NO_REGS && INTEGER_CLASS_P (rclass))
+  && INTEGER_CLASS_P (rclass))
 return ALL_SSE_REGS;
   return NO_REGS;
 }
--cut here--

Uros.


Testing _Complex varargs passing [was: Alpha, ABI change: pass SFmode and SCmode varargs by reference]

2016-09-04 Thread Uros Bizjak
On Fri, Sep 2, 2016 at 2:11 PM, Jakub Jelinek  wrote:
> On Fri, Sep 02, 2016 at 12:09:30PM +, Joseph Myers wrote:
>> On Fri, 2 Sep 2016, Uros Bizjak wrote:
>>
>> >  argument.  Passing _Complex float as a variable argument never
>> >  worked on alpha.  Thus, we have no backward compatibility issues
>>
>> Presumably there should be an architecture-independent execution test of
>> passing _Complex float in variable arguments - either new, or a
>> pre-existing one whose XFAIL or skip for alpha can be removed.  (That is,
>> one in the GCC testsuite rather than relying on a libffi test to test
>> GCC.)
>
> And if it is in g*.dg/compat/, it can even test ABI compatibility between
> different compilers or their versions.

What bothers me is the comment in
testsuite/gcc.dg/compat/scalar-by-value-4_main.c which explicitly
mentions that this test tests _Complex numbers that cannot be used in
vararg lists:


/* Test passing scalars by value.  This test includes _Complex types
   whose real and imaginary parts cannot be used in variable-length
   argument lists.  */


It looks that different handling of _Complex char, _Complex short and
_Complex float is there on purpose. Is (was?) there a limitation in a
c language standard that prevents passing of these arguments as
varargs?

Please note that there are no additional FAILs with mainline gcc on
x86_64-linux-gnu {,-m32} test run.

Uros.
Index: gcc.dg/compat/scalar-by-value-4_x.c
===
--- gcc.dg/compat/scalar-by-value-4_x.c (revision 239943)
+++ gcc.dg/compat/scalar-by-value-4_x.c (working copy)
@@ -13,6 +13,7 @@
 TYPE x05, TYPE x06, TYPE x07, TYPE x08,\
 TYPE x09, TYPE x10, TYPE x11, TYPE x12,\
 TYPE x13, TYPE x14, TYPE x15, TYPE x16);   \
+extern void testva##NAME (int n, ...); \
\
 void   \
 check##NAME (TYPE x, TYPE v)   \
@@ -62,6 +63,81 @@
  g13##NAME, g14##NAME, g15##NAME, g16##NAME);  \
   DEBUG_NL;\
   DEBUG_FPUTS (#NAME); \
+  DEBUG_FPUTS (" testva:");\
+  DEBUG_NL;\
+  testva##NAME (1, \
+   g01##NAME); \
+  DEBUG_NL;\
+  testva##NAME (2, \
+   g01##NAME, g02##NAME);  \
+  DEBUG_NL;\
+  testva##NAME (3, \
+   g01##NAME, g02##NAME, g03##NAME);   \
+  DEBUG_NL;\
+  testva##NAME (4, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME);\
+  DEBUG_NL;\
+  testva##NAME (5, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME); \
+  DEBUG_NL;\
+  testva##NAME (6, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME, g06##NAME);  \
+  DEBUG_NL;\
+  testva##NAME (7, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME, g06##NAME, g07##NAME);   \
+  DEBUG_NL;\
+  testva##NAME (8, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME, g06##NAME, g07##NAME, g08##NAME);\
+  DEBUG_NL;\
+  testva##NAME (9, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME, g06##NAME, g07##NAME, g08##NAME, \
+   g09##NAME); \
+  DEBUG_NL;\
+  testva##NAME (10,\
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME, g06##NAME, g07##NAME, g08##NAME, \
+   g09##NAME, g10##NAME); 

[PATCH, testsuite]: Test compat _Complex varargs passing

2016-09-08 Thread Uros Bizjak
On Mon, Sep 5, 2016 at 1:45 PM, Joseph Myers  wrote:
> On Sun, 4 Sep 2016, Uros Bizjak wrote:
>
>> It looks that different handling of _Complex char, _Complex short and
>> _Complex float is there on purpose. Is (was?) there a limitation in a
>> c language standard that prevents passing of these arguments as
>> varargs?
>
> Well, ISO C doesn't define complex integers at all.  But it's deliberate
> (see DR#206) that _Complex float doesn't promote to _Complex double in
> variable arguments.  And there is nothing in ISO C to stop _Complex float
> being passed in variable arguments.
>
> For all these types including the complex integer ones: given that the
> front end doesn't promote them, they should be usable in variable
> arguments.

Attached patch adds various _Complex variable arguments tests to
scalar-by-value-4 and scalar-return-4 tests. These tests previously
erroneously claimed that these argument types are unsupported as
variable arguments.

2016-09-08  Uros Bizjak  

* gcc.dg/compat/scalar-by-value-4_x.c: Also test passing of
variable arguments.
* gcc.dg/compat/scalar-by-value-4_y.c (testva##NAME): New.
* gcc.dg/compat/scalar-by-value-4_main.c: Update description comment.
* gcc.dg/compat/scalar-return-4_x.c: Also test returning of
variable argument.
* gcc.dg/compat/scalar-return-4_y.c (testva##NAME): New.
* gcc.dg/compat/scalar-return-4_main.c: Update description comment.

Tested on x86_64-linux-gnu {,-m32}.

OK for mainline?

Uros.
diff --git a/gcc/testsuite/gcc.dg/compat/scalar-by-value-4_main.c 
b/gcc/testsuite/gcc.dg/compat/scalar-by-value-4_main.c
index 8164b44..bd024c0 100644
--- a/gcc/testsuite/gcc.dg/compat/scalar-by-value-4_main.c
+++ b/gcc/testsuite/gcc.dg/compat/scalar-by-value-4_main.c
@@ -1,5 +1,5 @@
 /* Test passing scalars by value.  This test includes _Complex types
-   whose real and imaginary parts cannot be used in variable-length
+   whose real and imaginary parts can be used in variable-length
argument lists.  */
 
 extern void scalar_by_value_4_x (void);
diff --git a/gcc/testsuite/gcc.dg/compat/scalar-by-value-4_x.c 
b/gcc/testsuite/gcc.dg/compat/scalar-by-value-4_x.c
index a4e73c9..a36a060 100644
--- a/gcc/testsuite/gcc.dg/compat/scalar-by-value-4_x.c
+++ b/gcc/testsuite/gcc.dg/compat/scalar-by-value-4_x.c
@@ -13,6 +13,7 @@ test##NAME (TYPE x01, TYPE x02, TYPE x03, TYPE x04,   
\
 TYPE x05, TYPE x06, TYPE x07, TYPE x08,\
 TYPE x09, TYPE x10, TYPE x11, TYPE x12,\
 TYPE x13, TYPE x14, TYPE x15, TYPE x16);   \
+extern void testva##NAME (int n, ...); \
\
 void   \
 check##NAME (TYPE x, TYPE v)   \
@@ -62,6 +63,81 @@ testit##NAME (void)  
\
  g13##NAME, g14##NAME, g15##NAME, g16##NAME);  \
   DEBUG_NL;\
   DEBUG_FPUTS (#NAME); \
+  DEBUG_FPUTS (" testva:");\
+  DEBUG_NL;\
+  testva##NAME (1, \
+   g01##NAME); \
+  DEBUG_NL;\
+  testva##NAME (2, \
+   g01##NAME, g02##NAME);  \
+  DEBUG_NL;\
+  testva##NAME (3, \
+   g01##NAME, g02##NAME, g03##NAME);   \
+  DEBUG_NL;\
+  testva##NAME (4, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME);\
+  DEBUG_NL;\
+  testva##NAME (5, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME); \
+  DEBUG_NL;\
+  testva##NAME (6, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME, g06##NAME);  \
+  DEBUG_NL;\
+  testva##NAME (7, \
+   g01##NAME, g02##NAME, g03##NAME, g04##NAME, \
+   g05##NAME, g06##NAME, g07##NAME);   \
+  DEBUG_NL;\
+  testva##NAME (8,  

Re: Identify IEE

2014-07-03 Thread Uros Bizjak
On Wed, Jul 2, 2014 at 11:13 PM, FX  wrote:
> I’ve recently added IEEE support for the Fortran front-end and library. As 
> part of that, the front-end should be able to determine which of the 
> available floating-point types are IEEE-conforming [1]. Right now, I’ve taken 
> a conservative approach and only considered the target’s float_type_node and 
> double_type_node as IEEE modes, but I’d like to improve that (for example, to 
> include long double and binary128 modes on x86).
>
> How can I determine, from a “struct real_format”, whether it is an IEEE 
> format or not? I’ve looked through gcc/real.{h,c} but could find no clear 
> solution. If there is none, would it be okay to add a new bool field to the 
> structure, named “ieee” or “ieee_format”, to discriminate?

The TARGET_FLOAT_FORMAT macro was removed here [1], and it is expected
that HONOR_* macros are used instead.

Perhaps the easiest way is to use
TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P target hook. Maybe you
need to extend it to pass a mode to this hook. This way, target will
be able to signal if the mode is supported in the most flexible way.

[1] https://gcc.gnu.org/ml/gcc-patches/2008-08/msg00645.html

Uros.


Re: Identify IEE

2014-07-03 Thread Uros Bizjak
On Thu, Jul 3, 2014 at 9:14 AM, Uros Bizjak  wrote:
> On Wed, Jul 2, 2014 at 11:13 PM, FX  wrote:
>> I’ve recently added IEEE support for the Fortran front-end and library. As 
>> part of that, the front-end should be able to determine which of the 
>> available floating-point types are IEEE-conforming [1]. Right now, I’ve 
>> taken a conservative approach and only considered the target’s 
>> float_type_node and double_type_node as IEEE modes, but I’d like to improve 
>> that (for example, to include long double and binary128 modes on x86).
>>
>> How can I determine, from a “struct real_format”, whether it is an IEEE 
>> format or not? I’ve looked through gcc/real.{h,c} but could find no clear 
>> solution. If there is none, would it be okay to add a new bool field to the 
>> structure, named “ieee” or “ieee_format”, to discriminate?
>
> The TARGET_FLOAT_FORMAT macro was removed here [1], and it is expected
> that HONOR_* macros are used instead.
>
> Perhaps the easiest way is to use
> TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P target hook. Maybe you
> need to extend it to pass a mode to this hook. This way, target will
> be able to signal if the mode is supported in the most flexible way.

Maybe a new hook should be introduced instead: TARGET_IEEE_FORMAT_P
(mode). For some targets, even soft-fp supports required rounding
modes and can generate exceptions.

Uros.


Re: Enable EBX for x86 in 32bits PIC code

2014-07-07 Thread Uros Bizjak
On Mon, Jul 7, 2014 at 1:47 PM, Jakub Jelinek  wrote:
> On Mon, Jul 07, 2014 at 03:35:06PM +0400, Evgeny Stupachenko wrote:
>> The key problem here is that EBX is not used in register allocation.
>> If we relax the restriction on EBX the performance is back, but there
>> are several fails.
>> Some of them could be fixed.
>> However I don't like that way as EBX register is uninitialized at
>> register allocation.
>
> That is nothing wrong.  The magic registers are to be assumed live from the
> beginning until the prologue is emitted.
>
>> Initialization (SET_GOT) appeared only at: "217r.pro_and_epilogue" phase.
>>
>> The key point in 2 suggestions is to set EBX register only prior to a
>> call (as it is required by ABI). In all other cases it could be any
>> other register.
>
> You could use special call insn patterns for calls that need to have ebx
> set, where there would be a
> (use (match_operand:SI NN "register_operand" "b"))
> and pass in the lgot pseudo and leave the register allocator to do its job.
> You'd need to remember in which hard register (or memory) the register
> allocator wants lgot to be at the start of the first basic block (so that
> when prologue is expanded you know where to store it).

You can probably use get_hard_reg_initial_val for this.

Uros.


gmake-4.0 and multiple jobs (-j X) testing

2014-08-20 Thread Uros Bizjak
Hello!

It looks that gmake-4.0 terminates gcc testrun immediately after one
of the jobs fails. Does anybody else see this behavior? Do I need to
update gmake invocation or is "gmake -j 4 -k check" from the toplevel
build directory still OK?

Uros.


Re: gmake-4.0 and multiple jobs (-j X) testing

2014-08-25 Thread Uros Bizjak
On Thu, Aug 21, 2014 at 8:18 AM, Uros Bizjak  wrote:

> It looks that gmake-4.0 terminates gcc testrun immediately after one
> of the jobs fails. Does anybody else see this behavior? Do I need to
> update gmake invocation or is "gmake -j 4 -k check" from the toplevel
> build directory still OK?

Please disregard this message. Something was wrong with my dejagnu installation.

Uros.


Re: Enable EBX for x86 in 32bits PIC code

2014-08-28 Thread Uros Bizjak
On Thu, Aug 28, 2014 at 10:37 AM, Ilya Enkovich  wrote:
> 2014-08-28 1:39 GMT+04:00 Jeff Law :
>> On 08/26/14 15:42, Ilya Enkovich wrote:
>>>
>>> diff --git a/gcc/calls.c b/gcc/calls.c
>>> index 4285ec1..85dae6b 100644
>>> --- a/gcc/calls.c
>>> +++ b/gcc/calls.c
>>> @@ -1122,6 +1122,14 @@ initialize_argument_information (int num_actuals
>>> ATTRIBUTE_UNUSED,
>>>   call_expr_arg_iterator iter;
>>>   tree arg;
>>>
>>> +if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
>>> +  {
>>> +   gcc_assert (pic_offset_table_rtx);
>>> +   args[j].tree_value = make_tree (ptr_type_node,
>>> +   pic_offset_table_rtx);
>>> +   j--;
>>> +  }
>>> +
>>>   if (struct_value_addr_value)
>>> {
>>> args[j].tree_value = struct_value_addr_value;
>>
>> So why do you need this?  Can't this be handled in the call/call_value
>> expanders or what about attaching the use to CALL_INSN_FUNCTION_USAGE from
>> inside ix86_expand_call?  Basically I'm not seeing the need for another
>> target hook here.  I think that would significantly simply the patch as
>> well.
>
> GOT base address become an additional implicit arg with EBX relaxed
> and I handled it as all other args. I can move EBX initialization into
> ix86_expand_call. Would still need some hint from target to init
> pic_offset_table_rtx with proper value in the beginning of function
> expand.

Maybe you can you use get_hard_reg_initial_val for this?

Uros.


Re: Enable EBX for x86 in 32bits PIC code

2014-08-28 Thread Uros Bizjak
On Fri, Aug 22, 2014 at 2:21 PM, Ilya Enkovich  wrote:
> Hi,
>
> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 
> 32bit PIC mode.  It was decided that the best approach would be to not fix 
> ebx register, use speudo register for GOT base address and let allocator do 
> the rest.  This should be similar to how clang and icc work with GOT base 
> address.  I've been working for some time on such patch and now want to share 
> my results.

+#define PIC_OFFSET_TABLE_REGNUM
 \
+  ((TARGET_64BIT && (ix86_cmodel == CM_SMALL_PIC   \
+ || TARGET_PECOFF))
 \
+   || !flag_pic ? INVALID_REGNUM   \
+   : X86_TUNE_RELAX_PIC_REG ? (pic_offset_table_rtx ? INVALID_REGNUM   \
+  : REAL_PIC_OFFSET_TABLE_REGNUM)  \
+   : reload_completed ? REGNO (pic_offset_table_rtx)   \
: REAL_PIC_OFFSET_TABLE_REGNUM)

I'd like to avoid X86_TUNE_RELAX_PIC_REG and always treat EBX as an
allocatable register. This way, we can avoid all mess with implicit
xchgs in atomic_compare_and_swap_doubleword. Also, having
allocatable EBX would allow us to introduce __builtin_cpuid builtin
and cleanup cpiud.h.


Re: Enable EBX for x86 in 32bits PIC code

2014-08-28 Thread Uros Bizjak
On Thu, Aug 28, 2014 at 2:54 PM, Ilya Enkovich  wrote:

> diff --git a/gcc/calls.c b/gcc/calls.c
> index 4285ec1..85dae6b 100644
> --- a/gcc/calls.c
> +++ b/gcc/calls.c
> @@ -1122,6 +1122,14 @@ initialize_argument_information (int num_actuals
> ATTRIBUTE_UNUSED,
>   call_expr_arg_iterator iter;
>   tree arg;
>
> +if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
> +  {
> +   gcc_assert (pic_offset_table_rtx);
> +   args[j].tree_value = make_tree (ptr_type_node,
> +   pic_offset_table_rtx);
> +   j--;
> +  }
> +
>   if (struct_value_addr_value)
> {
> args[j].tree_value = struct_value_addr_value;

 So why do you need this?  Can't this be handled in the call/call_value
 expanders or what about attaching the use to CALL_INSN_FUNCTION_USAGE from
 inside ix86_expand_call?  Basically I'm not seeing the need for another
 target hook here.  I think that would significantly simply the patch as
 well.
>>>
>>> GOT base address become an additional implicit arg with EBX relaxed
>>> and I handled it as all other args. I can move EBX initialization into
>>> ix86_expand_call. Would still need some hint from target to init
>>> pic_offset_table_rtx with proper value in the beginning of function
>>> expand.
>>
>> Maybe you can you use get_hard_reg_initial_val for this?
>
> Actually there is no input hard reg holding GOT address.  Currently I
> use initialization with ebx with following ebx initialization in
> prolog_epilog pass.  But this is a temporary workaround.  It is
> inefficient because always uses callee save reg to get GOT address.  I
> suppose we should generate pseudo reg for pic_offset_table_rtx and
> also set_got with this register as a destination in expand pass.
> After register allocation set_got may be transformed into get_pc_thunk
> call with proper hard reg.  But some target hook has to be used for
> this.

Let me expand my idea a bit. IIRC, get_hard_reg_initial_val and
friends will automatically emit intialization of a pseudo from
pic_offset_table_rtx hard reg. After reload, real initialization of
pic_offset_table_rtx hard reg is emitted in pro_and_epilogue pass. I
don't know if this works with current implementation of dynamic
pic_offset_table_rtx selection, though.

Uros.


Re: Enable EBX for x86 in 32bits PIC code

2014-08-28 Thread Uros Bizjak
On Thu, Aug 28, 2014 at 3:29 PM, Ilya Enkovich  wrote:

>>> diff --git a/gcc/calls.c b/gcc/calls.c
>>> index 4285ec1..85dae6b 100644
>>> --- a/gcc/calls.c
>>> +++ b/gcc/calls.c
>>> @@ -1122,6 +1122,14 @@ initialize_argument_information (int num_actuals
>>> ATTRIBUTE_UNUSED,
>>>   call_expr_arg_iterator iter;
>>>   tree arg;
>>>
>>> +if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
>>> +  {
>>> +   gcc_assert (pic_offset_table_rtx);
>>> +   args[j].tree_value = make_tree (ptr_type_node,
>>> +   pic_offset_table_rtx);
>>> +   j--;
>>> +  }
>>> +
>>>   if (struct_value_addr_value)
>>> {
>>> args[j].tree_value = struct_value_addr_value;
>>
>> So why do you need this?  Can't this be handled in the call/call_value
>> expanders or what about attaching the use to CALL_INSN_FUNCTION_USAGE 
>> from
>> inside ix86_expand_call?  Basically I'm not seeing the need for another
>> target hook here.  I think that would significantly simply the patch as
>> well.
>
> GOT base address become an additional implicit arg with EBX relaxed
> and I handled it as all other args. I can move EBX initialization into
> ix86_expand_call. Would still need some hint from target to init
> pic_offset_table_rtx with proper value in the beginning of function
> expand.

 Maybe you can you use get_hard_reg_initial_val for this?
>>>
>>> Actually there is no input hard reg holding GOT address.  Currently I
>>> use initialization with ebx with following ebx initialization in
>>> prolog_epilog pass.  But this is a temporary workaround.  It is
>>> inefficient because always uses callee save reg to get GOT address.  I
>>> suppose we should generate pseudo reg for pic_offset_table_rtx and
>>> also set_got with this register as a destination in expand pass.
>>> After register allocation set_got may be transformed into get_pc_thunk
>>> call with proper hard reg.  But some target hook has to be used for
>>> this.
>>
>> Let me expand my idea a bit. IIRC, get_hard_reg_initial_val and
>> friends will automatically emit intialization of a pseudo from
>> pic_offset_table_rtx hard reg. After reload, real initialization of
>> pic_offset_table_rtx hard reg is emitted in pro_and_epilogue pass. I
>> don't know if this works with current implementation of dynamic
>> pic_offset_table_rtx selection, though.
>
> That means you should choose some hard reg early before register
> allocation to be used for PIC reg initialization.  I do not like we
> have to do this and want to just generate set_got with pseudo reg and
> do not involve any additional hard reg. That would look like
>
> (insn/f 168 167 169 2 (parallel [
> (set (reg:SI 127)
> (unspec:SI [
> (const_int 0 [0])
> ] UNSPEC_SET_GOT))
> (clobber (reg:CC 17 flags))
> ]) test.cc:42 -1
>  (expr_list:REG_CFA_FLUSH_QUEUE (nil)
> (nil)))
>
> after expand pass.  r127 is pic_offset_table_rtx here. And after
> reload it would become:
>
> (insn/f 168 167 169 2 (parallel [
> (set (reg:SI 3 bx)
> (unspec:SI [
> (const_int 0 [0])
> ] UNSPEC_SET_GOT))
> (clobber (reg:CC 17 flags))
> ]) test.cc:42 -1
>  (expr_list:REG_CFA_FLUSH_QUEUE (nil)
> (nil)))
>
> And no additional actions are required on pro_and_epilogue.  Also it
> simplifies analysis whether we should generate set_got at all.
> Current we check hard reg is ever live which is wrong with not fixed
> ebx because any usage of hard reg used to init GOT doesn't mean GOT
> usage.  And with my proposed scheme unused GOT would mean DCE just
> removes useless set_got.

Yes this is better. I was under impression you want to retain current
initialization insertion in expand_prologue.

Uros.


Re: Enable EBX for x86 in 32bits PIC code

2014-08-28 Thread Uros Bizjak
On Fri, Aug 22, 2014 at 2:21 PM, Ilya Enkovich  wrote:

> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 
> 32bit PIC mode.  It was decided that the best approach would be to not fix 
> ebx register, use speudo register for GOT base address and let allocator do 
> the rest.  This should be similar to how clang and icc work with GOT base 
> address.  I've been working for some time on such patch and now want to share 
> my results.

>  (define_insn "*pushtf"
>[(set (match_operand:TF 0 "push_operand" "=<,<")
> -   (match_operand:TF 1 "general_no_elim_operand" "x,*roF"))]
> +   (match_operand:TF 1 "nonimmediate_no_elim_operand" "x,*roF"))]

Can you please explain the reason for this change (and a couple of
similar changes to push patterns) ?

Uros.


Re: Account creation disabled on GCC Bugzilla

2014-09-01 Thread Uros Bizjak
Hello!

> 311 bugs have been created on GCC Bugzilla since yesterday. Only 2 are
> valid bugs. The remaining 309 ones are all spam and have been moved into
> the 'spam' component and marked as INVALID.

We can also avoid archiving bugs with "spam" component to gcc-bugs@ ML.

Uros.


Spam again

2014-09-02 Thread Uros Bizjak
> I again disabled account creation on GCC Bugzilla due to spammers being
> still very active. 117 user accounts have been created since yesterday.


Please immediately disable account creation on Bugzilla until an
effective solution to prevent spam is found. There is another spam
attack going on where thousands of users are created automatically.
Without some CAPTCHA or approval process, gcc bugzilla is sitting
duck.

Uros.


Fwd: Spam again

2014-09-02 Thread Uros Bizjak
Hello!

> I added code to GCC Bugzilla last night to collect IP addresses from
> requests for new accounts. 80% - 90% of requests are coming from the
> following IP ranges:
>
> 62.122.72.x - 62.122.79.x
> 91.229.229.x
> 185.2.32.x
> 185.44.77.x - 185.44.79.x
> 188.72.126.x - 188.72.127.x
> 188.72.96.x
> 193.105.154.x
> 194.29.185.x
> 195.34.78.x - 195.34.79.x
> 195.78.108.x - 195.78.109.x
>
> All of them asked for a @wowring.ru account. If some of you want to play
> with these IP ranges, I would be curious to know where they are coming
> from. Maybe Russia?

Please find GeoIP data for these IPs at [1], including location,
ISP, organization and domain.

[1] 
https://drive.google.com/file/d/0BzMiXQxzb9IOOHlFN3JPTE5jcmR6R1o4WlY4Y09BeUlGczZz/edit?usp=sharing

Uros.


Re: Compare Elimination problems

2014-09-03 Thread Uros Bizjak
Hello!

> While I'm here, in i386.md some of the flag setting operations specify a mode 
> and some don't . Eg
>
> (define_expand "cmp_1"
>   [(set (reg:CC FLAGS_REG)
> (compare:CC (match_operand:SWI48 0 "nonimmediate_operand")
>
>
> (define_insn "*add_3"
>   [(set (reg FLAGS_REG)
> (compare

The mode of the mode-less FLAGS_REG is checked with ix86_match_ccmode
in the insn constraint. Several modes are compatible with required
mode, also depending on input operands.

Uros.


Re: non-reproducible g++.dg/ubsan/align-2.C -Os execution failure

2014-09-04 Thread Uros Bizjak
Hello!

> I ran into this non-reproducible failure while testing a non-bootstrap build 
> on x86_64:
>
> ...
> PASS: g++.dg/ubsan/align-2.C   -Os  (test for excess errors)

I found the same problem on x86_64 CentOS 5.10 when testing with -m32:

gcc unix/-m32:

FAIL: c-c++-common/ubsan/align-2.c   -Os  execution test

g++ unix/-m32:

FAIL: c-c++-common/ubsan/align-2.c   -O2  execution test
FAIL: c-c++-common/ubsan/align-2.c   -O3 -fomit-frame-pointer  execution test
FAIL: c-c++-common/ubsan/align-2.c   -O3 -g  execution test
FAIL: c-c++-common/ubsan/align-2.c   -O2 -flto -flto-partition=none
execution test
FAIL: c-c++-common/ubsan/align-2.c   -O2 -flto  execution test
FAIL: c-c++-common/ubsan/align-4.c   -O1  execution test

The call to f4 in the following line triggers the failure:

  if (f2 (&v.u.e) + f3 (&v.u.e, 4) + f4 (&v.u.f.b) != 0)
__builtin_abort ();

I find it interesting that adding a dummy char c; after long long b in:

struct T { char a; long long b; };

solves the problem for me.

However, running the executable under valgrind shows nothing interesting.

For reference:

$ /lib/libc.so.6
GNU C Library stable release version 2.5, by Roland McGrath et al.

Uros.


Re: Enable EBX for x86 in 32bits PIC code

2014-09-23 Thread Uros Bizjak
On Tue, Sep 23, 2014 at 3:54 PM, Ilya Enkovich  wrote:

> Here is a patch which combines results of my and Vladimir's work on EBX 
> enabling.
>
> It works OK for SPEC2000 and SPEC2006 on -Ofast + LTO.  It passes bootstrap 
> but there are few new failures in make check.
>
> gcc.target/i386/pic-1.c fails because it doesn't expect we can use EBX in 
> 32bit PIC mode
> gcc.target/i386/pr55458.c fails due to the same reason
> gcc.target/i386/pr23098.c fails because compiler fails to use float constant 
> as an immediate and loads it from GOT instead
>
> Do we have the final decision about having a sompiler flag to control 
> enabling of pseudo PIC register?  I think we should have a possibility to use 
> fixed EBX at least until we make sure pseudo PIC doesn't harm debug info 
> generation. If we have such option then gcc.target/i386/pic-1.c and 
> gcc.target/i386/pr55458.c should be modified, otherwise these tests should be 
> removed.

I think having this flag would be dangerous. In effect, this flag
would be a hidden -ffixed-bx, with unwanted consequences on asm code
that handles ebx. As an example, please see config/i386/cpuid.h - ATM,
we handle ebx in a special way when __PIC__ is defined. With your
patch, we will have to handle it in a special way when new flag is in
effect, which is impossible, unless another compiler-generated define
is emitted.

So, I vote to change PIC reg to a pseudo unconditionally and adjust
testsuite for all (expected) fall-out.

Uros.


Denormals and underflow control (gradual vs. aburpt) in soft-fp library

2014-09-23 Thread Uros Bizjak
Hello!

Joseph, is there any support for underflow control in soft-fp library?
>From a private correspondence with FX about implementing gfortran IEEE
support for extended modes, soft-fp that implements 128bit support on
x86 could read this setting from FPU control registers and handle
denormals accordingly.

This would complement existing SSE handling of 32bit and 64bit FP
values and would probably have non-negligible effect on soft-fp
performance.

Uros.


Re: Denormals and underflow control (gradual vs. aburpt) in soft-fp library

2014-09-24 Thread Uros Bizjak
On Tue, Sep 23, 2014 at 7:13 PM, Joseph S. Myers
 wrote:

>> Joseph, is there any support for underflow control in soft-fp library?
>> >From a private correspondence with FX about implementing gfortran IEEE
>> support for extended modes, soft-fp that implements 128bit support on
>> x86 could read this setting from FPU control registers and handle
>> denormals accordingly.
>
> My current series of soft-fp patches pending review on libc-alpha includes
> one for control of handling of *input* subnormals
> , because that
> feature is in the kernel version (and David Miller volunteered to help get
> the Linux kernel using the current version of soft-fp, given the features
> from the kernel version added to it
> ).
>
> As in the kernel version, that implementation sets the denormal operand
> exception in this case, although it appears SSE does not do that (so
> further conditioning would need to be added for this to be an accurate x86
> emulation; different architectures do different things in this case).
>
> Neither version has anything to control underflow on *output*.  A
> flush-to-zero mode as on x86 appears to be different in directed rounding
> modes from abruptUnderflow as defined in IEEE 754-2008 (flush-to-zero
> always produces zero, abruptUnderflow can produce the smallest normal
> result depending on the rounding mode).  Such flush-to-zero or
> abruptUnderflow control could of course be implemented; there might be a
> need for further conditions on whether the underflow and inexact exception
> flags are set on flushing to zero (again, there are architecture
> variations).
>
> (Note that such underflow control on output, at least in the architecture
> manuals I checked, means underflow when the architecture-specific tininess
> condition is met - not when the rounding result ends up subnormal - so
> it's not possible to implement it with a late check of the rounded result;
> soft-fp would need to check at the point where it decides whether the
> result is tiny.)
>
> I don't know whether you'd want flush-to-zero emulating architecture
> semantics, abruptUnderflow, or both.

Looking at the standard at [1], the mode is flush-to-zero on output,
which fits SSE as well.

> (When used on x86 for operations involving binary128, there's also the
> point that only SSE not x87 has such modes, so you'd need to consider when
> it's correct to apply them to binary128 operations - is it appropriate for
> conversions between TFmode and XFmode or not?  Of course applying to all
> operations is easiest and avoids needing the soft-fp conditionals to
> depend on the operation / types involved, or having different
> sfp-machine.h definitions used in different source files.)

[1] http://j3-fortran.org/doc/year/03/03-131r1.txt

Uros.


Re: Issue with __builtin_remainder expansion on i386

2014-09-29 Thread Uros Bizjak
Hello!

> I have just submitted a patch emitting some new floating-point code from the 
> Fortran front-end,
> improving our IEEE support there: 
> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02444.html
>
> However, in one of the cases where we emit a call to __builtin_remainderf(), 
> we get wrong code
> generation on i386. It is peculiar because:
>
>  - the wrong code occurs at all optimization levels, and the following flags 
> (any or all) do not
> affect it:
>  -mfpmath=sse -msse2 -fno-unsafe-math-optimizations -frounding-math 
> -fsignaling-nans
>  - the wrong code does not occur with -ffloat-store
>  - the code generate looks fine by every aspect I could try. I could not 
> generate a direct C
>  testcase, unfortunately, but it is equivalent to:

The __builtin_remainderf on x86 expands to x87 fprem1 instruction [1].
According to the table in [1], +inf is not handled, and generates
division-by-zero exception.

IMO, we have to add "&& flag_finite_math_only" to expander enable
condition of remainder{sf,df,xf}3 expanders in i386.md

[1] http://x86.renejeschke.de/html/file_module_x86_id_108.html

Uros.


RTL infrastructure leaks VALUE expressions into aliasing-detecting functions

2014-10-09 Thread Uros Bizjak
Hello!

I'd like to bring PR 63475 to the attention of RTL maintainers. The
problem in the referred PR exposed the RTL infrastructure problem,
where VALUE expressions are leaked instead of MEM expresions into
various parts of aliasing-detecting support functions.

As an example, please consider following patch for base_alias_check:

--cut here--
Index: alias.c
===
--- alias.c (revision 216025)
+++ alias.c (working copy)
@@ -1824,6 +1824,13 @@ base_alias_check (rtx x, rtx x_base, rtx y, rtx y_
   if (rtx_equal_p (x_base, y_base))
 return 1;

+  if (GET_CODE (x) == VALUE || GET_CODE (y) == VALUE)
+{
+  debug_rtx (x);
+  debug_rtx (y);
+  gcc_unreachable ();
+}
+
   /* The base addresses are different expressions.  If they are not accessed
  via AND, there is no conflict.  We can bring knowledge of object
  alignment into play here.  For example, on alpha, "char a, b;" can
--cut here--

The crosscompiler to alpha-linux-gnu dies immediately on the testcase,
provided in the PR.

One of the checks that trigger this condition in base_alias_check is:

(and:DI (lo_sum:DI (reg/f:DI 13 $13 [72])
(symbol_ref:DI ("aaa") [flags 0x6]  ))
(const_int -8 [0xfff8]))

with

(value:DI 7:3304360 @0x1b619ee0/0x1b614cb0)

and this is the reason why aliasing of aaa and bbb is not detected.
The (VALUE) RTX corresponds to:

cselib value 7:3304360 0x8a84cb0 (and:DI (lo_sum:DI (reg/f:DI 11 $11 [78])
(symbol_ref:DI ("bbb") [flags 0x6]  ))
(const_int -8 [0xfff8]))

so, there is no way for current code to detect aliasing between these two RTXes.


Re: RTL infrastructure leaks VALUE expressions into aliasing-detecting functions

2014-10-10 Thread Uros Bizjak
On Fri, Oct 10, 2014 at 7:56 PM, Jeff Law  wrote:
> On 10/09/14 06:14, Uros Bizjak wrote:
>>
>> Hello!
>>
>> I'd like to bring PR 63475 to the attention of RTL maintainers. The
>> problem in the referred PR exposed the RTL infrastructure problem,
>> where VALUE expressions are leaked instead of MEM expresions into
>> various parts of aliasing-detecting support functions.
>>
>> As an example, please consider following patch for base_alias_check:
>>
>> --cut here--
>> Index: alias.c
>> ===
>> --- alias.c (revision 216025)
>> +++ alias.c (working copy)
>> @@ -1824,6 +1824,13 @@ base_alias_check (rtx x, rtx x_base, rtx y, rtx y_
>> if (rtx_equal_p (x_base, y_base))
>>   return 1;
>>
>> +  if (GET_CODE (x) == VALUE || GET_CODE (y) == VALUE)
>> +{
>> +  debug_rtx (x);
>> +  debug_rtx (y);
>> +  gcc_unreachable ();
>> +}
>> +
>> /* The base addresses are different expressions.  If they are not
>> accessed
>>via AND, there is no conflict.  We can bring knowledge of object
>>alignment into play here.  For example, on alpha, "char a, b;" can
>
> But when base_alias_check  returns, we call memrefs_conflict_p which does
> know how to dig down into a VALUE expression.

IIRC, the problem was that base_alias_check returned 0 due to:

  /* Differing symbols not accessed via AND never alias.  */
  if (GET_CODE (x_base) != ADDRESS && GET_CODE (y_base) != ADDRESS)
return 0;

so, the calling code never reached memrefs_conflict_p down the stream.

It might be that targets without AND addresses are immune to this
issue, but the code that deals with ANDs is certailny not prepared to
handle VALUEs.

(The testcase from the PR can be compiled with a crosscompiler to
alpha-linux-gnu, as outlined in the PR. Two AND addresses should be
detected as aliasing, but they are not - resulting in CSE propagating
aliased read after store in (insn 29).)

Uros.


Re: RTL infrastructure leaks VALUE expressions into aliasing-detecting functions

2014-10-10 Thread Uros Bizjak
On Fri, Oct 10, 2014 at 8:18 PM, Jeff Law  wrote:

 I'd like to bring PR 63475 to the attention of RTL maintainers. The
 problem in the referred PR exposed the RTL infrastructure problem,
 where VALUE expressions are leaked instead of MEM expresions into
 various parts of aliasing-detecting support functions.

 As an example, please consider following patch for base_alias_check:

 --cut here--
 Index: alias.c
 ===
 --- alias.c (revision 216025)
 +++ alias.c (working copy)
 @@ -1824,6 +1824,13 @@ base_alias_check (rtx x, rtx x_base, rtx y, rtx
 y_
  if (rtx_equal_p (x_base, y_base))
return 1;

 +  if (GET_CODE (x) == VALUE || GET_CODE (y) == VALUE)
 +{
 +  debug_rtx (x);
 +  debug_rtx (y);
 +  gcc_unreachable ();
 +}
 +
  /* The base addresses are different expressions.  If they are not
 accessed
 via AND, there is no conflict.  We can bring knowledge of object
 alignment into play here.  For example, on alpha, "char a, b;"
 can
>>>
>>>
>>> But when base_alias_check  returns, we call memrefs_conflict_p which does
>>> know how to dig down into a VALUE expression.
>>
>>
>> IIRC, the problem was that base_alias_check returned 0 due to:
>>
>>/* Differing symbols not accessed via AND never alias.  */
>>if (GET_CODE (x_base) != ADDRESS && GET_CODE (y_base) != ADDRESS)
>>  return 0;
>>
>> so, the calling code never reached memrefs_conflict_p down the stream.
>
> Right.  And my question is what happens if we aren't as aggressive here.
> What happens if before this check we return nonzero if X or Y is a VALUE?
> Do we then get into memrefs_conflict_p and does it do the right thing?

Following patch just after AND detection in base_alias_check fixes the
testcase from PR:

--cut here--
Index: alias.c
===
--- alias.c (revision 216100)
+++ alias.c (working copy)
@@ -1842,6 +1842,8 @@ base_alias_check (rtx x, rtx x_base, rtx y, rtx y_
  || (int) GET_MODE_UNIT_SIZE (x_mode) < -INTVAL (XEXP (y, 1
 return 1;

+  return 1;
+
   /* Differing symbols not accessed via AND never alias.  */
   if (GET_CODE (x_base) != ADDRESS && GET_CODE (y_base) != ADDRESS)
 return 0;
--cut here--

I have started a bootstrap on alpha native with this patch (it will
take a day or so) and will report back findings

The results with unpatched gcc are at [1].

[1] https://gcc.gnu.org/ml/gcc-testresults/2014-10/msg01151.html

Uros.


Re: RTL infrastructure leaks VALUE expressions into aliasing-detecting functions

2014-10-12 Thread Uros Bizjak
On Fri, Oct 10, 2014 at 8:37 PM, Uros Bizjak  wrote:

>> Right.  And my question is what happens if we aren't as aggressive here.
>> What happens if before this check we return nonzero if X or Y is a VALUE?
>> Do we then get into memrefs_conflict_p and does it do the right thing?
>
> Following patch just after AND detection in base_alias_check fixes the
> testcase from PR:
>
> --cut here--
> Index: alias.c
> ===
> --- alias.c (revision 216100)
> +++ alias.c (working copy)
> @@ -1842,6 +1842,8 @@ base_alias_check (rtx x, rtx x_base, rtx y, rtx y_
>   || (int) GET_MODE_UNIT_SIZE (x_mode) < -INTVAL (XEXP (y, 1
>  return 1;
>
> +  return 1;
> +
>/* Differing symbols not accessed via AND never alias.  */
>if (GET_CODE (x_base) != ADDRESS && GET_CODE (y_base) != ADDRESS)
>  return 0;
> --cut here--
>
> I have started a bootstrap on alpha native with this patch (it will
> take a day or so) and will report back findings

Yes, this "patch" solves the original gfortran problem.

It looks to me that the code that handles AND addresses in
base_alias_check is not prepared to handle VALUES correctly.

Uros.


Re: RTL infrastructure leaks VALUE expressions into aliasing-detecting functions

2014-10-14 Thread Uros Bizjak
On Sun, Oct 12, 2014 at 7:44 PM, Uros Bizjak  wrote:

>>> Right.  And my question is what happens if we aren't as aggressive here.
>>> What happens if before this check we return nonzero if X or Y is a VALUE?
>>> Do we then get into memrefs_conflict_p and does it do the right thing?
>>
>> Following patch just after AND detection in base_alias_check fixes the
>> testcase from PR:

[...]

> It looks to me that the code that handles AND addresses in
> base_alias_check is not prepared to handle VALUES correctly.

The further analysis and proposed patch follows up at [1].

[1] https://gcc.gnu.org/ml/gcc-patches/2014-10/msg01209.html

Uros.


Re: Towards GNU11

2014-10-15 Thread Uros Bizjak
Hello!

>> The consensus seems to be to go forward with this change.  I will
>> commit the patch in 24 hours unless I hear objections.
>
> I made the change.  Please report any fallout to me.

i686-linux-gnu testsuite trivially regressed [1]:

FAIL: gcc.dg/20020122-2.c (test for excess errors)
FAIL: gcc.dg/builtin-apply4.c (test for excess errors)
FAIL: gcc.dg/ia64-sync-1.c (test for excess errors)
FAIL: gcc.dg/ia64-sync-2.c (test for excess errors)
FAIL: gcc.dg/ia64-sync-3.c (test for excess errors)
FAIL: gcc.dg/pr32176.c (test for excess errors)
FAIL: gcc.dg/sync-2.c (test for excess errors)
FAIL: gcc.dg/sync-3.c (test for excess errors)
FAIL: gcc.target/i386/20060125-1.c (test for excess errors)
FAIL: gcc.target/i386/20060125-2.c (test for excess errors)
FAIL: gcc.target/i386/980312-1.c (test for excess errors)
FAIL: gcc.target/i386/980313-1.c (test for excess errors)
FAIL: gcc.target/i386/990524-1.c (test for excess errors)
FAIL: gcc.target/i386/avx512f-pr57233.c (test for excess errors)
FAIL: gcc.target/i386/avx512f-typecast-1.c (test for excess errors)
FAIL: gcc.target/i386/builtin-apply-mmx.c (test for excess errors)
FAIL: gcc.target/i386/crc32-2.c (test for excess errors)
FAIL: gcc.target/i386/crc32-3.c (test for excess errors)
FAIL: gcc.target/i386/intrinsics_3.c (test for excess errors)
FAIL: gcc.target/i386/loop-1.c (test for excess errors)
FAIL: gcc.target/i386/memcpy-1.c (test for excess errors)
FAIL: gcc.target/i386/pr26826.c (test for excess errors)
FAIL: gcc.target/i386/pr37184.c (test for excess errors)
FAIL: gcc.target/i386/pr40934.c (test for excess errors)
FAIL: gcc.target/i386/pr44948-2a.c (test for excess errors)
FAIL: gcc.target/i386/pr47564.c (test for excess errors)
FAIL: gcc.target/i386/pr50712.c (test for excess errors)
FAIL: gcc.target/i386/sse-5.c (test for excess errors)
FAIL: gcc.target/i386/stackalign/asm-1.c -mno-stackrealign (test for
excess errors)
FAIL: gcc.target/i386/stackalign/asm-1.c -mstackrealign (test for excess errors)
FAIL: gcc.target/i386/stackalign/return-2.c -mno-stackrealign (test
for excess errors)
FAIL: gcc.target/i386/stackalign/return-2.c -mstackrealign (test for
excess errors)
FAIL: gcc.target/i386/vectorize4.c (test for excess errors)

Mostly:

warning: implicit declaration of function ...

and

warning: return type defaults to 'int'

[1] https://gcc.gnu.org/ml/gcc-regression/2014-10/msg00347.html

Uros.


Recent bootstrap failure on CentOS 5.11, /usr/bin/ld: Dwarf Error: found dwarf version '4' ...

2014-10-16 Thread Uros Bizjak
Hello!

Recent change caused bootstrap failure on CentOS 5.11:

/usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
handles version 2 information.
unwind-dw2-fde-dip_s.o: In function `__pthread_cleanup_routine':
unwind-dw2-fde-dip.c:(.text+0x1590): multiple definition of
`__pthread_cleanup_routine'
/usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
handles version 2 information.
unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
/usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
handles version 2 information.
unwind-sjlj_s.o: In function `__pthread_cleanup_routine':
unwind-sjlj.c:(.text+0x0): multiple definition of `__pthread_cleanup_routine'
unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
/usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
handles version 2 information.
emutls_s.o: In function `__pthread_cleanup_routine':
emutls.c:(.text+0x170): multiple definition of `__pthread_cleanup_routine'
unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
collect2: error: ld returned 1 exit status
gmake[5]: *** [libgcc_s.so] Error 1

$ ld --version
GNU ld version 2.17.50.0.6-26.el5 20061020

Uros.


Re: Recent bootstrap failure on CentOS 5.11, /usr/bin/ld: Dwarf Error: found dwarf version '4' ...

2014-10-16 Thread Uros Bizjak
On Thu, Oct 16, 2014 at 11:25 AM, Uros Bizjak  wrote:

> Recent change caused bootstrap failure on CentOS 5.11:
>
> /usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
> handles version 2 information.
> unwind-dw2-fde-dip_s.o: In function `__pthread_cleanup_routine':
> unwind-dw2-fde-dip.c:(.text+0x1590): multiple definition of
> `__pthread_cleanup_routine'
> /usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
> handles version 2 information.
> unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
> /usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
> handles version 2 information.
> unwind-sjlj_s.o: In function `__pthread_cleanup_routine':
> unwind-sjlj.c:(.text+0x0): multiple definition of `__pthread_cleanup_routine'
> unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
> /usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
> handles version 2 information.
> emutls_s.o: In function `__pthread_cleanup_routine':
> emutls.c:(.text+0x170): multiple definition of `__pthread_cleanup_routine'
> unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
> collect2: error: ld returned 1 exit status
> gmake[5]: *** [libgcc_s.so] Error 1
>
> $ ld --version
> GNU ld version 2.17.50.0.6-26.el5 20061020

It looks like a switch-to-c11 fallout. Older glibc versions have
issues with c99 (and c11) conformance [1].

Changing "extern __inline void __pthread_cleanup_routine (...)" in
system /usr/include/pthread.h to

if __STDC_VERSION__ < 199901L
extern
#endif
__inline__ void __pthread_cleanup_routine (...)

fixes this issue and allows bootstrap to proceed.

However, fixincludes is not yet built in stage1 bootstrap. Is there a
way to fix this issue without changing system headers?

[1] https://gcc.gnu.org/ml/gcc-patches/2006-11/msg01030.html

Uros.


[PATCH, fixincludes]: Add pthread.h to glibc_c99_inline_4 fix

2014-10-21 Thread Uros Bizjak
On Thu, Oct 16, 2014 at 2:05 PM, Jakub Jelinek  wrote:

>> > Recent change caused bootstrap failure on CentOS 5.11:
>> >
>> > /usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
>> > handles version 2 information.
>> > unwind-dw2-fde-dip_s.o: In function `__pthread_cleanup_routine':
>> > unwind-dw2-fde-dip.c:(.text+0x1590): multiple definition of
>> > `__pthread_cleanup_routine'
>> > /usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
>> > handles version 2 information.
>> > unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
>> > /usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
>> > handles version 2 information.
>> > unwind-sjlj_s.o: In function `__pthread_cleanup_routine':
>> > unwind-sjlj.c:(.text+0x0): multiple definition of 
>> > `__pthread_cleanup_routine'
>> > unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
>> > /usr/bin/ld: Dwarf Error: found dwarf version '4', this reader only
>> > handles version 2 information.
>> > emutls_s.o: In function `__pthread_cleanup_routine':
>> > emutls.c:(.text+0x170): multiple definition of `__pthread_cleanup_routine'
>> > unwind-dw2_s.o:unwind-dw2.c:(.text+0x270): first defined here
>> > collect2: error: ld returned 1 exit status
>> > gmake[5]: *** [libgcc_s.so] Error 1
>> >
>> > $ ld --version
>> > GNU ld version 2.17.50.0.6-26.el5 20061020
>>
>> It looks like a switch-to-c11 fallout. Older glibc versions have
>> issues with c99 (and c11) conformance [1].
>>
>> Changing "extern __inline void __pthread_cleanup_routine (...)" in
>> system /usr/include/pthread.h to
>>
>> if __STDC_VERSION__ < 199901L
>> extern
>> #endif
>> __inline__ void __pthread_cleanup_routine (...)
>>
>> fixes this issue and allows bootstrap to proceed.
>>
>> However, fixincludes is not yet built in stage1 bootstrap. Is there a
>> way to fix this issue without changing system headers?
>>
>> [1] https://gcc.gnu.org/ml/gcc-patches/2006-11/msg01030.html
>
> Yeah, old glibcs are totally incompatible with -fno-gnu89-inline.
> Not sure if it is easily fixincludable, if yes, then -fgnu89-inline should
> be used for code like libgcc which is built with the newly built compiler
> before it is fixincluded.
> Or we need -fgnu89-inline by default for old glibcs (that is pretty
> much what we do e.g. in Developer Toolset for RHEL5).

At the end of the day, adding pthread.h to glibc_c99_inline_4 fix
fixes the bootstrap. The fix applies __attribute__((__gnu_inline__))
to the declaration:

extern __inline __attribute__ ((__gnu_inline__)) void
__pthread_cleanup_routine (struct __pthread_cleanup_frame *__frame)

2014-10-21  Uros Bizjak  

* inclhack.def (glibc_c99_inline_4): Add pthread.h to files.
* fixincl.x: Regenerate.

Bootstrapped and regression tested on CentOS 5.11 x86_64-linux-gnu {,-m32}.

OK for mainline?

Uros.
Index: fixincl.x
===
--- fixincl.x   (revision 216501)
+++ fixincl.x   (working copy)
@@ -2,11 +2,11 @@
  * 
  * DO NOT EDIT THIS FILE   (fixincl.x)
  * 
- * It has been AutoGen-ed  August 12, 2014 at 02:09:58 PM by AutoGen 5.12
+ * It has been AutoGen-ed  October 21, 2014 at 10:18:16 AM by AutoGen 5.16.2
  * From the definitionsinclhack.def
  * and the template file   fixincl
  */
-/* DO NOT SVN-MERGE THIS FILE, EITHER Tue Aug 12 14:09:58 MSK 2014
+/* DO NOT SVN-MERGE THIS FILE, EITHER Tue Oct 21 10:18:17 CEST 2014
  *
  * You must regenerate it.  Use the ./genfixes script.
  *
@@ -3173,7 +3173,7 @@
  *  File name selection pattern
  */
 tSCC zGlibc_C99_Inline_4List[] =
-  "sys/sysmacros.h\0*/sys/sysmacros.h\0wchar.h\0*/wchar.h\0";
+  
"sys/sysmacros.h\0*/sys/sysmacros.h\0wchar.h\0*/wchar.h\0pthread.h\0*/pthread.h\0";
 /*
  *  Machine/OS name selection pattern
  */
Index: inclhack.def
===
--- inclhack.def(revision 216501)
+++ inclhack.def(working copy)
@@ -1687,7 +1687,8 @@
  */
 fix = {
 hackname  = glibc_c99_inline_4;
-files = sys/sysmacros.h, '*/sys/sysmacros.h', wchar.h, '*/wchar.h';
+files = sys/sysmacros.h, '*/sys/sysmacros.h', wchar.h, '*/wchar.h',
+pthread.h, '*/pthread.h';
 bypass= "__extern_inline|__gnu_inline__";
 select= "(^| )extern __inline";
 c_fix = format;


Recent go changes broke alpha bootstrap

2014-10-30 Thread Uros Bizjak
Hello!

Recent go changes broke alpha bootstrap:

/bin/mkdir -p .; files=`echo
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/dir_largefile.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/dir.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/doc.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/env.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/error.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/error_unix.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/exec.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/exec_posix.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/exec_unix.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/file.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/file_posix.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/file_unix.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/getwd.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/path.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/path_unix.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/pipe_linux.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/proc.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/stat_atim.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/str.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/sys_linux.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/sys_unix.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/types.go
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/types_notwin.go
errors.gox io.gox runtime.gox sync/atomic.gox sync.gox syscall.gox
time.gox | sed -e 's/[^ ]*\.gox//g'`; /bin/sh ./libtool --tag GO
--mode=compile /space/homedirs/uros/gcc-build/./gcc/gccgo
-B/space/homedirs/uros/gcc-build/./gcc/
-B/usr/local/alphaev68-unknown-linux-gnu/bin/
-B/usr/local/alphaev68-unknown-linux-gnu/lib/ -isystem
/usr/local/alphaev68-unknown-linux-gnu/include -isystem
/usr/local/alphaev68-unknown-linux-gnu/sys-include  -O2 -g -mieee
-I . -c -fgo-pkgpath=`echo os.lo | sed -e 's/.lo$//' -e 's/-go$//'` -o
os.lo $files/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/stat_atim.go:22:29:
error: reference to undefined field or method ‘Mtim’
   modTime: timespecToTime(st.Mtim),
 ^
/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/stat_atim.go:60:50:
error: reference to undefined field or method ‘Atim’
  return timespecToTime(fi.Sys().(*syscall.Stat_t).Atim)
  ^
Makefile:4579: recipe for target 'os.lo' failed
gmake[4]: *** [os.lo] Error 1
gmake[4]: Leaving directory
'/space/uros/gcc-build/alphaev68-unknown-linux-gnu/libgo'

The relevant part of libgo/sysinfo.log declares:

libgo/sysinfo.go:type Stat_t struct { Dev uint64; Ino uint64; Rdev
uint64; Size int64; Blocks uint64; Mode uint32; Uid uint32; Gid
uint32; Blksize uint32; Nlink uint32; __pad0 int32; Godump_0 struct {
Atim [16]byte; Godump_1_align [0]uint64; }; Godump_2 struct { Mtim
[16]byte; Godump_3_align [0]uint64; }; Godump_4 struct { Ctim
[16]byte; Godump_5_align [0]uint64; }; __glibc_reserved [2+1]int64; }

Uros.


Re: Recent go changes broke alpha bootstrap

2014-10-30 Thread Uros Bizjak
On Thu, Oct 30, 2014 at 8:40 AM, Uros Bizjak  wrote:

> Recent go changes broke alpha bootstrap:

> $files/space/homedirs/uros/gcc-svn/trunk/libgo/go/os/stat_atim.go:22:29:
> error: reference to undefined field or method ‘Mtim’
>modTime: timespecToTime(st.Mtim),
>  ^
> /space/homedirs/uros/gcc-svn/trunk/libgo/go/os/stat_atim.go:60:50:
> error: reference to undefined field or method ‘Atim’
>   return timespecToTime(fi.Sys().(*syscall.Stat_t).Atim)
>   ^
> Makefile:4579: recipe for target 'os.lo' failed
> gmake[4]: *** [os.lo] Error 1
> gmake[4]: Leaving directory
> '/space/uros/gcc-build/alphaev68-unknown-linux-gnu/libgo'
>
> The relevant part of libgo/sysinfo.log declares:
>
> libgo/sysinfo.go:type Stat_t struct { Dev uint64; Ino uint64; Rdev
> uint64; Size int64; Blocks uint64; Mode uint32; Uid uint32; Gid
> uint32; Blksize uint32; Nlink uint32; __pad0 int32; Godump_0 struct {
> Atim [16]byte; Godump_1_align [0]uint64; }; Godump_2 struct { Mtim
> [16]byte; Godump_3_align [0]uint64; }; Godump_4 struct { Ctim
> [16]byte; Godump_5_align [0]uint64; }; __glibc_reserved [2+1]int64; }

Looks like "[PATCH 7/9] Gccgo port to s390[x] -- part I" [1,2] broke it.

The definitions from bits/stat.h are:

--cut here--
/* Nanosecond resolution timestamps are stored in a format equivalent to
   'struct timespec'.  This is the type used whenever possible but the
   Unix namespace rules do not allow the identifier 'timespec' to appear
   in the  header.  Therefore we have to handle the use of
   this header in strictly standard-compliant sources special.

   Use neat tidy anonymous unions and structures when possible.  */

#if defined __USE_MISC || defined __USE_XOPEN2K8
# if __GNUC_PREREQ(3,3)
#  define __ST_TIME(X)  \
__extension__ union {   \
struct timespec st_##X##tim;\
struct {\
__time_t st_##X##time;  \
unsigned long st_##X##timensec; \
};  \
}
# else
#  define __ST_TIME(X) struct timespec st_##X##tim
#  define st_atime st_atim.tv_sec
#  define st_mtime st_mtim.tv_sec
#  define st_ctime st_ctim.tv_sec
# endif
#else
# define __ST_TIME(X)   \
__time_t st_##X##time;  \
unsigned long st_##X##timensec
#endif


struct stat
  {
__dev_t st_dev; /* Device.  */
#ifdef __USE_FILE_OFFSET64
__ino64_t st_ino;   /* File serial number.  */
#else
__ino_t st_ino; /* File serial number.  */
int __pad0; /* 64-bit st_ino.  */
#endif
__dev_t st_rdev;/* Device number, if device.  */
__off_t st_size;/* Size of file, in bytes.  */
#ifdef __USE_FILE_OFFSET64
__blkcnt64_t st_blocks; /* Nr. 512-byte blocks allocated.  */
#else
__blkcnt_t st_blocks;   /* Nr. 512-byte blocks allocated.  */
int __pad1; /* 64-bit st_blocks.  */
#endif
__mode_t st_mode;   /* File mode.  */
__uid_t st_uid; /* User ID of the file's owner. */
__gid_t st_gid; /* Group ID of the file's group.*/
__blksize_t st_blksize; /* Optimal block size for I/O.  */
__nlink_t st_nlink; /* Link count.  */
int __pad2; /* Real padding.  */
__ST_TIME(a);   /* Time of last access.  */
__ST_TIME(m);   /* Time of last modification.  */
__ST_TIME(c);   /* Time of last status change.  */
long __glibc_reserved[3];
  };
--cut here--

[1] https://gcc.gnu.org/ml/gcc-patches/2014-10/msg02977.html
[2] https://gcc.gnu.org/ml/gcc-cvs/2014-10/msg01069.html

Uros.


Re: Recent go changes broke alpha bootstrap

2014-11-04 Thread Uros Bizjak
On Tue, Nov 4, 2014 at 1:00 AM, Ian Taylor  wrote:
> On Mon, Nov 3, 2014 at 9:02 AM, Dominik Vogt  wrote:
>> On Thu, Oct 30, 2014 at 08:05:14AM -0700, Ian Taylor wrote:
>>> On Thu, Oct 30, 2014 at 5:46 AM, Dominik Vogt  
>>> wrote:
>>> > I'm not quite sure about the best approach.  The attempt to
>>> > represent C unions in the "right" way is ultimately futile as Go
>>> > does not have the types necessary for it.  For example, the
>>> > handling of anonymous bit fields will never be right as it's
>>> > undefinied.  On the other hand I could fix the issue at hand by
>>> > changing the way anonymous unions are represented in Go.
>>> >
>>> > Example:
>>> >
>>> >   struct { int8_t x; union { int16_t y; int 32_t z; }; };
>>> >
>>> > Was represented (before the patch) as
>>> >
>>> >   struct { X byte; int16 Y; }
>>> >
>>> > which had size 4, alignment 2 and y at offset 2 but should had
>>> > have size 8, alignment 4 and y at offset 4.  With the current
>>> > patch the Go layout is
>>> >
>>> >   struct { X byte; artificial_name struct { y [2]byte; align [0]int32; } }
>>> >
>>> > with the proper size, alignment and offset, but y is addressed as
>>> > ".artificial_name.y" insted of just ".y", and y is a byte array
>>> > and not an int16.
>>> >
>>> > I could remove the "artificial_name struct" and add padding before
>>> > and after y instead:
>>> >
>>> >   struct { X byte; pad_0 [3]byte; Y int16; pad_1 [2]byte; align [0]int32; 
>>> > }
>>> >
>>> > What do you think?
>>>
>>> Sounds good to me.  Basically the fields of the anonymous union should
>>> be promoted to become fields of the struct.  We can't do it in
>>> general, but we can do it for the first field.  That addresses the
>>> actual uses of anonymous unions.
>>
>> The attached patch fixes this, at least if the first element of the
>> union is not a bitfield.
>>
>> Bitfields can really not be represented properly in Go (think about
>> constructs like "struct { int : 1; int bf : 1; }"), I'd rather not
>> try to represent them in a predictable way.  The patched code may
>> or may not give them a name, and reserves the proper amount of
>> padding where required in structs.  If a union begins with an
>> anonymous bitfield (which makes no sense), that is ignored.  If a
>> union begins with a named bitfield (possibly after unnamed ones),
>> this may or may not be used as the (sole) element of the Go
>> struct.
>
>
> Thanks.  I committed your patch.

I have checked that the patch fixes alpha bootstrap with libgo.

Thanks,
Uros.


Re: [ping] Re: proper name of i386/x86-64/etc targets

2015-01-19 Thread Uros Bizjak
On Tue, Jan 20, 2015 at 3:26 AM, Sandra Loosemore
 wrote:

>> I've noticed that the GCC user documentation is quite inconsistent about
>> the name(s) it uses for i386/x86-64/etc targets.  invoke.texi has a
>> section for "i386 and x86-64 Options", but in other places the manual
>> uses x86, X86, i?86, i[34567]86, x86_64 (underscore instead of a dash),
>> etc.
>>
>> I'd be happy to work on a patch to bring the manual to using a common
>> naming convention, but what should it be?  Wikipedia seems to use "x86"
>> (lowercase) to refer to the entire family of architectures (including
>> the original 16-bit variants), "IA-32" for the 32-bit architecture (I
>> believe that is Intel's official name), and "x86-64" (with a dash
>> instead of underscore) for the 64-bit architecture.  But of course the
>> target maintainers should have the final say on what names to use.
>
>
> Ping?  Any thoughts?

Let's ask Intel people ...

Uros.


Re: [ping] Re: proper name of i386/x86-64/etc targets

2015-01-20 Thread Uros Bizjak
On Tue, Jan 20, 2015 at 3:23 PM, H.J. Lu  wrote:

>>> > ia32 is confusing because ia64 (a well known term) sounds related but
>>> > can't be farther away from it, and it's also vendor specific.  Our
>>> > traditional i386 seems better to me (although it has its own problems,
>>> > but I'm not aware of any better abbreviation in the wild that's vendor
>>> > neutral and specifically means the 32bit incarnation of the x86
>>> > architecture).
>>> >
>>>
>>> The problem with i386 is it is a real processor.  When someone says
>>> i386, it isn't clear if it means the processor or 32-bit x86.
>>
>> That's what I meant with its own problems :)  But ia32 seems worse to me
>> than this IMO.
>>
>
> At least, IA-32 is clear, although IA-64 may be confusing :-).  FWIW,
> i386 is also vendor specific.

Wikipedia agrees [1]:

--q--
IA-32 (short for "Intel Architecture, 32-bit", sometimes also called
i386[1][2] through metonymy)[3] is the third generation of the x86
architecture, first implemented in the Intel 80386 microprocessors in
1985. It was the first incarnation of x86 to support 32-bit
computing.[4] As such, "IA-32" may be used as a metonym to refer to
all x86 versions that support 32-bit computing.[5][6]
--/q--

IMO, comparing IA-32 and i386, IA-32 looks better.

[1] http://en.wikipedia.org/wiki/IA-32

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Uros Bizjak
On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich  wrote:

> I was looking into PR65105 and tried to generate SSE computation for a
> simple 64bit  a + b + c sequence. Having no scalar integer instructions in
> SSE I have to use vector variants.

Is this approach really better that having two add/addc instructions?

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Uros Bizjak
On Fri, Apr 24, 2015 at 11:45 AM, Uros Bizjak  wrote:
> On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich  
> wrote:
>
>> I was looking into PR65105 and tried to generate SSE computation for a
>> simple 64bit  a + b + c sequence. Having no scalar integer instructions in
>> SSE I have to use vector variants.
>
> Is this approach really better that having two add/addc instructions?

FYI, V1DI mode was introduced because XMM shift insn were used to
shift DImode values. The cost of moves from/to integer DImode reg pair
was disastrous.

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Uros Bizjak
On Fri, Apr 24, 2015 at 12:09 PM, Ilya Enkovich  wrote:

 I was looking into PR65105 and tried to generate SSE computation for a
 simple 64bit  a + b + c sequence. Having no scalar integer instructions in
 SSE I have to use vector variants.
>>>
>>> Is this approach really better that having two add/addc instructions?
>>
>> FYI, V1DI mode was introduced because XMM shift insn were used to
>> shift DImode values. The cost of moves from/to integer DImode reg pair
>> was disastrous.
>>
>> Uros.
>
> Does it mean I have to add V1DI instructions for all opcodes I want to
> transform (add,sub,mul,or,and, etc.)?

No.

Please try to generate paradoxical subreg (V2DImode subreg of V1DImode
pseudo). IIRC, there is some functionality in the compiler that is
able to tell if the highpart of the paradoxical register is zeroed.

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Uros Bizjak
On Fri, Apr 24, 2015 at 12:14 PM, Uros Bizjak  wrote:

>>>>> I was looking into PR65105 and tried to generate SSE computation for a
>>>>> simple 64bit  a + b + c sequence. Having no scalar integer instructions in
>>>>> SSE I have to use vector variants.
>>>>
>>>> Is this approach really better that having two add/addc instructions?
>>>
>>> FYI, V1DI mode was introduced because XMM shift insn were used to
>>> shift DImode values. The cost of moves from/to integer DImode reg pair
>>> was disastrous.
>>>
>>> Uros.
>>
>> Does it mean I have to add V1DI instructions for all opcodes I want to
>> transform (add,sub,mul,or,and, etc.)?
>
> No.
>
> Please try to generate paradoxical subreg (V2DImode subreg of V1DImode
> pseudo). IIRC, there is some functionality in the compiler that is
> able to tell if the highpart of the paradoxical register is zeroed.

Probably you can even generate paradoxical V2DImode subreg of DImode.
I'm not sure if in this case register allocator degenerates the mode
of resulting hard register to DImode, it is worth a try.

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-07 Thread Uros Bizjak
On Thu, May 7, 2015 at 6:24 PM, Richard Henderson  wrote:
> On 04/24/2015 06:32 PM, Jan Hubicka wrote:
>> Also I believe it was kind of Richard's design deicsion to avoid use of
>> (paradoxical) subregs for vector conversions because these have funny
>> implications.
>
> Yes indeed.
>
>> The code for handling upper parts of paradoxical subregs is controlled by
>> macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
>> V1DI->V2DI conversions fluently without some middle-end hacking. (it will
>> probably try to produce zero extensions)
>>
>> When we are on SSE instructions, it would be great to finally teach
>> copy_by_pieces/store_by_pieces to use vector instructions (these are more
>> compact and either equaly fast or faster on some CPUs). I hope to get into
>> this, but it would be great if someone beat me.
>
> Well, I think it would be worthwhile to teach the i386 backend how to do 
> 64-bit
> vectors in SSE registers.  First, this would aid portability with other 
> targets
> who may have GCC generic vectors written only for 8 byte quantities.  Since we
> do have zero-extending 8 byte load/store insns for SSE, we don't actually need
> paradoxical regs, just additional macro-ization of the existing patterns.

If we consider SSE operations as DImode operations, we will loose the
ability to precisely specify which operation (SSE vs. general reg) we
want. I'm afraid that in DImode case, combine will choose FLAG-less
pattern that will mandate moves from general regs to SSE regs and
back. This was the reason to invent V1DImode/V1TImode "vectors" to
avoid moving double-mode values to MMX/SSE regs for double-mode
shifts.

The alternative would be RA that is able to select between alternative
instructions, not only between alternative register classes.

Uros.


Re: Broken test gcc.target/i386/sibcall-2.c

2015-05-14 Thread Uros Bizjak
On Wed, May 13, 2015 at 8:17 PM, Alexander Monakov  wrote:

> Can you also tell me why ..._pop call and sibcall instructions are predicated
> on !TARGET_64BIT?

Because "  /* None of the 64-bit ABIs pop arguments.  */ ".

Please see call_pop documentation and ix86_return_pops_args from
config/i386/i386.c

Uros.


Re: parameters to _mm_mwait intrinsic

2015-06-03 Thread Uros Bizjak
On Wed, Jun 3, 2015 at 2:47 PM, Kumar, Venkataramanan
 wrote:
> Hi,
>
> I was going through the "monitor" and "mwait" builtin implementation.
> I need clarification on the parameters passed to _mm_mwait intrinsic.
>

> Should the constraint be swaped for the operands in the pattern?

Please swap the constraints in the pattern.

Patch is pre-approved for mainline and release branches.

Thanks,
Uros.


Re: [RFC] split of i386.c

2019-03-13 Thread Uros Bizjak
On Tue, Mar 12, 2019 at 9:54 PM Jeff Law  wrote:
>
> On 3/12/19 2:50 PM, Eric Gallager wrote:
> > On 3/12/19, Martin Liška  wrote:
> >> Hi.
> >>
> >> I've thinking about the file split about quite some time, mainly
> >> in context of PR84402. I would like to discuss if it's fine for
> >> maintainers of the target to make such split and into which logical
> >> components can the file be split?
> >>
> >> I'm suggesting something like:
> >> - option-related and attribute-related stuff (i386-options.c - as seen in
> >> patch)
> >> - built-in related functions
> >> - expansion/gen functions - still quite of lot of functions, would make
> >>   sense to split into:
> >>   - scalar
> >>   - vector
> >> - prologue/epilogue, GOT, PLT, symbol emission
> >> - misc extensions like STV, TLS, CET, retpolines, multiversioning, ..
> >> - helpers - commonly used functions, print_reg, ix86_print_operand, ..
> >>
> >> I am volunteering to make the split, hopefully early in the next stage1.
> >>
> >> Thoughts?
> >>
> >> Thanks,
> >> Martin
> >>
> >
> > I'm not a maintainer, but just as an onlooker I highly support this
> > move; i386.c is way too long as it is. 7 pieces sounds like a good
> > number of new files to split it into, too.
> I trust your judgment on where/how to split and fully support the goals
> behind splitting.  Uros is the key person you need to get on board.

I'm OK with the split, the file is getting huge and I think the
suggested split would be beneficial.

Uros.


Re: mfentry and Darwin.

2019-05-21 Thread Uros Bizjak
On Tue, May 21, 2019 at 6:15 PM Iain Sandoe  wrote:
>
> Hi Uros,
>
> It seems to me that (even if it was working “properly”, which it isn't)  
> ‘-mfentry’ would break ABI on Darwin for both 32 and 64b - which require 
> 16byte stack alignment at call sites.
>
> For Darwin, the dynamic loader enforces the requirement when it can and will 
> abort a program that tries to make a DSO linkage with the stack in an 
> incorrect alignment.  We previously had a bug against profiling caused by 
> exactly this issue (but when the mcount call was in the post-prologue 
> position).
>
> Actually, I’m not sure why it’s not an issue for other 64b platforms that use 
> the psABI (AFAIR,  it’s only the 32b case that’s Darwin-specific).

The __fentry__ in glibc is written as a wrapper around the call to
__mcount_internal, and is written in such a way that it compensates
stack misalignment in a call to __mcount_internal. __fentry__ survives
stack misalignment, since no xmm regs are saved to the stack in the
function.

> Anyway, my current plan is to disable mfentry (for Darwin) - the alternative 
> might be some kind of “almost at the start of the function, but needing some 
> stack alignment change”,
>
> I’m interested in if you know of any compelling use-cases that would make it 
> worth finding some work-around instead of disabling.

Unfortunately, not from the top of my head...

Uros.


Re: [PATCH] x86: fix CVT{,T}PD2PI insns

2019-06-27 Thread Uros Bizjak
On Thu, Jun 27, 2019 at 11:10 AM Jan Beulich  wrote:
>
> >>> On 27.06.19 at 11:03,  wrote:
> > With just an "m" constraint misaligned memory operands won't be forced
> > into a register, and hence cause #GP. So far this was guaranteed only
> > in the case that CVT{,T}PD2DQ were chosen (which looks to be the case on
> > x86-64 only).
> >
> > Instead of switching the second alternative to Bm, use just m on the
> > first and replace nonimmediate_operand by vector_operand.
>
> While doing this and the others where I'm also replacing Bm by uses of
> vector_operand, I've started wondering whether Bm couldn't (and then
> shouldn't) be dropped altogether, replacing it everywhere by "m"
> combined with vector_operand (or vector_memory_operand when
> register operands aren't allowed anyway).

No. Register allocator will propagate unaligned memory in non-AVX
case, which is not allowed with vector_operand.

Uros.

> Furthermore there's an issue with Bm and vector_memory_operand:
> Whether alignment gets enforced depends on TARGET_AVX. However,
> just like the two insns in question here, the SHA ones too don't have
> VEX-encoded equivalents, and hence require alignment enforced even
> with -mavx. Together with the above I wonder whether Bm shouldn't
> be re-purposed to express this special requirement.
>
> Jan
>
>


Re: [PATCH] x86: fix CVT{,T}PD2PI insns

2019-06-27 Thread Uros Bizjak
On Thu, Jun 27, 2019 at 12:47 PM Jan Beulich  wrote:
>
> >>> On 27.06.19 at 12:22,  wrote:
> > On Thu, Jun 27, 2019 at 11:10 AM Jan Beulich  wrote:
> >>
> >> >>> On 27.06.19 at 11:03,  wrote:
> >> > With just an "m" constraint misaligned memory operands won't be forced
> >> > into a register, and hence cause #GP. So far this was guaranteed only
> >> > in the case that CVT{,T}PD2DQ were chosen (which looks to be the case on
> >> > x86-64 only).
> >> >
> >> > Instead of switching the second alternative to Bm, use just m on the
> >> > first and replace nonimmediate_operand by vector_operand.
> >>
> >> While doing this and the others where I'm also replacing Bm by uses of
> >> vector_operand, I've started wondering whether Bm couldn't (and then
> >> shouldn't) be dropped altogether, replacing it everywhere by "m"
> >> combined with vector_operand (or vector_memory_operand when
> >> register operands aren't allowed anyway).
> >
> > No. Register allocator will propagate unaligned memory in non-AVX
> > case, which is not allowed with vector_operand.
>
> I'm afraid I don't understand: Unaligned SIMD memory accesses will
> generally fault in non-AVX mode, so such propagation would seem
> wrong to me and hence would seem to be correctly not allowed.
> Furthermore both vector_operand and Bm resolve to the same
> vector_memory_operand. The TARGET_AVX check actually is inside
> vector_memory_operand, i.e. affects both the same way.

"Bm" *prevents* propagation of unaligned access for non-AVX targets.
As said, register allocator does not care for operand predicates (it
only looks at operand constraints), so it will propagate unaligned
access with "m" operand. To avoid propagation, "Bm" should and does
use vector_memory_operand constraint internally.

Uros.


  1   2   3   >