Re: [PATCH] Don't assume const/pure calls are total (fix PR tree-optimization/19828)

2005-02-19 Thread Zdenek Dvorak



Project submission for GCC 4.1 - AltiVec rewrite

2005-02-19 Thread Paolo Bonzini
I had already submitted this to Mark, but since I have improved a few 
rough spots in the code I think it's better to make it public.

* Project Title
AltiVec rewrite.
* Project Contributors
Paolo Bonzini
* Dependencies
none
> * Delivery Date
March 15th or earlier (the implementation is complete and has no 
regressions).

> * Description
The project reimplements the AltiVec vector primitives in a saner way,
without putting the burden on the preprocessor and instead processing
the "overloading" in the C front-end.
This would benefit compilation speed on AltiVec vector code, and move
the big table of AltiVec overloading pairs from an installed header
file to the compiler (an 11000-line header file is reduced to 500 lines
plus 2500 in the compiler).
The changes are so far self contained in the PowerPC backend, but I
would expect that a hack that I am using will require to be changed upon
review.  Unfortunately, a previous RFC I posted on the gcc mailing list
had (almost) no answer.
I plan to take a look at apple-ppc-branch, which supposedly does not 
need this hack, or to ask for feedback when I submit the project.

The current implementation improves the existing implementation in that
anything but predicates will accept unparenthesized literals even in C.
This line:
  vec_add (x, (vector unsigned int) {1, 2, 3, 4})
now fails in C and works in C++, but with the new implementation would 
work in C as well.  On the other hand, using a predicate like this

  vec_all_eq (x, (vector unsigned int) {1, 2, 3, 4})
will still not work in C (it will *not* be a regression in C++, where it 
will be okay both without and with my changes).  It would have to be 
written as

  vec_all_eq (x, ((vector unsigned int) {1, 2, 3, 4}))
exactly as in the current implementation.
Paolo


MMX built-ins performance oddities

2005-02-19 Thread Prakash Punnoor
Hi,
I noticed something strange when I use GCC's builtins for MMX:
I defined some unions:
typedef int v4hi __attribute__ ((__mode__(__V4HI__)));
typedef int v2si __attribute__ ((__mode__(__V2SI__)));
typedef int di __attribute__ ((__mode__(__DI__)));
typedef union
{
  v4hi v;
  short s[4];
  int i[2];
}  _v4hi;
typedef union
{
v2si v;
int i[2];
}  _v2si;
And the strange thing now is. If I use those unions (eg _v4hi var) in my 
code
and pass the vector to the mmx builtin (eg var.v), gcc produces faster code
than if I use eg the v4hi type directly. In my case latter case was 10% slower
in my tests. I'd expect identical results and even identical object files
considering scheduling of the assembler, but that was not the case.
Is this a known issue with gcc-3.4.3? I compiled the code using -O2
-march=athlon-xp -g3. If you want a smaller test case, I could try to do so.
Right now I just didn't want to waste my time in case this is a know issue or
I did something stupid...
I also tried using (Intel style?) intrinsics via mmintrin.h and here the 
times
are nearly the same using unions or vectors, but both as slow as above using
vectors.

The function I used, was (using the unions):
/* Code for use in OpenAL; LGPL license; Copyright 2005 by Prakash Punnor */
/* prepare sign-extension from 16bit to 32 bit for stream ST */
#define GET_SIGNMASK(ST) \
indata.v   = *(v4hi*)(entries[ST].data + offset); \
signmask.v = (v4hi)__builtin_ia32_pand((di)indata.v, (di)m->v); \
signmask.v = (v4hi)__builtin_ia32_pcmpeqw(signmask.v, m->v);
/* mix stream 0 */
#define MIX_ST0 \
GET_SIGNMASK (0);\
\
loout.v = (v2si)__builtin_ia32_punpcklwd(indata.v, signmask.v);\
hiout.v = (v2si)__builtin_ia32_punpckhwd(indata.v, signmask.v);
/* sign-extension and mix stream ST */
#define MIX(ST) \
GET_SIGNMASK(ST) \
temp.v = (v2si)__builtin_ia32_punpcklwd(indata.v, signmask.v); \
loout.v = __builtin_ia32_paddd(loout.v, temp.v); \
temp.v = (v2si)__builtin_ia32_punpckhwd(indata.v, signmask.v); \
hiout.v = __builtin_ia32_paddd(hiout.v, temp.v);
/* manual saturation to dst */
#define SATURATE(OFFSET) \
if (sample == (short)sample) dst[OFFSET] = sample; \
else { \
if(sample > 0 ) \
dst[OFFSET] = max_audioval; \
else \
dst[OFFSET] = min_audioval; \
}\
/* manually mix samples of mod_len */
#define MIX_MOD \
for (offset=0; offset
/* Mix all remaining and write to dst */
#define LOOP_MIX \
while (st
__attribute__((aligned(16))) static const short sm[4] =
{0x8000,0x8000,0x8000,0x8000};
__attribute__((aligned(16))) static const _v4hi *m = (_v4hi*)sm;
typedef struct _alMixEntry {
ALvoid *data;
ALint bytes;
} alMixEntry;
void MixAudio16_MMX_MOD0(ALshort *dst, alMixEntry *entries, int streams)
{
int len = entries[0].bytes;
int mod_len = len % (4 * sizeof(ALshort));
int offset;
int st;
_v4hi indata;
_v4hi signmask;
_v2si loout;
_v2si hiout;
_v2si temp;
MIX_MOD;
for (offset=0; offset
MIX_ST0;
st = 1;
LOOP_MIX;
}
__builtin_ia32_emms();
return;
}
I attached the objdumps:
old.dump - using unions -> fast
n3.dump - using vectors directly -> 10% slower on my athlon-xp, even when
generated asm seems to be shorter
BTW, the buffers were 16-byte aligned.
--
Prakash Punnoor
formerly known as Prakash K. Cheemplavam

mixaudio16.o: file format elf32-i386

Disassembly of section .text:

 :
__attribute__((aligned(16))) static const short sm[4] = 
{0x8000,0x8000,0x8000,0x8000};
__attribute__((aligned(16))) static const v4hi *m = (v4hi*)sm;

void MixAudio16_MMX_MOD0(ALshort *dst, alMixEntry *entries, int streams)
{
   0:   55  push   %ebp
   1:   89 e5   mov%esp,%ebp
   3:   57  push   %edi
   4:   56  push   %esi
   5:   53  push   %ebx
   6:   83 ec 0csub$0xc,%esp
int len = entries[0].bytes;
int mod_len = len % (4 * sizeof(ALshort));
int offset;
int st;

v4hi indata;
v4hi signmask;

v2si loout;
v2si hiout;

v2si temp;

MIX_MOD;
   9:   31 db   xor%ebx,%ebx
   b:   8b 75 0cmov0xc(%ebp),%esi
   e:   8b 7d 10mov0x10(%ebp),%edi
  11:   8b 46 04mov0x4(%esi),%eax
  14:   89 45 f0mov%eax,0xfff0(%ebp)
  17:   83 e0 07and$0x7,%eax
  1a:   39 c3   cmp%eax,%ebx
  1c:   89 45 ecmov%eax,0xffec(%ebp)
  1f:   7d 47   jge68 
  21:   eb 0d   jmp30 
  23:   90  nop
 

Re: MMX built-ins performance oddities

2005-02-19 Thread Andrew Pinski
On Feb 19, 2005, at 8:21 AM, Prakash Punnoor wrote:
Is this a known issue with gcc-3.4.3? I compiled the code using -O2
-march=athlon-xp -g3. If you want a smaller test case, I could try to 
do so.
Right now I just didn't want to waste my time in case this is a know 
issue or
I did something stupid...
Yes the builtins are known to be a little stupid in 3.4.x.  Could you 
try
a snapshot of 4.0.0?

-- Pinski


Re: MMX built-ins performance oddities

2005-02-19 Thread Prakash Punnoor
Andrew Pinski schrieb:
On Feb 19, 2005, at 8:21 AM, Prakash Punnoor wrote:
Is this a known issue with gcc-3.4.3? I compiled the code using -O2
-march=athlon-xp -g3. If you want a smaller test case, I could try to
do so.
Right now I just didn't want to waste my time in case this is a know
issue or
I did something stupid...

Yes the builtins are known to be a little stupid in 3.4.x.  Could you try
a snapshot of 4.0.0?
I'll try tomorrow, as I guess a new one will come out and I read the last 
one
had troubles to compile itself.
--
Prakash Punnoor
formerly known as Prakash K. Cheemplavam


signature.asc
Description: OpenPGP digital signature


moving v16sf reg with multiple sub-regs

2005-02-19 Thread Dylan Cuthbert
Hi there,

I have implemented a move of a v16sf type like this because it is held by 4 
v4sf registers:

--- snip ---

(define_expand "movv16sf"
  [(set (match_operand:V16SF 0 "nonimmediate_operand" "")
 (match_operand:V16SF 1 "general_operand" ""))]
  ""
  "  if ((reload_in_progress | reload_completed) == 0
  && !register_operand (operands[0], V16SFmode)
  && !nonmemory_operand (operands[1], V16SFmode))
operands[1] = force_reg (V16SFmode, operands[1]);

 move_v16sf( operands );
 DONE;
 ")

--- end snip ---


and in the config's .c file:


--- snip ---

void
move_v16sf (operands )
 rtx operands[];
{
  rtx op0 = operands[0];
  rtx op1 = operands[1];
  enum rtx_code code0 = GET_CODE (operands[0]);
  enum rtx_code code1 = GET_CODE (operands[1]);
  int subreg_offset0 = 0;
  int subreg_offset1 = 0;
  enum delay_type delay = DELAY_NONE;

  if (code0 == REG)
{
  int regno0 = REGNO (op0) + subreg_offset0;

  if (code1 == REG)
 {
   int regno1 = REGNO (op1) + subreg_offset1;

   /* Just in case, don't do anything for assigning a register
  to itself, unless we are filling a delay slot.  */
   if (regno0 == regno1 && set_nomacro == 0) return;

   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0  ), gen_rtx_SUBREG( 
V4SFmode, op1, 0   ) );
   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_SUBREG( 
V4SFmode, op1, 16  ) );
   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_SUBREG( 
V4SFmode, op1, 32  ) );
   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_SUBREG( 
V4SFmode, op1, 48  ) );
 }
  else if (code1 == MEM)
 {
   rtx src_reg;

   src_reg = copy_addr_to_reg ( XEXP (op1,0) );

   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0  ), gen_rtx_MEM( 
V4SFmode, src_reg ) );
   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( 
V4SFmode, plus_constant( src_reg, 16 ) ) );
   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( 
V4SFmode, plus_constant( src_reg, 32 ) ) );
   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( 
V4SFmode, plus_constant( src_reg, 48 ) ) );
 }

}

  else if (code0 == MEM)
{
  if (code1 == REG)
 {
   rtx dest_reg;

   dest_reg = copy_addr_to_reg ( XEXP (op0,0) );

   emit_move_insn( gen_rtx_MEM( V4SFmode, dest_reg ), gen_rtx_SUBREG 
(V4SFmode, op1, 0  ) );
   emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 16) ), 
gen_rtx_SUBREG (V4SFmode, op1, 16 ) );
   emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 32) ), 
gen_rtx_SUBREG (V4SFmode, op1, 32 ) );
   emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 48) ), 
gen_rtx_SUBREG (V4SFmode, op1, 48 ) );
 }
}

}
--- end snip ---


This works ok, but it produces inefficient code, here some sample source 
code:

--- snip ---

typedef int v4 __attribute__((mode(V4SF)));
typedef int m4 __attribute__((mode(V16SF)));

v4 vec1, vec2;
m4 frog;

int main( int argc, char* argv[] )
{
 m4 blob;

 asm( "some_instruction %0,%1,%2,%3" : "=&j" (blob): "j" (vec1), "j" (vec2), 
"j" (frog) );
 asm( "some_instruction2 %0,%1" : "=&j" (frog) : "j" (blob) );

 return 0;
}

--- end snip ---

where j is the register class for v4sf and v16sf types.
This produces a move of the v16sf type between the two asm instructions, 
when it doesn't need to, does anyone have any ideas why this move isn't 
eliminated?

 #APP
some_instruction r10,r22,r20,r00
 #NO_APP
move r00,r10
move r01,r11
move r02,r12
move r03,r13
 #APP
some_instruction2 r10, r00


r10 isn't needed to be preserved (it isn't written out) but it seems to be 
making a copy anyway.  Worse, if "blob" is defined in global space like 
"frog", then it also writes out r10 to memory when it shouldn't.


Any ideas appreciated.

Regards

Re: Shipping gmp and mpfr with gcc-4.0?

2005-02-19 Thread Bradley Lucier
On Feb 16, 2005, at 2:13 AM, Eric Botcazou wrote:
I tried this evening to install gmp-4.1.4 and mpfr-2.1.0 on my Solaris
machines and I failed on the first try.  (I think the default install
for gmp on my machines is a 64-bit version, but the default for mpfr
and gcc is 32-bit, so I'm going to have to figure out how to configure
everything to match.)
./configure sparc-sun-solaris2.9 --prefix=xxx --enable-mpfr
After explicitly specifying --build=sparc-sun-solaris2.9 with  
gmp-4.1.4, downloading a more recent mpfr and building it with  
--build=sparc-sun-solaris2.9, specifying

../configure --host=sparc-sun-solaris2.9 --build=sparc-sun-solaris2.9  
--target=sparc-sun-solaris2.9  
--prefix=/export/users/lucier/local/gcc-mainline  
--with-gmp=/pkgs/gmp-4.1.4 --with-mpfr=/pkgs/gmp-4.1.4 ; make -j 1  
bootstrap >& build.log

the build failed the first time gfortran tried to compile something  
with the error

/homes/lucier/programs/gcc/mainline/objdir/gcc/gfortran  
-B/homes/lucier/programs/gcc/mainline/objdir/gcc/  
-B/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/bin/  
-B/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/lib/  
-isystem  
/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/include  
-isystem  
/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/sys- 
include -Wall -fno-repack-arrays -fno-underscoring -c  
../../../libgfortran/intrinsics/selected_int_kind.f90  -fPIC -DPIC -o  
.libs/selected_int_kind.o
ld.so.1: /homes/lucier/programs/gcc/mainline/objdir/gcc/f951: fatal:  
libgmp.so.3: open failed: No such file or directory
gfortran: Internal error: Killed (program f951)
Please submit a full bug report.
See http://gcc.gnu.org/bugs.html> for instructions.
make[3]: *** [selected_int_kind.lo] Error 1
make[3]: Leaving directory  
`/export/users/lucier/programs/gcc/mainline/objdir/sparc-sun- 
solaris2.9/libgfortran'
make[2]: *** [all] Error 2
make[2]: Leaving directory  
`/export/users/lucier/programs/gcc/mainline/objdir/sparc-sun- 
solaris2.9/libgfortran'
make[1]: *** [all-target-libgfortran] Error 2
make[1]: Leaving directory  
`/export/users/lucier/programs/gcc/mainline/objdir'
make: *** [bootstrap] Error 2

So now what?  Not build shared libraries for gmp?  Add /pkgs/gmp-4.1.4  
to my LD_LIBRARY_PATH?  Find another configure option for GCC that I  
overlooked?

This is supposed to be straightforward?
Brad


Re: Shipping gmp and mpfr with gcc-4.0?

2005-02-19 Thread Eric Botcazou
> So now what?  Not build shared libraries for gmp?  Add /pkgs/gmp-4.1.4
> to my LD_LIBRARY_PATH?

The latter.

> This is supposed to be straightforward?

I guess so. :-)

-- 
Eric Botcazou


Will people install gfortran in 4.0? [was Re: Shipping gmp and mpfr with gcc-4.0?]

2005-02-19 Thread Bradley Lucier
On Feb 19, 2005, at 11:18 AM, Eric Botcazou wrote:
So now what?  Not build shared libraries for gmp?  Add /pkgs/gmp-4.1.4
to my LD_LIBRARY_PATH?
The latter.
Well, I can't really require people using the compiler to have 
/pkgs/gcc-4.0/lib, /pkgs/gcc-4.0/lib/sparcv9, *and* /pkgs/gmp-4.1.4 in 
their LD_LIBRARY_PATH, and I think my systems people would balk at 
adding /pkgs/gmp-4.1.4 to the crle path, so perhaps I'll just find out 
how to link the gmp libraries in statically.

But I think that in many installations people simply won't dance 
through these hoops and gfortran will not be installed in 4.0.

Brad