date:20050219

I had already submitted this to Mark, but since I have improved a few 
rough spots in the code I think it's better to make it public.

* Project Title
AltiVec rewrite.
* Project Contributors
Paolo Bonzini
* Dependencies
none
> * Delivery Date
March 15th or earlier (the implementation is complete and has no 
regressions).

> * Description
The project reimplements the AltiVec vector primitives in a saner way,
without putting the burden on the preprocessor and instead processing
the "overloading" in the C front-end.
This would benefit compilation speed on AltiVec vector code, and move
the big table of AltiVec overloading pairs from an installed header
file to the compiler (an 11000-line header file is reduced to 500 lines
plus 2500 in the compiler).
The changes are so far self contained in the PowerPC backend, but I
would expect that a hack that I am using will require to be changed upon
review.  Unfortunately, a previous RFC I posted on the gcc mailing list
had (almost) no answer.
I plan to take a look at apple-ppc-branch, which supposedly does not 
need this hack, or to ask for feedback when I submit the project.

The current implementation improves the existing implementation in that
anything but predicates will accept unparenthesized literals even in C.
This line:
  vec_add (x, (vector unsigned int) {1, 2, 3, 4})
now fails in C and works in C++, but with the new implementation would 
work in C as well.  On the other hand, using a predicate like this

  vec_all_eq (x, (vector unsigned int) {1, 2, 3, 4})
will still not work in C (it will *not* be a regression in C++, where it 
will be okay both without and with my changes).  It would have to be 
written as

  vec_all_eq (x, ((vector unsigned int) {1, 2, 3, 4}))
exactly as in the current implementation.
Paolo

MMX built-ins performance oddities

2005-02-19 Thread Prakash Punnoor

Hi,
I noticed something strange when I use GCC's builtins for MMX:
I defined some unions:
typedef int v4hi __attribute__ ((__mode__(__V4HI__)));
typedef int v2si __attribute__ ((__mode__(__V2SI__)));
typedef int di __attribute__ ((__mode__(__DI__)));
typedef union
{
  v4hi v;
  short s[4];
  int i[2];
}  _v4hi;
typedef union
{
v2si v;
int i[2];
}  _v2si;
And the strange thing now is. If I use those unions (eg _v4hi var) in my 
code
and pass the vector to the mmx builtin (eg var.v), gcc produces faster code
than if I use eg the v4hi type directly. In my case latter case was 10% slower
in my tests. I'd expect identical results and even identical object files
considering scheduling of the assembler, but that was not the case.
Is this a known issue with gcc-3.4.3? I compiled the code using -O2
-march=athlon-xp -g3. If you want a smaller test case, I could try to do so.
Right now I just didn't want to waste my time in case this is a know issue or
I did something stupid...
I also tried using (Intel style?) intrinsics via mmintrin.h and here the 
times
are nearly the same using unions or vectors, but both as slow as above using
vectors.

The function I used, was (using the unions):
/* Code for use in OpenAL; LGPL license; Copyright 2005 by Prakash Punnor */
/* prepare sign-extension from 16bit to 32 bit for stream ST */
#define GET_SIGNMASK(ST) \
indata.v   = *(v4hi*)(entries[ST].data + offset); \
signmask.v = (v4hi)__builtin_ia32_pand((di)indata.v, (di)m->v); \
signmask.v = (v4hi)__builtin_ia32_pcmpeqw(signmask.v, m->v);
/* mix stream 0 */
#define MIX_ST0 \
GET_SIGNMASK (0);\
\
loout.v = (v2si)__builtin_ia32_punpcklwd(indata.v, signmask.v);\
hiout.v = (v2si)__builtin_ia32_punpckhwd(indata.v, signmask.v);
/* sign-extension and mix stream ST */
#define MIX(ST) \
GET_SIGNMASK(ST) \
temp.v = (v2si)__builtin_ia32_punpcklwd(indata.v, signmask.v); \
loout.v = __builtin_ia32_paddd(loout.v, temp.v); \
temp.v = (v2si)__builtin_ia32_punpckhwd(indata.v, signmask.v); \
hiout.v = __builtin_ia32_paddd(hiout.v, temp.v);
/* manual saturation to dst */
#define SATURATE(OFFSET) \
if (sample == (short)sample) dst[OFFSET] = sample; \
else { \
if(sample > 0 ) \
dst[OFFSET] = max_audioval; \
else \
dst[OFFSET] = min_audioval; \
}\
/* manually mix samples of mod_len */
#define MIX_MOD \
for (offset=0; offset
/* Mix all remaining and write to dst */
#define LOOP_MIX \
while (st
__attribute__((aligned(16))) static const short sm[4] =
{0x8000,0x8000,0x8000,0x8000};
__attribute__((aligned(16))) static const _v4hi *m = (_v4hi*)sm;
typedef struct _alMixEntry {
ALvoid *data;
ALint bytes;
} alMixEntry;
void MixAudio16_MMX_MOD0(ALshort *dst, alMixEntry *entries, int streams)
{
int len = entries[0].bytes;
int mod_len = len % (4 * sizeof(ALshort));
int offset;
int st;
_v4hi indata;
_v4hi signmask;
_v2si loout;
_v2si hiout;
_v2si temp;
MIX_MOD;
for (offset=0; offset
MIX_ST0;
st = 1;
LOOP_MIX;
}
__builtin_ia32_emms();
return;
}
I attached the objdumps:
old.dump - using unions -> fast
n3.dump - using vectors directly -> 10% slower on my athlon-xp, even when
generated asm seems to be shorter
BTW, the buffers were 16-byte aligned.
--
Prakash Punnoor
formerly known as Prakash K. Cheemplavam

mixaudio16.o: file format elf32-i386

Disassembly of section .text:

 :
__attribute__((aligned(16))) static const short sm[4] = 
{0x8000,0x8000,0x8000,0x8000};
__attribute__((aligned(16))) static const v4hi *m = (v4hi*)sm;

void MixAudio16_MMX_MOD0(ALshort *dst, alMixEntry *entries, int streams)
{
   0:   55  push   %ebp
   1:   89 e5   mov%esp,%ebp
   3:   57  push   %edi
   4:   56  push   %esi
   5:   53  push   %ebx
   6:   83 ec 0csub$0xc,%esp
int len = entries[0].bytes;
int mod_len = len % (4 * sizeof(ALshort));
int offset;
int st;

v4hi indata;
v4hi signmask;

v2si loout;
v2si hiout;

v2si temp;

MIX_MOD;
   9:   31 db   xor%ebx,%ebx
   b:   8b 75 0cmov0xc(%ebp),%esi
   e:   8b 7d 10mov0x10(%ebp),%edi
  11:   8b 46 04mov0x4(%esi),%eax
  14:   89 45 f0mov%eax,0xfff0(%ebp)
  17:   83 e0 07and$0x7,%eax
  1a:   39 c3   cmp%eax,%ebx
  1c:   89 45 ecmov%eax,0xffec(%ebp)
  1f:   7d 47   jge68 
  21:   eb 0d   jmp30 
  23:   90  nop

Re: MMX built-ins performance oddities

2005-02-19 Thread Andrew Pinski

On Feb 19, 2005, at 8:21 AM, Prakash Punnoor wrote:
Is this a known issue with gcc-3.4.3? I compiled the code using -O2
-march=athlon-xp -g3. If you want a smaller test case, I could try to 
do so.
Right now I just didn't want to waste my time in case this is a know 
issue or
I did something stupid...
Yes the builtins are known to be a little stupid in 3.4.x.  Could you 
try
a snapshot of 4.0.0?

-- Pinski

Re: MMX built-ins performance oddities

2005-02-19 Thread Prakash Punnoor

Andrew Pinski schrieb:
On Feb 19, 2005, at 8:21 AM, Prakash Punnoor wrote:
Is this a known issue with gcc-3.4.3? I compiled the code using -O2
-march=athlon-xp -g3. If you want a smaller test case, I could try to
do so.
Right now I just didn't want to waste my time in case this is a know
issue or
I did something stupid...

Yes the builtins are known to be a little stupid in 3.4.x.  Could you try
a snapshot of 4.0.0?
I'll try tomorrow, as I guess a new one will come out and I read the last 
one
had troubles to compile itself.
--
Prakash Punnoor
formerly known as Prakash K. Cheemplavam


signature.asc
Description: OpenPGP digital signature

moving v16sf reg with multiple sub-regs

2005-02-19 Thread Dylan Cuthbert

Hi there,
(B
(BI have implemented a move of a v16sf type like this because it is held by 4 
(Bv4sf registers:
(B
(B--- snip ---
(B
(B(define_expand "movv16sf"
(B  [(set (match_operand:V16SF 0 "nonimmediate_operand" "")
(B (match_operand:V16SF 1 "general_operand" ""))]
(B  ""
(B  "  if ((reload_in_progress | reload_completed) == 0
(B  && !register_operand (operands[0], V16SFmode)
(B  && !nonmemory_operand (operands[1], V16SFmode))
(Boperands[1] = force_reg (V16SFmode, operands[1]);
(B
(B move_v16sf( operands );
(B DONE;
(B ")
(B
(B--- end snip ---
(B
(B
(Band in the config's .c file:
(B
(B
(B--- snip ---
(B
(Bvoid
(Bmove_v16sf (operands )
(B rtx operands[];
(B{
(B  rtx op0 = operands[0];
(B  rtx op1 = operands[1];
(B  enum rtx_code code0 = GET_CODE (operands[0]);
(B  enum rtx_code code1 = GET_CODE (operands[1]);
(B  int subreg_offset0 = 0;
(B  int subreg_offset1 = 0;
(B  enum delay_type delay = DELAY_NONE;
(B
(B  if (code0 == REG)
(B{
(B  int regno0 = REGNO (op0) + subreg_offset0;
(B
(B  if (code1 == REG)
(B {
(B   int regno1 = REGNO (op1) + subreg_offset1;
(B
(B   /* Just in case, don't do anything for assigning a register
(B  to itself, unless we are filling a delay slot.  */
(B   if (regno0 == regno1 && set_nomacro == 0) return;
(B
(B   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0  ), gen_rtx_SUBREG( 
(BV4SFmode, op1, 0   ) );
(B   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_SUBREG( 
(BV4SFmode, op1, 16  ) );
(B   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_SUBREG( 
(BV4SFmode, op1, 32  ) );
(B   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_SUBREG( 
(BV4SFmode, op1, 48  ) );
(B }
(B  else if (code1 == MEM)
(B {
(B   rtx src_reg;
(B
(B   src_reg = copy_addr_to_reg ( XEXP (op1,0) );
(B
(B   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0  ), gen_rtx_MEM( 
(BV4SFmode, src_reg ) );
(B   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( 
(BV4SFmode, plus_constant( src_reg, 16 ) ) );
(B   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( 
(BV4SFmode, plus_constant( src_reg, 32 ) ) );
(B   emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( 
(BV4SFmode, plus_constant( src_reg, 48 ) ) );
(B }
(B
(B}
(B
(B  else if (code0 == MEM)
(B{
(B  if (code1 == REG)
(B {
(B   rtx dest_reg;
(B
(B   dest_reg = copy_addr_to_reg ( XEXP (op0,0) );
(B
(B   emit_move_insn( gen_rtx_MEM( V4SFmode, dest_reg ), gen_rtx_SUBREG 
(B(V4SFmode, op1, 0  ) );
(B   emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 16) ), 
(Bgen_rtx_SUBREG (V4SFmode, op1, 16 ) );
(B   emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 32) ), 
(Bgen_rtx_SUBREG (V4SFmode, op1, 32 ) );
(B   emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 48) ), 
(Bgen_rtx_SUBREG (V4SFmode, op1, 48 ) );
(B }
(B}
(B
(B}
(B--- end snip ---
(B
(B
(BThis works ok, but it produces inefficient code, here some sample source 
(Bcode:
(B
(B--- snip ---
(B
(Btypedef int v4 __attribute__((mode(V4SF)));
(Btypedef int m4 __attribute__((mode(V16SF)));
(B
(Bv4 vec1, vec2;
(Bm4 frog;
(B
(Bint main( int argc, char* argv[] )
(B{
(B m4 blob;
(B
(B asm( "some_instruction %0,%1,%2,%3" : "=&j" (blob): "j" (vec1), "j" (vec2), 
(B"j" (frog) );
(B asm( "some_instruction2 %0,%1" : "=&j" (frog) : "j" (blob) );
(B
(B return 0;
(B}
(B
(B--- end snip ---
(B
(Bwhere j is the register class for v4sf and v16sf types.
(BThis produces a move of the v16sf type between the two asm instructions, 
(Bwhen it doesn't need to, does anyone have any ideas why this move isn't 
(Beliminated?
(B
(B #APP
(Bsome_instruction r10,r22,r20,r00
(B #NO_APP
(Bmove r00,r10
(Bmove r01,r11
(Bmove r02,r12
(Bmove r03,r13
(B #APP
(Bsome_instruction2 r10, r00
(B
(B
(Br10 isn't needed to be preserved (it isn't written out) but it seems to be 
(Bmaking a copy anyway.  Worse, if "blob" is defined in global space like 
(B"frog", then it also writes out r10 to memory when it shouldn't.
(B
(B
(BAny ideas appreciated.
(B
(BRegards

Re: Shipping gmp and mpfr with gcc-4.0?

2005-02-19 Thread Bradley Lucier

On Feb 16, 2005, at 2:13 AM, Eric Botcazou wrote:
I tried this evening to install gmp-4.1.4 and mpfr-2.1.0 on my Solaris
machines and I failed on the first try.  (I think the default install
for gmp on my machines is a 64-bit version, but the default for mpfr
and gcc is 32-bit, so I'm going to have to figure out how to configure
everything to match.)
./configure sparc-sun-solaris2.9 --prefix=xxx --enable-mpfr
After explicitly specifying --build=sparc-sun-solaris2.9 with  
gmp-4.1.4, downloading a more recent mpfr and building it with  
--build=sparc-sun-solaris2.9, specifying

../configure --host=sparc-sun-solaris2.9 --build=sparc-sun-solaris2.9  
--target=sparc-sun-solaris2.9  
--prefix=/export/users/lucier/local/gcc-mainline  
--with-gmp=/pkgs/gmp-4.1.4 --with-mpfr=/pkgs/gmp-4.1.4 ; make -j 1  
bootstrap >& build.log

the build failed the first time gfortran tried to compile something  
with the error

/homes/lucier/programs/gcc/mainline/objdir/gcc/gfortran  
-B/homes/lucier/programs/gcc/mainline/objdir/gcc/  
-B/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/bin/  
-B/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/lib/  
-isystem  
/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/include  
-isystem  
/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/sys- 
include -Wall -fno-repack-arrays -fno-underscoring -c  
../../../libgfortran/intrinsics/selected_int_kind.f90  -fPIC -DPIC -o  
.libs/selected_int_kind.o
ld.so.1: /homes/lucier/programs/gcc/mainline/objdir/gcc/f951: fatal:  
libgmp.so.3: open failed: No such file or directory
gfortran: Internal error: Killed (program f951)
Please submit a full bug report.
See http://gcc.gnu.org/bugs.html> for instructions.
make[3]: *** [selected_int_kind.lo] Error 1
make[3]: Leaving directory  
`/export/users/lucier/programs/gcc/mainline/objdir/sparc-sun- 
solaris2.9/libgfortran'
make[2]: *** [all] Error 2
make[2]: Leaving directory  
`/export/users/lucier/programs/gcc/mainline/objdir/sparc-sun- 
solaris2.9/libgfortran'
make[1]: *** [all-target-libgfortran] Error 2
make[1]: Leaving directory  
`/export/users/lucier/programs/gcc/mainline/objdir'
make: *** [bootstrap] Error 2

So now what?  Not build shared libraries for gmp?  Add /pkgs/gmp-4.1.4  
to my LD_LIBRARY_PATH?  Find another configure option for GCC that I  
overlooked?

This is supposed to be straightforward?
Brad

Re: Shipping gmp and mpfr with gcc-4.0?

2005-02-19 Thread Eric Botcazou

> So now what?  Not build shared libraries for gmp?  Add /pkgs/gmp-4.1.4
> to my LD_LIBRARY_PATH?

The latter.

> This is supposed to be straightforward?

I guess so. :-)

-- 
Eric Botcazou

Will people install gfortran in 4.0? [was Re: Shipping gmp and mpfr with gcc-4.0?]

2005-02-19 Thread Bradley Lucier

On Feb 19, 2005, at 11:18 AM, Eric Botcazou wrote:
So now what?  Not build shared libraries for gmp?  Add /pkgs/gmp-4.1.4
to my LD_LIBRARY_PATH?
The latter.
Well, I can't really require people using the compiler to have 
/pkgs/gcc-4.0/lib, /pkgs/gcc-4.0/lib/sparcv9, *and* /pkgs/gmp-4.1.4 in 
their LD_LIBRARY_PATH, and I think my systems people would balk at 
adding /pkgs/gmp-4.1.4 to the crle path, so perhaps I'll just find out 
how to link the gmp libraries in statically.

But I think that in many installations people simply won't dance 
through these hoops and gfortran will not be installed in 4.0.

Brad

Re: [PATCH] Don't assume const/pure calls are total (fix PR tree-optimization/19828)

Project submission for GCC 4.1 - AltiVec rewrite

MMX built-ins performance oddities

Re: MMX built-ins performance oddities

Re: MMX built-ins performance oddities

moving v16sf reg with multiple sub-regs

Re: Shipping gmp and mpfr with gcc-4.0?

Re: Shipping gmp and mpfr with gcc-4.0?

Will people install gfortran in 4.0? [was Re: Shipping gmp and mpfr with gcc-4.0?]

9 matches

Site Navigation

Mail list logo

Footer information