Re: [PATCH] Don't assume const/pure calls are total (fix PR tree-optimization/19828)
Project submission for GCC 4.1 - AltiVec rewrite
I had already submitted this to Mark, but since I have improved a few rough spots in the code I think it's better to make it public. * Project Title AltiVec rewrite. * Project Contributors Paolo Bonzini * Dependencies none > * Delivery Date March 15th or earlier (the implementation is complete and has no regressions). > * Description The project reimplements the AltiVec vector primitives in a saner way, without putting the burden on the preprocessor and instead processing the "overloading" in the C front-end. This would benefit compilation speed on AltiVec vector code, and move the big table of AltiVec overloading pairs from an installed header file to the compiler (an 11000-line header file is reduced to 500 lines plus 2500 in the compiler). The changes are so far self contained in the PowerPC backend, but I would expect that a hack that I am using will require to be changed upon review. Unfortunately, a previous RFC I posted on the gcc mailing list had (almost) no answer. I plan to take a look at apple-ppc-branch, which supposedly does not need this hack, or to ask for feedback when I submit the project. The current implementation improves the existing implementation in that anything but predicates will accept unparenthesized literals even in C. This line: vec_add (x, (vector unsigned int) {1, 2, 3, 4}) now fails in C and works in C++, but with the new implementation would work in C as well. On the other hand, using a predicate like this vec_all_eq (x, (vector unsigned int) {1, 2, 3, 4}) will still not work in C (it will *not* be a regression in C++, where it will be okay both without and with my changes). It would have to be written as vec_all_eq (x, ((vector unsigned int) {1, 2, 3, 4})) exactly as in the current implementation. Paolo
MMX built-ins performance oddities
Hi, I noticed something strange when I use GCC's builtins for MMX: I defined some unions: typedef int v4hi __attribute__ ((__mode__(__V4HI__))); typedef int v2si __attribute__ ((__mode__(__V2SI__))); typedef int di __attribute__ ((__mode__(__DI__))); typedef union { v4hi v; short s[4]; int i[2]; } _v4hi; typedef union { v2si v; int i[2]; } _v2si; And the strange thing now is. If I use those unions (eg _v4hi var) in my code and pass the vector to the mmx builtin (eg var.v), gcc produces faster code than if I use eg the v4hi type directly. In my case latter case was 10% slower in my tests. I'd expect identical results and even identical object files considering scheduling of the assembler, but that was not the case. Is this a known issue with gcc-3.4.3? I compiled the code using -O2 -march=athlon-xp -g3. If you want a smaller test case, I could try to do so. Right now I just didn't want to waste my time in case this is a know issue or I did something stupid... I also tried using (Intel style?) intrinsics via mmintrin.h and here the times are nearly the same using unions or vectors, but both as slow as above using vectors. The function I used, was (using the unions): /* Code for use in OpenAL; LGPL license; Copyright 2005 by Prakash Punnor */ /* prepare sign-extension from 16bit to 32 bit for stream ST */ #define GET_SIGNMASK(ST) \ indata.v = *(v4hi*)(entries[ST].data + offset); \ signmask.v = (v4hi)__builtin_ia32_pand((di)indata.v, (di)m->v); \ signmask.v = (v4hi)__builtin_ia32_pcmpeqw(signmask.v, m->v); /* mix stream 0 */ #define MIX_ST0 \ GET_SIGNMASK (0);\ \ loout.v = (v2si)__builtin_ia32_punpcklwd(indata.v, signmask.v);\ hiout.v = (v2si)__builtin_ia32_punpckhwd(indata.v, signmask.v); /* sign-extension and mix stream ST */ #define MIX(ST) \ GET_SIGNMASK(ST) \ temp.v = (v2si)__builtin_ia32_punpcklwd(indata.v, signmask.v); \ loout.v = __builtin_ia32_paddd(loout.v, temp.v); \ temp.v = (v2si)__builtin_ia32_punpckhwd(indata.v, signmask.v); \ hiout.v = __builtin_ia32_paddd(hiout.v, temp.v); /* manual saturation to dst */ #define SATURATE(OFFSET) \ if (sample == (short)sample) dst[OFFSET] = sample; \ else { \ if(sample > 0 ) \ dst[OFFSET] = max_audioval; \ else \ dst[OFFSET] = min_audioval; \ }\ /* manually mix samples of mod_len */ #define MIX_MOD \ for (offset=0; offset /* Mix all remaining and write to dst */ #define LOOP_MIX \ while (st __attribute__((aligned(16))) static const short sm[4] = {0x8000,0x8000,0x8000,0x8000}; __attribute__((aligned(16))) static const _v4hi *m = (_v4hi*)sm; typedef struct _alMixEntry { ALvoid *data; ALint bytes; } alMixEntry; void MixAudio16_MMX_MOD0(ALshort *dst, alMixEntry *entries, int streams) { int len = entries[0].bytes; int mod_len = len % (4 * sizeof(ALshort)); int offset; int st; _v4hi indata; _v4hi signmask; _v2si loout; _v2si hiout; _v2si temp; MIX_MOD; for (offset=0; offset MIX_ST0; st = 1; LOOP_MIX; } __builtin_ia32_emms(); return; } I attached the objdumps: old.dump - using unions -> fast n3.dump - using vectors directly -> 10% slower on my athlon-xp, even when generated asm seems to be shorter BTW, the buffers were 16-byte aligned. -- Prakash Punnoor formerly known as Prakash K. Cheemplavam mixaudio16.o: file format elf32-i386 Disassembly of section .text: : __attribute__((aligned(16))) static const short sm[4] = {0x8000,0x8000,0x8000,0x8000}; __attribute__((aligned(16))) static const v4hi *m = (v4hi*)sm; void MixAudio16_MMX_MOD0(ALshort *dst, alMixEntry *entries, int streams) { 0: 55 push %ebp 1: 89 e5 mov%esp,%ebp 3: 57 push %edi 4: 56 push %esi 5: 53 push %ebx 6: 83 ec 0csub$0xc,%esp int len = entries[0].bytes; int mod_len = len % (4 * sizeof(ALshort)); int offset; int st; v4hi indata; v4hi signmask; v2si loout; v2si hiout; v2si temp; MIX_MOD; 9: 31 db xor%ebx,%ebx b: 8b 75 0cmov0xc(%ebp),%esi e: 8b 7d 10mov0x10(%ebp),%edi 11: 8b 46 04mov0x4(%esi),%eax 14: 89 45 f0mov%eax,0xfff0(%ebp) 17: 83 e0 07and$0x7,%eax 1a: 39 c3 cmp%eax,%ebx 1c: 89 45 ecmov%eax,0xffec(%ebp) 1f: 7d 47 jge68 21: eb 0d jmp30 23: 90 nop
Re: MMX built-ins performance oddities
On Feb 19, 2005, at 8:21 AM, Prakash Punnoor wrote: Is this a known issue with gcc-3.4.3? I compiled the code using -O2 -march=athlon-xp -g3. If you want a smaller test case, I could try to do so. Right now I just didn't want to waste my time in case this is a know issue or I did something stupid... Yes the builtins are known to be a little stupid in 3.4.x. Could you try a snapshot of 4.0.0? -- Pinski
Re: MMX built-ins performance oddities
Andrew Pinski schrieb: On Feb 19, 2005, at 8:21 AM, Prakash Punnoor wrote: Is this a known issue with gcc-3.4.3? I compiled the code using -O2 -march=athlon-xp -g3. If you want a smaller test case, I could try to do so. Right now I just didn't want to waste my time in case this is a know issue or I did something stupid... Yes the builtins are known to be a little stupid in 3.4.x. Could you try a snapshot of 4.0.0? I'll try tomorrow, as I guess a new one will come out and I read the last one had troubles to compile itself. -- Prakash Punnoor formerly known as Prakash K. Cheemplavam signature.asc Description: OpenPGP digital signature
moving v16sf reg with multiple sub-regs
Hi there, I have implemented a move of a v16sf type like this because it is held by 4 v4sf registers: --- snip --- (define_expand "movv16sf" [(set (match_operand:V16SF 0 "nonimmediate_operand" "") (match_operand:V16SF 1 "general_operand" ""))] "" " if ((reload_in_progress | reload_completed) == 0 && !register_operand (operands[0], V16SFmode) && !nonmemory_operand (operands[1], V16SFmode)) operands[1] = force_reg (V16SFmode, operands[1]); move_v16sf( operands ); DONE; ") --- end snip --- and in the config's .c file: --- snip --- void move_v16sf (operands ) rtx operands[]; { rtx op0 = operands[0]; rtx op1 = operands[1]; enum rtx_code code0 = GET_CODE (operands[0]); enum rtx_code code1 = GET_CODE (operands[1]); int subreg_offset0 = 0; int subreg_offset1 = 0; enum delay_type delay = DELAY_NONE; if (code0 == REG) { int regno0 = REGNO (op0) + subreg_offset0; if (code1 == REG) { int regno1 = REGNO (op1) + subreg_offset1; /* Just in case, don't do anything for assigning a register to itself, unless we are filling a delay slot. */ if (regno0 == regno1 && set_nomacro == 0) return; emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_SUBREG( V4SFmode, op1, 0 ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_SUBREG( V4SFmode, op1, 16 ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_SUBREG( V4SFmode, op1, 32 ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_SUBREG( V4SFmode, op1, 48 ) ); } else if (code1 == MEM) { rtx src_reg; src_reg = copy_addr_to_reg ( XEXP (op1,0) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 0 ), gen_rtx_MEM( V4SFmode, src_reg ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 16 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 16 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 32 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 32 ) ) ); emit_move_insn( gen_rtx_SUBREG (V4SFmode, op0, 48 ), gen_rtx_MEM( V4SFmode, plus_constant( src_reg, 48 ) ) ); } } else if (code0 == MEM) { if (code1 == REG) { rtx dest_reg; dest_reg = copy_addr_to_reg ( XEXP (op0,0) ); emit_move_insn( gen_rtx_MEM( V4SFmode, dest_reg ), gen_rtx_SUBREG (V4SFmode, op1, 0 ) ); emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 16) ), gen_rtx_SUBREG (V4SFmode, op1, 16 ) ); emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 32) ), gen_rtx_SUBREG (V4SFmode, op1, 32 ) ); emit_move_insn( gen_rtx_MEM( V4SFmode, plus_constant( dest_reg, 48) ), gen_rtx_SUBREG (V4SFmode, op1, 48 ) ); } } } --- end snip --- This works ok, but it produces inefficient code, here some sample source code: --- snip --- typedef int v4 __attribute__((mode(V4SF))); typedef int m4 __attribute__((mode(V16SF))); v4 vec1, vec2; m4 frog; int main( int argc, char* argv[] ) { m4 blob; asm( "some_instruction %0,%1,%2,%3" : "=&j" (blob): "j" (vec1), "j" (vec2), "j" (frog) ); asm( "some_instruction2 %0,%1" : "=&j" (frog) : "j" (blob) ); return 0; } --- end snip --- where j is the register class for v4sf and v16sf types. This produces a move of the v16sf type between the two asm instructions, when it doesn't need to, does anyone have any ideas why this move isn't eliminated? #APP some_instruction r10,r22,r20,r00 #NO_APP move r00,r10 move r01,r11 move r02,r12 move r03,r13 #APP some_instruction2 r10, r00 r10 isn't needed to be preserved (it isn't written out) but it seems to be making a copy anyway. Worse, if "blob" is defined in global space like "frog", then it also writes out r10 to memory when it shouldn't. Any ideas appreciated. Regards
Re: Shipping gmp and mpfr with gcc-4.0?
On Feb 16, 2005, at 2:13 AM, Eric Botcazou wrote: I tried this evening to install gmp-4.1.4 and mpfr-2.1.0 on my Solaris machines and I failed on the first try. (I think the default install for gmp on my machines is a 64-bit version, but the default for mpfr and gcc is 32-bit, so I'm going to have to figure out how to configure everything to match.) ./configure sparc-sun-solaris2.9 --prefix=xxx --enable-mpfr After explicitly specifying --build=sparc-sun-solaris2.9 with gmp-4.1.4, downloading a more recent mpfr and building it with --build=sparc-sun-solaris2.9, specifying ../configure --host=sparc-sun-solaris2.9 --build=sparc-sun-solaris2.9 --target=sparc-sun-solaris2.9 --prefix=/export/users/lucier/local/gcc-mainline --with-gmp=/pkgs/gmp-4.1.4 --with-mpfr=/pkgs/gmp-4.1.4 ; make -j 1 bootstrap >& build.log the build failed the first time gfortran tried to compile something with the error /homes/lucier/programs/gcc/mainline/objdir/gcc/gfortran -B/homes/lucier/programs/gcc/mainline/objdir/gcc/ -B/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/bin/ -B/export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/lib/ -isystem /export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/include -isystem /export/users/lucier/local/gcc-mainline/sparc-sun-solaris2.9/sys- include -Wall -fno-repack-arrays -fno-underscoring -c ../../../libgfortran/intrinsics/selected_int_kind.f90 -fPIC -DPIC -o .libs/selected_int_kind.o ld.so.1: /homes/lucier/programs/gcc/mainline/objdir/gcc/f951: fatal: libgmp.so.3: open failed: No such file or directory gfortran: Internal error: Killed (program f951) Please submit a full bug report. See http://gcc.gnu.org/bugs.html> for instructions. make[3]: *** [selected_int_kind.lo] Error 1 make[3]: Leaving directory `/export/users/lucier/programs/gcc/mainline/objdir/sparc-sun- solaris2.9/libgfortran' make[2]: *** [all] Error 2 make[2]: Leaving directory `/export/users/lucier/programs/gcc/mainline/objdir/sparc-sun- solaris2.9/libgfortran' make[1]: *** [all-target-libgfortran] Error 2 make[1]: Leaving directory `/export/users/lucier/programs/gcc/mainline/objdir' make: *** [bootstrap] Error 2 So now what? Not build shared libraries for gmp? Add /pkgs/gmp-4.1.4 to my LD_LIBRARY_PATH? Find another configure option for GCC that I overlooked? This is supposed to be straightforward? Brad
Re: Shipping gmp and mpfr with gcc-4.0?
> So now what? Not build shared libraries for gmp? Add /pkgs/gmp-4.1.4 > to my LD_LIBRARY_PATH? The latter. > This is supposed to be straightforward? I guess so. :-) -- Eric Botcazou
Will people install gfortran in 4.0? [was Re: Shipping gmp and mpfr with gcc-4.0?]
On Feb 19, 2005, at 11:18 AM, Eric Botcazou wrote: So now what? Not build shared libraries for gmp? Add /pkgs/gmp-4.1.4 to my LD_LIBRARY_PATH? The latter. Well, I can't really require people using the compiler to have /pkgs/gcc-4.0/lib, /pkgs/gcc-4.0/lib/sparcv9, *and* /pkgs/gmp-4.1.4 in their LD_LIBRARY_PATH, and I think my systems people would balk at adding /pkgs/gmp-4.1.4 to the crle path, so perhaps I'll just find out how to link the gmp libraries in statically. But I think that in many installations people simply won't dance through these hoops and gfortran will not be installed in 4.0. Brad