IVopts bug?
Hi all I found IVopts rewrite a memory access with a weird iv candidate, which make it lost its original memory attribute. a non-local memory access' base pointer was rewrite into a local one, and it was deleted in pass_cd_dce since it was recognized as a local memory access. here is the case i simplified from a decoder source foo1(unsigned char* pSrcLeft, unsigned char* pSrcAbove, unsigned char* pSrcAboveLeft, unsigned char* pDst, int dstStep, int leftStep) { signed int x, y, s; unsigned char p1[5], p2[5], p3; p1[0] = *pSrcAboveLeft; p2[0] = p1[0]; p2[1] = pSrcLeft[0]; pSrcLeft += leftStep; p2[2] = pSrcLeft[0]; pSrcLeft += leftStep; p2[3] = pSrcLeft[0]; pSrcLeft += leftStep; p2[4] = pSrcLeft[0]; p1[1] = pSrcAbove[0]; p1[2] = pSrcAbove[1]; p1[3] = pSrcAbove[2]; p1[4] = pSrcAbove[3]; p3 = (unsigned char)(((signed int)p1[1] + (signed int)p2[1] + (signed int)p1[0] +(signed int)p1[0] + 2 ) >> 2 ); for( y=0; y<4; y++, pDst += dstStep ) { for( x=y+1; x<4; x++ ) { s = ( p1[x-y-1] + p1[x-y] + p1[x-y] + p1[x-y+1] + 2 ) >> 2; pDst[x] = (unsigned char)s; } pDst[y] = p3; -This memory access } } before IVopts D.6508_65 = pDst_88 + y.6_64; *D.6508_65 = p3_37; after IVopts it was rewrite to MEM[symbol: p1, index: ivtmp.161_200, offset: 0B] = p3_37 , by candidate 15 depends on 3 var_before ivtmp.161 var_after ivtmp.161 incremented before exit test type unsigned int base (unsigned int) pDst_39(D) - (unsigned int) &p1 step (unsigned int) (pretmp.28_118 + 1) so it still is &p1+ pDst - &p1 + step = pDst + step, and in pass_cd_dce, is_hidden_global_store () return false for this memory since it think this stmt only access local array p1. gcc version r180694 Configured with: /home/croseadu/android/_src/src/gcc-src/configure --host=i486-linux-gnu --build=i486-linux-gnu --target=arm-none-linux-gnueabi --prefix=/home/croseadu/android/_src/install/arm-none-linux-gnueabi --enable-threads --disable-libmudflap --disable-libssp --disable-libstdcxx-pch --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --enable-shared --enable-symvers=gnu --enable-__cxa_atexit --with-specs='%{funwind-tables|fno-unwind-tables|mabi=*|ffreestanding|nostdlib:;:-funwind-tables}' --disable-nls --enable-lto --with-sysroot=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/libc --with-build-sysroot=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/libc --with-gmp=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr --with-mpfr=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr --with-ppl=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm' --with-cloog=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr --enable-cloog-backend=isl --with-mpc=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr --enable-poison-system-directories --disable-libquadmath --enable-lto --enable-libgomp --with-build-time-tools=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/bin --with-cpu=cortex-a8 --with-float=soft compile flags: -O3 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-double need file a bug? Yuehai Du #include #define N 10 __attribute__ ((noinline)) void foo1(unsigned char* pSrcLeft, unsigned char* pSrcAbove, unsigned char* pSrcAboveLeft, unsigned char* pDst, int dstStep, int leftStep) { signed int x, y, s; unsigned char p1[5], p2[5], p3; p1[0] = *pSrcAboveLeft; p2[0] = p1[0]; p2[1] = pSrcLeft[0]; pSrcLeft += leftStep; p2[2] = pSrcLeft[0]; pSrcLeft += leftStep; p2[3] = pSrcLeft[0]; pSrcLeft += leftStep; p2[4] = pSrcLeft[0]; p1[1] = pSrcAbove[0]; p1[2] = pSrcAbove[1]; p1[3] = pSrcAbove[2]; p1[4] = pSrcAbove[3]; p3 = (unsigned char)(((signed int)p1[1] + (signed int)p2[1] + (signed int)p1[0] +(signed int)p1[0] + 2 ) >> 2 ); for( y=0; y<4; y++, pDst += dstStep ) { for( x=y+1; x<4; x++ ) { s = ( p1[x-y-1] + p1[x-y] + p1[x-y] + p1[x-y+1] + 2 ) >> 2; pDst[x] = (unsigned char)s; } pDst[y] = p3; } } __attribute__ ((noinline)) void foo2(unsigned char* pSrcLeft, unsigned char* pSrcAbove, unsigned char* pSrcAboveLeft, unsigned char* pDst, int dstStep, int leftStep) { signed int x, y, s; unsigned char p1[5], p2[5], p3; p1[0] = *pSrcAboveLeft; p2[0] = p1[0]; p2[1] = pSrcLeft[0]; pSrcLeft += leftStep; p2[2] = pSrcLeft[0]; pSrcLeft += leftStep; p2[3] = pSrcLeft[0]; pSrcLeft += leftStep; p2[4] = pSrcLeft[0]; p1[1] = pSrcAbove[0]; p1[2] = pSrcAbove[1]; p1[3] = pSrcAbove[2]; p1[4] = pSrcAbove[3]; p3 = (unsigned char)(((signed
Re: Adding official support into the main tree for SPARC Leon
I'll send new patches as a reply. Eric Botcazou wrote: > [CCing David Miller, the SPARC binutils maintainer] > OK, so you're proposing a new 'leon' sub-architecture for binutils. Yes. > >> The appended 2 patches do: >> 1. 0001-sparc-leon-Use-Aleon-assembler-switch-for-mcpu-leon-.patch >>Append "-Aleon" to the assembler > > This looks incomplete. Don't you also want to enable the instructions? The [casa,smac,umac] are used as inline assembler only. > >> 2. 0001-sparc-leon-add-leon-architecture-to-GAS.patch >>Define new "leon" processor type in GAS + enable for "leon" >>umac/smac and "casa". > > The configure.tgt change looks useless to me. I have removed it, if gcc's "-Aleon" would be added it is not needed. > > Other nits: > > @@ -1668,9 +1671,8 @@ EFPOP2_2 ("efcmpes",0x055, "e,f"), > { "cpop2", F3(2, 0x37, 0), F3(~2, ~0x37, ~1), "[1+2],d", F_ALIAS, v6notv9 > }, > > /* sparclet specific insns */ > - > -COMMUTEOP ("umac", 0x3e, sparclet), > -COMMUTEOP ("smac", 0x3f, sparclet), > +COMMUTEOP ("umac", 0x3e, sparclet|MASK_LEON), > +COMMUTEOP ("smac", 0x3f, sparclet|MASK_LEON), > COMMUTEOP ("umacd", 0x2e, sparclet), > COMMUTEOP ("smacd", 0x2f, sparclet), > COMMUTEOP ("umuld", 0x09, sparclet), > > sparclet|leon > > -{ "casa",F3(3, 0x3c, 0), F3(~3, ~0x3c, ~0), "[1]A,2,d", 0, v9 }, > -{ "casa",F3(3, 0x3c, 1), F3(~3, ~0x3c, ~1), "[1]o,2,d", 0, v9 }, > +{ "casa",F3(3, 0x3c, 0), F3(~3, ~0x3c, ~0), "[1]A,2,d", 0, v9|MASK_LEON > }, > +{ "casa",F3(3, 0x3c, 1), F3(~3, ~0x3c, ~1), "[1]o,2,d", 0, v9|MASK_LEON > }, > > v9|leon > > +{ "cas", F3(3, 0x3c, 0)|ASI(0x80), F3(~3, ~0x3c, ~0)|ASI(~0x80), > "[1],2,d", > F_ALIAS, v9|MASK_LEON }, /* casa [rs1]ASI_P,rs2,rd */ > +{ "casl",F3(3, 0x3c, 0)|ASI(0x88), F3(~3, ~0x3c, ~0)|ASI(~0x88), > "[1],2,d", > F_ALIAS, v9|MASK_LEON }, /* casa [rs1]ASI_P_L,rs2,rd */ > > Likewise. > I fixed that.
Intro
Here are the new patches for adding -Aleon to binutils and to add -Aleon as default asm-switch to gcc: - [PATCH 1/1] sparc leon: add -Aleon architecture to GAS: Binutils patch - [PATCH 1/1] sparc leon: Use -Aleon assembler switch for -mcpu=leon arch Gcc patch
[PATCH 1/1] sparc leon: add -Aleon architecture to GAS
Add -Aleon architecture selection to GAS. -Aleon supports [umul,smul] and [casa,casl]. Signed-off-by: Konrad Eisele --- gas/config/tc-sparc.c |3 ++- include/opcode/sparc.h |1 + opcodes/sparc-opc.c| 16 +--- 3 files changed, 12 insertions(+), 8 deletions(-) diff --git a/gas/config/tc-sparc.c b/gas/config/tc-sparc.c index 77fda56..47f4386 100644 --- a/gas/config/tc-sparc.c +++ b/gas/config/tc-sparc.c @@ -221,7 +221,7 @@ static void output_insn (const struct sparc_opcode *, struct sparc_it *); for this use. That table is for opcodes only. This table is for opcodes and file formats. */ -enum sparc_arch_types {v6, v7, v8, sparclet, sparclite, sparc86x, v8plus, +enum sparc_arch_types {v6, v7, v8, leon, sparclet, sparclite, sparc86x, v8plus, v8plusa, v9, v9a, v9b, v9_64}; static struct sparc_arch { @@ -246,6 +246,7 @@ static struct sparc_arch { { "sparcima", "v9b", v9, 0, 1, F_MUL32|F_DIV32|F_FSMULD|F_POPC|F_VIS|F_VIS2|F_FMAF|F_IMA }, { "sparcvis3", "v9b", v9, 0, 1, F_MUL32|F_DIV32|F_FSMULD|F_POPC|F_VIS|F_VIS2|F_FMAF|F_VIS3|F_HPC }, { "sparcvis3r", "v9b", v9, 0, 1, F_MUL32|F_DIV32|F_FSMULD|F_POPC|F_VIS|F_VIS2|F_FMAF|F_VIS3|F_HPC|F_RANDOM|F_TRANS|F_FJFMAU }, + { "leon", "leon", leon, 32, 1, F_MUL32|F_DIV32|F_FSMULD }, { "sparclet", "sparclet", sparclet, 32, 1, F_MUL32|F_DIV32|F_FSMULD }, { "sparclite", "sparclite", sparclite, 32, 1, F_MUL32|F_DIV32|F_FSMULD }, { "sparc86x", "sparclite", sparc86x, 32, 1, F_MUL32|F_DIV32|F_FSMULD }, diff --git a/include/opcode/sparc.h b/include/opcode/sparc.h index 7ae3641..2283a93 100644 --- a/include/opcode/sparc.h +++ b/include/opcode/sparc.h @@ -42,6 +42,7 @@ enum sparc_opcode_arch_val SPARC_OPCODE_ARCH_V6 = 0, SPARC_OPCODE_ARCH_V7, SPARC_OPCODE_ARCH_V8, + SPARC_OPCODE_ARCH_LEON, SPARC_OPCODE_ARCH_SPARCLET, SPARC_OPCODE_ARCH_SPARCLITE, /* V9 variants must appear last. */ diff --git a/opcodes/sparc-opc.c b/opcodes/sparc-opc.c index a2096c5..f467588 100644 --- a/opcodes/sparc-opc.c +++ b/opcodes/sparc-opc.c @@ -33,6 +33,7 @@ #define MASK_V6SPARC_OPCODE_ARCH_MASK (SPARC_OPCODE_ARCH_V6) #define MASK_V7SPARC_OPCODE_ARCH_MASK (SPARC_OPCODE_ARCH_V7) #define MASK_V8SPARC_OPCODE_ARCH_MASK (SPARC_OPCODE_ARCH_V8) +#define MASK_LEON SPARC_OPCODE_ARCH_MASK (SPARC_OPCODE_ARCH_LEON) #define MASK_SPARCLET SPARC_OPCODE_ARCH_MASK (SPARC_OPCODE_ARCH_SPARCLET) #define MASK_SPARCLITE SPARC_OPCODE_ARCH_MASK (SPARC_OPCODE_ARCH_SPARCLITE) #define MASK_V9SPARC_OPCODE_ARCH_MASK (SPARC_OPCODE_ARCH_V9) @@ -56,6 +57,7 @@ recognizes all v8 insns. */ #define v8 (MASK_V8 | MASK_SPARCLET | MASK_SPARCLITE \ | MASK_V9 | MASK_V9A | MASK_V9B) +#define leon (MASK_LEON) #define sparclet (MASK_SPARCLET) #define sparclite (MASK_SPARCLITE) #define v9 (MASK_V9 | MASK_V9A | MASK_V9B) @@ -76,6 +78,7 @@ const struct sparc_opcode_arch sparc_opcode_archs[] = { "v6", MASK_V6 }, { "v7", MASK_V6 | MASK_V7 }, { "v8", MASK_V6 | MASK_V7 | MASK_V8 }, + { "leon", MASK_V6 | MASK_V7 | MASK_V8 | MASK_LEON }, { "sparclet", MASK_V6 | MASK_V7 | MASK_V8 | MASK_SPARCLET }, { "sparclite", MASK_V6 | MASK_V7 | MASK_V8 | MASK_SPARCLITE }, /* ??? Don't some v8 priviledged insns conflict with v9? */ @@ -1668,9 +1671,8 @@ EFPOP2_2 ("efcmpes", 0x055, "e,f"), { "cpop2", F3(2, 0x37, 0), F3(~2, ~0x37, ~1), "[1+2],d", F_ALIAS, v6notv9 }, /* sparclet specific insns */ - -COMMUTEOP ("umac", 0x3e, sparclet), -COMMUTEOP ("smac", 0x3f, sparclet), +COMMUTEOP ("umac", 0x3e, sparclet|leon), +COMMUTEOP ("smac", 0x3f, sparclet|leon), COMMUTEOP ("umacd", 0x2e, sparclet), COMMUTEOP ("smacd", 0x2f, sparclet), COMMUTEOP ("umuld", 0x09, sparclet), @@ -1721,8 +1723,8 @@ SLCBCC("cbnefr", 15), #undef SLCBCC2 #undef SLCBCC -{ "casa", F3(3, 0x3c, 0), F3(~3, ~0x3c, ~0), "[1]A,2,d", 0, v9 }, -{ "casa", F3(3, 0x3c, 1), F3(~3, ~0x3c, ~1), "[1]o,2,d", 0, v9 }, +{ "casa", F3(3, 0x3c, 0), F3(~3, ~0x3c, ~0), "[1]A,2,d", 0, v9|leon }, +{ "casa", F3(3, 0x3c, 1), F3(~3, ~0x3c, ~1), "[1]o,2,d", 0, v9|leon }, { "casxa", F3(3, 0x3e, 0), F3(~3, ~0x3e, ~0), "[1]A,2,d", 0, v9 }, { "casxa", F3(3, 0x3e, 1), F3(~3, ~0x3e, ~1), "[1]o,2,d", 0, v9 }, @@ -1732,8 +1734,8 @@ SLCBCC("cbnefr", 15), { "signx", F3(2, 0x27, 0), F3(~2, ~0x27, ~0)|(1<<12)|ASI(~0)|RS2_G0, "r", F_ALIAS, v9 }, /* sra rd,%g0,rd */ { "clruw", F3(2, 0x26, 0), F3(~2, ~0x26, ~0)|(1<<12)|ASI(~0)|RS2_G0, "1,d", F_ALIAS, v9 }, /* srl rs1,%g0,rd */ { "clruw", F3(2, 0x26, 0), F3(~2, ~0x26, ~0)|(1<<12)|ASI(~0)|RS2_G0, "r", F_ALIAS, v9 }, /* srl rd,%g0,rd */ -{ "cas", F3(3, 0x3c, 0)|ASI(0x80), F3(~3, ~0x3c, ~0)|ASI(~0x80), "[1],2,d", F_ALIAS, v9 }, /* casa [rs1]ASI_P,rs2,rd */ -{ "casl", F3(3, 0x3c, 0)|ASI(0x88), F3(~3, ~0x3c, ~0)|ASI(~0x88), "[1],2,d", F_ALIAS, v9 },
[PATCH 1/1] sparc leon: Use -Aleon assembler switch for -mcpu=leon arch
Use -Aleon to enable binutils sparc-leon architecture. The leon-arch binutils GAS has umul/smul and casa enabled. Signed-off-by: Konrad Eisele --- gcc/config/sparc/sparc.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/config/sparc/sparc.h b/gcc/config/sparc/sparc.h index 65b4527..bbadeb2 100644 --- a/gcc/config/sparc/sparc.h +++ b/gcc/config/sparc/sparc.h @@ -236,7 +236,7 @@ extern enum cmodel sparc_cmodel; #if TARGET_CPU_DEFAULT == TARGET_CPU_leon #define CPP_CPU32_DEFAULT_SPEC "-D__leon__ -D__sparc_v8__" -#define ASM_CPU32_DEFAULT_SPEC "" +#define ASM_CPU32_DEFAULT_SPEC "-Aleon" #endif #endif @@ -324,7 +324,7 @@ extern enum cmodel sparc_cmodel; /* Override in target specific files. */ #define ASM_CPU_SPEC "\ -%{mcpu=sparclet:-Asparclet} %{mcpu=tsc701:-Asparclet} \ +%{mcpu=sparclet:-Asparclet} %{mcpu=leon:-Aleon} %{mcpu=tsc701:-Asparclet} \ %{mcpu=sparclite:-Asparclite} \ %{mcpu=sparclite86x:-Asparclite} \ %{mcpu=f930:-Asparclite} %{mcpu=f934:-Asparclite} \ -- 1.6.4.1
Re: [PATCH 1/1] sparc leon: add -Aleon architecture to GAS
Please post binutils patches with the binutils development list CC:'d.
Re: [PATCH 1/1] sparc leon: Use -Aleon assembler switch for -mcpu=leon arch
GCC patches are to be posted to gcc-patches, not gcc.
Re: [PATCH 1/1] sparc leon: add -Aleon architecture to GAS
David Miller wrote: > > Please post binutils patches with the binutils development list CC:'d. > > Is the binutils development list bug-binut...@gnu.org ?
Re: [PATCH 1/1] sparc leon: Use -Aleon assembler switch for -mcpu=leon arch
David Miller wrote: > > GCC patches are to be posted to gcc-patches, not gcc. > > I have sent it there.
Re: [PATCH 1/1] sparc leon: add -Aleon architecture to GAS
From: Konrad Eisele Date: Tue, 01 Nov 2011 10:19:04 +0100 > David Miller wrote: >> >> Please post binutils patches with the binutils development list CC:'d. >> >> > > Is the binutils development list bug-binut...@gnu.org ? No, it's binut...@sourceware.org
Re: [PATCH 1/1] sparc leon: add -Aleon architecture to GAS
David Miller wrote: > From: Konrad Eisele > Date: Tue, 01 Nov 2011 10:19:04 +0100 > >> David Miller wrote: >>> >>> Please post binutils patches with the binutils development list CC:'d. >>> >>> >> >> Is the binutils development list bug-binut...@gnu.org ? > > No, it's binut...@sourceware.org > > Ok, I've sent it there.
Re: Potentially merging the transactional-memory branch into mainline.
On Mon, Oct 31, 2011 at 11:33 PM, Aldy Hernandez wrote: > This is somewhat of a me-too message for the transactional-memory work. We > would also like it to be considered for merging with mainline before the end > of stage1. > > We have a kept a wiki here: > > http://gcc.gnu.org/wiki/TransactionalMemory > > What it is > == > > From the wiki... > > Transactional memory is intended to make programming with threads simpler. > As with databases, a transaction is a unit of work that either completes in > its entirety or has no effect at all. Further, transactions are isolated > from each other such that each transaction sees a consistent view of memory. > > Transactional memory comes in two forms: a Software Transactional Memory > (STM) system uses locks or other standard atomic instructions to do its job. > A Hardware Transactional Memory (HTM) system uses features of the cpu to > implement the requirements of the transaction directly (e.g. Rock > processor). Most HTM systems are best effort, which means that the > transaction can fail for unrelated reasons. Thus almost all systems that > incorporate HTM also have a STM component and are thus termed Hybrid > Transactional Memory systems. > > The transactional memory system to be implemented in GCC provides single > lock atomicity semantics. That is, a program behaves as if a single global > lock guards each transaction. > > What it involves > > > We have implemented the latest spec from the multi-vendor transactional > memory group that includes AMD, Intel, Oracle, and others. The last > official spec is what is in the wiki above, yet there are some minor changes > to the keywords that are currently being finalized in the final document > (but have already been agreed upon), and will be published shortly. > > It is my understanding (Torvald, correct me if I'm wrong), that the current > implementation is what has been agreed to by the committee, and has been > given a favorable nod by various members of the C++ standardization > committee. Most importantly, the keywords are agreed upon. > > There are changes to the C and C++ front-end, and a software library > (libitm) to go along with it. The library works on x86-64, x86-32, and > Richard's favorite, Alpha :-). Porting to other architectures should be a > straightforward affair. > > Status > == > > The current implementation runs the common TM benchmarks correctly, albeit > there is still work to be done to improve performance. > > There are a handful of failed compiler tests on the included transactional > memory testsuite (g*.dg/tm/*), but they are all missed optimizations, which > we hope to have fixed after the merge. > > What's left > === > > Torvald is working on some recent changes to noexcept, and we should have > this working in a few days. > > I will be removing the cancel-throw construct which didn't make it in the > final spec. I should have that done tomorrow. > > The final word > == > Seeing that a global maintainer has been lead on this for a while, I suspect > there isn't much to review formally. I believe the only bits that Richard > isn't directly responsible for are the C++ front-end changes. > > So what is the opinion/consensus on merging the branch? It would be nice to > get this infrastructure in place as well so we can get people to start using > it, and then we can work out any issues that arise. > > I have no idea how this happened, but apparently I'm on the hook for merging > both the cxx-mem-model and this branch (if/when one/both get approved). If > this gets approved, I'd prefer to get the cxx-mem-model branch merged first, > and the transactional-memory branch later during the week. I will be > partially available during the weekend, and definitely during next week. Given that you only recently merged with trunk again are you really sure this is a great idea at this point in time? Does the GCC 4.7 user community benefit from this in any way (or rather how much percentage of it)? Thus, please consider merging early during GCC 4.8 stage1 instead. Thanks, Richard.
Re: IVopts bug?
2011/11/1 杜越海 : > Hi all > > I found IVopts rewrite a memory access with a weird iv candidate, > which make it lost its original memory attribute. > a non-local memory access' base pointer was rewrite into a local one, > and it was deleted in pass_cd_dce since > it was recognized as a local memory access. > > here is the case i simplified from a decoder source > > foo1(unsigned char* pSrcLeft, > unsigned char* pSrcAbove, > unsigned char* pSrcAboveLeft, > unsigned char* pDst, > int dstStep, > int leftStep) > { > signed int x, y, s; > unsigned char p1[5], p2[5], p3; > > p1[0] = *pSrcAboveLeft; > p2[0] = p1[0]; > p2[1] = pSrcLeft[0]; > pSrcLeft += leftStep; > p2[2] = pSrcLeft[0]; > pSrcLeft += leftStep; > p2[3] = pSrcLeft[0]; > pSrcLeft += leftStep; > p2[4] = pSrcLeft[0]; > > p1[1] = pSrcAbove[0]; > p1[2] = pSrcAbove[1]; > p1[3] = pSrcAbove[2]; > p1[4] = pSrcAbove[3]; > > p3 = (unsigned char)(((signed int)p1[1] + (signed int)p2[1] + > (signed int)p1[0] > +(signed int)p1[0] + 2 ) >> 2 ); > > for( y=0; y<4; y++, pDst += dstStep ) { > for( x=y+1; x<4; x++ ) { > s = ( p1[x-y-1] + p1[x-y] + p1[x-y] + p1[x-y+1] + 2 ) >> 2; > pDst[x] = (unsigned char)s; > } > > pDst[y] = p3; -This memory access > } > } > > before IVopts > > D.6508_65 = pDst_88 + y.6_64; > *D.6508_65 = p3_37; > > after IVopts > it was rewrite to > MEM[symbol: p1, index: ivtmp.161_200, offset: 0B] = p3_37 , > > by > candidate 15 > depends on 3 > var_before ivtmp.161 > var_after ivtmp.161 > incremented before exit test > type unsigned int > base (unsigned int) pDst_39(D) - (unsigned int) &p1 > step (unsigned int) (pretmp.28_118 + 1) > > so it still is &p1+ pDst - &p1 + step = pDst + step, > and in pass_cd_dce, is_hidden_global_store () return false for this memory > since it think this stmt only access local array p1. > > > > gcc version r180694 > > Configured with: /home/croseadu/android/_src/src/gcc-src/configure > --host=i486-linux-gnu --build=i486-linux-gnu > --target=arm-none-linux-gnueabi > --prefix=/home/croseadu/android/_src/install/arm-none-linux-gnueabi > --enable-threads --disable-libmudflap --disable-libssp > --disable-libstdcxx-pch --with-gnu-as --with-gnu-ld > --enable-languages=c,c++ --enable-shared --enable-symvers=gnu > --enable-__cxa_atexit > --with-specs='%{funwind-tables|fno-unwind-tables|mabi=*|ffreestanding|nostdlib:;:-funwind-tables}' > --disable-nls --enable-lto > --with-sysroot=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/libc > --with-build-sysroot=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/libc > --with-gmp=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr > --with-mpfr=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr > --with-ppl=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr > --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic > -lm' > --with-cloog=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr > --enable-cloog-backend=isl > --with-mpc=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr > --enable-poison-system-directories --disable-libquadmath --enable-lto > --enable-libgomp > --with-build-time-tools=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/bin > --with-cpu=cortex-a8 --with-float=soft > > compile flags: > -O3 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-double > > need file a bug? Yes, it definitely should not do this kind of stupid (and invalid) thing. Richard. > > Yuehai Du >
SLP vectorizer on non-loop?
Hello, I have one example with two very similar loops. cunrolli pass unrolls one loop completely but not the other based on slightly different cost estimations. The not-unrolled loop get SLP-vectorized, then unrolled by "cunroll" pass, whereas the other unrolled loop cannot be vectorized since it is not a loop any more. In the end, there is big difference of performance between two loops. My question is why SLP vectorization has to be performed on loop (it is a sub-pass under pass_tree_loop). Conceptually, cannot it be done on any basic block? Our port are still stuck at 4.5. But I checked 4.7, it seems still the same. I also checked functions in tree-vect-slp.c. They use a lot of loop_vinfo structures. But in some places it checks whether loop_vinfo exists to use it or other alternative. I tried to add an extra SLP pass after pass_tree_loop, but it didn't work. I wonder how easy to make SLP works for non-loop. Thanks, Bingfeng Mei Broadcom UK void foo (int *__restrict__ temp_hist_buffer, int * __restrict__ p_hist_buff, int *__restrict__ p_input) { int i; for(i=0;i<4;i++) temp_hist_buffer[i]=p_hist_buff[i]; for(i=0;i<4;i++) temp_hist_buffer[i+4]=p_input[i]; }
Re: SLP vectorizer on non-loop?
gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM: > Hello, > I have one example with two very similar loops. cunrolli pass > unrolls one loop completely > but not the other based on slightly different cost estimations. The > not-unrolled loop > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the > other unrolled loop cannot > be vectorized since it is not a loop any more. In the end, there is > big difference of > performance between two loops. > Here what I see with the current trunk on x86_64 with -O3 (with the two loops split into different functions): The first loop, the one that doesn't get unrolled by cunrolli, gets loop vectorized with -fno-vect-cost-model. With the cost model the vectorization fails because the number of iterations is not sufficient (the vectorizer tries to apply loop peeling in order to align the accesses), the loop gets later unrolled by cunroll and the basic block gets vectorized by SLP. The second loop, unrolled by cunrolli, also gets vectorized by SLP. The *.optimized dumps look similar: : vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)]; MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48; return; : vect_var_.7_57 = MEM[(int *)p_input_10(D)]; MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57; return; > My question is why SLP vectorization has to be performed on loop (it > is a sub-pass under > pass_tree_loop). Conceptually, cannot it be done on any basic block? > Our port are still > stuck at 4.5. But I checked 4.7, it seems still the same. I also > checked functions in > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in > some places it checks > whether loop_vinfo exists to use it or other alternative. I tried to > add an extra SLP > pass after pass_tree_loop, but it didn't work. I wonder how easy to > make SLP works for > non-loop. SLP vectorization works both on loops (in vectorize pass) and on basic blocks (in slp-vectorize pass). Ira > > Thanks, > Bingfeng Mei > > Broadcom UK > > void foo (int *__restrict__ temp_hist_buffer, > int * __restrict__ p_hist_buff, > int *__restrict__ p_input) > { > int i; > for(i=0;i<4;i++) > temp_hist_buffer[i]=p_hist_buff[i]; > > for(i=0;i<4;i++) > temp_hist_buffer[i+4]=p_input[i]; > > } > >
RE: SLP vectorizer on non-loop?
Ira, Thank you very much for quick answer. I will check 4.7 x86-64 to see difference from our port. Is there significant change between 4.5 & 4.7 regarding SLP? Cheers, Bingfeng > -Original Message- > From: Ira Rosen [mailto:i...@il.ibm.com] > Sent: 01 November 2011 11:13 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: SLP vectorizer on non-loop? > > > > gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM: > > > Hello, > > I have one example with two very similar loops. cunrolli pass > > unrolls one loop completely > > but not the other based on slightly different cost estimations. The > > not-unrolled loop > > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the > > other unrolled loop cannot > > be vectorized since it is not a loop any more. In the end, there is > > big difference of > > performance between two loops. > > > > Here what I see with the current trunk on x86_64 with -O3 (with the two > loops split into different functions): > > The first loop, the one that doesn't get unrolled by cunrolli, gets > loop > vectorized with -fno-vect-cost-model. With the cost model the > vectorization > fails because the number of iterations is not sufficient (the > vectorizer > tries to apply loop peeling in order to align the accesses), the loop > gets > later unrolled by cunroll and the basic block gets vectorized by SLP. > > The second loop, unrolled by cunrolli, also gets vectorized by SLP. > > The *.optimized dumps look similar: > > > : > vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)]; > MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48; > return; > > > : > vect_var_.7_57 = MEM[(int *)p_input_10(D)]; > MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57; > return; > > > > My question is why SLP vectorization has to be performed on loop (it > > is a sub-pass under > > pass_tree_loop). Conceptually, cannot it be done on any basic block? > > Our port are still > > stuck at 4.5. But I checked 4.7, it seems still the same. I also > > checked functions in > > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in > > some places it checks > > whether loop_vinfo exists to use it or other alternative. I tried to > > add an extra SLP > > pass after pass_tree_loop, but it didn't work. I wonder how easy to > > make SLP works for > > non-loop. > > SLP vectorization works both on loops (in vectorize pass) and on basic > blocks (in slp-vectorize pass). > > Ira > > > > > Thanks, > > Bingfeng Mei > > > > Broadcom UK > > > > void foo (int *__restrict__ temp_hist_buffer, > > int * __restrict__ p_hist_buff, > > int *__restrict__ p_input) > > { > > int i; > > for(i=0;i<4;i++) > > temp_hist_buffer[i]=p_hist_buff[i]; > > > > for(i=0;i<4;i++) > > temp_hist_buffer[i+4]=p_input[i]; > > > > } > > > > >
RE: SLP vectorizer on non-loop?
"Bingfeng Mei" wrote on 01/11/2011 01:25:14 PM: > Ira, > Thank you very much for quick answer. I will check 4.7 x86-64 > to see difference from our port. Is there significant change > between 4.5 & 4.7 regarding SLP? Yes, I think so. 4.5 can't SLP data accesses with unknown alignment that you have here. Ira > > Cheers, > Bingfeng > > > -Original Message- > > From: Ira Rosen [mailto:i...@il.ibm.com] > > Sent: 01 November 2011 11:13 > > To: Bingfeng Mei > > Cc: gcc@gcc.gnu.org > > Subject: Re: SLP vectorizer on non-loop? > > > > > > > > gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM: > > > > > Hello, > > > I have one example with two very similar loops. cunrolli pass > > > unrolls one loop completely > > > but not the other based on slightly different cost estimations. The > > > not-unrolled loop > > > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the > > > other unrolled loop cannot > > > be vectorized since it is not a loop any more. In the end, there is > > > big difference of > > > performance between two loops. > > > > > > > Here what I see with the current trunk on x86_64 with -O3 (with the two > > loops split into different functions): > > > > The first loop, the one that doesn't get unrolled by cunrolli, gets > > loop > > vectorized with -fno-vect-cost-model. With the cost model the > > vectorization > > fails because the number of iterations is not sufficient (the > > vectorizer > > tries to apply loop peeling in order to align the accesses), the loop > > gets > > later unrolled by cunroll and the basic block gets vectorized by SLP. > > > > The second loop, unrolled by cunrolli, also gets vectorized by SLP. > > > > The *.optimized dumps look similar: > > > > > > : > > vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)]; > > MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48; > > return; > > > > > > : > > vect_var_.7_57 = MEM[(int *)p_input_10(D)]; > > MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57; > > return; > > > > > > > My question is why SLP vectorization has to be performed on loop (it > > > is a sub-pass under > > > pass_tree_loop). Conceptually, cannot it be done on any basic block? > > > Our port are still > > > stuck at 4.5. But I checked 4.7, it seems still the same. I also > > > checked functions in > > > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in > > > some places it checks > > > whether loop_vinfo exists to use it or other alternative. I tried to > > > add an extra SLP > > > pass after pass_tree_loop, but it didn't work. I wonder how easy to > > > make SLP works for > > > non-loop. > > > > SLP vectorization works both on loops (in vectorize pass) and on basic > > blocks (in slp-vectorize pass). > > > > Ira > > > > > > > > Thanks, > > > Bingfeng Mei > > > > > > Broadcom UK > > > > > > void foo (int *__restrict__ temp_hist_buffer, > > > int * __restrict__ p_hist_buff, > > > int *__restrict__ p_input) > > > { > > > int i; > > > for(i=0;i<4;i++) > > > temp_hist_buffer[i]=p_hist_buff[i]; > > > > > > for(i=0;i<4;i++) > > > temp_hist_buffer[i+4]=p_input[i]; > > > > > > } > > > > > > > > > >
Re: approaches to carry-flag modelling in RTL
On 01/11/11 02:43, Hans-Peter Nilsson wrote: Not obvious or maybe I was unclear as to what I alluded? In the below insn-bodies, "sub" is the insn that sets cc0 as a side-effect. Supposed canonical form : (parallel [(set cc_reg) (compare ...)) (set destreg) (sub ...))]) and: (parallel [(set destreg) (sub ...)) (clobber cc_reg)]) But IMHO it'd be easier (for most values of "easier") to combine both patterns with that non-existing mechanism (and no, I don't count match_parallel) if we instead canonicalized on the CC_REG set being the same as the clobber position: (parallel [(set destreg) (sub ...)) (set cc_reg) (compare ...))]) with: (parallel [(set destreg) (sub ...)) (clobber cc_reg)]) brgds, H-P That is very strange because if you look into RX or MN10300, they all have the set REG_CC as the last in the parallel. I wonder if it has anything to do with the fact that in these backends the set of the REG_CC only shows up after reload. -- PMatos
Re: Potentially merging the transactional-memory branch into mainline.
On Tue, 2011-11-01 at 10:49 +0100, Richard Guenther wrote: > On Mon, Oct 31, 2011 at 11:33 PM, Aldy Hernandez wrote: > > This is somewhat of a me-too message for the transactional-memory work. We > > would also like it to be considered for merging with mainline before the end > > of stage1. [snip] > > The final word > > == > > Seeing that a global maintainer has been lead on this for a while, I suspect > > there isn't much to review formally. I believe the only bits that Richard > > isn't directly responsible for are the C++ front-end changes. > > > > So what is the opinion/consensus on merging the branch? It would be nice to > > get this infrastructure in place as well so we can get people to start using > > it, and then we can work out any issues that arise. > > > > I have no idea how this happened, but apparently I'm on the hook for merging > > both the cxx-mem-model and this branch (if/when one/both get approved). If > > this gets approved, I'd prefer to get the cxx-mem-model branch merged first, > > and the transactional-memory branch later during the week. I will be > > partially available during the weekend, and definitely during next week. > > Given that you only recently merged with trunk again are you really > sure this is a great > idea at this point in time? Yes, for the reasons outlined below. > Does the GCC 4.7 user community benefit from this > in any way (or rather how much percentage of it)? Yes, we think so. Transactional Memory (TM) is a very easy-to-use synchronization mechanism, which does not burden the programmer with having to consider issues such as deadlocks or having to rely on conventions regarding which locks cover which data. This complements the recent efforts for low-level synchronization in GCC (ie, C++11 atomics) and other threading-related features in C++11. It is a new feature that isn't yet available in other mainstream compiler products, but there is wide industry interest in TM. The TM language specification for C++ that we have implemented in the branch is the output of a cross-industry working group consisting of people from HP, IBM, Intel, Oracle, and Red Hat. This group, including C++11 and synchronization experts such as Hans Boehm, has been working since at least 2009 on this specification, and we are pretty confident that we have a good understanding of the matter, and which programming abstractions we can and should offer. We have presented it to and discussed it with several other affected parties (e.g., Boost folks, academia, ...), and we hope to present it to the C++ standard community in February. On the hardware side, there clearly is interest too. For example, IBM BlueGene/Q chips have hardware support for TM, and AMD released a proposal for such support for x86 (AMD's Advanced Synchronization Facility). Thus, we are not investing in some wild and crazy idea here. In contrast, because other mainstream compilers do not have this feature yet (but do have in preview versions, e.g., in an ICC what-if prototype), it is actually an opportunity for GCC to offer improvements for its users first before other compilers do. Parallelization and synchronization are of interest to many GCC users I would argue, so giving them another reason to use GCC is definitely good. Also, this is an area that will become even more important in the future. > Thus, please consider merging early during GCC 4.8 stage1 instead. I do think that merging now is definitely better than waiting another cycle: - It does improve GCC for programmers that have to build concurrent code. We know that the percentage of these programmers will increase. - TM does not negatively affect any other features of GCC from the perspective of users, because TM as we have implemented it smoothly embeds into the C++11 memory model (but without actually being dependent on its implementation or presence) and does not create other dependencies. Do you see any examples for negative effects on other features? - The TM code in GCC is also pretty isolated from anything else; while there is front-end support, most of the implementation and logic is in an isolated runtime library (libitm). - We (here meaning my colleagues and myself) definitely have the expertise to maintain this, and we are willing to invest time in this in the future. Overall, this looks like much benefit, very little costs to me. The sooner we make this available in mainline, the earlier users can benefit from it. Torvald
Re: Potentially merging the transactional-memory branch into mainline.
On 11/01/2011 01:52 PM, Torvald Riegel wrote: > Yes, we think so. Transactional Memory (TM) is a very easy-to-use > synchronization mechanism, which does not burden the programmer with > having to consider issues such as deadlocks or having to rely on > conventions regarding which locks cover which data. This complements the > recent efforts for low-level synchronization in GCC (ie, C++11 atomics) > and other threading-related features in C++11. Speaking as someone not involved in the project, I have to agree with this. TM is something that's been kicking around in academe for a while now, and exposure in gcc is potentially a significant benefit for both people who want to experiment with TM and the gcc community. The promise of TM for scalability is so great that we'd be fools not to include it. Andrew.
Re: Potentially merging the transactional-memory branch into mainline.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/01/11 03:49, Richard Guenther wrote: > > Given that you only recently merged with trunk again are you > really sure this is a great idea at this point in time? Does the > GCC 4.7 user community benefit from this in any way (or rather how > much percentage of it)? > > Thus, please consider merging early during GCC 4.8 stage1 instead. This stuff is fairly isolated in terms of what it touches and I'm sure if anything goes wrong, Aldy, Richard & Torvald will be available to fix it. The request to merge came in before the end of stage1, I don't see a reason to delay things another 6-9 months. This isn't like asking to pull in a whole new register allocator at the end of stage1 :-) Additionally, I believe we have a small window where we can position GCC to be the compiler of choice for those working with TM; waiting 6-9 months for GCC 4.8 will miss that window. Jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOsA7jAAoJEBRtltQi2kC7S7AH/27+psLfl2BMUnxK6aJkbH7i UVto6b56d/mBKxNRAWr1CtwM2EQ312ZdR6Q5kWbXTfEOj/HaWHzg/0EROKT28HDn 3HQausblOz677J0Xx8iiAeuG7FY9pcsFKIs89KIDWqouuPkP1iea6ZxiMyF2YhDO bOdXUscyD5upYU8t8Xk9PUB/LoRpby7wPRpmVuK6sd+SAyNYOZRRzQ6Rfu6eHdGB R6jWktJmuNbKacFTYAFL7bwVRtFayb3VvrOwO+tIFcsUPRmHloz31HtCLjt0G6Vo q+cwRM345Ku3IT2+8o/GHORKg3rD0wXdA9dUj/hMLcW321s5pMHnADcF14zzha0= =CEcq -END PGP SIGNATURE-
Re: implementation of std::thread::hardware_concurrency()
> Er, the macro _GLIBCXX_NPROCS already handles > the case sysconf(_SC_NPROCESSORS_ONLN). > It looks like you actually want to remove the macro > _GLIBCXX_NPROCS completely. Fixed. diff --git a/libstdc++-v3/src/thread.cc b/libstdc++-v3/src/thread.cc index 09e7fc5..6feda4d 100644 --- a/libstdc++-v3/src/thread.cc +++ b/libstdc++-v3/src/thread.cc @@ -112,10 +112,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION unsigned int thread::hardware_concurrency() noexcept { -int __n = _GLIBCXX_NPROCS; -if (__n < 0) - __n = 0; -return __n; +int count = 0; +#if defined(PTW32_VERSION) || \ + (defined(__MINGW64_VERSION_MAJOR) && defined(_POSIX_THREADS)) || \ + defined(__hpux) +count = pthread_num_processors_np(); +#elif defined(__APPLE__) || defined(__FreeBSD__) +size_t size = sizeof(count); +sysctlbyname("hw.ncpu", &count, &size, NULL, 0); +#elif defined(_GLIBCXX_USE_GET_NPROCS) || \ + defined(_GLIBCXX_USE_SC_NPROCESSORS_ONLN) +count = _GLIBCXX_NPROCS; +#endif +return (count > 0) ? count : 0; } _GLIBCXX_END_NAMESPACE_VERSION > Do you have already a Copyright assignment in place? No. For public domain.
Re: implementation of std::thread::hardware_concurrency()
I've put gcc-patches@ back in the CC list and removed gcc@ On 1 November 2011 15:35, niXman wrote: >> Er, the macro _GLIBCXX_NPROCS already handles >> the case sysconf(_SC_NPROCESSORS_ONLN). >> It looks like you actually want to remove the macro >> _GLIBCXX_NPROCS completely. > > Fixed. No, this still isn't acceptable. I do not want to see preprocessor tests like +#elif defined(__APPLE__) || defined(__FreeBSD__) in the body of get_thread::hardware_concurrency(), the configure script should determine what is available on the platform and set an appropriate macro. Look at the definition of _GLIBCXX_NPROCS and adjust that to do #define _GLIBCXX_NPROCS pthread_num_processors_np() for the relevant platforms. For the platforms using sysctlbyname there could be an inline function that calls it, and _GLIBCXX_NPROCS could be defined to call that, so that thread::hardware_concurrency() can still be defined as it is today. Please read the code you're changing and understand how it works today before making changes.
Re: Potentially merging the transactional-memory branch into mainline.
On 11-11-01 11:23 , Jeff Law wrote: This stuff is fairly isolated in terms of what it touches and I'm sure if anything goes wrong, Aldy, Richard& Torvald will be available to fix it. The request to merge came in before the end of stage1, I don't see a reason to delay things another 6-9 months. This isn't like asking to pull in a whole new register allocator at the end of stage1 :-) Additionally, I believe we have a small window where we can position GCC to be the compiler of choice for those working with TM; waiting 6-9 months for GCC 4.8 will miss that window. I agree as well. I have not being following the TM work very closely, but I've been interested in TM for a Long Time. Having it available in 4.7 would be a huge benefit to GCC. I don't think it really is a risk from a release standpoint. TM is an optional component and should not affect standard codegen paths. Additionally, given that Aldy and Richard H. are involved in it, I'm sure they will be very quick in addressing any problems that crop up. Aldy, Richard, is there a patchset or master patch I could read? Diego.
Re: Potentially merging the transactional-memory branch into mainline.
On Tue, Nov 1, 2011 at 5:49 AM, Richard Guenther wrote: > Given that you only recently merged with trunk again are you really > sure this is a great > idea at this point in time? Does the GCC 4.7 user community benefit from this > in any way (or rather how much percentage of it)? GCC has a history of merging and exposing technology previews. Why should the bar be placed higher for this feature? The feature is isolated and does not appear that it will interfere with other parts of GCC. Aldy, RTH, Torvald and Red Hat appear ready to address any problems promptly. - David
Re: Potentially merging the transactional-memory branch into mainline.
On 11/1/2011 12:59 PM, David Edelsohn wrote: On Tue, Nov 1, 2011 at 5:49 AM, Richard Guenther wrote: Given that you only recently merged with trunk again are you really sure this is a great idea at this point in time? Does the GCC 4.7 user community benefit from this in any way (or rather how much percentage of it)? GCC has a history of merging and exposing technology previews. Why should the bar be placed higher for this feature? The feature is isolated and does not appear that it will interfere with other parts of GCC. Aldy, RTH, Torvald and Red Hat appear ready to address any problems promptly. I think this is an important feature, and support Richard's viewpoint on this.
Re: Potentially merging the transactional-memory branch into mainline.
On Tue, Nov 1, 2011 at 17:19, Robert Dewar wrote: > On 11/1/2011 12:59 PM, David Edelsohn wrote: >> >> On Tue, Nov 1, 2011 at 5:49 AM, Richard Guenther >> wrote: >> >>> Given that you only recently merged with trunk again are you really >>> sure this is a great >>> idea at this point in time? Does the GCC 4.7 user community benefit from >>> this >>> in any way (or rather how much percentage of it)? >> >> GCC has a history of merging and exposing technology previews. Why >> should the bar be placed higher for this feature? The feature is >> isolated and does not appear that it will interfere with other parts >> of GCC. >> >> Aldy, RTH, Torvald and Red Hat appear ready to address any problems >> promptly. > > I think this is an important feature, and support Richard's viewpoint on > this. Richard who? There are two Richards in this thread, and they seem to have opposing views. Diego.
Re: Potentially merging the transactional-memory branch into mainline.
Richard who? There are two Richards in this thread, and they seem to have opposing views. I am confused by the multiple levels of quotes I think (the feature in mailers of easily allowing you to include an entire earlier thread is evil! :-) Anyway, I support merging this in ... Diego.
Re: Potentially merging the transactional-memory branch into mainline.
Aldy, Richard, is there a patchset or master patch I could read? I have made current diff as of today: http://quesejoda.com/tm-branch-latest.bz2
Re: Potentially merging the transactional-memory branch into mainline.
On 11-11-01 14:44 , Aldy Hernandez wrote: Aldy, Richard, is there a patchset or master patch I could read? I have made current diff as of today: http://quesejoda.com/tm-branch-latest.bz2 Thanks. Diego.
Re: Potentially merging the transactional-memory branch into mainline.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/01/11 12:44, Aldy Hernandez wrote: > >> Aldy, Richard, is there a patchset or master patch I could read? > > I have made current diff as of today: > > http://quesejoda.com/tm-branch-latest.bz2 Umm, Have you looked at those diffs, there's a fair amount of unrelated crud in there... It might help to break the blob into more easily understood hunks for actual submissions. ie, runtime bits (libitm), changes to existing runtime stuff, compiler proper, testsuite bits, etc. Obviously folks will want to look at changes to existing runtime and the compiler proper bits the closest. Jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOsEFTAAoJEBRtltQi2kC78/IH/RJ5yGVNZH8pJO1dt8AVGgBO oj60ZpAqrXq0atYaYAj7VjPBx9RTYHFUWWW4acW4PGLuS01e2F7bRjSxI0dkSc5s s++C01k+JvWzu9Q3hoSN73owGDC+eOEJ9vob6p8b99STgAWFly5OMXGdjjCcZjH1 EilDo6RNrpSn0Ez3rPxjeKItkwjsKHdE3LbgFScYnaQwE+LcU/JUgxXCiAvqu5Dg 3Aa4ADdbMWZHeOx9DewxHrcUrr8mRGgY3cCMv3miW0aCv6ClpMBrg+yPYK4Fh8EI gJ9DvL9Y1yKTmYOp9cvhbAMs4UFAuEvbAUxOFZhp6dEp+HVAC+vGInQtAET27Ak= =mxyE -END PGP SIGNATURE-
Re: Potentially merging the transactional-memory branch into mainline.
Have you looked at those diffs, there's a fair amount of unrelated Will clean up. crud in there... It might help to break the blob into more easily understood hunks for actual submissions. ie, runtime bits (libitm), changes to existing runtime stuff, compiler proper, testsuite bits, etc. Will do.
_mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64} intrinsics semantics
Hi! As the vgather* insns are designed to support both unconditional and conditional gather loads, the current pattern consume the previous content of the destination register, so we end up with code like: vmovaps .LC0(%rip), %ymm0 vmovdqa .LC1(%rip), %ymm5 vmovdqa .LC2(%rip), %ymm4 .p2align 4,,10 .p2align 3 .L6: vmovdqa k(%rax,%rax), %ymm1 vmovaps %ymm0, %ymm6 vmovaps %ymm0, %ymm2 vmovdqa k+32(%rax,%rax), %ymm3 vgatherdps %ymm6, vf1(,%ymm1,4), %ymm2 vmovaps %ymm0, %ymm1 vmovaps %ymm0, %ymm6 vcvttps2dq %ymm2, %ymm2 vpshufb %ymm5, %ymm2, %ymm2 vgatherdps %ymm6, vf1(,%ymm3,4), %ymm1 ... note: each vgather* preceeded usually by two movaps, one copying usually before the loop computed/loaded mask of all ones and the other initializes the destination register. But with mask of all ones the whole destination register is overwritten unless there is a segfault, so IMNSHO at least for autovectorization it would be nice to just leave the content of the destination register undefined in case of a segfault. The only way users can see a difference is if a segfault happens and in a segfault handler they inspect the destination register or transfer control to the next insn from the segfault handler. My question is about the avx2intrin.h intrinsics, in the AVX2 manual the insns are well documented, but there are no details about the intrinsics. There are 2 kind of intrinsics for gather, one without mask/src operands, one with them. So, my question is, for the intrinsics without mask/src operands, is it supposed to be well defined what dest register will contain after a segfault? Currently we load zeros into src, but would it be a valid optimization to just leave that register undefined in case of segfault? And, what about the other intrinsics if mask is known to be all ones? Can the compiler optimize this and assume the destination is just overwritten rather than being in/out operand? What could be done is during expansion check if mask has all high bits set and if so, just use different insn patterns that wouldn't consume the register with "0" constraint. Or have second set of compiler builtins that wouldn't have src/mask arguments. On large testcases (like Toon's weather forecast routine which has over 260 vgather* insns) this would allow us to get rid of one extra insn per vgather* insn. Jakub
Re: Potentially merging the transactional-memory branch into mainline.
On Tue, Nov 1, 2011 at 5:59 PM, David Edelsohn wrote: > On Tue, Nov 1, 2011 at 5:49 AM, Richard Guenther > wrote: > >> Given that you only recently merged with trunk again are you really >> sure this is a great >> idea at this point in time? Does the GCC 4.7 user community benefit from >> this >> in any way (or rather how much percentage of it)? > > GCC has a history of merging and exposing technology previews. Why > should the bar be placed higher for this feature? The feature is > isolated and does not appear that it will interfere with other parts > of GCC. I remember at least seeing middle-end pieces in alias analysis. > Aldy, RTH, Torvald and Red Hat appear ready to address any problems promptly. Sure, I was just asking for a good reason to merge it now, given that I had the impression the desire to merge for 4.7 is a bit rushed (given that the branch wasn't kept up-to-date with trunk until very recently and trunk regressions were still being fixed). I'd like to see some breakdown into subsystem patches. Can someone provide those together with changelog entries? Thanks, Richard. > - David >
Re: approaches to carry-flag modelling in RTL
Please, when replying, also send to me, not just the list. On Tue, 1 Nov 2011, Paulo J. Matos wrote: > On 01/11/11 02:43, Hans-Peter Nilsson wrote: > > > > Not obvious or maybe I was unclear as to what I alluded? > > In the below insn-bodies, "sub" is the insn that sets cc0 as a > > side-effect. > > > > Supposed canonical form : > > > > (parallel > > [(set cc_reg) (compare ...)) > >(set destreg) (sub ...))]) > > and: > > (parallel > > [(set destreg) (sub ...)) > >(clobber cc_reg)]) > That is very strange because if you look into RX or MN10300, they all have the > set REG_CC as the last in the parallel. That'd be a good reason to flip the default...except that the i386 has it the other way round i.e. as shown above. I think the main reason is that it just seemed right to those port authors. > I wonder if it has anything to do with > the fact that in these backends the set of the REG_CC only shows up after > reload. Right, it'd only matter where (also) GCC cooks up combinations (which IIRC it doesn't if the register is only exposed post-reload), not where only the port emits them. N.B., it *could* very well be that I misremember about the canonical form, but it seems neither of us bother to search the archives, so never mind. ;) ...oh wait, see the comments at combine.c:2824 and 3030 r180744. I can't find anything in the docs, but that might just be my grep-fu failing. I'm still thinking of a generic md iterator mechanism (one that doesn't restrict the form of the expansion in ways getting in the way with expanding to both a clobber and a set, and in swapped locations as above), to make the troubles go away... But maybe expanding them by a pass through e.g. m4 would be better than cooking up something new there. brgds, H-P
Re: Potentially merging the transactional-memory branch into mainline.
I'd like to see some breakdown into subsystem patches. Can someone provide those together with changelog entries? I am doing another merge from trunk->branch, and will post a series of patches by subsystem. I will do so after the merge is complete and tested.
gcc-4.4-20111101 is now available
Snapshot gcc-4.4-2001 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.4-2001/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.4 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_4-branch revision 180747 You'll find: gcc-4.4-20111101.tar.bz2 Complete GCC MD5=ada84cede36790f97da8a772e17dd211 SHA1=962a08327a57dec5e3205821de336bd320707b4d Diffs from 4.4-20111025 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.4 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Need help resolving PR target/50906
On Mon, Oct 31, 2011 at 10:58:03AM -0500, Moffett, Kyle D wrote: > I have not yet been able to figure out if it's a libgcc issue or an > actual compiler issue. It is a gcc bug. I've added a comment to the PR. -- Alan Modra Australia Development Lab, IBM
printed versions of GCC Internals book?
While I really like machine-readable (and searchable) text online for the GCC internals, there's still an atavistic streak in me that wants hard copy that I can put post-it notes on, run a highlighter over relevant passages or read when I'm not near a computer screen. I have two bound hard-copies (but the newer one is GCC 2.95) and laser-printed newer editions, but I've decided I really miss the bound-book format. Anybody have any experience with using one of the print-on-demand services to produce a recent version of the gccint manual? I was actually kind of surprised that the FSF hasn't taken advantage of this as a fund-raising opportunity. After the initial setup costs, it looks like the per/book price for the 700pg gccint would be about $20, but the setup fees (at least here http://www.harvard.com/on_our_shelves/in_store_book_printing/books_on_demand/ ) would be ~$100. So, unless someone has already done this, is there anyone else who'd want to buy a printed copy at a price that would recover my investment in the setup costs and postage? I'd be happy to turn over the whole project to the FSF so they could end up with an ongoing revenue stream once I break-even on the deal I'd guess that with 10 copies, we'd be looking at ~$35/copy, which is about as high I price as I'd be willing to pay if I was reading this email instead of writing it. So, are there 10 people out there who'd like a reasonably current version of the Internals book, or is there someone else who'd like to drive? -- Al Lehotsky
Re: IVopts bug?
2011/11/1 Richard Guenther : > 2011/11/1 杜越海 : >> Hi all >> >> I found IVopts rewrite a memory access with a weird iv candidate, >> which make it lost its original memory attribute. >> a non-local memory access' base pointer was rewrite into a local one, >> and it was deleted in pass_cd_dce since >> it was recognized as a local memory access. >> >> here is the case i simplified from a decoder source >> >> foo1(unsigned char* pSrcLeft, >> unsigned char* pSrcAbove, >> unsigned char* pSrcAboveLeft, >> unsigned char* pDst, >> int dstStep, >> int leftStep) >> { >> signed int x, y, s; >> unsigned char p1[5], p2[5], p3; >> >> p1[0] = *pSrcAboveLeft; >> p2[0] = p1[0]; >> p2[1] = pSrcLeft[0]; >> pSrcLeft += leftStep; >> p2[2] = pSrcLeft[0]; >> pSrcLeft += leftStep; >> p2[3] = pSrcLeft[0]; >> pSrcLeft += leftStep; >> p2[4] = pSrcLeft[0]; >> >> p1[1] = pSrcAbove[0]; >> p1[2] = pSrcAbove[1]; >> p1[3] = pSrcAbove[2]; >> p1[4] = pSrcAbove[3]; >> >> p3 = (unsigned char)(((signed int)p1[1] + (signed int)p2[1] + >> (signed int)p1[0] >>+(signed int)p1[0] + 2 ) >> 2 ); >> >> for( y=0; y<4; y++, pDst += dstStep ) { >>for( x=y+1; x<4; x++ ) { >>s = ( p1[x-y-1] + p1[x-y] + p1[x-y] + p1[x-y+1] + 2 ) >> >> 2; >>pDst[x] = (unsigned char)s; >>} >> >>pDst[y] = p3; -This memory access >> } >> } >> >> before IVopts >> >> D.6508_65 = pDst_88 + y.6_64; >> *D.6508_65 = p3_37; >> >> after IVopts >> it was rewrite to >> MEM[symbol: p1, index: ivtmp.161_200, offset: 0B] = p3_37 , >> >> by >> candidate 15 >> depends on 3 >> var_before ivtmp.161 >> var_after ivtmp.161 >> incremented before exit test >> type unsigned int >> base (unsigned int) pDst_39(D) - (unsigned int) &p1 >> step (unsigned int) (pretmp.28_118 + 1) >> >> so it still is &p1+ pDst - &p1 + step = pDst + step, >> and in pass_cd_dce, is_hidden_global_store () return false for this memory >> since it think this stmt only access local array p1. >> >> >> >> gcc version r180694 >> >> Configured with: /home/croseadu/android/_src/src/gcc-src/configure >> --host=i486-linux-gnu --build=i486-linux-gnu >> --target=arm-none-linux-gnueabi >> --prefix=/home/croseadu/android/_src/install/arm-none-linux-gnueabi >> --enable-threads --disable-libmudflap --disable-libssp >> --disable-libstdcxx-pch --with-gnu-as --with-gnu-ld >> --enable-languages=c,c++ --enable-shared --enable-symvers=gnu >> --enable-__cxa_atexit >> --with-specs='%{funwind-tables|fno-unwind-tables|mabi=*|ffreestanding|nostdlib:;:-funwind-tables}' >> --disable-nls --enable-lto >> --with-sysroot=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/libc >> --with-build-sysroot=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/libc >> --with-gmp=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr >> --with-mpfr=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr >> --with-ppl=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr >> --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic >> -lm' >> --with-cloog=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr >> --enable-cloog-backend=isl >> --with-mpc=/home/croseadu/android/_src/objs/arm-none-linux-gnueabi/obj/host-libs-/usr >> --enable-poison-system-directories --disable-libquadmath --enable-lto >> --enable-libgomp >> --with-build-time-tools=/home/croseadu/android/_src/install/arm-none-linux-gnueabi/arm-none-linux-gnueabi/bin >> --with-cpu=cortex-a8 --with-float=soft >> >> compile flags: >> -O3 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-double >> >> need file a bug? > > Yes, it definitely should not do this kind of stupid (and invalid) thing. > > Richard. > >> >> Yuehai Du >> > file a bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50955, Could someboy help me to fix this PR? Thank you very much. Yuehai Du
Re: scalar vector shift expansion problem on 64-bit
From: David Miller Date: Fri, 28 Oct 2011 01:05:54 -0400 (EDT) > So should expand_vector_broadcast() really provide this invariant to > the vec_init expander, or does the vec_init expander need to tidy > things up with gen_lowpart() etc. calls? Richard I don't know if you had a chance to look into this at all yet, but I wanted to make a comment about vec_init in general. I've come to find that I want the compiler to do as little as possible with the expressions that get put into vector initializers. I don't want it to modify the mode of the individual inner elements in the assignment. I also don't want it to force mems into registers. In fact, the less it does the better. I want to make use of the special VIS load instructions that can take a QImode or HImode value in memory and load it zero extended into a 64-bit float register. For example: int x; __v8qi test_v8qi(void) { __v8qi ret = { x, x, x, x, x, x, x, x }; return ret; } I want to be able to generate: test_v8qi: sethi %hi(x + 3), %g1 or%g1, %lo(x + 3), %g1 ldda [%g1] ASI_FL8_P, %f2 sethi %hi(0x), %g2 or%g2, %lo(0x), %g2 bmask %g2, %g0, %g0 retl bshuffle %f2, %f2, %f0 but I can't because the vec_init expander sees: (parallel:V8QI [ (reg:QI 110 [ D.2249 ]) (reg:QI 110 [ D.2249 ]) (reg:QI 110 [ D.2249 ]) (reg:QI 110 [ D.2249 ]) (reg:QI 110 [ D.2249 ]) (reg:QI 110 [ D.2249 ]) (reg:QI 110 [ D.2249 ]) (reg:QI 110 [ D.2249 ]) ]) in operands[1].
Re: # of unexpected failures 768 ?
On 10/31/11 19:20, Jonathan Wakely wrote: > On 31 October 2011 17:38, Rainer Orth wrote: >> Dennis Clarke writes: >> > I'm uncertain if Solaris 8/x86 still supports bare i386 machines, so it > might be better to keep the default of pentiumpro instead. Solaris 8 won't run on anything less than pentium, I recently convinced someone else to stop building GCC for i386 on Solaris: http://gcc.gnu.org/ml/gcc-help/2011-10/msg5.html > > Quite. In fact there are *very* good reasons not to configure for > 80386: libstdc++'s configure uses the default arch being configured > for, and disables a number of features on i386 because it doesn't > support the required atomic ops. > > So by configuring for i386 you will distribute a GCC package that is > missing useful features, but supports an ancient architecture that > Solaris doesn't even run on. > > You should configure for pentium-pc-solaris2.8 or use --with-arch-32=pentium When not configuring with '--host=i386-pc-solaris2.8', it is config.guess that detects 'i386-pc-solaris2.8', just tried here with most recent config.guess on i86pc Solaris2.10, result is 'i386-pc-solaris2.10'. Actually, it is uname showing the 'i386' on Solaris: $ uname -p # Prints the current host's ISA or processor type. i386 $ uname -i # Prints the name of the platform. i86pc So I'd wonder if '--host=i386-pc-solaris2.8' actually does make any difference here. Just my 2 cents. /haubi/