RE: option -mprfchw on 2 different Opteron cpus
Hi, > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > NightStrike > Sent: Monday, May 2, 2016 1:55 AM > To: gcc@gcc.gnu.org > Cc: Jan Hubicka ; Jakub Jelinek > Subject: option -mprfchw on 2 different Opteron cpus > > Reposting from here: > https://gcc.gnu.org/ml/gcc-help/2016-05/msg3.html > > Not sure if this applies: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54210 > > If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw > listed in the options in -fverbose-asm. In the assembly, I see this: > > prefetcht0 (%rax) # ivtmp.1160 > prefetcht0 304(%rcx) # > prefetcht0 (%rax) # ivtmp.1160 In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA support. (Snip) CPUID Fn8000_0001_ECX Feature Identifiers Bit 8 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and “PREFETCHW” in APM3 Ref: http://support.amd.com/TechDocs/25481.pdf (Snip) Can you please confirm what this CPUID flag returns on your k8 machine ?. I believe this ISA is not available on k8 machine so when -march=native is added you don’t see -mprfchw in verbose. > > If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to > target the older system), I do see it listed in the options in -fverbose-asm. > In > the assembly, I see this: K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw (3DNowPrefetch). https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html So when you add -march=k8 you see -mprfchw getting listed in verbose. > > prefetcht0 (%rax) # ivtmp.1160 > prefetcht0 304(%rcx) # > prefetchw (%rax) # ivtmp.1160 > > (The third line is the only difference) > This is my guess without seeing the test case, when write prefetching is requested "prefetchw" is generated. 3dnow (TARGET_3DNOW) ISA has support for it. (Snip) Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR Fn8000_0001_EDX[3DNow] = 1. (Snip) Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf > In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248? > > Also, FWIW: > > 1) The march=native version that uses prefetcht0 is very repeatably faster by > about 15% in the particular test case I'm looking at. > > 2) The compilers in both instances are not just the same version, they are the > same compiler binary installed on an NFS mount and shared to both > computers. As per GCC4.9.3 source. (Snip) (define_expand "prefetch" [(prefetch (match_operand 0 "address_operand") (match_operand:SI 1 "const_int_operand") (match_operand:SI 2 "const_int_operand"))] "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1" { bool write = INTVAL (operands[1]) != 0; int locality = INTVAL (operands[2]); gcc_assert (IN_RANGE (locality, 0, 3)); /* Use 3dNOW prefetch in case we are asking for write prefetch not supported by SSE counterpart or the SSE prefetch is not available (K6 machines). Otherwise use SSE prefetch as it allows specifying of locality. */ if (TARGET_PREFETCHWT1 && write && locality <= 2) operands[2] = const2_rtx; else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) operands[2] = GEN_INT (3); else operands[1] = const0_rtx; }) (Snip) Write prefetch may be requested (either by auto prefetcher or builtins) but on -march=native, the below check could have become false. else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) TARGET_PRFCHW is off on native. So there are two issues here. (1) ISA flags enabled with -march=k8 is different from -march=native on k8 machine. (2) Need to check why GCC middle end requested write prefetch for the test case with -march=k8 . Regards, Venkat.
GCC 6.1 Hard-coded C++ header paths and relocation problem on Windows
This is a cross-post from gcc-help as there haven't been any replies on gcc-help since two days ago. Hope someone could help. ``` I have built GCC from gcc-6-branch in MSYS2 with mingw-w64 CRT on Windows today. Now I have a relocation problem: Assuming mingw-w64 headers are located in the follow directory,which is, the native_system_header_dir: > C:/MinGW/MSYS2/mingw32/lib/gcc/i686-w64-mingw32/6.1.1/include I have built GCC and it has that hard-coded path. When I compile something using g++ -v, the headers are searched in the following paths: ``` ignoring nonexistent directory "/mingw32/include" ignoring duplicate directory "C:/MinGW/MSYS2/mingw32/i686-w64-mingw32/include" #include "..." search starts here: #include <...> search starts here: C:/MinGW/MSYS2/mingw32/include/c++/6.1.1 C:/MinGW/MSYS2/mingw32/include/c++/6.1.1/i686-w64-mingw32 C:/MinGW/MSYS2/mingw32/include/c++/6.1.1/backward C:/MinGW/MSYS2/mingw32/lib/gcc/i686-w64-mingw32/6.1.1/include C:/MinGW/MSYS2/mingw32/lib/gcc/i686-w64-mingw32/6.1.1/../../../../include C:/MinGW/MSYS2/mingw32/lib/gcc/i686-w64-mingw32/6.1.1/include-fixed C:/MinGW/MSYS2/mingw32/lib/gcc/i686-w64-mingw32/6.1.1/../../../../i686-w64-mingw32/include End of search list. ``` The C++ headers are searched before any mingw-w64 headers, which is just fine. However, if I move gcc to another directory, let's say, C:/this_is_a_new_directory/mingw32/, then re-compile the same program with g++ -v, the headers are searched in the following paths: ``` ignoring duplicate directory "C:/this_is_a_new_directory/mingw32/lib/gcc/../../lib/gcc/i686-w64-mingw32/6.1.1/include" ignoring nonexistent directory "C:/MinGW/MSYS2/mingw32/include" ignoring nonexistent directory "/mingw32/include" ignoring duplicate directory "C:/this_is_a_new_directory/mingw32/lib/gcc/../../lib/gcc/i686-w64-mingw32/6.1.1/include-fixed" ignoring duplicate directory "C:/this_is_a_new_directory/mingw32/lib/gcc/../../lib/gcc/i686-w64-mingw32/6.1.1/../../../../i686-w64-mingw32/include" ignoring nonexistent directory "C:/MinGW/MSYS2/mingw32/i686-w64-mingw32/include" #include "..." search starts here: #include <...> search starts here: C:/this_is_a_new_directory/mingw32/bin/../lib/gcc/i686-w64-mingw32/6.1.1/include C:/this_is_a_new_directory/mingw32/bin/../lib/gcc/i686-w64-mingw32/6.1.1/../../../../include C:/this_is_a_new_directory/mingw32/bin/../lib/gcc/i686-w64-mingw32/6.1.1/include-fixed C:/this_is_a_new_directory/mingw32/bin/../lib/gcc/i686-w64-mingw32/6.1.1/../../../../i686-w64-mingw32/include C:/this_is_a_new_directory/mingw32/lib/gcc/../../include/c++/6.1.1 C:/this_is_a_new_directory/mingw32/lib/gcc/../../include/c++/6.1.1/i686-w64-mingw32 C:/this_is_a_new_directory/mingw32/lib/gcc/../../include/c++/6.1.1/backward End of search list. ``` This time the C++ headers are searched after mingw-w64 headers, which causes the following error: ``` In file included from C:/MinGW/mingw32/include/c++/6.1.1/ext/string_conversions.h:41:0, from C:/MinGW/mingw32/include/c++/6.1.1/bits/basic_string.h:5402, from C:/MinGW/mingw32/include/c++/6.1.1/string:52, from C:/MinGW/mingw32/include/c++/6.1.1/bits/locale_classes.h:40, from C:/MinGW/mingw32/include/c++/6.1.1/bits/ios_base.h:41, from C:/MinGW/mingw32/include/c++/6.1.1/ios:42, from C:/MinGW/mingw32/include/c++/6.1.1/ostream:38, from C:/MinGW/mingw32/include/c++/6.1.1/iostream:39, from test.cpp:1: C:/MinGW/mingw32/include/c++/6.1.1/cstdlib:75:25: fatal error: stdlib.h: No such file or directory #include_next ^ compilation terminated. ``` Do you know how to solve this problem (modifications to gcc source code are expected)? Thanks in advance. -- Best regards, lh_mouse 2016-05-02
Re: GCC 6.1 Hard-coded C++ header paths and relocation problem on Windows
I made some investigation yesterday and here is the result: ``` Diff'ing gcc/libstdc++-v3/include/c_global/cstdlib from gcc-5-branch and gcc-6-branch gives the following result: (git diff gcc-5-branch gcc-6-branch -- libstdc++-v3/include/c_global/cstdlib) ``` @@ -69,7 +69,11 @@ namespace std #else -#include +// Need to ensure this finds the C library's not a libstdc++ +// wrapper that might already be installed later in the include search path. +#define _GLIBCXX_INCLUDE_NEXT_C_HEADERS +#include_next +#undef _GLIBCXX_INCLUDE_NEXT_C_HEADERS // Get rid of those macros defined in in lieu of real functions. #undef abort ``` Replacing #include_next with #include fixes the problem. However, I am not exactly clear about whether it is these headers (cstdlib and cmath currently, there might be more) that are the problem. In my point of view, it is the inversion of C and C++ header paths that is the problem. -- Best regards, lh_mouse 2016-05-02
Re: option -mprfchw on 2 different Opteron cpus
On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan wrote: >> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw >> listed in the options in -fverbose-asm. In the assembly, I see this: >> >> prefetcht0 (%rax) # ivtmp.1160 >> prefetcht0 304(%rcx) # >> prefetcht0 (%rax) # ivtmp.1160 > > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA > support. > > (Snip) > CPUID Fn8000_0001_ECX Feature Identifiers > Bit 8 > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and > “PREFETCHW” in APM3 > Ref: http://support.amd.com/TechDocs/25481.pdf > (Snip) > > Can you please confirm what this CPUID flag returns on your k8 machine ?. > I believe this ISA is not available on k8 machine so when -march=native is > added you don’t see -mprfchw in verbose. Looks like zero? This was generated with the cpuid program from http://www.etallen.com/cpuid.html CPU 0: 0x 0x00: eax=0x0001 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65 0x0001 0x00: eax=0x0f58 ebx=0x0800 ecx=0x edx=0x078bfbff 0x8000 0x00: eax=0x8018 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65 0x8001 0x00: eax=0x0f58 ebx=0x0405 ecx=0x edx=0xe1d3fbff 0x8002 0x00: eax=0x20444d41 ebx=0x6574704f ecx=0x286e6f72 edx=0x20296d74 0x8003 0x00: eax=0x636f7250 ebx=0x6f737365 ecx=0x34322072 edx=0x0038 0x8004 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8005 0x00: eax=0xff08ff08 ebx=0xff20ff20 ecx=0x40020140 edx=0x40020140 0x8006 0x00: eax=0x ebx=0x42004200 ecx=0x04008140 edx=0x 0x8007 0x00: eax=0x ebx=0x ecx=0x edx=0x0009 0x8008 0x00: eax=0x3028 ebx=0x ecx=0x edx=0x 0x8009 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x800a 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x800b 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x800c 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x800d 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x800e 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x800f 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8010 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8011 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8012 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8013 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8014 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8015 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8016 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8017 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8018 0x00: eax=0x ebx=0x ecx=0x edx=0x 0x8086 0x00: eax=0x ebx=0x ecx=0x edx=0x 0xc000 0x00: eax=0x ebx=0x ecx=0x edx=0x CPU: vendor_id = "AuthenticAMD" version information (1/eax): processor type = primary processor (0) family = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15) model = 0x5 (5) stepping id = 0x8 (8) extended family = 0x0 (0) extended model = 0x0 (0) (simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon 64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um miscellaneous (1/ebx): process local APIC physical ID = 0x0 (0) cpu count = 0x0 (0) CLFLUSH line size = 0x8 (8) brand index= 0x0 (0) brand id = 0x00 (0): unknown feature information (1/edx): x87 FPU on chip= true virtual-8086 mode enhancement = true debugging extensions = true page size extensions = true time stamp counter = true RDMSR and WRMSR support= true physical address extensions= true machine check exception= true CMPXCHG8B inst.= true APIC on chip = true SYSENTER and SYSEXIT = true memory type range registers= true PTE global bit = true machine check architecture = true conditional move/compare instruction = true page attribute table = true page size extension= true processor serial number= f
Re: (R5900) Implementing Vector Support
On 04/29/2016 07:54 AM, Liu Woon Yung wrote: I've done something like that, but GCC still doesn't select the pattern to use: (define_insn "vec_cmp" Because you've used the wrong name. The patterns are: OPTAB_CD(vec_cmp_optab, "vec_cmp$a$b") OPTAB_CD(vec_cmpu_optab, "vec_cmpu$a$b") I see where the confusion is though. These: i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmpv2div2di" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmpu" i386/sse.md:(define_expand "vec_cmpu" i386/sse.md:(define_expand "vec_cmpu" i386/sse.md:(define_expand "vec_cmpu" i386/sse.md:(define_expand "vec_cmpuv2div2di" are the only usage examples within the gcc tree. All of the other "vec_cmp" stuff that you're seeing are internal to the rs6000 and s390 ports, for implementing builtins and/or vcond. rs6000 doesn't implement bare comparisons, but only implements the "vcond" conditional move upon which uses the comparison. Many of the other targets do the same thing. Is there a reason why implementing only vcond is preferred? I believe that's just history. IIRC, only vcond was present originally. Amusingly, I believe that was because vcond was designed to handle one of the other MIPS vector extensions (MDMX?) wherein the comparison results are placed in (a set of) condition code registers, and thus producing a per-element {0,-1} vector result requires extra instructions. r~
r235766 incomplete?
Hi Jan, I just noticed the compilation errors in the attached file with the latest trunk. It seems as though your recent patch below may be incomplete: commit 46e5dccc6f188bd0fd5af4e9778f547ab63c9cae Author: hubicka Date: Mon May 2 16:55:56 2016 + The following change causes compilation errors due to ipa_find_agg_cst_for_param taking just three arguments, while it is being called with four. (I haven't looked into the other error.) Regards Martin --- a/gcc/ipa-inline-analysis.c +++ b/gcc/ipa-inline-analysis.c @@ -850,7 +850,8 @@ evaluate_conditions_for_known_args (struct cgraph_node *node , if (known_aggs.exists ()) { agg = known_aggs[c->operand_num]; - val = ipa_find_agg_cst_for_param (agg, c->offset, c->by_ref); + val = ipa_find_agg_cst_for_param (agg, known_vals[c->operand_num], + c->offset, c->by_ref); /src/gcc/66561/gcc/ipa-inline-analysis.c: In function âclause_t evaluate_conditions_for_known_args(cgraph_node*, bool, vec, vec)â: /src/gcc/66561/gcc/ipa-inline-analysis.c:854:27: error: invalid conversion from âtree_node*â to âlong intâ [-fpermissive] c->offset, c->by_ref); ^ /src/gcc/66561/gcc/ipa-inline-analysis.c:854:27: error: too many arguments to function âtree_node* ipa_find_agg_cst_for_param(ipa_agg_jump_function*, long int, bool)â In file included from /src/gcc/66561/gcc/ipa-inline-analysis.c:90:0: /src/gcc/66561/gcc/ipa-prop.h:639:6: note: declared here tree ipa_find_agg_cst_for_param (struct ipa_agg_jump_function *, HOST_WIDE_INT, ^ /src/gcc/66561/gcc/ipa-inline.c: In function âbool can_inline_edge_p(cgraph_edge*, bool, bool, bool)â: /src/gcc/66561/gcc/ipa-inline.c:341:55: error: âCIF_THUNKâ was not declared in this scope e->inline_failed = e->caller->thunk.thunk_p ? CIF_THUNK : CIF_MISMATCHED_ARGUMENTS; ^
Re: r235766 incomplete?
On Mon, 2016-05-02 at 11:50 -0600, Martin Sebor wrote: > Hi Jan, > > I just noticed the compilation errors in the attached file with > the latest trunk. It seems as though your recent patch below may > be incomplete: > >commit 46e5dccc6f188bd0fd5af4e9778f547ab63c9cae >Author: hubicka >Date: Mon May 2 16:55:56 2016 + > > The following change causes compilation errors due to > ipa_find_agg_cst_for_param taking just three arguments, while it > is being called with four. (I haven't looked into the other error.) > > Regards > Martin > > --- a/gcc/ipa-inline-analysis.c > +++ b/gcc/ipa-inline-analysis.c > @@ -850,7 +850,8 @@ evaluate_conditions_for_known_args (struct > cgraph_node *node > , >if (known_aggs.exists ()) > { >agg = known_aggs[c->operand_num]; > - val = ipa_find_agg_cst_for_param (agg, c->offset, c > ->by_ref); > + val = ipa_find_agg_cst_for_param (agg, > known_vals[c->operand_num], > + c->offset, c > ->by_ref); I saw this too (with r235766). I believe it's fixed by r235770 and r235771: 2016-05-02 Jan Hubicka * cif-code.def (CIF_THUNK): Add. * ipa-inline-analsysis.c (evaluate_conditions_for_known_args): Revert accidental change. (albeit with a typo in that second filename) r235771 work for me, FWIW. Dave
determining reassociation width
So, my first cut at the function to select reassociation width for power was modeled after what I saw i386 and aarch64 doing, which is to return something based on the number of that kind of op we can do at the same time: static int rs6000_reassociation_width (unsigned int opc, enum machine_mode mode) { switch (rs6000_cpu) { case PROCESSOR_POWER8: case PROCESSOR_POWER9: if (VECTOR_MODE_P (mode)) return 2; if (INTEGRAL_MODE_P (mode)) { if ( opc == MULT_EXPR ) return 2; return 6; /* correct for all integral modes? */ } if (FLOAT_MODE_P (mode)) return 2; /* decimal float gets default 1 */ break; default: break; } return 1; } However, the reality of the situation is a bit more complicated I think. * If we want maximum parallelism, we should really base this on the number of units times the latency. I.e. for float on p8 we have 2 units and 6 cycles latency so we would want to issue up to 12 fadd or fmul in parallel, then the result from the first one would be ready for the next series of dependent ops. * Of course this may cause massive register spills and so we can't really make things that wide. So, reassociation ought to be aware of how much register pressure it is creating and how much has been created by things that want to be live across this bb. * Ideally we would also be aware of whether we are reassociating a tree of fp additions whose terms are fp multiplies because now we have fused multipy-adds to consider. See PR 70912 for more on this. Suggestions? Thanks, Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: r235766 incomplete?
> On Mon, 2016-05-02 at 11:50 -0600, Martin Sebor wrote: > > Hi Jan, > > > > I just noticed the compilation errors in the attached file with > > the latest trunk. It seems as though your recent patch below may > > be incomplete: > > > >commit 46e5dccc6f188bd0fd5af4e9778f547ab63c9cae > >Author: hubicka > >Date: Mon May 2 16:55:56 2016 + > > > > The following change causes compilation errors due to > > ipa_find_agg_cst_for_param taking just three arguments, while it > > is being called with four. (I haven't looked into the other error.) > > > > Regards > > Martin > > > > --- a/gcc/ipa-inline-analysis.c > > +++ b/gcc/ipa-inline-analysis.c > > @@ -850,7 +850,8 @@ evaluate_conditions_for_known_args (struct > > cgraph_node *node > > , > >if (known_aggs.exists ()) > > { > >agg = known_aggs[c->operand_num]; > > - val = ipa_find_agg_cst_for_param (agg, c->offset, c > > ->by_ref); > > + val = ipa_find_agg_cst_for_param (agg, > > known_vals[c->operand_num], > > + c->offset, c > > ->by_ref); > > I saw this too (with r235766). I believe it's fixed by r235770 and > r235771: > > 2016-05-02 Jan Hubicka > > * cif-code.def (CIF_THUNK): Add. > * ipa-inline-analsysis.c (evaluate_conditions_for_known_args): Revert > accidental change. > > (albeit with a typo in that second filename) Uh, thanks. Will fix that with next commit. I amanaged to accidentaly bundle unrelated changes to the patch. My apologizes for that. Will try to keep my commit tree clean. Honza
Re: Is MODES_TIEABLE_P transitive?
On Mon, Apr 25, 2016 at 11:04:01AM -0600, Jeff Law wrote: > On 04/21/2016 01:53 PM, Michael Meissner wrote: > >As I start to allow integer modes into vector registers, I need to revisit > >MODES_TIEABLE_P. I'm wondering if MODES_TIEABLE_P is transitive? > I don't recall a need for it to be transitive. The only really > special thing I remember about MODES_TIEABLE_P was its relation to > HARD_REGNO_MODE_OK and the need for them to be consistent. > > > > >What I'd like to do, in a Power8 context, is to allow these to return true > >(after allowing SImode to go in VSX registers): > > > > MODES_TIEABLE_P (SImode, DFmode) > > MODES_TIEABLE_P (SImode, QImode) > > > >but, the following would return false for power8: > > > > MODES_TIEABLE_P (QImode, DFmode) > > > >In a power9 context, since there are loads/stores of 8/16-bit items it would > >return true > So what this kindof setup would allow would be subregs more > aggressively between SImode/DFmode and SImode/QImode, but would > restrict QImode/DFmode. > > You may need to twiddle CANNOT_CHANGE_MODE_CLASS along the way. > > > >So the question is whether we might need to support MODES_TIEABLE_P being > >transitive (i.e. it would return true a lot less of the time). I would prefer > >to not have to worry about odd corner cases where another type is tieable > >with > >one of the arguments, but not tieable with the other. And does it matter > >whether we are using RELOAD or IRA? > IIRC MODES_TIEABLE_P was largely related to reloads [in]ability to > handle certain subreg extractions -- like trying to extract a QImode > subreg out of FP hard register and the like. Yeah, this is getting rather complex. I recall trying to change MODES_TIEABLE_P in the past, and back then having all sorts of reload issues. I kind of want my cake and eat it too. On one hand, I want the primary integer types to be tieable, and other hand, having 32/64-bit ints tieable with floating point (things like IBM extended double and vectors can never be tieable due to the extended double using 2 64-bit parts in 2 separate registers, and vectors using a single 128-bit part in 1 register). In particular, in power9 there are various instructions for packing and unpacking floating point types, and it would be natural to want to use them for unions to help speed up the math library (as well as the ability to do byte and half-word memory operations). I will put it on the back burner for now. Thanks. -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
RE: option -mprfchw on 2 different Opteron cpus
Hi > -Original Message- > From: NightStrike [mailto:nightstr...@gmail.com] > Sent: Monday, May 2, 2016 10:31 PM > To: Kumar, Venkataramanan > Cc: Uros Bizjak (ubiz...@gmail.com) ; > lopeziba...@gmail.com; Jan Hubicka ; Jakub Jelinek > ; gcc@gcc.gnu.org > Subject: Re: option -mprfchw on 2 different Opteron cpus > > On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan > wrote: > >> If I compile on a k8 Opteron 248 with -march=native, I do not see > >> -mprfchw listed in the options in -fverbose-asm. In the assembly, I see > this: > >> > >> prefetcht0 (%rax) # ivtmp.1160 > >> prefetcht0 304(%rcx) # > >> prefetcht0 (%rax) # ivtmp.1160 > > > > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA > support. > > > > (Snip) > > CPUID Fn8000_0001_ECX Feature Identifiers Bit 8 > > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See > > “PREFETCH” and “PREFETCHW” in APM3 > > Ref: http://support.amd.com/TechDocs/25481.pdf > > (Snip) > > > > Can you please confirm what this CPUID flag returns on your k8 machine ?. > > I believe this ISA is not available on k8 machine so when -march=native is > added you don’t see -mprfchw in verbose. > > Looks like zero? This was generated with the cpuid program from > http://www.etallen.com/cpuid.html > > 3DNow! instruction extensions = true > 3DNow! instructions = true It has 3Dnow support. "prefetchw" is available with 3dnow. > misaligned SSE mode= false > 3DNow! PREFETCH/PREFETCHW instructions = false It does not have 3DNowprefetch enabling ISA flag -mprftchw is not correct for -march=k8. > OS visible workaround = false > instruction based sampling = false > >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying > >> to target the older system), I do see it listed in the options in > >> -fverbose-asm. In the assembly, I see this: > > > > K8 has 3dnow support and there is a patch that replaced 3dnow with > prefetchw (3DNowPrefetch). > > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html > > So when you add -march=k8 you see -mprfchw getting listed in verbose. > > > >> > >> prefetcht0 (%rax) # ivtmp.1160 > >> prefetcht0 304(%rcx) # > >> prefetchw (%rax) # ivtmp.1160 > >> > >> (The third line is the only difference) > >> > > > > This is my guess without seeing the test case, when write prefetching is > requested "prefetchw" is generated. > > 3dnow (TARGET_3DNOW) ISA has support for it. > > > > (Snip) > > Support for the PREFETCH and PREFETCHW instructions is indicated by > > CPUID Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR > > Fn8000_0001_EDX[3DNow] = 1. > > (Snip) > > Ref: > http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf > > > >> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248? > >> > >> Also, FWIW: > >> > >> 1) The march=native version that uses prefetcht0 is very repeatably > >> faster by about 15% in the particular test case I'm looking at. > >> > >> 2) The compilers in both instances are not just the same version, > >> they are the same compiler binary installed on an NFS mount and > >> shared to both computers. > > > > As per GCC4.9.3 source. > > > > (Snip) > > (define_expand "prefetch" > > [(prefetch (match_operand 0 "address_operand") > > (match_operand:SI 1 "const_int_operand") > > (match_operand:SI 2 "const_int_operand"))] > > "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1" > > { > > bool write = INTVAL (operands[1]) != 0; > > int locality = INTVAL (operands[2]); > > > > gcc_assert (IN_RANGE (locality, 0, 3)); > > > > /* Use 3dNOW prefetch in case we are asking for write prefetch not > > supported by SSE counterpart or the SSE prefetch is not available > > (K6 machines). Otherwise use SSE prefetch as it allows specifying > > of locality. */ > > if (TARGET_PREFETCHWT1 && write && locality <= 2) > > operands[2] = const2_rtx; > > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > > operands[2] = GEN_INT (3); > > else > > operands[1] = const0_rtx; > > }) > > (Snip) > > > > Write prefetch may be requested (either by auto prefetcher or builtins) but > on -march=native, the below check could have become false. > >else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > > TARGET_PRFCHW is off on native. > > > > So there are two issues here. > > > > (1) ISA flags enabled with -march=k8 is different from -march=native on k8 > machine. I think we need to file bug for this. Need to check with Uros why the flag -mprfchw is shared with 3dnow. To work around this issue you can use -mno-prfchw when building with -march=k8. > > (2) Need to check why GCC middle end requested write prefetch for the > test case with -march=k8 . On "prefetchw" generation it may be the case that GCC auto prefet