[Bug c++/40145] New: structure inside a static function is exported, producing warning
Given the following code: === struct EditorInternalCommand { }; static void createCommandMap() { struct CommandEntry { EditorInternalCommand command; }; } === The structure createCommandMap()::CommandEntry is exported from a local-scope (static) function. When compiling the code above with -fvisibility=hidden, g++ 4.3 or 4.4 outputs the following warning: visibility.cpp:5: warning: 'createCommandMap()::CommandEntry' declared with greater visibility than the type of its field 'createCommandMap()::CommandEntry::command' If I add constructors to both structures so that there are symbols emitted, the ELF symbol table looks like this: 6: 22 FUNCLOCAL DEFAULT2 _ZZL16createCommandMapvEN12CommandEntryC1Ev 7: 0016 16 FUNCLOCAL DEFAULT2 _ZL16createCommandMapv 12: 5 FUNCWEAK HIDDEN 6 _ZN21EditorInternalCommandC1Ev My understanding of the problem is that a "static" function has LOCAL scope but DEFAULT visibility. The inner structure inherits these properties. However, the outer structure (EditorInternalCommand) has HIDDEN visibility, which triggers the warning. However, since the binding scope is LOCAL, those symbols will not be exported by the linker in the final ELF object anyways, thereby making them effectively have hidden visibility. Workarounds: Any of the following three actions make the warning disappear: 1) remove the "static" keyword 2) move the inner structure outside the static function 3) place the outer structure in an anonymous namespace Actions #2 and #3 above change those propeties making both constructors match either LOCAL/DEFAULT or WEAK/HIDDEN. Action #1 causes the function to become GLOBAL/HIDDEN, but leaves the inner structure unchanged -- however, the warning is gone too. -- Summary: structure inside a static function is exported, producing warning Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: thiago at kde dot org GCC build triplet: i586-manbo-linux-gnu GCC host triplet: i586-manbo-linux-gnu GCC target triplet: i586-manbo-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40145
[Bug target/96238] New: [i386] cpuid.h header needs include guards
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96238 Bug ID: 96238 Summary: [i386] cpuid.h header needs include guards Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- $ cat x.c #include #include $ gcc -c x.c /usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:228:1: error: redefinition of ‘__get_cpuid_max’ 228 | __get_cpuid_max (unsigned int __ext, unsigned int *__sig) | ^~~ In file included from :32: /usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:228:1: note: previous definition of ‘__get_cpuid_max’ was here 228 | __get_cpuid_max (unsigned int __ext, unsigned int *__sig) | ^~~ /usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:283:1: error: redefinition of ‘__get_cpuid’ 283 | __get_cpuid (unsigned int __leaf, | ^~~ /usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:283:1: note: previous definition of ‘__get_cpuid’ was here 283 | __get_cpuid (unsigned int __leaf, | ^~~ /usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:300:1: error: redefinition of ‘__get_cpuid_count’ 300 | __get_cpuid_count (unsigned int __leaf, unsigned int __subleaf, | ^ /usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:300:1: note: previous definition of ‘__get_cpuid_count’ was here 300 | __get_cpuid_count (unsigned int __leaf, unsigned int __subleaf, | ^
[Bug target/95483] [i386] Missing SIMD functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95483 --- Comment #2 from Thiago Macieira --- Hello Evan I was about to report that _mm_loadu_epi16 is missing, but I'm glad you've got a more complete listing. FYI, here's a Godbolt link showing ICC and Clang with this intrinsic: https://gcc.godbolt.org/z/8nMcPE. I'll only have to report to Microsoft and will reference this bug report so they check their own implementation. FYI, for anyone stumbling upon this report when their code failed: most of the missing intrinsics can be worked around by combining one or more and will result in the same code. (In reply to Evan Nemerson from comment #0) > Here is the list: > > AVX _mm256_cvtsi256_si32 > AVX-512 _mm512_cvtsi512_si32 _mm256_extract_epi32 or _mm_cvtsi128_si32(mm256_castsi256_si128(x)) Ditto for 512-bit. > AVX2 _mm_broadcastsd_pd If using AVX2 is acceptable, one can use _mm_broadcastq_epi64 with suitable casting between __m128i and __m128d. > AVX2 _mm_broadcastsi128_si256 Looks like a typo; this one exists as _mm256 and so it should be. > AVX-512 _mm512_storeu_epi16 > AVX-512 _mm512_storeu_epi8 > AVX-512 _mm256_storeu_epi16 > AVX-512 _mm256_storeu_epi8 > AVX-512 _mm_storeu_epi16 > AVX-512 _mm_storeu_epi8 > AVX-512 _mm512_loadu_epi16 > AVX-512 _mm512_loadu_epi8 > AVX-512 _mm256_loadu_epi16 > AVX-512 _mm256_loadu_epi8 > AVX-512 _mm_loadu_epi16 > AVX-512 _mm_loadu_epi8 > AVX-512 _mm256_store_epi32 > AVX-512 _mm_store_epi32 > AVX-512 _mm256_loadu_epi64 > AVX-512 _mm256_loadu_epi32 > AVX-512 _mm_loadu_epi64 > AVX-512 _mm_loadu_epi32 > AVX-512 _mm256_load_epi64 > AVX-512 _mm256_load_epi32 > AVX-512 _mm_load_epi64 > AVX-512 _mm_load_epi32 All of these can be implemented as the mask (for storing) or maskz (for loading) equivalents with a mask of ~0 (UINT64_MAX for the epi8 ones). For example _mm256_loadu_epi16(ptr) becomes _mm256_maskz_loadu_epi16(~0, ptr) > AVX-512 _mm_cvtsd_i32 > AVX-512 _mm_cvtsd_i64 > AVX-512 _mm_cvtss_i32 > AVX-512 _mm_cvtss_i64 > AVX-512 _mm_cvti32_sd > AVX-512 _mm_cvti64_sd > AVX-512 _mm_cvti32_ss > AVX-512 _mm_cvti64_ss Not sure why those are needed; they generate the same instruction as _mm_cvtsX_siYY. Clang's header is even: #define _mm_cvtss_i32 _mm_cvtss_si32 #define _mm_cvtsd_i32 _mm_cvtsd_si32 #define _mm_cvti32_sd _mm_cvtsi32_sd #define _mm_cvti32_ss _mm_cvtsi32_ss #ifdef __x86_64__ #define _mm_cvtss_i64 _mm_cvtss_si64 #define _mm_cvtsd_i64 _mm_cvtsd_si64 #define _mm_cvti64_sd _mm_cvtsi64_sd #define _mm_cvti64_ss _mm_cvtsi64_ss #endif ICC does the same. > SSE _mm_storeu_si16 > SSE2 _mm_storeu_si32 With casting of the pointer: *dest = _mm_cvtsi128_si16(mm) If the casting is too scary or triggers aliasing warnings, then: uintXX_t val = _mm_cvtsi128_siXX(mm); memcpy(dest, &val, sizeof(val)); GCC optimises the memcpy and reg-reg MOVD into a single MOVD into memory. > SSE _mm_loadu_si16 > SSE2 _mm_loadu_si32 Ditto for the _mm_cvtsiXX_si128.
[Bug target/90129] Wrong error: inlining failed in call to always_inline ‘_mm256_adds_epi16’: target specific option mismatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90129 Thiago Macieira changed: What|Removed |Added CC||thiago at kde dot org --- Comment #3 from Thiago Macieira --- Another test: $ cat test.c #include __attribute__((target("arch=haswell"))) int hsw_test32(float f) { __m128 m = _mm_set_ss(f); m = _mm_cmpeq_ss(m, m); return _mm_movemask_ps(m); } $ gcc -c test.c In file included from /usr/lib64/gcc/x86_64-suse-linux/10/include/immintrin.h:29, from test.c:1: test.c: In function ‘hsw_test32’: /usr/lib64/gcc/x86_64-suse-linux/10/include/xmmintrin.h:814:1: error: inlining failed in call to ‘always_inline’ ‘_mm_movemask_ps’: target specific option mismatch [...] $ clang -c test.c && echo No error No error $ gcc -march=haswell -c test.c && echo No error No error
[Bug c++/92400] New: Incorrect selection of constructor overload for brace list
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92400 Bug ID: 92400 Summary: Incorrect selection of constructor overload for brace list Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Godbolt link: https://gcc.godbolt.org/z/bckr_n Testcase: #include struct A; struct V { V() = default; V(const A &); A make_a() const; }; struct A { A(); A(const A &); A(A &&); A(std::initializer_list); }; void sink(A &); void f() { A a{ V().make_a() }; sink(a); } When compiled, GCC generates a call to A::A(std::initializer_list). The three other compilers in the test do not -- ICC has a call to A::A(A&&) and Clang can be made to have that call with -std=c++14 -fno-elide-constructors. See https://wg21.link/cwg1631 - CWG1631: Incorrect overload resolution for single-element initializer-list https://wg21.link/cwg1467 - CWG1467: List-initialization of aggregate from same-type object https://wg21.link/cwg2137 - CWG2137: List-initialization from object of same type
[Bug c++/92400] Incorrect selection of constructor overload for brace list
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92400 Thiago Macieira changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #2 from Thiago Macieira --- Yes, it's the same. *** This bug has been marked as a duplicate of bug 85577 ***
[Bug c++/85577] list-initialization chooses initializer-list constructor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85577 Thiago Macieira changed: What|Removed |Added CC||thiago at kde dot org --- Comment #8 from Thiago Macieira --- *** Bug 92400 has been marked as a duplicate of this bug. ***
[Bug c++/92855] New: -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855 Bug ID: 92855 Summary: -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Related to bug 47877 and bug 45065, but apparently different. Issue probably goes back a long time. We're compiling with -fvisibility=hidden -fvisibility-inlines-hidden and expect that any inline functions used by libstdc++ to perform its job are hidden and not exported from our library. Unfortunately, GCC is failing to hide some of those functions and they can be seen with eu-readelf -s in the library output, where they appear as "WEAK DEFAULT". This is currently not expected to be a big problem, since these functions *are* inline and therefore expected to be emitted in any user code that needed to use them. They just cause our symbol table to be bigger than it needs to be. Testcase: #include class QThreadCreateThread { public: explicit QThreadCreateThread(std::future &&future) : m_future(std::move(future)) { } private: virtual void run() { m_future.get(); } std::future m_future; }; // QThread *QThread::createThreadImpl(std::future &&future) QThreadCreateThread *createThreadImpl(std::future &&future) { return new QThreadCreateThread(std::move(future)); } Compile with -O2 -fvisibility=hidden -fvisibility-inlines-hidden -fno-inline to force no inlining. In the assembly output, there are plenty of inline functions with .hidden and plenty without. For example, see std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(): .section .text.std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(),"axG",@progbits,std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(),comdat .align 2 .p2align 4 .weak std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() .type std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(), @function std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(): [...] No .hidden present.
[Bug c++/92855] -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855 --- Comment #3 from Thiago Macieira --- The symbol in question is inline, therefore -fvisibility-inlines-hidden should trigger and cause it to become hidden too. Testcase showing that GCC will apply that: #define VISIBILITY(x) __attribute__((visibility(#x))) namespace N VISIBILITY(default) { void other(); inline void f() { other(); } void g() { f(); } } If you compile this with -fno-inline to cause f() to be emitted, you'll see: .section.text.N::f(),"axG",@progbits,N::f(),comdat .p2align 4 .weak N::f() .hidden N::f() .type N::f(), @function N::f(): jmp N::other() .size N::f(), .-N::f() See: https://gcc.godbolt.org/z/nW3RbX So I contend that the symbol should have been hidden and wasn't because of a bug. Please reconsider.
[Bug c++/92855] -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855 --- Comment #5 from Thiago Macieira --- (In reply to Alexander Monakov from comment #4) > (FWIW, making 'f' a template in your example makes it non-hidden) > > Can you explain why you expect the command-line option to override the > attribute on the namespace? GCC usually implements the opposite, i.e. > attributes prevail over the defaults specified on the command line. > > In your sample on Godbolt, Clang also appears to honour the attribute rather > than the option. And ICC does the opposite and hides everything. Either way, GCC's behaviour of applying this to templates (which is bug 47877, so you may close as duplicate) is unexpected and seems inconsistent. I expect the emitted function to be hidden because it's inline and because of -fvisibility-inlines-hidden. From the TexInfo manual: The effect of this is that GCC may, effectively, mark inline methods with '__attribute__ ((visibility ("hidden")))' so that they do not appear in the export table of a DSO and do not require a PLT indirection when used within the DSO. Enabling this option can have a dramatic effect on load and link times of a DSO as it massively reduces the size of the dynamic export table when the library makes heavy use of templates. Since the out-of-line copies of the inline functions will be emitted in every TU that failed to inline them, and thus remain in every DSO, there's no need to export them. Each DSO can call its own, local copy through PC-relative calls and jumps. For the particular problem at hand, which we're still debugging, see https://bugreports.qt.io/browse/QTBUG-80535. The issue there is that certain non-Qt symbols were exported by the DSO and thus got tagged with the ELF version "Qt_5". That by itself is not a problem, but we've found that some applications began referencing those symbols with that ELF version and we don't understand why. The result is that the internal details of how something was implemented became part of our ABI.
[Bug c/56446] New: Generate one fewer relocation when calling a checked weakref function
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56446 Bug #: 56446 Summary: Generate one fewer relocation when calling a checked weakref function Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: thi...@kde.org When you have code like: static int f() __attribute__((weakref("foo"))); void g() { int (*ptr)() = f; if (ptr) ptr(); } which is typical for weakref functions, when compiled in PIC/PIE mode, gcc sees through the variable and generates: cmpq$0, f@GOTPCREL(%rip) je .L1 xorl%eax, %eax jmp f@PLT .L1: ret That means there will be two GOT entries for the "foo" symbol: one in the actual GOT and one in the .plt.got (lazily initialised). Since the actual GOT needs to have the address filled in at load time, there's no gain in lazy initialisation -- in fact, there's a loss. GCC could do exactly what the code is suggesting and load the actual address onto a register and then use it. This would save one relocation, the indirect PLT jumps and the loss in the lazy resolution.
[Bug tree-optimization/56446] [4.6/4.7/4.8 Regression] Generate one fewer relocation when calling a checked weakref function
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56446 --- Comment #3 from Thiago Macieira 2013-02-25 22:27:14 UTC --- This should not be done for non-PIC code. In those, it might be preferable to make the actual call, as opposed to an indirect jump. I also wonder what would happen for a call that resolves back into the current module. In those cases, keeping the indirect call would be unnecessary. However, it also seems like an edge case to me: why is the symbol weak if it's part of the module?
[Bug tree-optimization/56446] [4.6/4.7/4.8 Regression] Generate one fewer relocation when calling a checked weakref function
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56446 --- Comment #4 from Thiago Macieira 2013-02-25 22:28:07 UTC --- One more detail: both ICC 13 and Clang 3.0 do the same thing as GCC.
[Bug middle-end/56574] False possibly uninitialized variable warning
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56574 Thiago Macieira changed: What|Removed |Added CC||thiago at kde dot org --- Comment #3 from Thiago Macieira 2013-03-08 21:11:19 UTC --- Looking at the code that GCC generated (4.7.2 from Fedora and similarly with pristine 4.8 trunk@196249): %edi = flag; %eax = value 11 testl %edi, %edi 12 je .L12 .L12 is the call to get_value() is placed 13 .L2: 14 testl %edi, %edi 15 sete%dl 16 testl %eax, %eax Here, EAX might be uninitialised 17 setne %al 18 testb %dl, %al 19 jne .L3 .L3 is an infinite loop 20 testb %dl, %dl 21 jne .L8 .L8 is the function exit (the loop break) fall-through is an infinite loop In other words, the warning is true: the generated code *is* using an uninitialised variable. The question is whether that is acceptable. In order for EAX to be uninitialised, we must not have jumped to .L12. Since the JE jump on line 12 was not taken, SETE must have set DL to 0 on line 15. Then we compare AL to DL on line 18: as DL is zero, the result of the comparison is ZF, whichever the value of AL might be. That means the JNZ jump on line 19 is not taken. The code will then proceed to the infinite loop. Conclusion: it's just a bogus warning. It is correct from the point of view of the assembly code that was generated, but the uninitialised value is never used in any decisions.
[Bug c/56727] New: [4.7/4.8] [missed-optimization] Recursive call goes through the PLT unnecessarily
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56727 Bug #: 56727 Summary: [4.7/4.8] [missed-optimization] Recursive call goes through the PLT unnecessarily Classification: Unclassified Product: gcc Version: 4.7.2 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: thi...@kde.org Consider the following code, compiled with -O2 -fPIC either in C or in C++: === void f(TYPE b) { f(0); } === if TYPE is a type of 32- or 64-bit width (int, unsigned, long, long long), GCC generates the following code (-m32, -mx32 or -m64): === f: .L2: jmp .L2 === If TYPE is shorter than 32-bit (bool, _Bool, char, short), GCC generates the following code (-mx32, -m64): === f: xorl%edi, %edi jmp f@PLT === and much worse code for -m32. For whatever reason, GCC decided to place the call via the PLT. That's a the missed optimisation: if this function was called, then the PLT must resolve back to itself. What's more, since the argument wasn't used, it's also unnecessary to set it. The output happens without -fPIC, but in that case there is no PLT. Tested on: GCC 4.7.2 (as shipped by Fedora 17) GCC 4.9 (trunk build from 20130318) This is a contrived example (infinite recursion) that no one would write in their sane mind. But it might point to missed optimisations in legitimate recursive functions.
[Bug c++/57064] New: [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 Bug #: 57064 Summary: [clarification requested] Which overload with ref-qualifier should be called? Classification: Unclassified Product: gcc Version: 4.8.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: thi...@kde.org I'm not sure this is a bug. I am requesting clarification on the behaviour. >From the C++11 standard (13.3.3.2 [over.ics.rank] p 3): struct A { void p() &; void p() &&; }; void f() { A a; a.p(); A().p(); } GCC 4.8.1 correctly calls the lvalue-ref overload first, then the rvalue overload second. Now suppose the following function: void g(A &&a) { a.p(); } Which overload should GCC call? This is my request for clarification. I couldn't find anything specific in the standard that would help explain one way or the other. Intuitively, it would be the rvalue overload, but gcc calls the lvalue overload instead. Making it: std::move(a).p(); Does not help.
[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 --- Comment #1 from Thiago Macieira 2013-04-25 00:45:00 UTC --- Here's why I'm asking: QString has members like: QString arg(int, [other parameters]) const; Which are used like so: return QString("%1 %2 %3 %4").arg(42).arg(47).arg(-42).arg(-47); // returns "42 47 -42 -47" Right now, each call creates a new temporary, which is required to do memory allocation. I'd like to avoid the new temporaries by simply reusing the existing ones: QString arg(int, [...]) const &; // returns a new copy QString &&arg(int, [...]) &&; // modifies this object, return *this; When these two overloads are present, every other call will be to rvalue-ref and the others to lvalue-ref. That is, the first call (right after the constructor) calls arg()&&, which returns an rvalue-ref. The next call will be to arg()&, which returns a temporary, making the third call to arg()&& again. I can get the desired behaviour by using the overloads: QString arg(int, [...]) const &; // returns a new copy QString arg(int, [...]) &&; // returns a moved temporary via return std::move(*this); However, the side-effect of that is that we still have 4 temporaries too many, albeit empty (moved-out) ones. You can see this by counting the number of calls to the destructor: $ ~/gcc4.8/bin/g++ -fverbose-asm -fno-exceptions -fPIE -std=c++11 -S -o - -I$QTOBJDIR/include /tmp/test.cpp | grep -B1 call.*QStringD movq%rax, %rdi # tmp82, call_ZN7QStringD1Ev@PLT # -- movq%rax, %rdi # tmp83, call_ZN7QStringD1Ev@PLT # -- movq%rax, %rdi # tmp84, call_ZN7QStringD1Ev@PLT # -- movq%rax, %rdi # tmp85, call_ZN7QStringD1Ev@PLT #
[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 --- Comment #2 from Thiago Macieira 2013-04-25 00:45:39 UTC --- This was a self-compiled, pristine GCC gcc version 4.8.1 20130420 (prerelease) (GCC) trunk at 198107
[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 --- Comment #3 from Thiago Macieira 2013-04-25 00:53:20 UTC --- One more note. Given: void p(A &); void p(A &&); void f(A &&a) { p(a); } like the member function case, this calls p(A &). It's slightly surprising at first glance, but is a known and documented case. Unlike the member function case, if you do p(std::move(a)); it will call p(A &&).
[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 --- Comment #6 from Thiago Macieira 2013-04-25 06:51:33 UTC --- void f(A &&a) { std::move(a).p(); } _Z1fO1A: .cfi_startproc jmp _ZNR1A1pEv@PLT # .cfi_endproc Then this looks like a bug in 4.8.1. But then are we in agreement that a.p() in that function above should call the lvalue-ref overload? It does make the feature sligthly less useful for me. It would require writing: return std::move(std::move(std::move(std::move(QString("%1 %2 %3 %4") .arg(42)) .arg(47)) .arg(-42)) .arg(-47));
[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 --- Comment #8 from Thiago Macieira 2013-04-25 07:13:44 UTC --- Hmm... this might be an effect of the same bug. Can you try this on 4.9? struct A { A p() const &; A &&p() &&; }; void f() { A().p().p(); } I get: leaq15(%rsp), %rdi #, tmp60 call_ZNO1A1pEv@PLT # movq%rax, %rdi # D.69575, call_ZNKR1A1pEv@PLT # Is this second call supposed to be to R? If it's to O, it's exactly what I need to make the feature useful.
[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 --- Comment #10 from Thiago Macieira 2013-04-25 07:34:21 UTC --- Great! That changes everything. Now I can provide a mutating arg() overload. I'll just need some #ifdef and build magic to add the R, O overloads without removing the overloads that already exist (binary compatibility). It would have been nicer if the lvalue ref overload didn't get extra decoration.
[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 --- Comment #14 from Thiago Macieira 2013-04-26 06:16:04 UTC --- Understood. The idea is that one would write: QString str = QString("%1 %2").arg(42).arg(43);
[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064 --- Comment #16 from Thiago Macieira 2013-04-26 13:45:35 UTC --- Thanks for the hint. However, returning an rvalue, even if moved-onto, will generate code for the destructor. It's not a matter of efficiency, just of code size. Anyway, I'll do some benchmarks, after I figure out how to work around the binary compatibility break imposed by having the & in the function that already existed.
[Bug other/57202] New: Please make the intrinsics headers like immintrin.h be usable without compiler flags
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57202 Bug #: 57202 Summary: Please make the intrinsics headers like immintrin.h be usable without compiler flags Classification: Unclassified Product: gcc Version: 4.8.1 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: other AssignedTo: unassig...@gcc.gnu.org ReportedBy: thi...@kde.org Please make all headers for intrinsics be includable without special compiler flags. In other words, I want the following to work: $ gcc -fsyntax-only -include smmintrin.h -xc /dev/null In file included from :0:0: /usr/lib/gcc/x86_64-redhat-linux/4.7.2/include/smmintrin.h:31:3: error: #error "SSE4.1 instruction set not enabled" Note it works with ICC: $ icc -fsyntax-only -include smmintrin.h -xc /dev/null && echo works works Not only that, please make all the intrinsics functions be defined and ready to be used. This is necessary so that the following source file could compile even if -msse4.1 is not passed on the command-line (adapted from http://gcc.gnu.org/gcc-4.8/changes.html): #include __attribute__ ((target ("default"))) int foo(void) { return 1; } __attribute__ ((target ("sse4.2"))) int foo(void) { __m128i v; _mm_blendv_epi8(v, v, v); return 2; } There are several reasons for that, number one among them that it makes the GCC 4.8 feature above actually useful for non-trivial code. Also, passing extra options on the command-line are simply not an option for C++ code (where the feature above is useful) if that code is moderately complex and uses inline functions, and those options cannot be used if LTO is to be used (bug 54231).
[Bug target/57202] Please make the intrinsics headers like immintrin.h be usable without compiler flags
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57202 Thiago Macieira changed: What|Removed |Added Target|x86_64-*-* i?86-*-* |x86_64-*-* i?86-*-* arm-*-* --- Comment #1 from Thiago Macieira 2013-05-08 07:03:26 UTC --- This also applies to arm_neon.h.
[Bug c/54202] New: Overeager warning about freeing non-heap objects
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54202 Bug #: 54202 Summary: Overeager warning about freeing non-heap objects Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: thi...@kde.org GCC 4.7 has a warning about freeing non-heap objects that is way too eager. Compiling the following code in C or C++: == #include typedef struct Data { int refcount; } Data; extern const Data shared_null; Data *allocate() { return (Data *)(&shared_null); } void dispose(Data *d) { if (d->refcount == 0) free(d); } void f() { Data *d = allocate(); dispose(d); } Produces the following warning: test.c: In function 'f' test.c:17:13: warning: attempt to free a non-heap object 'shared_null' [-Wfree-nonheap-object] The warning is overeager because it says "attempt to free" without indicating that it's only a possibility. GCC cannot prove that the call to free() will happen with that particular pointer, as the value of shared_null.refcount is not known. The warning should either: a) be modified to indicate it's only a possibility and the compiler can't prove it; b) be issued only when the compiler is sure that the free will happen on non-heap objects. Or both, by having two warnings: one for when it's sure and one for when it isn't.
[Bug c/54202] Overeager warning about freeing non-heap objects
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54202 --- Comment #2 from Thiago Macieira 2012-08-08 14:21:59 UTC --- To be honest, I don't want false-positive warnings. The code and data are constructed so that it never frees the non-heap object (it has a reference count of -1). If the driver to this warning can't be improved to be certain, I'd recommend at least changing the text, like the -Wuninitialized one: 'varname' may be used uninitialized in this function When GCC warnings are assertive, like the "will break strict aliasing" one, we go an extra mile to try and fix them.
[Bug c/54202] Overeager warning about freeing non-heap objects
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54202 --- Comment #4 from Thiago Macieira 2012-08-08 14:53:13 UTC --- (In reply to comment #3) > Note that even for the uninitialized use case we warn for functions > that may be never executed at runtime. So - are you happy with the > definitive warning if the free () call happens unconditionally when > the function is entered? I'm not sure I follow your reasoning. Please bear with me. If GCC can prove that the function will be called with a non-heap object, print the warning, even if the function in question never gets executed. That is, after inlining, code like: extern Data shared_null; void dispose() { free(&shared_null); } *should* print the warning, regardless of whether dispose() ever gets run. My point was that the code that GCC was seeing, after inlining, was: void f() { if (shared_null.refcount == 0) free(&shared_null); } In which case, the call to free() isn't unconditional. In this case, the warning should either be suppressed, or indicate that it's only a possibility instead of being assertive.
[Bug c/54231] New: LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 Bug #: 54231 Summary: LTO generates code for the wrong CPU if different options used Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: thi...@kde.org Created attachment 27992 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27992 Makefile Summary: Given the following code: = #include void BZERO(char *ptr, size_t count) { __m128i zero = _mm_set1_epi8(0); while (count--) { _mm_stream_si128((__m128i*)ptr, zero); ptr += 16; } } = When compiled twice, once for SSE2 and once for AVX (so we get VEX-prefixed code), under LTO gcc will generate both cases using VEX. See the attached Makefile. Long description: A library or program that attempts to determine at runtime whether certain CPU features, like AVX support, may need to compile different compilation units with different compiler flags. In the example I am providing, a simple function that zeroes out a segment of memory aligned to 16 bytes. It's provided by the same compilation unit which is compiled twice, but that does not seem to be relevant. The idea is that each of these two functions would be called by a dispatcher function, after verifying the result of CPUID. However, if you compile the code with LTO (e.g., by make CFLAGS=-flto with the attached Makefile), GCC will apply the highest CPU setting to all compilation units. This defeats the runtime detection technique: in this example, both functions will contain AVX code, which would end up being run on computers without AVX support. This might be intentional. If so, please close this bug report. However, I would recommend that the behaviour be fixed: the ability to use LTO with different CPU settings would allow for better inlining of the functions and suppressing unnecessary function calls. The bzero example is a good one.
[Bug c/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #1 from Thiago Macieira 2012-08-11 22:30:50 UTC --- Created attachment 27993 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27993 main.c
[Bug c/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #2 from Thiago Macieira 2012-08-11 22:33:31 UTC --- When adding the following source file to the library build: #include void bzero_sse2(char *, size_t); void bzero_avx(char *, size_t); extern int avx_supported; void my_bzero(char *ptr, size_t n) { if (avx_supported) bzero_avx(ptr, n); else bzero_sse2(ptr, n); } and compiling everything with -O2 -flto, GCC produces the following function: 02e0 : 2e0: mov0x200171(%rip),%rax# 200458 2e7: mov(%rax),%eax 2e9: test %eax,%eax 2eb: jne310 2ed: test %rsi,%rsi 2f0: vpxor %xmm0,%xmm0,%xmm0 2f4: je 30e 2f6: nopw %cs:0x0(%rax,%rax,1) 300: vmovntdq %xmm0,(%rdi) 304: add$0x10,%rdi 308: sub$0x1,%rsi 30c: jne300 30e: repz retq 310: test %rsi,%rsi 313: je 30e 315: vpxor %xmm0,%xmm0,%xmm0 319: nopl 0x0(%rax) 320: vmovntdq %xmm0,(%rdi) 324: add$0x10,%rdi 328: sub$0x1,%rsi 32c: jne320 32e: repz retq As can be seen, VEX-prefixed instructions were used in both cases.
[Bug c/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #3 from Thiago Macieira 2012-08-11 22:36:20 UTC --- Another note: it appears the Intel compiler has the same bug. It produces the following code when compiling with -O2 -ipo: 0340 : 340: dec%rsi 343: mov0x2001ae(%rip),%rax# 2004f8 <_DYNAMIC+0xe0> 34a: vpxor %xmm0,%xmm0,%xmm0 34e: cmpl $0x0,(%rax) 351: je 36c 353: cmp$0x,%rsi 357: je 383 359: dec%rsi 35c: vmovntdq %xmm0,(%rdi) 360: add$0x10,%rdi 364: cmp$0x,%rsi 368: jne359 36a: jmp383 36c: cmp$0x,%rsi 370: je 383 372: dec%rsi 375: vmovntdq %xmm0,(%rdi) 379: add$0x10,%rdi 37d: cmp$0x,%rsi 381: jne372 383: retq 384: nopl 0x0(%rax,%rax,1) 389: nopl 0x0(%rax) Note, additionally, that there's an instruction-scheduling issue: a VPXOR instruction was scheduled to before the test of the CPU features.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #6 from Thiago Macieira 2012-08-11 23:23:39 UTC --- (In reply to comment #5) > "Fixing" this in the compiler isn't straight-forward. The _mm_stream functions > are just wrappers around builtin functions. It may work correctly if you put > the bzero functions in two separate files or call the builtins directly (a > variant of __builtin_ia32_movntdq in this case), but the way your BZERO is > defined, I don't think it will ever work. They *are* in separate files already. Calling the builtin directly instead of the intrinsic wrapper might work, but I did not test it because it's not acceptable, as the code would be GCC-specific. > Have you considered using ifunc? IFUNC is also irrelevant: in order to use it, I need to have two separate source files which are compiled with different compiler settings, so we end up where we started: the bzero_sse2() function will have AVX code.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #9 from Thiago Macieira 2012-08-13 09:44:51 UTC --- (In reply to comment #8) > If you do something like > > gcc -c t1.c -mavx -flto > gcc -c t2.c -msse2 -flto > gcc t1.o t2.o -flto > > then the link step will use -mavx -msse2, that is, target options are > concatenated. Indeed. What I'm asking for is that each source file be compiled with its own target options. I realise this is a request for enhancement, though.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #10 from Thiago Macieira 2012-08-13 09:53:32 UTC --- Another test: $ cat main_avx.c #define BZERO bzero_avx #pragma GCC target ("avx") #include "main.c" $ cat main_sse2.c #define BZERO bzero_sse2 #pragma GCC target ("sse2") #include "main.c" $ cat main.c #include void BZERO(char *ptr, size_t count) { __m128i zero = _mm_set1_epi8(0); while (count--) { _mm_stream_si128((__m128i*)ptr, zero); ptr += 16; } } $ gcc -flto -O2 -shared -o libtest.so main_avx.c main_sse2.c $ objdump -Cdr --no-show-raw-insn libtest.so [...] 0650 : 650: test %rsi,%rsi 653: pxor %xmm0,%xmm0 657: je 66e 659: nopl 0x0(%rax) 660: movntdq %xmm0,(%rdi) 664: add$0x10,%rdi 668: sub$0x1,%rsi 66c: jne660 66e: repz retq 0670 : 670: test %rsi,%rsi 673: pxor %xmm0,%xmm0 677: je 68e 679: nopl 0x0(%rax) 680: movntdq %xmm0,(%rdi) 684: add$0x10,%rdi 688: sub$0x1,%rsi 68c: jne680 68e: repz retq
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #11 from Thiago Macieira 2012-08-13 10:12:48 UTC --- Attaching __attribute__((target("xxx"))) to the function does help. It generates the following with the my_bzero function from comment 2: 02e0 : 2e0: test %rsi,%rsi 2e3: vpxor %xmm0,%xmm0,%xmm0 2e7: je 2fe 2e9: nopl 0x0(%rax) 2f0: vmovntdq %xmm0,(%rdi) 2f4: add$0x10,%rdi 2f8: sub$0x1,%rsi 2fc: jne2f0 2fe: repz retq 0300 : 300: mov0x200171(%rip),%rax# 200478 307: mov(%rax),%eax 309: test %eax,%eax 30b: jne330 30d: test %rsi,%rsi 310: pxor %xmm0,%xmm0 314: je 332 316: nopw %cs:0x0(%rax,%rax,1) 320: movntdq %xmm0,(%rdi) 324: add$0x10,%rdi 328: sub$0x1,%rsi 32c: jne320 32e: repz retq 330: jmp2e0 332: repz retq This workaround might be useful for me in a few places where the code inlining provided by LTO was desired (even though, in this example, the AVX variant is exactly what it would be if no LTO had been used). But it won't work without major changes to the code if I have 400+ functions in a file, plus possibly inlines from headers, to be compiled.
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #13 from Thiago Macieira 2012-08-13 12:13:40 UTC --- (In reply to comment #12) > Yes, there are similar option-related bugs for this. Note somebody needs > to sit down and document the desired semantics of combining translation > units T1 and T2, compiled with different options OP1 and OP2, at link-time > with > options OP3. Desired semantics including which cross-file optimizations > (inlining?) are possible. >From my (admittedly restrict) point of view, inlining should be possible, provided the following conditions: - when inlining a function with a "lower" optimisation / target setting, apply the outer scope's setting to the inlined code - when inlining a function with a higher target requirement, inlining should be done only in the sense of partial function splitting, prologue, epilogues, constant propagation, etc. In the case that I pasted, for example, I'd like GCC to realise that it has already tested if the counter variable is 0, then forego that test in the inlined, inner function. Worst case scenario, simply forego inlining completely. Then the code would simply be no worse than the non-LTO case.
[Bug libstdc++/54172] [4.7/4.8 Regression] __cxa_guard_acquire thread-safety issue
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54172 --- Comment #4 from Thiago Macieira 2012-08-30 07:52:31 UTC --- I'll post today. I haven't yet looked up which mailing list you're even talking about.
[Bug libstdc++/54172] [4.7/4.8 Regression] __cxa_guard_acquire thread-safety issue
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54172 --- Comment #7 from Thiago Macieira 2012-09-01 08:26:05 UTC --- I posted the patches on Thursday, three patches because I found one more issue, to both lists. Will they be picked up from there and applied to the source tree?
[Bug libstdc++/54172] [4.7/4.8 Regression] __cxa_guard_acquire thread-safety issue
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54172 --- Comment #9 from Thiago Macieira 2012-09-04 17:47:08 UTC --- (In reply to comment #8) > (In reply to comment #7) > > I posted the patches on Thursday, three patches because I found one more > > issue, > > to both lists. > > I havn't seen anything from you arrive on gcc-patches. > But I will say that the patch attached here looks good. http://gcc.gnu.org/ml/gcc-patches/2012-08/msg02026.html http://gcc.gnu.org/ml/gcc-patches/2012-08/msg02027.html http://gcc.gnu.org/ml/gcc-patches/2012-08/msg02028.html
[Bug c++/54485] g++ should diagnose default arguments in out-of-line definitions for template class member functions
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54485 Thiago Macieira changed: What|Removed |Added CC||thiago at kde dot org --- Comment #1 from Thiago Macieira 2012-09-05 07:27:08 UTC --- FYI $ icpc -c a.cc
[Bug lto/54231] LTO generates code for the wrong CPU if different options used
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231 --- Comment #14 from Thiago Macieira 2012-09-12 13:02:23 UTC --- >From GCC's own manual: (Node "Function attributes"): On the 386/x86_64 and PowerPC backends, the inliner will not inline a function that has different target options than the caller, unless the callee has a subset of the target options of the caller. For example a function declared with `target("sse3")' can inline a function with `target("sse2")', since `-msse3' implies `-msse2'.
[Bug c++/54988] fpmath=sse target pragma causes inlining failure because of target specific option mismatch
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54988 --- Comment #3 from Thiago Macieira 2012-10-22 14:43:11 UTC --- This might be as I pointed out in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231: (Node "Function attributes"): On the 386/x86_64 and PowerPC backends, the inliner will not inline a function that has different target options than the caller, unless the callee has a subset of the target options of the caller. For example a function declared with `target("sse3")' can inline a function with `target("sse2")', since `-msse3' implies `-msse2'. My guess was that we were forcing the inlining (via always_inline) of a function that has different target options. But I guess that doesn't explain why it happens only in C++ and only in optimising mode. Does always_inline inline on -O0 too?
[Bug tree-optimization/43247] [4.3/4.4 Regression] Incorrect optimization while declaring array[1]
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247 --- Comment #10 from Thiago Macieira 2010-12-22 10:35:23 UTC --- This is still not fixed. I can reproduce now with a different testcase, in 4.5.1. However, this time, the same code works fine in 4.4. The reason is again accessing an array out-of-bounds for elements that we know to be there. Pay attention to the way operator== is implemented in the following code. If I compile it with -O1, it prints "true" as it should. If I compile it with -O2, it prints "false". If I compile it with -O1 -finline-small-functions -finline -findirect-inlining -fstrict-overflow and compare the disassembly with -O2 and a suitable list of -fno-*, the code is exactly identical, except for some instructions that should perform the copy of half of m1's data into m3. So in the end the comparison fails due to comparing to garbage. === code === #include template class QGenericMatrix { public: QGenericMatrix(); QGenericMatrix(const QGenericMatrix& other); explicit QGenericMatrix(const T *values); bool operator==(const QGenericMatrix& other) const; private: T m[N][M];// Column-major order to match OpenGL. QGenericMatrix(int) {} // Construct without initializing identity matrix }; template QGenericMatrix::QGenericMatrix(const QGenericMatrix& other) { for (int col = 0; col < N; ++col) for (int row = 0; row < M; ++row) m[col][row] = other.m[col][row]; } template QGenericMatrix::QGenericMatrix(const T *values) { for (int col = 0; col < N; ++col) for (int row = 0; row < M; ++row) m[col][row] = values[row * N + col]; } template bool QGenericMatrix::operator==(const QGenericMatrix& other) const { for (int index = 0; index < N * M; ++index) { if (m[0][index] != other.m[0][index]) return false; } return true; } typedef double qreal; typedef QGenericMatrix<2, 2, qreal> QMatrix2x2; int main(int , char**) { qreal m1Data[] = {0.0, 0.0, 0.0, 0.0}; QMatrix2x2 m1(m1Data); QMatrix2x2 m3 = m1; puts((m1 == m3) ? "true" : "false"); } === code === common args: -fno-exceptions -fno-rtti -fverbose-asm -march=core2 -mfpmath=sse (though x87 math also shows the same problem) prints "true" with: -O1 -finline-small-functions -finline -findirect-inlining -fstrict-overflow prints "false" with: -O2 -fno-align-functions -fno-align-jumps -fno-align-labels -fno-caller-saves -fno-tree-switch-conversion -fno-tree-vrp -fno-crossjumping -fno-cse-follow-jumps -fno-expensive-optimizations -fno-gcse -fno-ipa-cp -fno-ipa-sra -fno-optimize-register-move -fno-optimize-sibling-calls -fno-peephole2 -fno-regmove -fno-reorder-blocks -fno-reorder-functions -fno-rerun-cse-after-loop -fno-schedule-insns2 -fno-strict-aliasing -fno-strict-aliasing -fno-thread-jumps -fno-tree-builtin-call-dce -fno-tree-pre
[Bug tree-optimization/43247] [4.3/4.4 Regression] Incorrect optimization while declaring array[1]
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247 --- Comment #12 from Thiago Macieira 2010-12-22 19:55:38 UTC --- (In reply to comment #11) > >The reason is again accessing an array out-of-bounds for elements that we > >know to be there. > > No that is undefined and different from the original testcase. Ok. Shall I open a new report with the new information?
[Bug c++/57854] New: Would like to have a warning for virtual overrides without C++11 "override" keyword
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57854 Bug ID: 57854 Summary: Would like to have a warning for virtual overrides without C++11 "override" keyword Product: gcc Version: 4.8.1 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org I would like a new (optional) warning that would point out every C++ virtual override that is done without the C++11 keyword that indicates an override. By necessity, this warning would only be permitted in C++11 mode. The keyword was added so that developers would let the compiler know when an override is intended. However, the [[base_check]] attribute was dropped from C++11 prior to standardisation, so there's no way (currently) to ask the compiler to let us know which classes are doing overrides without the keyword. This warning should be printed in the otherwise perfectly correct code: struct Base { virtual void v(); }; struct Derived: Base { virtual void v(); // warning happens here }; This warning should not be in -Wall. It should be in -Weffc++. I'll leave it up to you whether it's in -Wextra.
[Bug libstdc++/54172] New: [4.7 Regression] __cxa_guard_acquire thread-safety issue
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54172 Bug #: 54172 Summary: [4.7 Regression] __cxa_guard_acquire thread-safety issue Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: libstdc++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: thi...@kde.org Created attachment 27936 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27936 Proposed fix. In commit 184110, the __cxa_guard_acquire implementation in libsupc++/guard.cc has been updated to use the new __atomic_* intrinsincs instead of the old __sync_* ones. I believe this has introduced a regression due to a race condition. == Proof == While debugging a program, I set a hardware watchpoint on a guard variable and set gdb to continue execution upon stop. The output was: Hardware watchpoint 1: _ZGVZN12_GLOBAL__N_121Q_QGS_textCodecsMutex13innerFunctionEvE6holder Old value = 0 New value = 256 0x00381205f101 in __cxxabiv1::__cxa_guard_acquire (g=0x77dc9a60) at ../../../../libstdc++-v3/libsupc++/guard.cc:254 254 if (__atomic_compare_exchange_n(gi, &expected, pending_bit, false, Hardware watchpoint 1: _ZGVZN12_GLOBAL__N_121Q_QGS_textCodecsMutex13innerFunctionEvE6holder Old value = 256 New value = 1 __cxxabiv1::__cxa_guard_release (g=0x77dc9a60) at ../../../../libstdc++-v3/libsupc++/guard.cc:376 376 if ((old & waiting_bit) != 0) [Switching to Thread 0x7fffebfff700 (LWP 113412)] Hardware watchpoint 1: _ZGVZN12_GLOBAL__N_121Q_QGS_textCodecsMutex13innerFunctionEvE6holder Old value = 1 New value = 256 0x00381205f101 in __cxxabiv1::__cxa_guard_acquire (g=0x77dc9a60) at ../../../../libstdc++-v3/libsupc++/guard.cc:254 254 if (__atomic_compare_exchange_n(gi, &expected, pending_bit, false, [New Thread 0x70a2d700 (LWP 113413)] Hardware watchpoint 1: _ZGVZN12_GLOBAL__N_121Q_QGS_textCodecsMutex13innerFunctionEvE6holder Old value = 256 New value = 1 __cxxabiv1::__cxa_guard_release (g=0x77dc9a60) at ../../../../libstdc++-v3/libsupc++/guard.cc:376 376 if ((old & waiting_bit) != 0) As can be seen by the output, the guard variable transitioned from 0 -> 256 -> 1 -> 256 -> 1. == Analysis == The code in guard.cc is: int expected(0); const int guard_bit = _GLIBCXX_GUARD_BIT; const int pending_bit = _GLIBCXX_GUARD_PENDING_BIT; const int waiting_bit = _GLIBCXX_GUARD_WAITING_BIT; while (1) { if (__atomic_compare_exchange_n(gi, &expected, pending_bit, false, __ATOMIC_ACQ_REL, __ATOMIC_RELAXED)) { // This thread should do the initialization. return 1; } if (expected == guard_bit) { // Already initialized. return 0; } if (expected == pending_bit) { int newv = expected | waiting_bit; if (!__atomic_compare_exchange_n(gi, &expected, newv, false, __ATOMIC_ACQ_REL, __ATOMIC_RELAXED)) continue; expected = newv; } syscall (SYS_futex, gi, _GLIBCXX_FUTEX_WAIT, expected, 0); } We have two threads running and they both reach __cxa_guard_acquire more or less at the same time. On one thread, the execution is the expected path: the first CAS succeeds and that transitions the guard variable from 0 to 256. That thread will initialise the static. In the second thread, the CAS fails, so it will proceed to the second CAS, trying to replace 256 with 768 (to indicate it's going to sleep). In the mean time, the first thread calls __cxa_guard_release, which exchanges the 256 with a 1. Therefore, on the second thread, the second CAS fails and now expected == 1 (it got updated). The continue makes it return to the first CAS with expected == 1 and that one succeeds, by replacing it from 1 to 256, which is wrong. == Solution == This issue appears to be caused by the new atomic intrinsics updating the expected variable and the looping. If the second CAS fails, the code needs to inspect the value set there to determine what to do next. The possible values are: 0: the other thread aborted, we should try again -> continue 1: initialisation completed, we should return 0 (256: can't happen) 768: yet another thread succeeded in setting the waiting bit, we should sleep The attached patch is a proposed solution to the problem, but I have not been able to test it yet.
[Bug c/58889] New: GCC 4.9 fails to compile certain functions with intrinsics with __attribute__((target))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58889 Bug ID: 58889 Summary: GCC 4.9 fails to compile certain functions with intrinsics with __attribute__((target)) Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Source: $ cat t.c #include __attribute__((target("avx2"))) int f(void *ptr) { return _mm256_movemask_epi8(_mm256_loadu_si256((__m256i*)ptr)); } Works: $ ~/gcc4.9/bin/g++ -S -O3 -o /dev/null t.c $ ~/gcc4.9/bin/g++ -m32 -S -O3 -o /dev/null t.c $ ~/gcc4.9/bin/g++ -march=core2 -S -O3 -o /dev/null t.c $ ~/gcc4.9/bin/g++ -march=core2 -m32 -S -O3 -o /dev/null t.c $ ~/gcc4.9/bin/g++ -march=nocona -S -O3 -o /dev/null t.c $ ~/gcc4.9/bin/g++ -march=nocona -m32 -S -O3 -o /dev/null t.c $ ~/gcc4.9/bin/g++ -march=prescott -m32 -S -O3 -o /dev/null t.c $ ~/gcc4.9/bin/g++ -march=pentium4 -m32 -S -O3 -o /dev/null t.c $ ~/gcc4.9/bin/g++ -march=pentium3 -m32 -S -O3 -o /dev/null t.c Fails: $ ~/gcc4.9/bin/g++ -march=pentium2 -m32 -S -O3 -o /dev/null t.c avxintrin.h: In function ‘int f(void*)’: avxintrin.h:890:1: error: inlining failed in call to always_inline ‘__m256i _mm256_loadu_si256(const __m256i*)’: target specific option mismatch _mm256_loadu_si256 (__m256i const *__P) ^ [...] g++: internal compiler error: Segmentation fault (program cc1plus) 0x409614 execute /home/thiago/src/gcc/gcc/gcc.c:2864 $ ~/gcc4.9/bin/g++ -march=pentium -m32 -S -O3 -o /dev/null t.c avxintrin.h: In function ‘int f(void*)’: avxintrin.h:890:1: error: inlining failed in call to always_inline ‘__m256i _mm256_loadu_si256(const __m256i*)’: target specific option mismatch _mm256_loadu_si256 (__m256i const *__P) ^ [...] [no segfault] This is an unpatched, pristine GCC, built from trunk@203862. System: Linux 64-bit (Fedora 17) Configure options: --enable-lang=c,c++
[Bug c/58889] GCC 4.9 fails to compile certain functions with intrinsics with __attribute__((target))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58889 --- Comment #1 from Thiago Macieira --- This problem also happens with other combinations of functions in use and compiler options. My original problem happened on a 64-bit build with -march=corei7-avx and a function with __attribute__((target("avx2"))).
[Bug target/59539] New: Missed optimisation: VEX-prefixed operations don't need aligned data
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539 Bug ID: 59539 Summary: Missed optimisation: VEX-prefixed operations don't need aligned data Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Consider the following code: #include int f(void *p1, void *p2) { __m128i d1 = _mm_loadu_si128((__m128i*)p1); __m128i d2 = _mm_loadu_si128((__m128i*)p2); __m128i result = _mm_cmpeq_epi16(d1, d2); return _mm_movemask_epi8(result); } If compiled with -O2 -mavx, it produces the following code with GCC 4.9 (current trunk): f: vmovdqu (%rdi), %xmm0 vmovdqu (%rsi), %xmm1 vpcmpeqw%xmm1, %xmm0, %xmm0 vpmovmskb %xmm0, %eax ret One of the two VMOVDQU are unnecessary, since the VEX-prefixed VCMPEQW instruction can do unaligned loads without faulting. The Intel Software Developer's Manual Volume 1, Chapter 14 says in 14.9 "Memory alignment": > With the exception of explicitly aligned 16 or 32 byte SIMD load/store > instructions, most VEX-encoded, > arithmetic and data processing instructions operate in a flexible environment > regarding memory address > alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load > semantics will support unaligned load > operation by default. Memory arguments for most instructions with VEX prefix > operate normally without > causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE > instructions). The instructions that > require explicit memory alignment requirements are listed in Table 14-22. Clang and ICC have already implemente this optimisation: Clang 3.3 produces: f: # @f vmovdqu (%rsi), %xmm0 vpcmpeqw(%rdi), %xmm0, %xmm0 vpmovmskb %xmm0, %eax ret Similarly, ICC 14 produces: f: vmovdqu (%rdi), %xmm0 vpcmpeqw (%rsi), %xmm0, %xmm1 vpmovmskb %xmm1, %eax ret
[Bug target/59539] Missed optimisation: VEX-prefixed operations don't need aligned data
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539 --- Comment #2 from Thiago Macieira --- I have to use _mm_loadu_si128 because non-VEX SSE requires explicit unaligned loads. Here's more food for thought: __m128i result = _mm_cmpeq_epi16((__m128i*)p1, (__m128i*)p2); For non-VEX code, so far the compiler emitted one MOVDQA and one PCMPEQW if it could, enforcing that both sources needed to be aligned. With VEX, VPCMPEQW can do unaligned, so should the other load also be changed to VPMOVDQU instead of VPMOVDQA? Similarly, if I use _mm_load_si128 (not loadu), can the compiler combine one load into the next instruction? Performance-wise, the execution will be the same, with one fewer instruction to be retired (so, better); but it will not cause an unaligned fault if the pointer isn't aligned.
[Bug target/59539] Missed optimisation: VEX-prefixed operations don't need aligned data
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539 --- Comment #12 from Thiago Macieira --- Thanks, rebuilding!
[Bug target/59539] Missed optimisation: VEX-prefixed operations don't need aligned data
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539 --- Comment #13 from Thiago Macieira --- I can't confirm. trunk@206091: $ ~/gcc4.9/bin/gcc -mavx -S -o - -O3 -xc - <<<'#include int f(void *p1, void *p2) { __m128i d1 = _mm_loadu_si128((__m128i*)p1); __m128i d2 = _mm_loadu_si128((__m128i*)p2); __m128i result = _mm_cmpeq_epi16(d1, d2); return _mm_movemask_epi8(result); } ' .file "" .section.text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4,,15 .globl f .type f, @function f: .LFB1073: .cfi_startproc vmovdqu (%rdi), %xmm0 vmovdqu (%rsi), %xmm1 vpcmpeqw%xmm1, %xmm0, %xmm0 vpmovmskb %xmm0, %eax ret .cfi_endproc .LFE1073: .size f, .-f .section.text.unlikely .LCOLDE0: .text .LHOTE0: .ident "GCC: (GNU) 4.9.0 20131121 (experimental)" .section.note.GNU-stack,"",@progbits
[Bug target/59539] Missed optimisation: VEX-prefixed operations don't need aligned data
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539 --- Comment #14 from Thiago Macieira --- *facepalm* I had forgotten to make install! It works: $ ~/gcc4.9/bin/gcc -mavx -S -o - -O3 -xc - <<<'#include int f(void *p1, void *p2) { __m128i d1 = _mm_loadu_si128((__m128i*)p1); __m128i d2 = _mm_loadu_si128((__m128i*)p2); __m128i result = _mm_cmpeq_epi16(d1, d2); return _mm_movemask_epi8(result); } ' .file "" .section.text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4,,15 .globl f .type f, @function f: .LFB1073: .cfi_startproc vmovdqu (%rsi), %xmm0 vpcmpeqw(%rdi), %xmm0, %xmm0 vpmovmskb %xmm0, %eax ret .cfi_endproc .LFE1073: .size f, .-f .section.text.unlikely .LCOLDE0: .text .LHOTE0: .ident "GCC: (GNU) 4.9.0 20131218 (experimental)" .section.note.GNU-stack,"",@progbits
[Bug target/19520] protected function pointer doesn't work right
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19520 --- Comment #23 from Thiago Macieira 2012-01-16 14:56:50 UTC --- I've changed my opinion on this matter. I think GCC is generating the proper code (most efficient). It's ld that should accept this decision.
[Bug target/19520] protected function pointer doesn't work right
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19520 --- Comment #26 from Thiago Macieira 2012-01-18 13:28:05 UTC --- ld *can* link, it just chooses not to. $ cat > foo.c __attribute__((visibility("protected"))) void * foo (void) { return (void *)foo; } $ gcc -fPIC -shared foo.c /usr/bin/ld: /tmp/cclrufLV.o: relocation R_X86_64_PC32 against protected symbol `foo' can not be used when making a shared object /usr/bin/ld: final link failed: Bad value collect2: ld returned 1 exit status $ gcc -Wl,-Bsymbolic-functions -fPIC -shared foo.c && echo success success $ cat > empty.dynlist { "__this_symbol_isnt_present__"; }; $ gcc -Wl,--dynamic-list,empty.dynlist -fPIC -shared foo.c && echo success success I also cannot confirm that icc does anything different: $ icc -fPIC -shared foo.c ld: /tmp/iccf15gTK.o: relocation R_X86_64_PC32 against protected symbol `foo' can not be used when making a shared object ld: final link failed: Bad value $ icc -O3 -S -o /dev/stdout -fPIC -shared foo.c | grep -A4 foo: foo: ..B1.1: # Preds ..B1.0 ..___tag_value_foo.1: #2.19 lea foo(%rip), %rax #2.36 ret #2.36 What's more, if you actually do compile the following program into a shared library, it succeeds: $ cat > foo.S .text .globl foo .protected foo .type foo, @function foo: movq foo@GOTPCREL(%rip), %rax ret $ gcc -shared foo.S && echo success success But the resulting shared object has the following (extracted from eu-readelf): Relocation section [ 5] '.rela.dyn' for section [ 0] '' at offset 0x230 contains 1 entry: Offset TypeValue Addend Name 0x00200330 X86_64_GLOB_DAT 0x0248 +0 foo 2: 0248 0 FUNCGLOBAL PROTECTED 6 foo Now we introduce a third component to this discussion: the dynamic linker. What will it do? This has become a decision, not a bug: what should the compiler do when taking the address of a function when said function is under protected visibility. Both solutions are technically correct and would load the same function address under the correct circumstances. The compiler is also taking on the "protected" visibility to the letter (at least, according to its own definition of so): "protected" Protected visibility is like default visibility except that it indicates that references within the defining module will bind to the definition in that module. That is, the declared entity cannot be overridden by another module. Since the symbol was marked as "protected" in the symbol table, it's expected that the linker and dynamic linker will bind it locally. That being the case, the compiler can optimise for that fact. It can calculate what value would be placed in the GOT entry and load that instead. That's the LEA instruction. The linker, however, mandates that the address to symbol should not be loaded directly, but only through the GOT. This is necessary because the psABI requires that the function address resolve to the PLT entry found in the position-dependent executable. If the executable takes the address of this global (but protected) symbol, it will hardcode the address to its own address space, forcing other ELF modules to follow suit. Finally, what does the dynamic linker do when an "entity (that) cannot be overridden by another module" is overridden by another module? The glibc 2.14 loader will resolve the GOT entry's relocation to the executable's PLT stub, even if the symbol in question has protected visibility. Other loaders might work differently. As it stands, the psABI requires that the address to a protected function be loaded through the GOT, even though the compiler thinks it knows what the address will be. However, I really wish the compiler *not* to change its behaviour for PIC code, but instead change its behaviour for ELF position-dependent executables. I am asking for a change in the psABI and requesting that the loading of function addresses for "default" visibility symbols (not protected!) should be done via the GOT. In other words, I'm asking that we optimise for shared libraries, not for executables. Versions: GCC: 4.6.0 ld: 2.21.51.0.6-6.fc15 20110118 ICC: 12.1.0 20111011
[Bug target/19520] protected function pointer doesn't work right
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19520 --- Comment #30 from Thiago Macieira 2012-01-19 18:52:57 UTC --- This does solve the problem. It's just unfortunate that it does so by creating more work for the library even if no executable ever takes the address of this protected function. It would have been preferable to somehow tell the compiler when compiling an executable that this function it's taking the address of is protected elsewhere, so it should use the GOT too.
[Bug target/83562] broken destructors of thread_local objects on i686 mingw targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83562 --- Comment #3 from Thiago Macieira --- This can easily be fixed by way of a trampoline that adjusts the parameter.
[Bug c++/88475] New: -E -fdirectives-only clashes with raw strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88475 Bug ID: 88475 Summary: -E -fdirectives-only clashes with raw strings Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Source: $ cat test.cpp extern const char str[] = R"( #define FOO 1 #NOSORT )"; Compiles just fine: $ g++ -c test.cpp; echo $? 0 Preprocessed output looks correct: $ g++ -E test.cpp # 1 "test.cpp" # 1 "" # 1 "" # 1 "/usr/include/stdc-predef.h" 1 3 4 # 1 "" 2 # 1 "test.cpp" extern const char str[] = R"( #define FOO 1 #NOSORT )"; But in the presence of -fdirectives-only (which icecream uses), it produces an error and incorrectly preprocesses: $ g++ -E -fdirectives-only test.cpp | tail -5 test.cpp:3:2: error: invalid preprocessing directive #NOSORT #NOSORT ^~ # 1 "test.cpp" extern const char str[] = R"( #define FOO 1 )"; According to strace, cc1plus is the preprocessor, not /lib/cpp.
[Bug sanitizer/89124] New: __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx)))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89124 Bug ID: 89124 Summary: __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx))) Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: sanitizer Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org, jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at gcc dot gnu.org Target Milestone: --- $ cat test.cpp #include #ifdef __GNUC__ __attribute__((target("avx2"), no_sanitize_address)) #endif void f(void *ptr) { _mm256_loadu_si256((__m256i *)ptr); } $ gcc -c test.cpp && echo ok ok $ gcc -c -fsanitize=addreess test.cpp In file included from /opt/compiler-explorer/gcc-8.2.0/lib/gcc/x86_64-linux-gnu/8.2.0/include/immintrin.h:41, from :1: /opt/compiler-explorer/gcc-8.2.0/lib/gcc/x86_64-linux-gnu/8.2.0/include/avxintrin.h: In function 'void f(void*)': /opt/compiler-explorer/gcc-8.2.0/lib/gcc/x86_64-linux-gnu/8.2.0/include/avxintrin.h:919:1: error: inlining failed in call to always_inline '__m256i _mm256_loadu_si256(const __m256i_u*)': function attribute mismatch _mm256_loadu_si256 (__m256i_u const *__P) ^~ :8:23: note: called from here _mm256_loadu_si256((__m256i *)ptr); ~~^~~~ Works fine in Clang. Godbolt link: https://godbolt.org/z/rg5kUD
[Bug sanitizer/89124] __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx)))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89124 --- Comment #1 from Thiago Macieira --- Worse: $ cat test.cpp #include #ifdef __GNUC__ __attribute__((no_sanitize_address)) #endif void f(void *ptr) { _mm256_loadu_si256((__m256i *)ptr); } $ gcc -c -mavx2 test.cpp [same errors]
[Bug sanitizer/89124] __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx)))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89124 --- Comment #2 from Thiago Macieira --- -fsanitize=address missing from the command-line in the previous comment. It should be: gcc -c -mavx2 -fsanitize=address test.cpp
[Bug sanitizer/89124] __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx)))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89124 --- Comment #4 from Thiago Macieira --- Or permit the inlining if the function is also __artificial__. It's documented, but I don't see anyone needing to use that besides gcc's own headers.
[Bug libstdc++/71660] [6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #19 from Thiago Macieira --- And Qt has stopped complaining about this. https://codereview.qt-project.org/227296
[Bug target/89445] New: [8 regression] _mm512_maskz_loadu_pd "forgets" to use the mask
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445 Bug ID: 89445 Summary: [8 regression] _mm512_maskz_loadu_pd "forgets" to use the mask Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Created attachment 45793 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45793&action=edit example showing segmentation fault In the following code: void daxpy(size_t n, double a, double const* __restrict x, double* __restrict y) { const __m512d v_a = _mm512_broadcastsd_pd(_mm_set_sd(a)); const __mmask16 final = (1U << (n % 8u)) - 1; __mmask16 mask = 65535u; for (size_t i = 0; i < n * sizeof(double); i += 8 * sizeof(double)) { if (i + 8 * sizeof(double) > n * sizeof(double)) mask = final; __m512d v_x = _mm512_maskz_loadu_pd(mask, (char const *)x + i); __m512d v_y = _mm512_maskz_loadu_pd(mask, (char const *)y + i); __m512d tmp = _mm512_fmadd_pd(v_x, v_a, v_y); _mm512_mask_storeu_pd((char *)y + i, mask, tmp); } } When compiled with GCC 8, the loop looks like .L5: cmpq%rax, %r10 cmovb %r9d, %r8d movzbl %r8b, %ecx kmovd %ecx, %k1 leaq(%rdx,%rax), %rcx vmovapd (%rsi,%rax), %zmm1{%k1}{z} vmovapd (%rcx), %zmm2{%k1}{z} vfmadd132pd %zmm0, %zmm2, %zmm1 vmovupd %zmm1, (%rcx){%k1} addq$64, %rax cmpq%rdi, %rax jb .L5 Whereas GCC trunk (as of r269073) generates: .L5: vmovapd (%rsi,%rax), %zmm1 cmpq%rax, %r9 vfmadd213pd (%rdx,%rax), %zmm0, %zmm1 cmovb %r8d, %ecx kmovb %ecx, %k1 vmovupd %zmm1, (%rdx,%rax){%k1} addq$64, %rax cmpq%rdi, %rax jb .L5 Godbolt link: https://gcc.godbolt.org/z/2ys7ZO Since the neither memory loads are masked, the resulting registers can contain garbage and trigger FP exceptions. They can also cause segmentation faults if portions of the source are not mapped regions. The attached example forces the operation on a page boundary where half the 64 bytes addressed by the second load are unmapped. When run, the example will crash.
[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445 --- Comment #6 from Thiago Macieira --- (In reply to Jakub Jelinek from comment #4) > vmovupd (%rsi,%rax), %zmm1{%k1}{z} > addq%rdx, %rax > vmovupd (%rax), %zmm2{%k1}{z} > vfmadd132pd %zmm0, %zmm2, %zmm1 > vmovupd %zmm1, (%rax){%k1} > isn't optimal btw, it would be nice if we could merge that masking into the > vfmadd132pd instruction, like: > vmovupd (%rsi,%rax), %zmm1{%k1}{z} > addq%rdx, %rax > vfmadd132pd (%rax), %zmm2, %zmm1%{k1}{z} > vmovupd %zmm1, (%rax){%k1} > but not really sure how to achieve that. It would be nice. It would be even nicer not to have that "addq". That's actually what ICC generates (click on the godbolt link and change one of the compilers to ICC 19): ..B1.3: # Preds ..B1.3 ..B1.2 cmpq %rax, %r8 #12.13 cmova %r10d, %r9d #12.13 kmovw %r9d, %k1 #13.20 vmovupd (%r8,%rsi), %zmm1{%k1}{z} #13.20 vfmadd213pd (%r8,%rdx), %zmm0, %zmm1{%k1}{z}#15.20 vmovupd %zmm1, (%r8,%rdx){%k1}#16.9 addq $64, %r8 #10.48 cmpq %rcx, %r8 #10.32 jb..B1.3# Prob 82% #10.32 There's one more simplification here: ICC lacks the movzbl instruction which GCC inserted but is completely superfluous. First, we've already calculated the proper 32-bit pattern and stored it in %r9d, there was no need to zero extend it. Second, when operating on 512-bit packed doubles, there are 8 lanes, so only the low 8 bits of the mask register will be considered in the first place. (Arguably, the intrinsic should have used __mmask8, but that wasn't added until AVX512DQ and this is F) That reduces the number of instructions and will save you a couple of uops per loop. Depending on how long your loop is, it may help you fit in the DSB and help the Loop Stream Detector. I'm not at all knowledgeable about those details, so I'll just link to https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of#answer-39940932. For this particular loop, if run long enough, I don't think there's any effect, but this is an area for improvement for longer loops. The number of instructions is also significant for short-lived loops, which happens to me often when using SIMD for strings (tens of bytes of length, so the loop is run once or twice only).
[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445 --- Comment #7 from Thiago Macieira --- Comment on attachment 45800 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45800 gcc9-pr89445.patch Tested and works on my machine. The movzbl that GCC 8 generated is also gone, but it inserted moves *from* the OpMask register: .L4: movq%rcx, %rax addq$64, %rcx cmpq%rdi, %rcx kmovw %k1, %r9d cmova %r8d, %r9d kmovw %r9d, %k1 vmovupd (%rsi,%rax), %zmm1{%k1}{z} addq%rdx, %rax vmovupd (%rax), %zmm2{%k1}{z} vfmadd132pd %zmm0, %zmm2, %zmm1 vmovupd %zmm1, (%rax){%k1} cmpq%rdi, %rcx jb .L4 Seems like it forgot the GPR that used to contain the mask, so it needed to reload from %k1. The end detection is also slightly worse. Yesterday, when I benchmarked with GCC 8, it ran 1000 iterations over 10 million doubles in roughly 11.9 ms, with 10 million instructions. Today, I am getting 11.8 ms at 16 million instructions (the increase of instructions/cycle is roughly equal to the decrease in instructions per iteration, proving that memory bandwidth is the bottleneck)
[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445 --- Comment #8 from Thiago Macieira --- Sorry, in editing I ended up removing an important point: GCC 8 also generates the move *from* OpMask when I put it in the benchmark loop. So that's not a regression, per se.
[Bug target/87317] New: Missed optimisation: merging VMOVQ with operations that only use the low 8 bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87317 Bug ID: 87317 Summary: Missed optimisation: merging VMOVQ with operations that only use the low 8 bytes Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Test: #include int f(void *ptr) { __m128i data = _mm_loadl_epi64((__m128i *)ptr); data = _mm_cvtepu8_epi16(data); return _mm_cvtsi128_si32(data); } GCC generates (-march=haswell or -march=skylake): vmovq (%rdi), %xmm0 vpmovzxbw %xmm0, %xmm0 vmovd %xmm0, %eax ret Note that the VPMOVZXBW instruction only reads the low 8 bytes from the source, including if it is a memory reference. Both Clang and ICC generate: vpmovzxbw (%rdi), %xmm0 vmovd %xmm0, %eax retq Similarly for: void f(void *dst, void *ptr) { __m128i data = _mm_cvtsi32_si128(*(int*)ptr); data = _mm_cvtepu8_epi32(data); _mm_storeu_si128((__m128i*)dst, data); } GCC: vmovd (%rsi), %xmm0 vpmovzxbd %xmm0, %xmm0 vmovups %xmm0, (%rdi) ret Clang and ICC: vpmovzxbd (%rsi), %xmm0 vmovdqu %xmm0, (%rdi) retq There are other instructions that might benefit from this. AVX-512 memory instructions where the OpMask is a constant might be candidates too.
[Bug target/87522] LTO incorrectly merges target specific options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87522 --- Comment #2 from Thiago Macieira --- In the original case, all sources were compiled with -march=westmere, though some files had -mavx added.
[Bug target/69471] "-march=native" unintentionally breaks further -march/-mtune flags
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69471 Thiago Macieira changed: What|Removed |Added CC||thiago at kde dot org --- Comment #5 from Thiago Macieira --- Same thing here. User passes CFLAGS="-march=native" for their system, but library needs to build one .cpp source with -march=haswell for additional functionality (runtime-checked via CPUID). Unfortunately, -march=native supersedes all other -march options, regardless of order, unlike all other options. Examples: $ gcc -dM -E -xc /dev/null -march=sandybridge -march=haswell | grep AVX #define __AVX__ 1 #define __AVX2__ 1 $ gcc -dM -E -xc /dev/null -march=haswell -march=sandybridge | grep AVX #define __AVX__ 1 $ gcc -dM -E -xc /dev/null -march=sandybridge -march=native | grep AVX #define __AVX__ 1 #define __AVX2__ 1 $ gcc -dM -E -xc /dev/null -march=native -march=sandybridge | grep AVX #define __AVX__ 1 #define __AVX2__ 1 Qt is affected: https://bugreports.qt.io/browse/QTBUG-71564. The problem began when we switched from appending -mavx2 to appending -march=haswell, so we'd get FMA and BMI1/2 in the same file.
[Bug target/69471] "-march=native" unintentionally breaks further -march/-mtune flags
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69471 --- Comment #6 from Thiago Macieira --- Clang is not affected: $ clang -dM -E -xc /dev/null -march=sandybridge -march=native | grep AVX #define __AVX2__ 1 #define __AVX__ 1 $ clang -dM -E -xc /dev/null -march=native -march=sandybridge | grep AVX #define __AVX__ 1 Instead of enabling the CPU features your CPU has, Clang tries to guess which CPU you have and will apply it. This has side-effects for non-arch-specific items like AES. ICC is similarly affected, despite claiming it isn't: $ icc -dM -E -xc /dev/null -march=sandybridge -march=native | grep AVX icc: command line warning #10121: overriding '-march=sandybridge' with '-march=native' icc: command line warning #10121: overriding '-march=sandybridge' with '-march=native' #define __AVX_I__ 1 #define __AVX__ 1 #define __AVX2__ 1 $ icc -dM -E -xc /dev/null -march=native -march=sandybridge | grep AVX icc: command line warning #10121: overriding '-march=native' with '-march=sandybridge' #define __AVX_I__ 1 #define __AVX__ 1 #define __AVX2__ 1 It says it's overriding, but doesn't override.
[Bug target/87976] New: [i386] Sub-optimal code generation for _mm256_set1_epi64()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87976 Bug ID: 87976 Summary: [i386] Sub-optimal code generation for _mm256_set1_epi64() Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- In the following code, Clang and ICC emit a very optimal function that consists of three instructions (including the tail call). MSVC emits a pretty good equivalent with a bit more function overhead, but no memory access GCC emits a completely unnecessary memory access. Code: #include #include #ifndef _MSC_VER #define __vectorcall #endif void __vectorcall f(__m256i value256); void g(uint64_t value) { f( _mm256_set1_epi64x(value)); } Clang and ICC (optimal) output: g: vmovd %rdi, %xmm0 vpbroadcastq %xmm0, %ymm0 jmp f GCC: g: pushq %r13 leaq16(%rsp), %r13 andq$-32, %rsp pushq -8(%r13) pushq %rbp movq%rsp, %rbp pushq %r13 movq%rdi, -24(%rbp) vpbroadcastq-24(%rbp), %ymm0 popq%r13 popq%rbp leaq-16(%r13), %rsp popq%r13 jmp f Godbolt link for all compilers: https://gcc.godbolt.org/z/-gNvec
[Bug target/87976] [i386] Sub-optimal code generation for _mm256_set1_epi64()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87976 --- Comment #3 from Thiago Macieira --- Workaround: __m128i value64 = _mm_set_epi64x(0, value); // _mm_cvtsi64_si128(value); asm ("" : "+x" (value64)); __m256i value256 = _mm256_broadcastq_epi64(value64);
[Bug c++/69549] New: Named Address Spaces does not compile in C++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69549 Bug ID: 69549 Summary: Named Address Spaces does not compile in C++ Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- It works in C: $ cat test.c __seg_gs char * ptr; $ gcc -c test.c && echo Success Success But not in C++: $ gcc -xc++ -c test.c test.c:1:1: error: ‘__seg_gs’ does not name a type Even though it's advertised as supported: $ gcc -xc++ -dM -E /dev/null | grep SEG_GS #define __SEG_GS 1
[Bug c++/69549] Named Address Spaces does not compile in C++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69549 --- Comment #1 from Thiago Macieira --- Bump? Still happening on 7.0 (built 20160502)
[Bug c++/82081] New: Tail call optimisation of noexcept function leads to exception allowed through
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82081 Bug ID: 82081 Summary: Tail call optimisation of noexcept function leads to exception allowed through Product: gcc Version: 7.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- When a noexcept function gets optimised with tail-call, the frame disappears so the unwinder cannot know that the function was noexcept and thus std::terminate() should be called. Code: $ cat throw.cpp void noexcept_function() noexcept; bool false_condition = false; void will_throw() { throw 1; } void wrapper() { noexcept_function(); if (false_condition) throw 42; } $ cat main.cpp #include void will_throw(); // throws int void wrapper(); extern bool false_condition; void noexcept_function() noexcept { will_throw(); } int main() { try { wrapper(); } catch (int v) { std::cout << "Caught " << v; return v; } return 0; } By bouncing around translation units, we prevent inlining. The compiler cannot know that wrapper() calls noexcept_function(), which calls will_throw(). In debug mode, the program behaves as expected $ g++ -O0 -g throw.cpp main.cpp $ ./a.out terminate called after throwing an instance of 'int' [1]46552 abort (core dumped) ./a.out (gdb) bt #0 0x7f9df0ce1a90 in raise () from /lib64/libc.so.6 #1 0x7f9df0ce30f6 in abort () from /lib64/libc.so.6 #2 0x7f9df1615235 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x7f9df1613026 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x7f9df1611fe9 in ?? () from /usr/lib64/libstdc++.so.6 #5 0x7f9df1612958 in __gxx_personality_v0 () from /usr/lib64/libstdc++.so.6 #6 0x7f9df10633a3 in ?? () from /lib64/libgcc_s.so.1 #7 0x7f9df10638b0 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1 #8 0x7f9df16132a6 in __cxa_throw () from /usr/lib64/libstdc++.so.6 #9 0x004009ed in will_throw () at throw.cpp:6 #10 0x00400a2f in noexcept_function () at main.cpp:7 #11 0x004009f6 in wrapper () at throw.cpp:11 #12 0x00400a40 in main () at main.cpp:12 However, when optimised, we see that the exception thrown from will_throw() does pass through and is caught by main(): $ g++ -O2 -g throw.cpp main.cpp $ ./a.out Caught 1 (gdb) disass noexcept_function Dump of assembler code for function noexcept_function(): 0x00400b10 <+0>: jmpq 0x400aa0 I see two possible paths to solving this. 1) forbid tail-call optimisation of a noexcept(false) call in a noexcept function, so that there is a frame in place for the unwinder to find. That is, the noexcept_function should be: sub %rsp, 8 call will_throw() retq (GCC generates this under some conditions, like placing all functions in the same TU but using -fno-inline) 2) wrap the call point of the noexcept function (in this case, wrapper()) with an EH table that enforces that no exceptions should come out of it. The first solution implies a performance penalty due to optimisation that could not be used. If you choose to implement this, please try to disable this correction under -fno-exceptions. The second solution allows the runtime performance at the expense of expanding EH tables around every noexcept function. Neither solution completely solves the problem for mixed-age code in different libraries: solution 1 solves the problem if the callee is recompiled but lets the problem still happen if only the caller is recompiled. Solution 2 is the dual converse: if the caller is recompiled, the problem is solved, but the problem still happens if only the callee is recompiled.
[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #8 from Thiago Macieira --- (In reply to Peter Cordes from comment #7) > 8B alignment is required for 8B objects to be efficiently lock-free (using > SSE load / store for .load() and .store(), see > https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a- > naturally-aligned-variable-atomic), and to avoid a factor of ~100 slowdown > if lock cmpxchg8b is split across a cache-line boundary. Unfortunately, the issue is not efficiency, but compatibility. The change broke ABI for roughly 50% of structs containing atomic<64bit>. I understand being fast, but not at the expense of silently breaking code at runtime. > alignof(long double) in 32-bit is different from alignof(long double) in > 64-bit. std::atomic or _Atomic long double should always have > the same alignment as long double. In and out of structs? That's the whole problem: inside structs, the alignment is 4 for historical reasons.
[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #10 from Thiago Macieira --- Actually, PR 65146 points out that the problem is not efficiency but correctness. An under-aligned type could cross a cacheline boundary and thus fail to be atomic in the first place. Therefore, it is correct to increase the alignment, even if that causes an ABI change for existing structures. Those structures were disasters waiting to happen. I withdraw my bug report. Close it as INVALID or NOTABUG or whatever is appropriate.
[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #12 from Thiago Macieira --- Another problem is that we've now had a couple of years with this issue, so it's probably worse to make a change again.
[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #14 from Thiago Macieira --- (In reply to Peter Cordes from comment #13) > If you want a struct with non-atomic members to match the layout of a struct > with atomic members, do something like > > struct foo { > char c; > alignas(atomic) long long t; > }; > [cut] > IDK what Qt's assert is guarding against. If you're specifically worried > about atomicity, checking that alignof(InStruct) == sizeof(long long) makes > more sense, because that's required on almost any architecture as a > guaranteed way to avoid cache-line splits. (C/C++ don't have a simple way > to express "unaligned is fine except at cache line boundaries" like you get > on Intel specifically (not AMD)). It was trying to guard against exactly what you said above: that the alignment of a QAtomicInteger was exactly the same as the alignment of a plain T inside a struct, so one could replace a previous plain member with an atomic and keep binary compatibility. But it's clear now that atomic types may need extra alignment than the plain types. In hindsight, the check is unnecessary and should be removed; people should not expect to replace T with std::atomic or QAtomicInteger and keep ABI.
[Bug c++/80439] New: __attribute__((target("xxx"))) not applied to lambdas
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80439 Bug ID: 80439 Summary: __attribute__((target("xxx"))) not applied to lambdas Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Testcase (see also https://godbolt.org/g/H2xjNc for GCC and Clang build): #include #include __attribute__((target("sse4.2"))) unsigned aeshash(const uint8_t *p, size_t len, unsigned seed) { const auto l = [](unsigned data) { __m128i m = _mm_insert_epi32(_mm_setzero_si128(), data, 1); return _mm_extract_epi32(m, 1); }; return l(seed); } In the testcase above, if the source is compiled with base options for x86 (either 32- or 64-bit mode), GCC fails to compile with error: /usr/lib/gcc/x86_64-linux-gnu/6.3.0/include/smmintrin.h:447:1: error: inlining failed in call to always_inline 'int _mm_extract_epi32(__m128i, int)': target specific option mismatch _mm_extract_epi32 (__m128i __X, const int __N) ^ :9:38: note: called from here return _mm_extract_epi32(m, 1); ^ Clang compiles the above just fine. The compilation works if I add __attribute__((target("sse4.2"))) to the lambda.
[Bug target/57202] Please make the intrinsics headers like immintrin.h be usable without compiler flags
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57202 --- Comment #10 from Thiago Macieira --- > But that's what this bug report is for - to make the intrinsicsalways available. I never asked for them to be available in undecorated functions. Yes, that's how both the Intel and Microsoft compilers behave, but I actually find that GCC and Clang's behaviour makes sense too. This allows a clear demarcation of where different instructions may be used by the compiler, so the CPU check code can be sure of no leakage. What's more, it allows the compiler to use other instructions that you didn't specifically use. It's not perfect, but neither is unrestricted use. I've seen code generated by either ICC or MSVC (don't remember which) when using an AVX2 instruction like VPMOVXZBW be surrounded by non-VEX-encoded SSE2 instructions because we never told the compiler it was ok to to use VEX.
[Bug c++/80460] New: Non-sensical fallthrough warning after [[noreturn]] function leading to __builtin_unreachable()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80460 Bug ID: 80460 Summary: Non-sensical fallthrough warning after [[noreturn]] function leading to __builtin_unreachable() Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Testcase: === [[noreturn]] void qt_assert() noexcept; inline void qt_noop() {} void f(int i) { switch (i) { case 0: ((!(!"message")) ? qt_assert() : qt_noop()); case 1: qt_noop(); } } === Prints (under -O2): : In function 'void f(int)': :8:49: warning: this statement may fall through [-Wimplicit-fallthrough=] ((!(!"message")) ? qt_assert() : qt_noop()); ~~~^~ :9:5: note: here case 1: ^~~~ The condition !!"message" is always true, so the [[noreturn]] function qt_assert() will be called. There's no condition under which qt_noop() will be called, so there's no fallthrough possible.
[Bug c++/80460] Incorrect fallthrough warning after [[noreturn]] function inside always-true conditional
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80460 --- Comment #7 from Thiago Macieira --- (In reply to Jakub Jelinek from comment #1) > The warning is done before optimizations (except GENERIC opts), and can > hardly be done much later. I imagined it would be the case. Treat this as low priority. I've added the [[fallthrough]] to the source code where this appeared to silence the warning. Arguably, the author should have used Q_UNREACHABLE() there too, not Q_ASSERT(!"message").
[Bug c/54202] Overeager warning about freeing non-heap objects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54202 --- Comment #6 from Thiago Macieira --- ping. If you can't fix GCC so that it can prove that the free is on a non-heap object, then please change the warning to indicate that GCC may be wrong. For example: warning: free() may be called with non-heap object 'name'
[Bug c/80922] New: #pragma diagnostic ignored not honoured with -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80922 Bug ID: 80922 Summary: #pragma diagnostic ignored not honoured with -flto Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- $ cat f1.cpp #pragma GCC diagnostic ignored "-Wfree-nonheap-object" void myfree(void *ptr) { __builtin_free(ptr); } $ cat f2.cpp void myfree(void *); static char c; int main() { myfree(&c); } This code is intentionally bogus just to trigger the warning. The situation that caused this was correct code, with a false positive warning I was trying to suppress. $ gcc -O2 -include f1.cpp f2.cpp [no warning, as expected] $ gcc -O2 -flto f1.cpp f2.cpp In function ‘myfree.constprop’, inlined from ‘main’ at f2.cpp:6:11: f1.cpp:4:19: warning: attempt to free a non-heap object ‘c’ [-Wfree-nonheap-object] __builtin_free(ptr); ^
[Bug target/78782] New: [x86] _mm_loadu_si64 intrinsic missing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78782 Bug ID: 78782 Summary: [x86] _mm_loadu_si64 intrinsic missing Product: gcc Version: 6.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- See this copy of the Intel manual: https://hjlebbink.github.io/x86doc/html/MOVQ.html (note the typo in the _mm_move_epi64 intrinsic). Clang addition: https://reviews.llvm.org/D21504 However, Microsoft's compiler seems not to have it either. Seems like the functionality can be achieved by way of _mm_loadl_epi64.
[Bug c++/82443] New: Would like a way to control emission of vague/weak symbol for inline variables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82443 Bug ID: 82443 Summary: Would like a way to control emission of vague/weak symbol for inline variables Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- C++17 introduced inline variables and made all static constexpr members be implicitly inline. With C++14, this code: === header.h === struct S { static constexpr int i = 42; }; === tu1.cpp === #include constexpr int S::i; === tu2.cpp === #include const void *f() { return &S::i; } == Clang 5 and GCC 7.2, when compiled with -std=c++14, emit the S::i symbol in tu1.o and it's not weak. There's no S::i symbol emitted in tu2.o. When compiled with -std=c++17, GCC 7 does not emit the symbol in tu1.o. Clang 5 does. Both compilers emit a weak symbol in tu2.o. ICC 17 with -std=c++14 emits nothing in tu1.o and emits a weak S::i in tu2.o. This inconsistency is fragile. Now add -fvisibility=hidden -fvisibility-inlines-hidden: I'd like a way to make sure that he inline variable is emitted only in my .cpp file. Everywhere else that needs to take the address will not emit a copy and will get it from my .so.
[Bug c++/77849] New: [regression/4.9] Warning about deprecated enum even when "-Wdeprecated-declarations" is off
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77849 Bug ID: 77849 Summary: [regression/4.9] Warning about deprecated enum even when "-Wdeprecated-declarations" is off Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- $ cat test.cpp class C { public: enum __attribute__((__deprecated__("Do not use"))) MyEnum { Foo, Bar }; #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdeprecated-declarations" __attribute__((__deprecated__("Really, do not use"))) static const MyEnum mySpecialEnum = Foo; #pragma GCC diagnostic pop }; int main() { return C::Foo; } $ gcc-6 -fsyntax-only test.cpp test.cpp:1:7: warning: ‘C::mySpecialEnum’ is deprecated: Really, do not use [-Wdeprecated-declarations] test.cpp:12:79: note: declared here Notes: * no warnings on GCC 4 * warnings on mySpecialEnum in GCC 5 and 6 (not about the actual enum usage, about the actual definition of mySpecialEnum) * no warnings with ICC * warnings in main on clang 3.7-3.9 Sorry, I don't have a GCC trunk (7) build available.
[Bug target/59952] -march=core-avx2 should not enable RTM
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59952 --- Comment #12 from Thiago Macieira --- GCC 4.9.0 got released with -march=haswell still enabling RTM and HLE, even though there are Haswell parts without TSX.
[Bug target/59952] -march=core-avx2 should not enable RTM
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59952 --- Comment #15 from Thiago Macieira --- (In reply to H.J. Lu from comment #14) > I think HLE is the part of TSX. It is and should be removed from the list.
[Bug target/59952] -march=core-avx2 should not enable RTM
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59952 --- Comment #19 from Thiago Macieira --- The prefix can be emitted for any CPU, you don't need a flag for that. However, you cannot emit the XTEST instruction unless the CPU has HLE or RTM.
[Bug c++/43247] Icorrect optimization while declaring array[1]
--- Comment #1 from thiago at kde dot org 2010-03-03 14:41 --- Problem also happens on: gcc 4.4.3 on linux 32-bit gcc 4.4.1 on linux ARM (armel gnueabi) Also reproducible with -O1 -ftree-vrp. -- thiago at kde dot org changed: What|Removed |Added CC||thiago at kde dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247
[Bug c++/43247] Icorrect optimization while declaring array[1]
--- Comment #2 from thiago at kde dot org 2010-03-03 14:44 --- Also: -O1 -ftree-vrp -fno-cprop-registers -fno-defer-pop -fno-guess-branch-probability -fno-if-conversion -fno-if-conversion2 -fno-ipa-pure-const -fno-ipa-reference -fno-merge-constants -fno-omit-frame-pointer -fno-split-wide-types -fno-tree-ch -fno-tree-copy-prop -fno-tree-copyrename -fno-tree-dce -fno-tree-dominator-opts -fno-tree-dse -fno-tree-fre -fno-tree-sink -fno-tree-sra -fno-tree-ter However, if I add -fno-tree-ccp, the program starts to work as expected again. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247
[Bug c++/40145] structure inside a static function is exported, producing warning
--- Comment #1 from thiago at kde dot org 2010-03-03 14:46 --- Anyone? This is not a showstopper, but produces unnecessary (and incorrect) warnings. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40145
[Bug tree-optimization/43247] [4.3/4.4 Regression] Incorrect optimization while declaring array[1]
--- Comment #6 from thiago at kde dot org 2010-03-26 21:46 --- Is this fix going to be backported to the 4.4.x line? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247
[Bug c/65888] New: Need a way to disable copy relocations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65888 Bug ID: 65888 Summary: Need a way to disable copy relocations Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Qt would like to optimise libraries by resolving relocations that loop back into the library in question at link-time, disallowing interposing. The libraries remain position-independent by always resolving symbols via PC-relative addressing or via R_xxx_RELATIVE relocations for what pointers need to be stored in memory (such as virtual tables). Do do that, we use -Bsymbolic or -Bsymbolic-functions. Either way, this is not enough: The problem happens when the symbols used from the libraries get used in the main application. Due to copy relocation and position-dependent code generation, those symbols "transfer" to the main application: * variables are copy-relocated * functions' entry points are now the PLT location in the application Since the official address of certain variables or functions change, the link-time resolving that happened inside the library is now different from what the application and other libraries will resolve. So far, using -fPIE has been enough to make the main executable not create copy relocations on i386 and x86-64, with GCC 4.9 and earlier, Clang and ICC. GCC 5 breaks that. Given the relative code size of the application vs the libraries (the libraries are at least 10x larger and more complex), I argue that we're optimising for the wrong thing by using copy relocations. It's a historic mistake that needs fixing in the ABI. Please provide a way for libraries to be allowed to use -Bsymbolic and -fvisibility=protected by making applications never use copy relocations. Applications should resolve symbols coming from libraries via indirect, position-independent addressing. We are ok with tagging every symbol in question with a new __attribute__ (they are already all tagged with __attribute__((visibility("default".
[Bug target/65886] [5/6 Regression] Copy reloc in PIE incompatible with DSO created by -Wl,-Bsymbolic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886 --- Comment #3 from Thiago Macieira --- Thanks H.J.! Can I ask that -fsymbolic be the default? Otherwise, code with -fPIE MUST add -fsymbolic in GCC 5+, but can't add it prior because the option didn't exist. Please leave that for a release or two so that we can adapt buildsystems.