[Bug c++/40145] New: structure inside a static function is exported, producing warning

2009-05-14 Thread thiago at kde dot org
Given the following code:

===
struct EditorInternalCommand { };

static void createCommandMap()
{
struct CommandEntry { EditorInternalCommand command; };
}
===

The structure createCommandMap()::CommandEntry is exported from a local-scope
(static) function. When compiling the code above with -fvisibility=hidden, g++
4.3 or 4.4 outputs the following warning:

visibility.cpp:5: warning: 'createCommandMap()::CommandEntry' declared with
greater visibility than the type of its field
'createCommandMap()::CommandEntry::command'

If I add constructors to both structures so that there are symbols emitted, the
ELF symbol table looks like this:
6:  22 FUNCLOCAL  DEFAULT2
_ZZL16createCommandMapvEN12CommandEntryC1Ev
7: 0016 16 FUNCLOCAL  DEFAULT2 _ZL16createCommandMapv
   12:   5 FUNCWEAK   HIDDEN 6
_ZN21EditorInternalCommandC1Ev

My understanding of the problem is that a "static" function has LOCAL scope but
DEFAULT visibility. The inner structure inherits these properties. However, the
outer structure (EditorInternalCommand) has HIDDEN visibility, which triggers
the warning.

However, since the binding scope is LOCAL, those symbols will not be exported
by the linker in the final ELF object anyways, thereby making them effectively
have hidden visibility.


Workarounds:

Any of the following three actions make the warning disappear:
1) remove the "static" keyword
2) move the inner structure outside the static function
3) place the outer structure in an anonymous namespace

Actions #2 and #3 above change those propeties making both constructors match
either LOCAL/DEFAULT or WEAK/HIDDEN. Action #1 causes the function to become
GLOBAL/HIDDEN, but leaves the inner structure unchanged -- however, the warning
is gone too.


-- 
   Summary: structure inside a static function is exported,
producing warning
   Product: gcc
   Version: 4.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: thiago at kde dot org
 GCC build triplet: i586-manbo-linux-gnu
  GCC host triplet: i586-manbo-linux-gnu
GCC target triplet: i586-manbo-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40145



[Bug target/96238] New: [i386] cpuid.h header needs include guards

2020-07-17 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96238

Bug ID: 96238
   Summary: [i386] cpuid.h header needs include guards
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

$ cat x.c
#include 
#include 
$ gcc -c x.c
/usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:228:1: error: redefinition
of ‘__get_cpuid_max’
  228 | __get_cpuid_max (unsigned int __ext, unsigned int *__sig)
  | ^~~
In file included from :32:
/usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:228:1: note: previous
definition of ‘__get_cpuid_max’ was here
  228 | __get_cpuid_max (unsigned int __ext, unsigned int *__sig)
  | ^~~
/usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:283:1: error: redefinition
of ‘__get_cpuid’
  283 | __get_cpuid (unsigned int __leaf,
  | ^~~
/usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:283:1: note: previous
definition of ‘__get_cpuid’ was here
  283 | __get_cpuid (unsigned int __leaf,
  | ^~~
/usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:300:1: error: redefinition
of ‘__get_cpuid_count’
  300 | __get_cpuid_count (unsigned int __leaf, unsigned int __subleaf,
  | ^
/usr/lib64/gcc/x86_64-suse-linux/10/include/cpuid.h:300:1: note: previous
definition of ‘__get_cpuid_count’ was here
  300 | __get_cpuid_count (unsigned int __leaf, unsigned int __subleaf,
  | ^

[Bug target/95483] [i386] Missing SIMD functions

2020-08-04 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95483

--- Comment #2 from Thiago Macieira  ---
Hello Evan

I was about to report that _mm_loadu_epi16 is missing, but I'm glad you've got
a more complete listing. FYI, here's a Godbolt link showing ICC and Clang with
this intrinsic: https://gcc.godbolt.org/z/8nMcPE. I'll only have to report to
Microsoft and will reference this bug report so they check their own
implementation.

FYI, for anyone stumbling upon this report when their code failed: most of the
missing intrinsics can be worked around by combining one or more and will
result in the same code.

(In reply to Evan Nemerson from comment #0)
> Here is the list:
> 
>   AVX _mm256_cvtsi256_si32
>   AVX-512 _mm512_cvtsi512_si32

_mm256_extract_epi32
 or
_mm_cvtsi128_si32(mm256_castsi256_si128(x))

Ditto for 512-bit.

>   AVX2 _mm_broadcastsd_pd

If using AVX2 is acceptable, one can use _mm_broadcastq_epi64 with suitable
casting between __m128i and __m128d.

>   AVX2 _mm_broadcastsi128_si256

Looks like a typo; this one exists as _mm256 and so it should be.

>   AVX-512 _mm512_storeu_epi16
>   AVX-512 _mm512_storeu_epi8
>   AVX-512 _mm256_storeu_epi16
>   AVX-512 _mm256_storeu_epi8
>   AVX-512 _mm_storeu_epi16
>   AVX-512 _mm_storeu_epi8
>   AVX-512 _mm512_loadu_epi16
>   AVX-512 _mm512_loadu_epi8
>   AVX-512 _mm256_loadu_epi16
>   AVX-512 _mm256_loadu_epi8
>   AVX-512 _mm_loadu_epi16
>   AVX-512 _mm_loadu_epi8
>   AVX-512 _mm256_store_epi32
>   AVX-512 _mm_store_epi32
>   AVX-512 _mm256_loadu_epi64
>   AVX-512 _mm256_loadu_epi32
>   AVX-512 _mm_loadu_epi64
>   AVX-512 _mm_loadu_epi32
>   AVX-512 _mm256_load_epi64
>   AVX-512 _mm256_load_epi32
>   AVX-512 _mm_load_epi64
>   AVX-512 _mm_load_epi32

All of these can be implemented as the mask (for storing) or maskz (for
loading) equivalents with a mask of ~0 (UINT64_MAX for the epi8 ones). For
example
  _mm256_loadu_epi16(ptr)
becomes
  _mm256_maskz_loadu_epi16(~0, ptr)

>   AVX-512 _mm_cvtsd_i32
>   AVX-512 _mm_cvtsd_i64
>   AVX-512 _mm_cvtss_i32
>   AVX-512 _mm_cvtss_i64
>   AVX-512 _mm_cvti32_sd
>   AVX-512 _mm_cvti64_sd
>   AVX-512 _mm_cvti32_ss
>   AVX-512 _mm_cvti64_ss

Not sure why those are needed; they generate the same instruction as
_mm_cvtsX_siYY. Clang's header is even:

#define _mm_cvtss_i32 _mm_cvtss_si32
#define _mm_cvtsd_i32 _mm_cvtsd_si32
#define _mm_cvti32_sd _mm_cvtsi32_sd
#define _mm_cvti32_ss _mm_cvtsi32_ss
#ifdef __x86_64__
#define _mm_cvtss_i64 _mm_cvtss_si64
#define _mm_cvtsd_i64 _mm_cvtsd_si64
#define _mm_cvti64_sd _mm_cvtsi64_sd
#define _mm_cvti64_ss _mm_cvtsi64_ss
#endif

ICC does the same.

>   SSE _mm_storeu_si16
>   SSE2 _mm_storeu_si32

With casting of the pointer:
*dest = _mm_cvtsi128_si16(mm)

If the casting is too scary or triggers aliasing warnings, then:

  uintXX_t val = _mm_cvtsi128_siXX(mm);
  memcpy(dest, &val, sizeof(val));

GCC optimises the memcpy and reg-reg MOVD into a single MOVD into memory.

>   SSE _mm_loadu_si16
>   SSE2 _mm_loadu_si32

Ditto for the _mm_cvtsiXX_si128.

[Bug target/90129] Wrong error: inlining failed in call to always_inline ‘_mm256_adds_epi16’: target specific option mismatch

2020-09-03 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90129

Thiago Macieira  changed:

   What|Removed |Added

 CC||thiago at kde dot org

--- Comment #3 from Thiago Macieira  ---
Another test:

$ cat test.c
#include 

__attribute__((target("arch=haswell")))
int hsw_test32(float f)
{
__m128 m = _mm_set_ss(f);
m = _mm_cmpeq_ss(m, m);
return _mm_movemask_ps(m);
}
$ gcc -c test.c 
In file included from
/usr/lib64/gcc/x86_64-suse-linux/10/include/immintrin.h:29,
 from test.c:1:
test.c: In function ‘hsw_test32’:
/usr/lib64/gcc/x86_64-suse-linux/10/include/xmmintrin.h:814:1: error: inlining
failed in call to ‘always_inline’ ‘_mm_movemask_ps’: target specific option
mismatch
[...]
$ clang -c test.c && echo No error
No error
$ gcc -march=haswell -c test.c && echo No error
No error

[Bug c++/92400] New: Incorrect selection of constructor overload for brace list

2019-11-06 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92400

Bug ID: 92400
   Summary: Incorrect selection of constructor overload for brace
list
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Godbolt link: https://gcc.godbolt.org/z/bckr_n
Testcase:

#include 

struct A;
struct V
{
V() = default;
V(const A &);
A make_a() const;
};

struct A
{
A();
A(const A &);
A(A &&);
A(std::initializer_list);
};

void sink(A &);
void f()
{
A a{ V().make_a() };
sink(a);
}

When compiled, GCC generates a call to A::A(std::initializer_list). The
three other compilers in the test do not -- ICC has a call to A::A(A&&) and
Clang can be made to have that call with -std=c++14 -fno-elide-constructors.

See
https://wg21.link/cwg1631 - CWG1631: Incorrect overload resolution for
single-element initializer-list 
https://wg21.link/cwg1467 - CWG1467: List-initialization of aggregate from
same-type object
https://wg21.link/cwg2137 - CWG2137: List-initialization from object of same
type

[Bug c++/92400] Incorrect selection of constructor overload for brace list

2019-11-06 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92400

Thiago Macieira  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #2 from Thiago Macieira  ---
Yes, it's the same.

*** This bug has been marked as a duplicate of bug 85577 ***

[Bug c++/85577] list-initialization chooses initializer-list constructor

2019-11-06 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85577

Thiago Macieira  changed:

   What|Removed |Added

 CC||thiago at kde dot org

--- Comment #8 from Thiago Macieira  ---
*** Bug 92400 has been marked as a duplicate of this bug. ***

[Bug c++/92855] New: -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions

2019-12-07 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855

Bug ID: 92855
   Summary: -fvisibility-inlines-hidden failing to hide
out-of-line copies of certain inline member functions
   Product: gcc
   Version: 9.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Related to bug 47877 and bug 45065, but apparently different. Issue probably
goes back a long time.

We're compiling with -fvisibility=hidden -fvisibility-inlines-hidden and expect
that any inline functions used by libstdc++ to perform its job are hidden and
not exported from our library. Unfortunately, GCC is failing to hide some of
those functions and they can be seen with eu-readelf -s in the library output,
where they appear as "WEAK  DEFAULT".

This is currently not expected to be a big problem, since these functions *are*
inline and therefore expected to be emitted in any user code that needed to use
them. They just cause our symbol table to be bigger than it needs to be.

Testcase:
#include 

class QThreadCreateThread
{
public:
explicit QThreadCreateThread(std::future &&future)
: m_future(std::move(future))
{
}

private:
virtual void run()
{
m_future.get();
}

std::future m_future;
};

// QThread *QThread::createThreadImpl(std::future &&future)
QThreadCreateThread *createThreadImpl(std::future &&future)
{
return new QThreadCreateThread(std::move(future));
}

Compile with -O2 -fvisibility=hidden -fvisibility-inlines-hidden -fno-inline to
force no inlining. In the assembly output, there are plenty of inline functions
with .hidden and plenty without. For example, see
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release():

.section   
.text.std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(),"axG",@progbits,std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(),comdat
.align 2
.p2align 4
.weak   std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
.type  
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(), @function
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release():
[...]

No .hidden present.

[Bug c++/92855] -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions

2019-12-08 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855

--- Comment #3 from Thiago Macieira  ---
The symbol in question is inline, therefore -fvisibility-inlines-hidden should
trigger and cause it to become hidden too.

Testcase showing that GCC will apply that:

#define VISIBILITY(x) __attribute__((visibility(#x)))

namespace N VISIBILITY(default) {
void other();

inline void f()
{
other();
}

void g() { f(); }
}

If you compile this with -fno-inline to cause f() to be emitted, you'll see:

.section.text.N::f(),"axG",@progbits,N::f(),comdat
.p2align 4
.weak   N::f()
.hidden N::f()
.type   N::f(), @function
N::f():
jmp N::other()
.size   N::f(), .-N::f()

See: https://gcc.godbolt.org/z/nW3RbX

So I contend that the symbol should have been hidden and wasn't because of a
bug. Please reconsider.

[Bug c++/92855] -fvisibility-inlines-hidden failing to hide out-of-line copies of certain inline member functions

2019-12-08 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92855

--- Comment #5 from Thiago Macieira  ---
(In reply to Alexander Monakov from comment #4)
> (FWIW, making 'f' a template in your example makes it non-hidden)
> 
> Can you explain why you expect the command-line option to override the
> attribute on the namespace? GCC usually implements the opposite, i.e.
> attributes prevail over the defaults specified on the command line.
> 
> In your sample on Godbolt, Clang also appears to honour the attribute rather
> than the option.

And ICC does the opposite and hides everything. Either way, GCC's behaviour of
applying this to templates (which is bug 47877, so you may close as duplicate)
is unexpected and seems inconsistent.

I expect the emitted function to be hidden because it's inline and because of
-fvisibility-inlines-hidden. From the TexInfo manual:

 The effect of this is that GCC may, effectively, mark inline
 methods with '__attribute__ ((visibility ("hidden")))' so that they
 do not appear in the export table of a DSO and do not require a PLT
 indirection when used within the DSO.  Enabling this option can
 have a dramatic effect on load and link times of a DSO as it
 massively reduces the size of the dynamic export table when the
 library makes heavy use of templates.

Since the out-of-line copies of the inline functions will be emitted in every
TU that failed to inline them, and thus remain in every DSO, there's no need to
export them. Each DSO can call its own, local copy through PC-relative calls
and jumps.

For the particular problem at hand, which we're still debugging, see
https://bugreports.qt.io/browse/QTBUG-80535. The issue there is that certain
non-Qt symbols were exported by the DSO and thus got tagged with the ELF
version "Qt_5". That by itself is not a problem, but we've found that some
applications began referencing those symbols with that ELF version and we don't
understand why. The result is that the internal details of how something was
implemented became part of our ABI.

[Bug c/56446] New: Generate one fewer relocation when calling a checked weakref function

2013-02-25 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56446



 Bug #: 56446

   Summary: Generate one fewer relocation when calling a checked

weakref function

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: c

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: thi...@kde.org





When you have code like:



static int f() __attribute__((weakref("foo"))); 

void g() 

{ 

int (*ptr)() = f; 

if (ptr) 

ptr(); 

}



which is typical for weakref functions, when compiled in PIC/PIE mode, gcc sees

through the variable and generates:



cmpq$0, f@GOTPCREL(%rip)

je  .L1

xorl%eax, %eax

jmp f@PLT

.L1:

ret



That means there will be two GOT entries for the "foo" symbol: one in the

actual GOT and one in the .plt.got (lazily initialised). Since the actual GOT

needs to have the address filled in at load time, there's no gain in lazy

initialisation -- in fact, there's a loss.



GCC could do exactly what the code is suggesting and load the actual address

onto a register and then use it. This would save one relocation, the indirect

PLT jumps and the loss in the lazy resolution.


[Bug tree-optimization/56446] [4.6/4.7/4.8 Regression] Generate one fewer relocation when calling a checked weakref function

2013-02-25 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56446



--- Comment #3 from Thiago Macieira  2013-02-25 22:27:14 
UTC ---

This should not be done for non-PIC code. In those, it might be preferable to

make the actual call, as opposed to an indirect jump.



I also wonder what would happen for a call that resolves back into the current

module. In those cases, keeping the indirect call would be unnecessary.

However, it also seems like an edge case to me: why is the symbol weak if it's

part of the module?


[Bug tree-optimization/56446] [4.6/4.7/4.8 Regression] Generate one fewer relocation when calling a checked weakref function

2013-02-25 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56446



--- Comment #4 from Thiago Macieira  2013-02-25 22:28:07 
UTC ---

One more detail: both ICC 13 and Clang 3.0 do the same thing as GCC.


[Bug middle-end/56574] False possibly uninitialized variable warning

2013-03-08 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56574



Thiago Macieira  changed:



   What|Removed |Added



 CC||thiago at kde dot org



--- Comment #3 from Thiago Macieira  2013-03-08 21:11:19 
UTC ---

Looking at the code that GCC generated (4.7.2 from Fedora and similarly with

pristine 4.8 trunk@196249):



%edi = flag; %eax = value

11  testl   %edi, %edi

12  je  .L12

.L12 is the call to get_value() is placed

13  .L2:

14  testl   %edi, %edi

15  sete%dl

16  testl   %eax, %eax

Here, EAX might be uninitialised

17  setne   %al

18  testb   %dl, %al

19  jne .L3

.L3 is an infinite loop

20  testb   %dl, %dl

21  jne .L8

.L8 is the function exit (the loop break)

fall-through is an infinite loop



In other words, the warning is true: the generated code *is* using an

uninitialised variable.



The question is whether that is acceptable.



In order for EAX to be uninitialised, we must not have jumped to .L12. Since

the JE jump on line 12 was not taken, SETE must have set DL to 0 on line 15.

Then we compare AL to DL on line 18: as DL is zero, the result of the

comparison is ZF, whichever the value of AL might be. That means the JNZ jump

on line 19 is not taken.



The code will then proceed to the infinite loop.



Conclusion: it's just a bogus warning. It is correct from the point of view of

the assembly code that was generated, but the uninitialised value is never used

in any decisions.


[Bug c/56727] New: [4.7/4.8] [missed-optimization] Recursive call goes through the PLT unnecessarily

2013-03-25 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56727



 Bug #: 56727

   Summary: [4.7/4.8] [missed-optimization] Recursive call goes

through the PLT unnecessarily

Classification: Unclassified

   Product: gcc

   Version: 4.7.2

Status: UNCONFIRMED

  Severity: minor

  Priority: P3

 Component: c

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: thi...@kde.org





Consider the following code, compiled with -O2 -fPIC either in C or in C++:



===

void f(TYPE b)

{

f(0);

}

===



if TYPE is a type of 32- or 64-bit width (int, unsigned, long, long long), GCC

generates the following code (-m32, -mx32 or -m64):



===

f:

.L2:

jmp .L2

===



If TYPE is shorter than 32-bit (bool, _Bool, char, short), GCC generates the

following code (-mx32, -m64):



===

f:

xorl%edi, %edi

jmp f@PLT

===



and much worse code for -m32.



For whatever reason, GCC decided to place the call via the PLT. That's a the

missed optimisation: if this function was called, then the PLT must resolve

back to itself. What's more, since the argument wasn't used, it's also

unnecessary to set it.



The output happens without -fPIC, but in that case there is no PLT.



Tested on:

  GCC 4.7.2 (as shipped by Fedora 17)

  GCC 4.9 (trunk build from 20130318)



This is a contrived example (infinite recursion) that no one would write in

their sane mind. But it might point to missed optimisations in legitimate

recursive functions.


[Bug c++/57064] New: [clarification requested] Which overload with ref-qualifier should be called?

2013-04-24 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



 Bug #: 57064

   Summary: [clarification requested] Which overload with

ref-qualifier should be called?

Classification: Unclassified

   Product: gcc

   Version: 4.8.1

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: c++

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: thi...@kde.org





I'm not sure this is a bug. I am requesting clarification on the behaviour.



>From the C++11 standard (13.3.3.2 [over.ics.rank] p 3):



struct A {

void p() &;

void p() &&;

};



void f()

{

A a;

a.p();

A().p();

}



GCC 4.8.1 correctly calls the lvalue-ref overload first, then the rvalue

overload second.



Now suppose the following function:



void g(A &&a)

{

a.p();

}



Which overload should GCC call? This is my request for clarification. I

couldn't find anything specific in the standard that would help explain one way

or the other.



Intuitively, it would be the rvalue overload, but gcc calls the lvalue overload

instead. Making it:



std::move(a).p();



Does not help.


[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?

2013-04-24 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



--- Comment #1 from Thiago Macieira  2013-04-25 00:45:00 
UTC ---

Here's why I'm asking:



QString has members like:



QString arg(int, [other parameters]) const;



Which are used like so:



return QString("%1 %2 %3 %4").arg(42).arg(47).arg(-42).arg(-47);

// returns "42 47 -42 -47"



Right now, each call creates a new temporary, which is required to do memory

allocation. I'd like to avoid the new temporaries by simply reusing the

existing ones:



QString arg(int, [...]) const &; // returns a new copy

QString &&arg(int, [...]) &&; // modifies this object, return *this;



When these two overloads are present, every other call will be to rvalue-ref

and the others to lvalue-ref. That is, the first call (right after the

constructor) calls arg()&&, which returns an rvalue-ref. The next call will be

to arg()&, which returns a temporary, making the third call to arg()&& again.



I can get the desired behaviour by using the overloads:



QString arg(int, [...]) const &; // returns a new copy

QString arg(int, [...]) &&; // returns a moved temporary via return

std::move(*this);



However, the side-effect of that is that we still have 4 temporaries too many,

albeit empty (moved-out) ones. You can see this by counting the number of calls

to the destructor:



$ ~/gcc4.8/bin/g++ -fverbose-asm -fno-exceptions -fPIE -std=c++11 -S -o -

-I$QTOBJDIR/include /tmp/test.cpp | grep -B1 call.*QStringD

movq%rax, %rdi  # tmp82,

call_ZN7QStringD1Ev@PLT #

--

movq%rax, %rdi  # tmp83,

call_ZN7QStringD1Ev@PLT #

--

movq%rax, %rdi  # tmp84,

call_ZN7QStringD1Ev@PLT #

--

movq%rax, %rdi  # tmp85,

call_ZN7QStringD1Ev@PLT #


[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?

2013-04-24 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



--- Comment #2 from Thiago Macieira  2013-04-25 00:45:39 
UTC ---

This was a self-compiled, pristine GCC



gcc version 4.8.1 20130420 (prerelease) (GCC) 

trunk at 198107


[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?

2013-04-24 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



--- Comment #3 from Thiago Macieira  2013-04-25 00:53:20 
UTC ---

One more note. Given:



void p(A &);

void p(A &&);



void f(A &&a)

{

p(a);

}



like the member function case, this calls p(A &). It's slightly surprising at

first glance, but is a known and documented case.



Unlike the member function case, if you do



p(std::move(a));



it will call p(A &&).


[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?

2013-04-24 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



--- Comment #6 from Thiago Macieira  2013-04-25 06:51:33 
UTC ---

void f(A &&a)

{

std::move(a).p();

}



_Z1fO1A:

.cfi_startproc

jmp _ZNR1A1pEv@PLT  #

.cfi_endproc



Then this looks like a bug in 4.8.1.



But then are we in agreement that a.p() in that function above should call the

lvalue-ref overload? It does make the feature sligthly less useful for me. It

would require writing:



return std::move(std::move(std::move(std::move(QString("%1 %2 %3 %4")

   .arg(42))

  .arg(47))

   .arg(-42))

 .arg(-47));


[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?

2013-04-25 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



--- Comment #8 from Thiago Macieira  2013-04-25 07:13:44 
UTC ---

Hmm... this might be an effect of the same bug. Can you try this on 4.9?



struct A {

A p() const &;

A &&p() &&;

};



void f()

{

A().p().p();

}



I get:

leaq15(%rsp), %rdi  #, tmp60

call_ZNO1A1pEv@PLT  #

movq%rax, %rdi  # D.69575,

call_ZNKR1A1pEv@PLT #



Is this second call supposed to be to R? If it's to O, it's exactly what I need

to make the feature useful.


[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?

2013-04-25 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



--- Comment #10 from Thiago Macieira  2013-04-25 
07:34:21 UTC ---

Great! That changes everything. Now I can provide a mutating arg() overload.



I'll just need some #ifdef and build magic to add the R, O overloads without

removing the  overloads that already exist (binary compatibility). It

would have been nicer if the lvalue ref overload didn't get extra decoration.


[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?

2013-04-25 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



--- Comment #14 from Thiago Macieira  2013-04-26 
06:16:04 UTC ---

Understood. The idea is that one would write:



  QString str = QString("%1 %2").arg(42).arg(43);


[Bug c++/57064] [clarification requested] Which overload with ref-qualifier should be called?

2013-04-26 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57064



--- Comment #16 from Thiago Macieira  2013-04-26 
13:45:35 UTC ---

Thanks for the hint.



However, returning an rvalue, even if moved-onto, will generate code for the

destructor. It's not a matter of efficiency, just of code size.



Anyway, I'll do some benchmarks, after I figure out how to work around the

binary compatibility break imposed by having the & in the function that already

existed.


[Bug other/57202] New: Please make the intrinsics headers like immintrin.h be usable without compiler flags

2013-05-07 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57202



 Bug #: 57202

   Summary: Please make the intrinsics headers like immintrin.h be

usable without compiler flags

Classification: Unclassified

   Product: gcc

   Version: 4.8.1

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: other

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: thi...@kde.org





Please make all headers for intrinsics be includable without special compiler

flags.



In other words, I want the following to work:



$ gcc -fsyntax-only -include smmintrin.h -xc /dev/null

In file included from :0:0:

/usr/lib/gcc/x86_64-redhat-linux/4.7.2/include/smmintrin.h:31:3: error: #error

"SSE4.1 instruction set not enabled"



Note it works with ICC:

$ icc -fsyntax-only -include smmintrin.h -xc /dev/null && echo works

works





Not only that, please make all the intrinsics functions be defined and ready to

be used.



This is necessary so that the following source file could compile even if

-msse4.1 is not passed on the command-line (adapted from

http://gcc.gnu.org/gcc-4.8/changes.html):



#include 



 __attribute__ ((target ("default")))

int foo(void)

{

  return 1;

}



__attribute__ ((target ("sse4.2")))

int foo(void)

{

  __m128i v;

  _mm_blendv_epi8(v, v, v);

  return 2;

}



There are several reasons for that, number one among them that it makes the GCC

4.8 feature above actually useful for non-trivial code. Also, passing extra

options on the command-line are simply not an option for C++ code (where the

feature above is useful) if that code is moderately complex and uses inline

functions, and those options cannot be used if LTO is to be used (bug 54231).


[Bug target/57202] Please make the intrinsics headers like immintrin.h be usable without compiler flags

2013-05-08 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57202



Thiago Macieira  changed:



   What|Removed |Added



 Target|x86_64-*-* i?86-*-* |x86_64-*-* i?86-*-* arm-*-*



--- Comment #1 from Thiago Macieira  2013-05-08 07:03:26 
UTC ---

This also applies to arm_neon.h.


[Bug c/54202] New: Overeager warning about freeing non-heap objects

2012-08-08 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54202

 Bug #: 54202
   Summary: Overeager warning about freeing non-heap objects
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: c
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: thi...@kde.org


GCC 4.7 has a warning about freeing non-heap objects that is way too eager.

Compiling the following code in C or C++:
==
#include 

typedef struct Data
{ 
int refcount;
} Data; 
extern const Data shared_null;

Data *allocate()
{
return (Data *)(&shared_null);
} 

void dispose(Data *d)
{
if (d->refcount == 0)
free(d);
}

void f()
{
Data *d = allocate();
dispose(d);
}


Produces the following warning:

test.c: In function 'f'
test.c:17:13: warning: attempt to free a non-heap object 'shared_null'
[-Wfree-nonheap-object]

The warning is overeager because it says "attempt to free" without indicating
that it's only a possibility. GCC cannot prove that the call to free() will
happen with that particular pointer, as the value of shared_null.refcount is
not known.

The warning should either:
 a) be modified to indicate it's only a possibility and the compiler can't
prove it;
 b) be issued only when the compiler is sure that the free will happen on
non-heap objects.

Or both, by having two warnings: one for when it's sure and one for when it
isn't.


[Bug c/54202] Overeager warning about freeing non-heap objects

2012-08-08 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54202

--- Comment #2 from Thiago Macieira  2012-08-08 14:21:59 
UTC ---
To be honest, I don't want false-positive warnings. The code and data are
constructed so that it never frees the non-heap object (it has a reference
count of -1). If the driver to this warning can't be improved to be certain,
I'd recommend at least changing the text, like the -Wuninitialized one:

  'varname' may be used uninitialized in this function

When GCC warnings are assertive, like the "will break strict aliasing" one, we
go an extra mile to try and fix them.


[Bug c/54202] Overeager warning about freeing non-heap objects

2012-08-08 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54202

--- Comment #4 from Thiago Macieira  2012-08-08 14:53:13 
UTC ---
(In reply to comment #3)
> Note that even for the uninitialized use case we warn for functions
> that may be never executed at runtime.  So - are you happy with the
> definitive warning if the free () call happens unconditionally when
> the function is entered?

I'm not sure I follow your reasoning. Please bear with me.

If GCC can prove that the function will be called with a non-heap object, print
the warning, even if the function in question never gets executed. That is,
after inlining, code like:

extern Data shared_null;
void dispose()
{
free(&shared_null);
}

*should* print the warning, regardless of whether dispose() ever gets run.

My point was that the code that GCC was seeing, after inlining, was:

void f()
{
if (shared_null.refcount == 0)
   free(&shared_null);
}

In which case, the call to free() isn't unconditional. In this case, the
warning should either be suppressed, or indicate that it's only a possibility
instead of being assertive.


[Bug c/54231] New: LTO generates code for the wrong CPU if different options used

2012-08-11 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

 Bug #: 54231
   Summary: LTO generates code for the wrong CPU if different
options used
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: thi...@kde.org


Created attachment 27992
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27992
Makefile

Summary:

Given the following code:

=
#include 

void BZERO(char *ptr, size_t count)
{
__m128i zero = _mm_set1_epi8(0);
while (count--) {
_mm_stream_si128((__m128i*)ptr, zero);
ptr += 16;
}
}
=

When compiled twice, once for SSE2 and once for AVX (so we get VEX-prefixed
code), under LTO gcc will generate both cases using VEX. See the attached
Makefile.

Long description:

A library or program that attempts to determine at runtime whether certain CPU
features, like AVX support, may need to compile different compilation units
with different compiler flags. In the example I am providing, a simple function
that zeroes out a segment of memory aligned to 16 bytes. It's provided by the
same compilation unit which is compiled twice, but that does not seem to be
relevant.

The idea is that each of these two functions would be called by a dispatcher
function, after verifying the result of CPUID.

However, if you compile the code with LTO (e.g., by make CFLAGS=-flto with the
attached Makefile), GCC will apply the highest CPU setting to all compilation
units. This defeats the runtime detection technique: in this example, both
functions will contain AVX code, which would end up being run on computers
without AVX support.

This might be intentional. If so, please close this bug report.

However, I would recommend that the behaviour be fixed: the ability to use LTO
with different CPU settings would allow for better inlining of the functions
and suppressing unnecessary function calls. The bzero example is a good one.


[Bug c/54231] LTO generates code for the wrong CPU if different options used

2012-08-11 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #1 from Thiago Macieira  2012-08-11 22:30:50 
UTC ---
Created attachment 27993
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27993
main.c


[Bug c/54231] LTO generates code for the wrong CPU if different options used

2012-08-11 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #2 from Thiago Macieira  2012-08-11 22:33:31 
UTC ---
When adding the following source file to the library build:

#include 
void bzero_sse2(char *, size_t);
void bzero_avx(char *, size_t);

extern int avx_supported;

void my_bzero(char *ptr, size_t n)
{
if (avx_supported)
bzero_avx(ptr, n);
else
bzero_sse2(ptr, n);
}


and compiling everything with -O2 -flto, GCC produces the following function:

02e0 :
 2e0:   mov0x200171(%rip),%rax# 200458 
 2e7:   mov(%rax),%eax
 2e9:   test   %eax,%eax
 2eb:   jne310 
 2ed:   test   %rsi,%rsi
 2f0:   vpxor  %xmm0,%xmm0,%xmm0
 2f4:   je 30e 
 2f6:   nopw   %cs:0x0(%rax,%rax,1)
 300:   vmovntdq %xmm0,(%rdi)
 304:   add$0x10,%rdi
 308:   sub$0x1,%rsi
 30c:   jne300 
 30e:   repz retq 
 310:   test   %rsi,%rsi
 313:   je 30e 
 315:   vpxor  %xmm0,%xmm0,%xmm0
 319:   nopl   0x0(%rax)
 320:   vmovntdq %xmm0,(%rdi)
 324:   add$0x10,%rdi
 328:   sub$0x1,%rsi
 32c:   jne320 
 32e:   repz retq 

As can be seen, VEX-prefixed instructions were used in both cases.


[Bug c/54231] LTO generates code for the wrong CPU if different options used

2012-08-11 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #3 from Thiago Macieira  2012-08-11 22:36:20 
UTC ---
Another note: it appears the Intel compiler has the same bug. It produces the
following code when compiling with -O2 -ipo:


0340 :
 340:   dec%rsi
 343:   mov0x2001ae(%rip),%rax# 2004f8 <_DYNAMIC+0xe0>
 34a:   vpxor  %xmm0,%xmm0,%xmm0
 34e:   cmpl   $0x0,(%rax)
 351:   je 36c 
 353:   cmp$0x,%rsi
 357:   je 383 
 359:   dec%rsi
 35c:   vmovntdq %xmm0,(%rdi)
 360:   add$0x10,%rdi
 364:   cmp$0x,%rsi
 368:   jne359 
 36a:   jmp383 
 36c:   cmp$0x,%rsi
 370:   je 383 
 372:   dec%rsi
 375:   vmovntdq %xmm0,(%rdi)
 379:   add$0x10,%rdi
 37d:   cmp$0x,%rsi
 381:   jne372 
 383:   retq   
 384:   nopl   0x0(%rax,%rax,1)
 389:   nopl   0x0(%rax)

Note, additionally, that there's an instruction-scheduling issue: a VPXOR
instruction was scheduled to before the test of the CPU features.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-11 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #6 from Thiago Macieira  2012-08-11 23:23:39 
UTC ---
(In reply to comment #5)
> "Fixing" this in the compiler isn't straight-forward. The _mm_stream functions
> are just wrappers around builtin functions. It may work correctly if you put
> the bzero functions in two separate files or call the builtins directly (a
> variant of __builtin_ia32_movntdq in this case), but the way your BZERO is
> defined, I don't think it will ever work.

They *are* in separate files already. Calling the builtin directly instead of
the intrinsic wrapper might work, but I did not test it because it's not
acceptable, as the code would be GCC-specific.

> Have you considered using ifunc?

IFUNC is also irrelevant: in order to use it, I need to have two separate
source files which are compiled with different compiler settings, so we end up
where we started: the bzero_sse2() function will have AVX code.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #9 from Thiago Macieira  2012-08-13 09:44:51 
UTC ---
(In reply to comment #8)
> If you do something like
> 
>  gcc -c t1.c -mavx -flto
>  gcc -c t2.c -msse2 -flto
>  gcc t1.o t2.o -flto
> 
> then the link step will use -mavx -msse2, that is, target options are
> concatenated.

Indeed.

What I'm asking for is that each source file be compiled with its own target
options. I realise this is a request for enhancement, though.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #10 from Thiago Macieira  2012-08-13 
09:53:32 UTC ---
Another test:

$ cat main_avx.c
#define BZERO bzero_avx
#pragma GCC target ("avx")
#include "main.c"

$ cat main_sse2.c
#define BZERO bzero_sse2
#pragma GCC target ("sse2")
#include "main.c"

$ cat main.c
#include 

void BZERO(char *ptr, size_t count)
{
__m128i zero = _mm_set1_epi8(0);
while (count--) {
_mm_stream_si128((__m128i*)ptr, zero);
ptr += 16;
}
}

$ gcc -flto -O2 -shared -o libtest.so main_avx.c main_sse2.c
$ objdump -Cdr --no-show-raw-insn libtest.so
[...]

0650 :
 650:   test   %rsi,%rsi
 653:   pxor   %xmm0,%xmm0
 657:   je 66e 
 659:   nopl   0x0(%rax)
 660:   movntdq %xmm0,(%rdi)
 664:   add$0x10,%rdi
 668:   sub$0x1,%rsi
 66c:   jne660 
 66e:   repz retq 

0670 :
 670:   test   %rsi,%rsi
 673:   pxor   %xmm0,%xmm0
 677:   je 68e 
 679:   nopl   0x0(%rax)
 680:   movntdq %xmm0,(%rdi)
 684:   add$0x10,%rdi
 688:   sub$0x1,%rsi
 68c:   jne680 
 68e:   repz retq


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #11 from Thiago Macieira  2012-08-13 
10:12:48 UTC ---
Attaching __attribute__((target("xxx"))) to the function does help.

It generates the following with the my_bzero function from comment 2:

02e0 :
 2e0:   test   %rsi,%rsi
 2e3:   vpxor  %xmm0,%xmm0,%xmm0
 2e7:   je 2fe 
 2e9:   nopl   0x0(%rax)
 2f0:   vmovntdq %xmm0,(%rdi)
 2f4:   add$0x10,%rdi
 2f8:   sub$0x1,%rsi
 2fc:   jne2f0 
 2fe:   repz retq 

0300 :
 300:   mov0x200171(%rip),%rax# 200478 
 307:   mov(%rax),%eax
 309:   test   %eax,%eax
 30b:   jne330 
 30d:   test   %rsi,%rsi
 310:   pxor   %xmm0,%xmm0
 314:   je 332 
 316:   nopw   %cs:0x0(%rax,%rax,1)
 320:   movntdq %xmm0,(%rdi)
 324:   add$0x10,%rdi
 328:   sub$0x1,%rsi
 32c:   jne320 
 32e:   repz retq 
 330:   jmp2e0 
 332:   repz retq 


This workaround might be useful for me in a few places where the code inlining
provided by LTO was desired (even though, in this example, the AVX variant is
exactly what it would be if no LTO had been used). But it won't work without
major changes to the code if I have 400+ functions in a file, plus possibly
inlines from headers, to be compiled.


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-08-13 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #13 from Thiago Macieira  2012-08-13 
12:13:40 UTC ---
(In reply to comment #12)
> Yes, there are similar option-related bugs for this.  Note somebody needs
> to sit down and document the desired semantics of combining translation
> units T1 and T2, compiled with different options OP1 and OP2, at link-time 
> with
> options OP3.  Desired semantics including which cross-file optimizations
> (inlining?) are possible.

>From my (admittedly restrict) point of view, inlining should be possible,
provided the following conditions:
 - when inlining a function with a "lower" optimisation / target setting, apply
the outer scope's setting to the inlined code
 - when inlining a function with a higher target requirement, inlining should
be done only in the sense of partial function splitting, prologue, epilogues,
constant propagation, etc.

In the case that I pasted, for example, I'd like GCC to realise that it has
already tested if the counter variable is 0, then forego that test in the
inlined, inner function.

Worst case scenario, simply forego inlining completely. Then the code would
simply be no worse than the non-LTO case.


[Bug libstdc++/54172] [4.7/4.8 Regression] __cxa_guard_acquire thread-safety issue

2012-08-30 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54172

--- Comment #4 from Thiago Macieira  2012-08-30 07:52:31 
UTC ---
I'll post today.

I haven't yet looked up which mailing list you're even talking about.


[Bug libstdc++/54172] [4.7/4.8 Regression] __cxa_guard_acquire thread-safety issue

2012-09-01 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54172

--- Comment #7 from Thiago Macieira  2012-09-01 08:26:05 
UTC ---
I posted the patches on Thursday, three patches because I found one more issue,
to both lists.

Will they be picked up from there and applied to the source tree?


[Bug libstdc++/54172] [4.7/4.8 Regression] __cxa_guard_acquire thread-safety issue

2012-09-04 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54172

--- Comment #9 from Thiago Macieira  2012-09-04 17:47:08 
UTC ---
(In reply to comment #8)
> (In reply to comment #7)
> > I posted the patches on Thursday, three patches because I found one more 
> > issue,
> > to both lists.
> 
> I havn't seen anything from you arrive on gcc-patches.
> But I will say that the patch attached here looks good.

http://gcc.gnu.org/ml/gcc-patches/2012-08/msg02026.html
http://gcc.gnu.org/ml/gcc-patches/2012-08/msg02027.html
http://gcc.gnu.org/ml/gcc-patches/2012-08/msg02028.html


[Bug c++/54485] g++ should diagnose default arguments in out-of-line definitions for template class member functions

2012-09-05 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54485

Thiago Macieira  changed:

   What|Removed |Added

 CC||thiago at kde dot org

--- Comment #1 from Thiago Macieira  2012-09-05 07:27:08 
UTC ---
FYI

$ icpc -c a.cc


[Bug lto/54231] LTO generates code for the wrong CPU if different options used

2012-09-12 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231

--- Comment #14 from Thiago Macieira  2012-09-12 
13:02:23 UTC ---
>From GCC's own manual:

(Node "Function attributes"):

 On the 386/x86_64 and PowerPC backends, the inliner will not
 inline a function that has different target options than the
 caller, unless the callee has a subset of the target options of
 the caller.  For example a function declared with `target("sse3")'
 can inline a function with `target("sse2")', since `-msse3'
 implies `-msse2'.


[Bug c++/54988] fpmath=sse target pragma causes inlining failure because of target specific option mismatch

2012-10-22 Thread thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54988



--- Comment #3 from Thiago Macieira  2012-10-22 14:43:11 
UTC ---

This might be as I pointed out in

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54231:



(Node "Function attributes"):



 On the 386/x86_64 and PowerPC backends, the inliner will not

 inline a function that has different target options than the

 caller, unless the callee has a subset of the target options of

 the caller.  For example a function declared with `target("sse3")'

 can inline a function with `target("sse2")', since `-msse3'

 implies `-msse2'.



My guess was that we were forcing the inlining (via always_inline) of a

function that has different target options.



But I guess that doesn't explain why it happens only in C++ and only in

optimising mode. Does always_inline inline on -O0 too?


[Bug tree-optimization/43247] [4.3/4.4 Regression] Incorrect optimization while declaring array[1]

2010-12-22 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247

--- Comment #10 from Thiago Macieira  2010-12-22 
10:35:23 UTC ---
This is still not fixed. I can reproduce now with a different testcase, in
4.5.1. However, this time, the same code works fine in 4.4. The reason is again
accessing an array out-of-bounds for elements that we know to be there. Pay
attention to the way operator== is implemented in the following code.

If I compile it with -O1, it prints "true" as it should. If I compile it with
-O2, it prints "false". If I compile it with -O1 -finline-small-functions
-finline -findirect-inlining -fstrict-overflow and compare the disassembly with
-O2 and a suitable list of -fno-*, the code is exactly identical, except for
some instructions that should perform the copy of half of m1's data into m3. So
in the end the comparison fails due to comparing to garbage.

=== code ===
#include 

template 
class QGenericMatrix
{
public:
QGenericMatrix();
QGenericMatrix(const QGenericMatrix& other);
explicit QGenericMatrix(const T *values);

bool operator==(const QGenericMatrix& other) const;
private:
T m[N][M];// Column-major order to match OpenGL.

QGenericMatrix(int) {}   // Construct without initializing identity
matrix
};

template 
QGenericMatrix::QGenericMatrix(const QGenericMatrix& other)
{
for (int col = 0; col < N; ++col)
for (int row = 0; row < M; ++row)
m[col][row] = other.m[col][row];
}

template 
QGenericMatrix::QGenericMatrix(const T *values)
{
for (int col = 0; col < N; ++col)
for (int row = 0; row < M; ++row)
m[col][row] = values[row * N + col];
}

template 
bool QGenericMatrix::operator==(const QGenericMatrix& other)
const
{
for (int index = 0; index < N * M; ++index) {
if (m[0][index] != other.m[0][index])
return false;
}
return true;
}

typedef double qreal;
typedef QGenericMatrix<2, 2, qreal> QMatrix2x2;

int main(int , char**)
{
qreal m1Data[] = {0.0, 0.0, 0.0, 0.0};
QMatrix2x2 m1(m1Data);

QMatrix2x2 m3 = m1;
puts((m1 == m3) ? "true" : "false");
}
=== code ===

common args: -fno-exceptions -fno-rtti -fverbose-asm -march=core2 -mfpmath=sse
(though x87 math also shows the same problem)

prints "true" with: -O1 -finline-small-functions -finline -findirect-inlining
-fstrict-overflow

prints "false" with: -O2 -fno-align-functions -fno-align-jumps
-fno-align-labels -fno-caller-saves -fno-tree-switch-conversion -fno-tree-vrp
-fno-crossjumping -fno-cse-follow-jumps -fno-expensive-optimizations -fno-gcse
-fno-ipa-cp -fno-ipa-sra -fno-optimize-register-move
-fno-optimize-sibling-calls -fno-peephole2 -fno-regmove -fno-reorder-blocks
-fno-reorder-functions -fno-rerun-cse-after-loop -fno-schedule-insns2
-fno-strict-aliasing -fno-strict-aliasing -fno-thread-jumps
-fno-tree-builtin-call-dce -fno-tree-pre


[Bug tree-optimization/43247] [4.3/4.4 Regression] Incorrect optimization while declaring array[1]

2010-12-22 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247

--- Comment #12 from Thiago Macieira  2010-12-22 
19:55:38 UTC ---
(In reply to comment #11)
> >The reason is again accessing an array out-of-bounds for elements that we 
> >know to be there.
> 
> No that is undefined and different from the original testcase.

Ok. Shall I open a new report with the new information?


[Bug c++/57854] New: Would like to have a warning for virtual overrides without C++11 "override" keyword

2013-07-08 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57854

Bug ID: 57854
   Summary: Would like to have a warning for virtual overrides
without C++11 "override" keyword
   Product: gcc
   Version: 4.8.1
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org

I would like a new (optional) warning that would point out every C++ virtual
override that is done without the C++11 keyword that indicates an override. By
necessity, this warning would only be permitted in C++11 mode.

The keyword was added so that developers would let the compiler know when an
override is intended. However, the [[base_check]] attribute was dropped from
C++11 prior to standardisation, so there's no way (currently) to ask the
compiler to let us know which classes are doing overrides without the keyword.

This warning should be printed in the otherwise perfectly correct code:

struct Base {
virtual void v();
};
struct Derived: Base {
virtual void v(); // warning happens here
};

This warning should not be in -Wall. It should be in -Weffc++. I'll leave it up
to you whether it's in -Wextra.


[Bug libstdc++/54172] New: [4.7 Regression] __cxa_guard_acquire thread-safety issue

2012-08-04 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54172

 Bug #: 54172
   Summary: [4.7 Regression] __cxa_guard_acquire thread-safety
issue
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: libstdc++
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: thi...@kde.org


Created attachment 27936
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27936
Proposed fix.

In commit 184110, the __cxa_guard_acquire implementation in libsupc++/guard.cc
has been updated to use the new __atomic_* intrinsincs instead of the old
__sync_* ones. I believe this has introduced a regression due to a race
condition.

== Proof ==
While debugging a program, I set a hardware watchpoint on a guard variable and
set gdb to continue execution upon stop. The output was:

Hardware watchpoint 1:
_ZGVZN12_GLOBAL__N_121Q_QGS_textCodecsMutex13innerFunctionEvE6holder

Old value = 0
New value = 256
0x00381205f101 in __cxxabiv1::__cxa_guard_acquire (g=0x77dc9a60)
at ../../../../libstdc++-v3/libsupc++/guard.cc:254
254 if (__atomic_compare_exchange_n(gi, &expected, pending_bit,
false,
Hardware watchpoint 1:
_ZGVZN12_GLOBAL__N_121Q_QGS_textCodecsMutex13innerFunctionEvE6holder

Old value = 256
New value = 1
__cxxabiv1::__cxa_guard_release (g=0x77dc9a60) at
../../../../libstdc++-v3/libsupc++/guard.cc:376
376 if ((old & waiting_bit) != 0)
[Switching to Thread 0x7fffebfff700 (LWP 113412)]
Hardware watchpoint 1:
_ZGVZN12_GLOBAL__N_121Q_QGS_textCodecsMutex13innerFunctionEvE6holder

Old value = 1
New value = 256
0x00381205f101 in __cxxabiv1::__cxa_guard_acquire (g=0x77dc9a60)
at ../../../../libstdc++-v3/libsupc++/guard.cc:254
254 if (__atomic_compare_exchange_n(gi, &expected, pending_bit,
false,
[New Thread 0x70a2d700 (LWP 113413)]
Hardware watchpoint 1:
_ZGVZN12_GLOBAL__N_121Q_QGS_textCodecsMutex13innerFunctionEvE6holder

Old value = 256
New value = 1
__cxxabiv1::__cxa_guard_release (g=0x77dc9a60) at
../../../../libstdc++-v3/libsupc++/guard.cc:376
376 if ((old & waiting_bit) != 0)

As can be seen by the output, the guard variable transitioned from 0 -> 256 ->
1 -> 256 -> 1.

== Analysis ==

The code in guard.cc is:

int expected(0);
const int guard_bit = _GLIBCXX_GUARD_BIT;
const int pending_bit = _GLIBCXX_GUARD_PENDING_BIT;
const int waiting_bit = _GLIBCXX_GUARD_WAITING_BIT;

while (1)
  {
if (__atomic_compare_exchange_n(gi, &expected, pending_bit, false,
__ATOMIC_ACQ_REL,
__ATOMIC_RELAXED))
  {
// This thread should do the initialization.
return 1;
  }

if (expected == guard_bit)
  {
// Already initialized.
return 0;   
  }
 if (expected == pending_bit)
   {
 int newv = expected | waiting_bit;
 if (!__atomic_compare_exchange_n(gi, &expected, newv, false,
  __ATOMIC_ACQ_REL, 
  __ATOMIC_RELAXED))
   continue;

 expected = newv;
   }

syscall (SYS_futex, gi, _GLIBCXX_FUTEX_WAIT, expected, 0);
  }

We have two threads running and they both reach __cxa_guard_acquire more or
less at the same time. On one thread, the execution is the expected path: the
first CAS succeeds and that transitions the guard variable from 0 to 256. That
thread will initialise the static.

In the second thread, the CAS fails, so it will proceed to the second CAS,
trying to replace 256 with 768 (to indicate it's going to sleep).

In the mean time, the first thread calls __cxa_guard_release, which exchanges
the 256 with a 1.

Therefore, on the second thread, the second CAS fails and now expected == 1 (it
got updated). The continue makes it return to the first CAS with expected == 1
and that one succeeds, by replacing it from 1 to 256, which is wrong.

== Solution ==

This issue appears to be caused by the new atomic intrinsics updating the
expected variable and the looping. If the second CAS fails, the code needs to
inspect the value set there to determine what to do next. The possible values
are:

 0: the other thread aborted, we should try again -> continue
 1: initialisation completed, we should return 0
 (256: can't happen)
 768: yet another thread succeeded in setting the waiting bit, we should sleep

The attached patch is a proposed solution to the problem, but I have not been
able to test it yet.


[Bug c/58889] New: GCC 4.9 fails to compile certain functions with intrinsics with __attribute__((target))

2013-10-26 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58889

Bug ID: 58889
   Summary: GCC 4.9 fails to compile certain functions with
intrinsics with __attribute__((target))
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org

Source:

$ cat t.c
#include 
__attribute__((target("avx2"))) int f(void *ptr)
{ 
  return _mm256_movemask_epi8(_mm256_loadu_si256((__m256i*)ptr)); 
}

Works:

$ ~/gcc4.9/bin/g++ -S -O3 -o /dev/null t.c
$ ~/gcc4.9/bin/g++ -m32 -S -O3 -o /dev/null t.c
$ ~/gcc4.9/bin/g++ -march=core2 -S -O3 -o /dev/null t.c
$ ~/gcc4.9/bin/g++ -march=core2 -m32 -S -O3 -o /dev/null t.c
$ ~/gcc4.9/bin/g++ -march=nocona -S -O3 -o /dev/null t.c
$ ~/gcc4.9/bin/g++ -march=nocona -m32 -S -O3 -o /dev/null t.c
$ ~/gcc4.9/bin/g++ -march=prescott -m32 -S -O3 -o /dev/null t.c
$ ~/gcc4.9/bin/g++ -march=pentium4 -m32 -S -O3 -o /dev/null t.c
$ ~/gcc4.9/bin/g++ -march=pentium3 -m32 -S -O3 -o /dev/null t.c

Fails:
$ ~/gcc4.9/bin/g++ -march=pentium2 -m32 -S -O3 -o /dev/null t.c
avxintrin.h: In function ‘int f(void*)’:
avxintrin.h:890:1: error: inlining failed in call to always_inline ‘__m256i
_mm256_loadu_si256(const __m256i*)’: target specific option mismatch
 _mm256_loadu_si256 (__m256i const *__P)
 ^
[...]
g++: internal compiler error: Segmentation fault (program cc1plus)
0x409614 execute
/home/thiago/src/gcc/gcc/gcc.c:2864


$ ~/gcc4.9/bin/g++ -march=pentium -m32 -S -O3 -o /dev/null t.c
avxintrin.h: In function ‘int f(void*)’:
avxintrin.h:890:1: error: inlining failed in call to always_inline ‘__m256i
_mm256_loadu_si256(const __m256i*)’: target specific option mismatch
 _mm256_loadu_si256 (__m256i const *__P)
 ^
[...]
[no segfault]

This is an unpatched, pristine GCC, built from trunk@203862.
System: Linux 64-bit (Fedora 17)
Configure options: --enable-lang=c,c++

[Bug c/58889] GCC 4.9 fails to compile certain functions with intrinsics with __attribute__((target))

2013-10-26 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58889

--- Comment #1 from Thiago Macieira  ---
This problem also happens with other combinations of functions in use and
compiler options.

My original problem happened on a 64-bit build with -march=corei7-avx and a
function with __attribute__((target("avx2"))).


[Bug target/59539] New: Missed optimisation: VEX-prefixed operations don't need aligned data

2013-12-17 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539

Bug ID: 59539
   Summary: Missed optimisation: VEX-prefixed operations don't
need aligned data
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org

Consider the following code:

#include 
int f(void *p1, void *p2)
{
__m128i d1 = _mm_loadu_si128((__m128i*)p1);
__m128i d2 = _mm_loadu_si128((__m128i*)p2);
__m128i result = _mm_cmpeq_epi16(d1, d2);
return _mm_movemask_epi8(result);
}

If compiled with -O2 -mavx, it produces the following code with GCC 4.9
(current trunk):
f:
vmovdqu (%rdi), %xmm0
vmovdqu (%rsi), %xmm1
vpcmpeqw%xmm1, %xmm0, %xmm0
vpmovmskb   %xmm0, %eax
ret

One of the two VMOVDQU are unnecessary, since the VEX-prefixed VCMPEQW
instruction can do unaligned loads without faulting. The Intel Software
Developer's Manual Volume 1, Chapter 14 says in 14.9 "Memory alignment":

> With the exception of explicitly aligned 16 or 32 byte SIMD load/store 
> instructions, most VEX-encoded,
> arithmetic and data processing instructions operate in a flexible environment 
> regarding memory address
> alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load 
> semantics will support unaligned load
> operation by default. Memory arguments for most instructions with VEX prefix 
> operate normally without
> causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE 
> instructions). The instructions that
> require explicit memory alignment requirements are listed in Table 14-22.

Clang and ICC have already implemente this optimisation:

Clang 3.3 produces:
f:  # @f
vmovdqu (%rsi), %xmm0
vpcmpeqw(%rdi), %xmm0, %xmm0
vpmovmskb   %xmm0, %eax
ret

Similarly, ICC 14 produces:
f:
vmovdqu   (%rdi), %xmm0
vpcmpeqw  (%rsi), %xmm0, %xmm1
vpmovmskb %xmm1, %eax
ret


[Bug target/59539] Missed optimisation: VEX-prefixed operations don't need aligned data

2013-12-18 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539

--- Comment #2 from Thiago Macieira  ---
I have to use _mm_loadu_si128 because non-VEX SSE requires explicit unaligned
loads.

Here's more food for thought:

__m128i result = _mm_cmpeq_epi16((__m128i*)p1, (__m128i*)p2);

For non-VEX code, so far the compiler emitted one MOVDQA and one PCMPEQW if it
could, enforcing that both sources needed to be aligned. With VEX, VPCMPEQW can
do unaligned, so should the other load also be changed to VPMOVDQU instead of
VPMOVDQA?

Similarly, if I use _mm_load_si128 (not loadu), can the compiler combine one
load into the next instruction? Performance-wise, the execution will be the
same, with one fewer instruction to be retired (so, better); but it will not
cause an unaligned fault if the pointer isn't aligned.


[Bug target/59539] Missed optimisation: VEX-prefixed operations don't need aligned data

2013-12-18 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539

--- Comment #12 from Thiago Macieira  ---
Thanks, rebuilding!


[Bug target/59539] Missed optimisation: VEX-prefixed operations don't need aligned data

2013-12-18 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539

--- Comment #13 from Thiago Macieira  ---
I can't confirm. trunk@206091:

$ ~/gcc4.9/bin/gcc -mavx -S -o - -O3 -xc - <<<'#include 
int f(void *p1, void *p2)
{
__m128i d1 = _mm_loadu_si128((__m128i*)p1);
__m128i d2 = _mm_loadu_si128((__m128i*)p2);
__m128i result = _mm_cmpeq_epi16(d1, d2);
return _mm_movemask_epi8(result);
}
'
.file   ""
.section.text.unlikely,"ax",@progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl  f
.type   f, @function
f:
.LFB1073:
.cfi_startproc
vmovdqu (%rdi), %xmm0
vmovdqu (%rsi), %xmm1
vpcmpeqw%xmm1, %xmm0, %xmm0
vpmovmskb   %xmm0, %eax
ret
.cfi_endproc
.LFE1073:
.size   f, .-f
.section.text.unlikely
.LCOLDE0:
.text
.LHOTE0:
.ident  "GCC: (GNU) 4.9.0 20131121 (experimental)"
.section.note.GNU-stack,"",@progbits


[Bug target/59539] Missed optimisation: VEX-prefixed operations don't need aligned data

2013-12-18 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539

--- Comment #14 from Thiago Macieira  ---
*facepalm* I had forgotten to make install!

It works:
$ ~/gcc4.9/bin/gcc -mavx -S -o - -O3 -xc - <<<'#include 
int f(void *p1, void *p2)
{
__m128i d1 = _mm_loadu_si128((__m128i*)p1);
__m128i d2 = _mm_loadu_si128((__m128i*)p2);
__m128i result = _mm_cmpeq_epi16(d1, d2);
return _mm_movemask_epi8(result);
}
'
.file   ""
.section.text.unlikely,"ax",@progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl  f
.type   f, @function
f:
.LFB1073:
.cfi_startproc
vmovdqu (%rsi), %xmm0
vpcmpeqw(%rdi), %xmm0, %xmm0
vpmovmskb   %xmm0, %eax
ret
.cfi_endproc
.LFE1073:
.size   f, .-f
.section.text.unlikely
.LCOLDE0:
.text
.LHOTE0:
.ident  "GCC: (GNU) 4.9.0 20131218 (experimental)"
.section.note.GNU-stack,"",@progbits


[Bug target/19520] protected function pointer doesn't work right

2012-01-16 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19520

--- Comment #23 from Thiago Macieira  2012-01-16 
14:56:50 UTC ---
I've changed my opinion on this matter. I think GCC is generating the proper
code (most efficient). It's ld that should accept this decision.


[Bug target/19520] protected function pointer doesn't work right

2012-01-18 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19520

--- Comment #26 from Thiago Macieira  2012-01-18 
13:28:05 UTC ---
ld *can* link, it just chooses not to.

$ cat > foo.c
__attribute__((visibility("protected")))
void * foo (void) { return (void *)foo; }

$ gcc -fPIC -shared foo.c   
/usr/bin/ld: /tmp/cclrufLV.o: relocation R_X86_64_PC32 against protected symbol
`foo' can not be used when making a shared object
/usr/bin/ld: final link failed: Bad value
collect2: ld returned 1 exit status

$ gcc -Wl,-Bsymbolic-functions -fPIC -shared foo.c && echo success
success
$ cat > empty.dynlist 
{ "__this_symbol_isnt_present__"; };
$ gcc -Wl,--dynamic-list,empty.dynlist -fPIC -shared foo.c && echo success
success

I also cannot confirm that icc does anything different:
$ icc -fPIC -shared foo.c
ld: /tmp/iccf15gTK.o: relocation R_X86_64_PC32 against protected symbol `foo'
can not be used when making a shared object
ld: final link failed: Bad value
$ icc -O3 -S -o /dev/stdout -fPIC -shared foo.c | grep -A4 foo:
foo:
..B1.1: # Preds ..B1.0
..___tag_value_foo.1:   #2.19
lea   foo(%rip), %rax   #2.36
ret #2.36

What's more, if you actually do compile the following program into a shared
library, it succeeds:
$ cat > foo.S
.text
.globl  foo
.protected  foo
.type   foo, @function
foo:
movq  foo@GOTPCREL(%rip), %rax
ret
$ gcc -shared foo.S && echo success
success

But the resulting shared object has the following (extracted from eu-readelf):
Relocation section [ 5] '.rela.dyn' for section [ 0] '' at offset 0x230
contains 1 entry:
  Offset  TypeValue   Addend Name
  0x00200330  X86_64_GLOB_DAT 0x0248  +0 foo

2: 0248  0 FUNCGLOBAL PROTECTED  6 foo

Now we introduce a third component to this discussion: the dynamic linker. What
will it do?

This has become a decision, not a bug: what should the compiler do when taking
the address of a function when said function is under protected visibility.
Both solutions are technically correct and would load the same function address
under the correct circumstances. 

The compiler is also taking on the "protected" visibility to the letter (at
least, according to its own definition of so):

"protected"
  Protected visibility is like default visibility except that it
  indicates that references within the defining module will
  bind to the definition in that module.  That is, the declared
  entity cannot be overridden by another module.

Since the symbol was marked as "protected" in the symbol table, it's expected
that the linker and dynamic linker will bind it locally. That being the case,
the compiler can optimise for that fact. It can calculate what value would be
placed in the GOT entry and load that instead. That's the LEA instruction.

The linker, however, mandates that the address to symbol should not be loaded
directly, but only through the GOT. This is necessary because the psABI
requires that the function address resolve to the PLT entry found in the
position-dependent executable. If the executable takes the address of this
global (but protected) symbol, it will hardcode the address to its own address
space, forcing other ELF modules to follow suit.

Finally, what does the dynamic linker do when an "entity (that) cannot be
overridden by another module" is overridden by another module? The glibc 2.14
loader will resolve the GOT entry's relocation to the executable's PLT stub,
even if the symbol in question has protected visibility. Other loaders might
work differently.

As it stands, the psABI requires that the address to a protected function be
loaded through the GOT, even though the compiler thinks it knows what the
address will be.

However, I really wish the compiler *not* to change its behaviour for PIC code,
but instead change its behaviour for ELF position-dependent executables. I am
asking for a change in the psABI and requesting that the loading of function
addresses for "default" visibility symbols (not protected!) should be done via
the GOT. In other words, I'm asking that we optimise for shared libraries, not
for executables.

Versions:
GCC: 4.6.0
ld: 2.21.51.0.6-6.fc15 20110118
ICC: 12.1.0 20111011


[Bug target/19520] protected function pointer doesn't work right

2012-01-19 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19520

--- Comment #30 from Thiago Macieira  2012-01-19 
18:52:57 UTC ---
This does solve the problem.

It's just unfortunate that it does so by creating more work for the library
even if no executable ever takes the address of this protected function.

It would have been preferable to somehow tell the compiler when compiling an
executable that this function it's taking the address of is protected
elsewhere, so it should use the GOT too.


[Bug target/83562] broken destructors of thread_local objects on i686 mingw targets

2018-12-11 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83562

--- Comment #3 from Thiago Macieira  ---
This can easily be fixed by way of a trampoline that adjusts the parameter.

[Bug c++/88475] New: -E -fdirectives-only clashes with raw strings

2018-12-12 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88475

Bug ID: 88475
   Summary: -E -fdirectives-only clashes with raw strings
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Source:
$ cat test.cpp
extern const char str[] = R"( 
#define FOO 1
#NOSORT
)";

Compiles just fine:
$ g++ -c test.cpp; echo $?
0

Preprocessed output looks correct:
$ g++ -E test.cpp 
# 1 "test.cpp"
# 1 ""
# 1 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "" 2
# 1 "test.cpp"
extern const char str[] = R"( 
#define FOO 1
#NOSORT
)";

But in the presence of -fdirectives-only (which icecream uses), it produces an
error and incorrectly preprocesses:
$ g++ -E -fdirectives-only test.cpp | tail -5
test.cpp:3:2: error: invalid preprocessing directive #NOSORT
 #NOSORT
  ^~
# 1 "test.cpp"
extern const char str[] = R"( 
#define FOO 1

)";

According to strace, cc1plus is the preprocessor, not /lib/cpp.

[Bug sanitizer/89124] New: __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx)))

2019-01-30 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89124

Bug ID: 89124
   Summary: __attribute__((no_sanitize_address)) interferes with
__attribute__((target(xxx)))
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: sanitizer
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at 
gcc dot gnu.org
  Target Milestone: ---

$ cat test.cpp
#include 

#ifdef __GNUC__
__attribute__((target("avx2"), no_sanitize_address))
#endif
void f(void *ptr)
{
_mm256_loadu_si256((__m256i *)ptr);
}
$ gcc -c test.cpp && echo ok
ok
$ gcc -c -fsanitize=addreess test.cpp
In file included from
/opt/compiler-explorer/gcc-8.2.0/lib/gcc/x86_64-linux-gnu/8.2.0/include/immintrin.h:41,
 from :1:
/opt/compiler-explorer/gcc-8.2.0/lib/gcc/x86_64-linux-gnu/8.2.0/include/avxintrin.h:
In function 'void f(void*)':
/opt/compiler-explorer/gcc-8.2.0/lib/gcc/x86_64-linux-gnu/8.2.0/include/avxintrin.h:919:1:
error: inlining failed in call to always_inline '__m256i
_mm256_loadu_si256(const __m256i_u*)': function attribute mismatch
 _mm256_loadu_si256 (__m256i_u const *__P)
 ^~
:8:23: note: called from here
 _mm256_loadu_si256((__m256i *)ptr);
 ~~^~~~

Works fine in Clang. Godbolt link: https://godbolt.org/z/rg5kUD

[Bug sanitizer/89124] __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx)))

2019-01-30 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89124

--- Comment #1 from Thiago Macieira  ---
Worse:

$ cat test.cpp
#include 

#ifdef __GNUC__
__attribute__((no_sanitize_address))
#endif
void f(void *ptr)
{
_mm256_loadu_si256((__m256i *)ptr);
}
$ gcc -c -mavx2 test.cpp
[same errors]

[Bug sanitizer/89124] __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx)))

2019-01-30 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89124

--- Comment #2 from Thiago Macieira  ---
-fsanitize=address missing from the command-line in the previous comment. It
should be:

gcc -c -mavx2 -fsanitize=address test.cpp

[Bug sanitizer/89124] __attribute__((no_sanitize_address)) interferes with __attribute__((target(xxx)))

2019-01-30 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89124

--- Comment #4 from Thiago Macieira  ---
Or permit the inlining if the function is also __artificial__. It's documented,
but I don't see anyone needing to use that besides gcc's own headers.

[Bug libstdc++/71660] [6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2018-04-24 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #19 from Thiago Macieira  ---
And Qt has stopped complaining about this.
https://codereview.qt-project.org/227296

[Bug target/89445] New: [8 regression] _mm512_maskz_loadu_pd "forgets" to use the mask

2019-02-21 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445

Bug ID: 89445
   Summary: [8 regression] _mm512_maskz_loadu_pd "forgets" to use
the mask
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Created attachment 45793
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45793&action=edit
example showing segmentation fault

In the following code:

void daxpy(size_t n, double a, double const* __restrict x,  double* __restrict
y)
{
const __m512d v_a = _mm512_broadcastsd_pd(_mm_set_sd(a));

const __mmask16 final = (1U << (n % 8u)) - 1;
__mmask16 mask = 65535u;
for (size_t i = 0; i < n * sizeof(double); i += 8 * sizeof(double)) {
if (i + 8 * sizeof(double) > n * sizeof(double))
mask = final;
__m512d v_x = _mm512_maskz_loadu_pd(mask, (char const *)x + i);
__m512d v_y = _mm512_maskz_loadu_pd(mask, (char const *)y + i);
__m512d tmp = _mm512_fmadd_pd(v_x, v_a, v_y);
_mm512_mask_storeu_pd((char *)y + i, mask, tmp);
}
}

When compiled with GCC 8, the loop looks like

.L5:
cmpq%rax, %r10
cmovb   %r9d, %r8d
movzbl  %r8b, %ecx
kmovd   %ecx, %k1
leaq(%rdx,%rax), %rcx
vmovapd (%rsi,%rax), %zmm1{%k1}{z}
vmovapd (%rcx), %zmm2{%k1}{z}
vfmadd132pd %zmm0, %zmm2, %zmm1
vmovupd %zmm1, (%rcx){%k1}
addq$64, %rax
cmpq%rdi, %rax
jb  .L5

Whereas GCC trunk (as of r269073) generates:

.L5:
vmovapd (%rsi,%rax), %zmm1
cmpq%rax, %r9
vfmadd213pd (%rdx,%rax), %zmm0, %zmm1
cmovb   %r8d, %ecx
kmovb   %ecx, %k1
vmovupd %zmm1, (%rdx,%rax){%k1}
addq$64, %rax
cmpq%rdi, %rax
jb  .L5

Godbolt link: https://gcc.godbolt.org/z/2ys7ZO

Since the neither memory loads are masked, the resulting registers can contain
garbage and trigger FP exceptions. They can also cause segmentation faults if
portions of the source are not mapped regions. The attached example forces the
operation on a page boundary where half the 64 bytes addressed by the second
load are unmapped. When run, the example will crash.

[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask

2019-02-22 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445

--- Comment #6 from Thiago Macieira  ---
(In reply to Jakub Jelinek from comment #4)
> vmovupd (%rsi,%rax), %zmm1{%k1}{z}
> addq%rdx, %rax
> vmovupd (%rax), %zmm2{%k1}{z}
> vfmadd132pd %zmm0, %zmm2, %zmm1
> vmovupd %zmm1, (%rax){%k1}
> isn't optimal btw, it would be nice if we could merge that masking into the
> vfmadd132pd instruction, like:
> vmovupd (%rsi,%rax), %zmm1{%k1}{z}
> addq%rdx, %rax
> vfmadd132pd (%rax), %zmm2, %zmm1%{k1}{z}
> vmovupd %zmm1, (%rax){%k1}
> but not really sure how to achieve that.

It would be nice. It would be even nicer not to have that "addq". That's
actually what ICC generates (click on the godbolt link and change one of the
compilers to ICC 19):

..B1.3: # Preds ..B1.3 ..B1.2
cmpq  %rax, %r8 #12.13
cmova %r10d, %r9d   #12.13
kmovw %r9d, %k1 #13.20
vmovupd   (%r8,%rsi), %zmm1{%k1}{z} #13.20
vfmadd213pd (%r8,%rdx), %zmm0, %zmm1{%k1}{z}#15.20
vmovupd   %zmm1, (%r8,%rdx){%k1}#16.9
addq  $64, %r8  #10.48
cmpq  %rcx, %r8 #10.32
jb..B1.3# Prob 82%  #10.32

There's one more simplification here: ICC lacks the movzbl instruction which
GCC inserted but is completely superfluous. First, we've already calculated the
proper 32-bit pattern and stored it in %r9d, there was no need to zero extend
it. Second, when operating on 512-bit packed doubles, there are 8 lanes, so
only the low 8 bits of the mask register will be considered in the first place.
(Arguably, the intrinsic should have used __mmask8, but that wasn't added until
AVX512DQ and this is F)

That reduces the number of instructions and will save you a couple of uops per
loop. Depending on how long your loop is, it may help you fit in the DSB and
help the Loop Stream Detector. I'm not at all knowledgeable about those
details, so I'll just link to
https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of#answer-39940932.

For this particular loop, if run long enough, I don't think there's any effect,
but this is an area for improvement for longer loops. The number of
instructions is also significant for short-lived loops, which happens to me
often when using SIMD for strings (tens of bytes of length, so the loop is run
once or twice only).

[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask

2019-02-22 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445

--- Comment #7 from Thiago Macieira  ---
Comment on attachment 45800
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45800
gcc9-pr89445.patch

Tested and works on my machine.

The movzbl that GCC 8 generated is also gone, but it inserted moves *from* the
OpMask register:

.L4:
movq%rcx, %rax
addq$64, %rcx
cmpq%rdi, %rcx
kmovw   %k1, %r9d
cmova   %r8d, %r9d
kmovw   %r9d, %k1
vmovupd (%rsi,%rax), %zmm1{%k1}{z}
addq%rdx, %rax
vmovupd (%rax), %zmm2{%k1}{z}
vfmadd132pd %zmm0, %zmm2, %zmm1
vmovupd %zmm1, (%rax){%k1}
cmpq%rdi, %rcx
jb  .L4

Seems like it forgot the GPR that used to contain the mask, so it needed to
reload from %k1. The end detection is also slightly worse.

Yesterday, when I benchmarked with GCC 8, it ran 1000 iterations over 10
million doubles in roughly 11.9 ms, with 10 million instructions. Today, I am
getting 11.8 ms at 16 million instructions (the increase of instructions/cycle
is roughly equal to the decrease in instructions per iteration, proving that
memory bandwidth is the bottleneck)

[Bug rtl-optimization/89445] [9 regression] _mm512_maskz_loadu_pd "forgets" to use the mask

2019-02-22 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445

--- Comment #8 from Thiago Macieira  ---
Sorry, in editing I ended up removing an important point: GCC 8 also generates
the move *from* OpMask when I put it in the benchmark loop. So that's not a
regression, per se.

[Bug target/87317] New: Missed optimisation: merging VMOVQ with operations that only use the low 8 bytes

2018-09-14 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87317

Bug ID: 87317
   Summary: Missed optimisation: merging VMOVQ with operations
that only use the low 8 bytes
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Test:

#include 

int f(void *ptr)
{
__m128i data = _mm_loadl_epi64((__m128i *)ptr);
data = _mm_cvtepu8_epi16(data);
return _mm_cvtsi128_si32(data);
}

GCC generates (-march=haswell or -march=skylake):

vmovq   (%rdi), %xmm0
vpmovzxbw   %xmm0, %xmm0
vmovd   %xmm0, %eax
ret

Note that the VPMOVZXBW instruction only reads the low 8 bytes from the source,
including if it is a memory reference. Both Clang and ICC generate:

vpmovzxbw   (%rdi), %xmm0
vmovd   %xmm0, %eax
retq

Similarly for:

void f(void *dst, void *ptr)
{
__m128i data = _mm_cvtsi32_si128(*(int*)ptr);
data = _mm_cvtepu8_epi32(data);
_mm_storeu_si128((__m128i*)dst, data);
}

GCC:

vmovd   (%rsi), %xmm0
vpmovzxbd   %xmm0, %xmm0
vmovups %xmm0, (%rdi)
ret

Clang and ICC:

vpmovzxbd   (%rsi), %xmm0
vmovdqu %xmm0, (%rdi)
retq

There are other instructions that might benefit from this.

AVX-512 memory instructions where the OpMask is a constant might be candidates
too.

[Bug target/87522] LTO incorrectly merges target specific options

2018-10-04 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87522

--- Comment #2 from Thiago Macieira  ---
In the original case, all sources were compiled with -march=westmere, though
some files had -mavx added.

[Bug target/69471] "-march=native" unintentionally breaks further -march/-mtune flags

2018-11-05 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69471

Thiago Macieira  changed:

   What|Removed |Added

 CC||thiago at kde dot org

--- Comment #5 from Thiago Macieira  ---
Same thing here. User passes CFLAGS="-march=native" for their system, but
library needs to build one .cpp source with -march=haswell for additional
functionality (runtime-checked via CPUID). Unfortunately, -march=native
supersedes all other -march options, regardless of order, unlike all other
options.

Examples:
$ gcc -dM -E -xc /dev/null -march=sandybridge -march=haswell  | grep AVX 
#define __AVX__ 1
#define __AVX2__ 1
$ gcc -dM -E -xc /dev/null -march=haswell -march=sandybridge  | grep AVX
#define __AVX__ 1

$ gcc -dM -E -xc /dev/null -march=sandybridge -march=native | grep AVX
#define __AVX__ 1
#define __AVX2__ 1
$ gcc -dM -E -xc /dev/null -march=native  -march=sandybridge | grep AVX
#define __AVX__ 1
#define __AVX2__ 1

Qt is affected: https://bugreports.qt.io/browse/QTBUG-71564. The problem began
when we switched from appending -mavx2 to appending -march=haswell, so we'd get
FMA and BMI1/2 in the same file.

[Bug target/69471] "-march=native" unintentionally breaks further -march/-mtune flags

2018-11-05 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69471

--- Comment #6 from Thiago Macieira  ---
Clang is not affected:

$ clang -dM -E -xc /dev/null -march=sandybridge -march=native | grep AVX
#define __AVX2__ 1
#define __AVX__ 1
$ clang -dM -E -xc /dev/null -march=native  -march=sandybridge | grep AVX
#define __AVX__ 1

Instead of enabling the CPU features your CPU has, Clang tries to guess which
CPU you have and will apply it. This has side-effects for non-arch-specific
items like AES.

ICC is similarly affected, despite claiming it isn't:

$ icc -dM -E -xc /dev/null -march=sandybridge  -march=native | grep AVX
icc: command line warning #10121: overriding '-march=sandybridge' with
'-march=native'
icc: command line warning #10121: overriding '-march=sandybridge' with
'-march=native'
#define __AVX_I__ 1
#define __AVX__ 1
#define __AVX2__ 1
$ icc -dM -E -xc /dev/null -march=native -march=sandybridge | grep AVX  
icc: command line warning #10121: overriding '-march=native' with
'-march=sandybridge'
#define __AVX_I__ 1
#define __AVX__ 1
#define __AVX2__ 1

It says it's overriding, but doesn't override.

[Bug target/87976] New: [i386] Sub-optimal code generation for _mm256_set1_epi64()

2018-11-11 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87976

Bug ID: 87976
   Summary: [i386] Sub-optimal code generation for
_mm256_set1_epi64()
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

In the following code, Clang and ICC emit a very optimal function that consists
of three instructions (including the tail call). MSVC emits a pretty good
equivalent with a bit more function overhead, but no memory access

GCC emits a completely unnecessary memory access.

Code:

#include 
#include 

#ifndef _MSC_VER
#define __vectorcall
#endif
void __vectorcall f(__m256i value256);

void g(uint64_t value)
{
f( _mm256_set1_epi64x(value));
}


Clang and ICC (optimal) output:
g:
vmovd %rdi, %xmm0
vpbroadcastq %xmm0, %ymm0
jmp   f

GCC:
g:
pushq   %r13
leaq16(%rsp), %r13
andq$-32, %rsp
pushq   -8(%r13)
pushq   %rbp
movq%rsp, %rbp
pushq   %r13
movq%rdi, -24(%rbp)
vpbroadcastq-24(%rbp), %ymm0
popq%r13
popq%rbp
leaq-16(%r13), %rsp
popq%r13
jmp f

Godbolt link for all compilers: https://gcc.godbolt.org/z/-gNvec

[Bug target/87976] [i386] Sub-optimal code generation for _mm256_set1_epi64()

2018-11-11 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87976

--- Comment #3 from Thiago Macieira  ---
Workaround:
__m128i value64 = _mm_set_epi64x(0, value); // _mm_cvtsi64_si128(value);
asm ("" : "+x" (value64));
__m256i value256 =  _mm256_broadcastq_epi64(value64);

[Bug c++/69549] New: Named Address Spaces does not compile in C++

2016-01-28 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69549

Bug ID: 69549
   Summary: Named Address Spaces does not compile in C++
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

It works in C:

$ cat test.c
__seg_gs char * ptr;
$ gcc -c test.c && echo Success
Success

But not in C++:

$ gcc -xc++ -c test.c
test.c:1:1: error: ‘__seg_gs’ does not name a type

Even though it's advertised as supported:

$ gcc -xc++ -dM -E /dev/null | grep SEG_GS   
#define __SEG_GS 1

[Bug c++/69549] Named Address Spaces does not compile in C++

2016-05-11 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69549

--- Comment #1 from Thiago Macieira  ---
Bump?

Still happening on 7.0 (built 20160502)

[Bug c++/82081] New: Tail call optimisation of noexcept function leads to exception allowed through

2017-09-01 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82081

Bug ID: 82081
   Summary: Tail call optimisation of noexcept function leads to
exception allowed through
   Product: gcc
   Version: 7.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

When a noexcept function gets optimised with tail-call, the frame disappears so
the unwinder cannot know that the function was noexcept and thus
std::terminate() should be called.

Code:

$ cat throw.cpp
void noexcept_function() noexcept;

bool false_condition = false;
void will_throw()
{
throw 1;
}

void wrapper()
{
noexcept_function();
if (false_condition)
throw 42;
}
$ cat main.cpp
#include 

void will_throw();  // throws int
void wrapper();
extern bool false_condition;

void noexcept_function() noexcept { will_throw(); }

int main()
{
try {
wrapper();
} catch (int v) {
std::cout << "Caught " << v;
return v;
}
return 0;
}

By bouncing around translation units, we prevent inlining. The compiler cannot
know that wrapper() calls noexcept_function(), which calls will_throw().

In debug mode, the program behaves as expected

$ g++ -O0 -g throw.cpp main.cpp
$ ./a.out
terminate called after throwing an instance of 'int'
[1]46552 abort (core dumped)  ./a.out
(gdb) bt
#0  0x7f9df0ce1a90 in raise () from /lib64/libc.so.6
#1  0x7f9df0ce30f6 in abort () from /lib64/libc.so.6
#2  0x7f9df1615235 in __gnu_cxx::__verbose_terminate_handler() () from
/usr/lib64/libstdc++.so.6
#3  0x7f9df1613026 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x7f9df1611fe9 in ?? () from /usr/lib64/libstdc++.so.6
#5  0x7f9df1612958 in __gxx_personality_v0 () from
/usr/lib64/libstdc++.so.6
#6  0x7f9df10633a3 in ?? () from /lib64/libgcc_s.so.1
#7  0x7f9df10638b0 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#8  0x7f9df16132a6 in __cxa_throw () from /usr/lib64/libstdc++.so.6
#9  0x004009ed in will_throw () at throw.cpp:6
#10 0x00400a2f in noexcept_function () at main.cpp:7
#11 0x004009f6 in wrapper () at throw.cpp:11
#12 0x00400a40 in main () at main.cpp:12

However, when optimised, we see that the exception thrown from will_throw()
does pass through and is caught by main():

$ g++ -O2 -g throw.cpp main.cpp
$ ./a.out 
Caught 1
(gdb) disass noexcept_function
Dump of assembler code for function noexcept_function():
   0x00400b10 <+0>: jmpq   0x400aa0 


I see two possible paths to solving this.
1) forbid tail-call optimisation of a noexcept(false) call in a noexcept
function, so that there is a frame in place for the unwinder to find. That is,
the noexcept_function should be:
  sub  %rsp, 8
  call will_throw()
  retq
(GCC generates this under some conditions, like placing all functions in the
same TU but using -fno-inline)

2) wrap the call point of the noexcept function (in this case, wrapper()) with
an EH table that enforces that no exceptions should come out of it.

The first solution implies a performance penalty due to optimisation that could
not be used. If you choose to implement this, please try to disable this
correction under -fno-exceptions.

The second solution allows the runtime performance at the expense of expanding
EH tables around every noexcept function.

Neither solution completely solves the problem for mixed-age code in different
libraries: solution 1 solves the problem if the callee is recompiled but lets
the problem still happen if only the caller is recompiled. Solution 2 is the
dual converse: if the caller is recompiled, the problem is solved, but the
problem still happens if only the callee is recompiled.

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-05 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #8 from Thiago Macieira  ---
(In reply to Peter Cordes from comment #7)
> 8B alignment is required for 8B objects to be efficiently lock-free (using
> SSE load / store for .load() and .store(), see
> https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-
> naturally-aligned-variable-atomic), and to avoid a factor of ~100 slowdown
> if lock cmpxchg8b is split across a cache-line boundary.

Unfortunately, the issue is not efficiency, but compatibility. The change broke
ABI for roughly 50% of structs containing atomic<64bit>. I understand being
fast, but not at the expense of silently breaking code at runtime.

> alignof(long double) in 32-bit is different from alignof(long double) in
> 64-bit.  std::atomic or _Atomic long double should always have
> the same alignment as long double.

In and out of structs? That's the whole problem: inside structs, the alignment
is 4 for historical reasons.

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-05 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #10 from Thiago Macieira  ---
Actually, PR 65146 points out that the problem is not efficiency but
correctness. An under-aligned type could cross a cacheline boundary and thus
fail to be atomic in the first place.

Therefore, it is correct to increase the alignment, even if that causes an ABI
change for existing structures. Those structures were disasters waiting to
happen.

I withdraw my bug report. Close it as INVALID or NOTABUG or whatever is
appropriate.

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-05 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #12 from Thiago Macieira  ---
Another problem is that we've now had a couple of years with this issue, so
it's probably worse to make a change again.

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-09 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #14 from Thiago Macieira  ---
(In reply to Peter Cordes from comment #13)
> If you want a struct with non-atomic members to match the layout of a struct
> with atomic members, do something like
> 
> struct foo {
> char c;
> alignas(atomic) long long t;
> };
> 
[cut]
> IDK what Qt's assert is guarding against.  If you're specifically worried
> about atomicity, checking that alignof(InStruct) == sizeof(long long) makes
> more sense, because that's required on almost any architecture as a
> guaranteed way to avoid cache-line splits.  (C/C++ don't have a simple way
> to express "unaligned is fine except at cache line boundaries" like you get
> on Intel specifically (not AMD)).

It was trying to guard against exactly what you said above: that the alignment
of a QAtomicInteger was exactly the same as the alignment of a plain T
inside a struct, so one could replace a previous plain member with an atomic
and keep binary compatibility. 

But it's clear now that atomic types may need extra alignment than the plain
types. In hindsight, the check is unnecessary and should be removed; people
should not expect to replace T with std::atomic or QAtomicInteger and
keep ABI.

[Bug c++/80439] New: __attribute__((target("xxx"))) not applied to lambdas

2017-04-15 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80439

Bug ID: 80439
   Summary: __attribute__((target("xxx"))) not applied to lambdas
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Testcase (see also https://godbolt.org/g/H2xjNc for GCC and Clang build):


#include 
#include 

__attribute__((target("sse4.2")))
unsigned aeshash(const uint8_t *p, size_t len, unsigned seed)
{
const auto l = [](unsigned data) {
__m128i m = _mm_insert_epi32(_mm_setzero_si128(), data, 1);
return _mm_extract_epi32(m, 1);
};
return l(seed);
}

In the testcase above, if the source is compiled with base options for x86
(either 32- or 64-bit mode), GCC fails to compile with error:

/usr/lib/gcc/x86_64-linux-gnu/6.3.0/include/smmintrin.h:447:1: error: inlining
failed in call to always_inline 'int _mm_extract_epi32(__m128i, int)': target
specific option mismatch
 _mm_extract_epi32 (__m128i __X, const int __N)
 ^
:9:38: note: called from here
 return _mm_extract_epi32(m, 1);
  ^

Clang compiles the above just fine.

The compilation works if I add __attribute__((target("sse4.2"))) to the lambda.

[Bug target/57202] Please make the intrinsics headers like immintrin.h be usable without compiler flags

2017-04-17 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57202

--- Comment #10 from Thiago Macieira  ---
> But that's what this bug report is for - to make the intrinsicsalways
available.

I never asked for them to be available in undecorated functions. Yes, that's
how both the Intel and Microsoft compilers behave, but I actually find that GCC
and Clang's behaviour makes sense too. This allows a clear demarcation of where
different instructions may be used by the compiler, so the CPU check code can
be sure of no leakage. What's more, it allows the compiler to use other
instructions that you didn't specifically use.

It's not perfect, but neither is unrestricted use. I've seen code generated by
either ICC or MSVC (don't remember which) when using an AVX2 instruction like
VPMOVXZBW be surrounded by non-VEX-encoded SSE2 instructions because we never
told the compiler it was ok to to use VEX.

[Bug c++/80460] New: Non-sensical fallthrough warning after [[noreturn]] function leading to __builtin_unreachable()

2017-04-18 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80460

Bug ID: 80460
   Summary: Non-sensical fallthrough warning after [[noreturn]]
function leading to __builtin_unreachable()
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Testcase:

===
[[noreturn]] void qt_assert() noexcept;
inline void qt_noop() {}

void f(int i)
{
switch (i) {
case 0:
((!(!"message")) ? qt_assert() : qt_noop());
case 1:
qt_noop();
}
}
===

Prints (under -O2):

: In function 'void f(int)':
:8:49: warning: this statement may fall through
[-Wimplicit-fallthrough=]
 ((!(!"message")) ? qt_assert() : qt_noop());
  ~~~^~
:9:5: note: here
 case 1:
 ^~~~

The condition !!"message" is always true, so the [[noreturn]] function
qt_assert() will be called. There's no condition under which qt_noop() will be
called, so there's no fallthrough possible.

[Bug c++/80460] Incorrect fallthrough warning after [[noreturn]] function inside always-true conditional

2017-04-19 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80460

--- Comment #7 from Thiago Macieira  ---
(In reply to Jakub Jelinek from comment #1)
> The warning is done before optimizations (except GENERIC opts), and can
> hardly be done much later.

I imagined it would be the case. Treat this as low priority.

I've added the [[fallthrough]] to the source code where this appeared to
silence the warning. Arguably, the author should have used Q_UNREACHABLE()
there too, not Q_ASSERT(!"message").

[Bug c/54202] Overeager warning about freeing non-heap objects

2017-05-30 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54202

--- Comment #6 from Thiago Macieira  ---
ping.

If you can't fix GCC so that it can prove that the free is on a non-heap
object, then please change the warning to indicate that GCC may be wrong. For
example:

warning: free() may be called with non-heap object 'name'

[Bug c/80922] New: #pragma diagnostic ignored not honoured with -flto

2017-05-30 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80922

Bug ID: 80922
   Summary: #pragma diagnostic ignored  not honoured with -flto
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

$ cat f1.cpp
#pragma GCC diagnostic ignored "-Wfree-nonheap-object"
void myfree(void *ptr)
{
__builtin_free(ptr);
}

$ cat f2.cpp
void myfree(void *);

static char c;
int main()
{
myfree(&c);
}

This code is intentionally bogus just to trigger the warning. The situation
that caused this was correct code, with a false positive warning I was trying
to suppress.

$ gcc -O2 -include f1.cpp f2.cpp
[no warning, as expected]

$ gcc -O2 -flto f1.cpp f2.cpp   
In function ‘myfree.constprop’,
inlined from ‘main’ at f2.cpp:6:11:
f1.cpp:4:19: warning: attempt to free a non-heap object ‘c’
[-Wfree-nonheap-object]
 __builtin_free(ptr);
   ^

[Bug target/78782] New: [x86] _mm_loadu_si64 intrinsic missing

2016-12-12 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78782

Bug ID: 78782
   Summary: [x86] _mm_loadu_si64 intrinsic missing
   Product: gcc
   Version: 6.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

See this copy of the Intel manual:
https://hjlebbink.github.io/x86doc/html/MOVQ.html (note the typo in the
_mm_move_epi64 intrinsic).

Clang addition: https://reviews.llvm.org/D21504

However, Microsoft's compiler seems not to have it either. Seems like the
functionality can be achieved by way of _mm_loadl_epi64.

[Bug c++/82443] New: Would like a way to control emission of vague/weak symbol for inline variables

2017-10-05 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82443

Bug ID: 82443
   Summary: Would like a way to control emission of vague/weak
symbol for inline variables
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

C++17 introduced inline variables and made all static constexpr members be
implicitly inline. With C++14, this code:

=== header.h ===
struct S
{
static constexpr int i = 42;
};

=== tu1.cpp ===
#include 
constexpr int S::i;

=== tu2.cpp ===
#include 
const void *f() { return &S::i; }

==

Clang 5 and GCC 7.2, when compiled with -std=c++14, emit the S::i symbol in
tu1.o and it's not weak. There's no S::i symbol emitted in tu2.o.

When compiled with -std=c++17, GCC 7 does not emit the symbol in tu1.o. Clang 5
does. Both compilers emit a weak symbol in tu2.o.

ICC 17 with -std=c++14 emits nothing in tu1.o and emits a weak S::i in tu2.o.

This inconsistency is fragile.

Now add -fvisibility=hidden -fvisibility-inlines-hidden: I'd like a way to make
sure that he inline variable is emitted only in my .cpp file. Everywhere else
that needs to take the address will not emit a copy and will get it from my
.so.

[Bug c++/77849] New: [regression/4.9] Warning about deprecated enum even when "-Wdeprecated-declarations" is off

2016-10-04 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77849

Bug ID: 77849
   Summary: [regression/4.9] Warning about deprecated enum even
when "-Wdeprecated-declarations" is off
   Product: gcc
   Version: 6.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

$ cat test.cpp
class C {
public:
enum __attribute__((__deprecated__("Do not use"))) MyEnum
{
Foo,
Bar
};

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdeprecated-declarations"

__attribute__((__deprecated__("Really, do not use"))) static const MyEnum
mySpecialEnum = Foo;

#pragma GCC diagnostic pop

};

int main() {
  return C::Foo;
}

$ gcc-6 -fsyntax-only test.cpp
test.cpp:1:7: warning: ‘C::mySpecialEnum’ is deprecated: Really, do not use
[-Wdeprecated-declarations]
test.cpp:12:79: note: declared here

Notes:
* no warnings on GCC 4
* warnings on mySpecialEnum in GCC 5 and 6 (not about the actual enum usage,
about the actual definition of mySpecialEnum)
* no warnings with ICC
* warnings in main on clang 3.7-3.9

Sorry, I don't have a GCC trunk (7) build available.

[Bug target/59952] -march=core-avx2 should not enable RTM

2014-05-07 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59952

--- Comment #12 from Thiago Macieira  ---
GCC 4.9.0 got released with -march=haswell still enabling RTM and HLE, even
though there are Haswell parts without TSX.


[Bug target/59952] -march=core-avx2 should not enable RTM

2014-05-08 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59952

--- Comment #15 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #14)
> I think HLE is the part of TSX.

It is and should be removed from the list.


[Bug target/59952] -march=core-avx2 should not enable RTM

2014-05-08 Thread thiago at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59952

--- Comment #19 from Thiago Macieira  ---
The prefix can be emitted for any CPU, you don't need a flag for that. However,
you cannot emit the XTEST instruction unless the CPU has HLE or RTM.


[Bug c++/43247] Icorrect optimization while declaring array[1]

2010-03-03 Thread thiago at kde dot org


--- Comment #1 from thiago at kde dot org  2010-03-03 14:41 ---
Problem also happens on:

gcc 4.4.3 on linux 32-bit
gcc 4.4.1 on linux ARM (armel gnueabi)

Also reproducible with -O1 -ftree-vrp.


-- 

thiago at kde dot org changed:

   What|Removed |Added

 CC||thiago at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247



[Bug c++/43247] Icorrect optimization while declaring array[1]

2010-03-03 Thread thiago at kde dot org


--- Comment #2 from thiago at kde dot org  2010-03-03 14:44 ---
Also:
-O1 -ftree-vrp -fno-cprop-registers -fno-defer-pop
-fno-guess-branch-probability -fno-if-conversion -fno-if-conversion2
-fno-ipa-pure-const -fno-ipa-reference -fno-merge-constants
-fno-omit-frame-pointer -fno-split-wide-types -fno-tree-ch -fno-tree-copy-prop
-fno-tree-copyrename -fno-tree-dce -fno-tree-dominator-opts -fno-tree-dse
-fno-tree-fre -fno-tree-sink -fno-tree-sra -fno-tree-ter

However, if I add -fno-tree-ccp, the program starts to work as expected again.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247



[Bug c++/40145] structure inside a static function is exported, producing warning

2010-03-03 Thread thiago at kde dot org


--- Comment #1 from thiago at kde dot org  2010-03-03 14:46 ---
Anyone?

This is not a showstopper, but produces unnecessary (and incorrect) warnings.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40145



[Bug tree-optimization/43247] [4.3/4.4 Regression] Incorrect optimization while declaring array[1]

2010-03-26 Thread thiago at kde dot org


--- Comment #6 from thiago at kde dot org  2010-03-26 21:46 ---
Is this fix going to be backported to the 4.4.x line?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43247



[Bug c/65888] New: Need a way to disable copy relocations

2015-04-25 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65888

Bug ID: 65888
   Summary: Need a way to disable copy relocations
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org

Qt would like to optimise libraries by resolving relocations that loop back
into the library in question at link-time, disallowing interposing. The
libraries remain position-independent by always resolving symbols via
PC-relative addressing or via R_xxx_RELATIVE relocations for what pointers need
to be stored in memory (such as virtual tables).

Do do that, we use -Bsymbolic or -Bsymbolic-functions. Either way, this is not
enough:

The problem happens when the symbols used from the libraries get used in the
main application. Due to copy relocation and position-dependent code
generation, those symbols "transfer" to the main application:
 * variables are copy-relocated
 * functions' entry points are now the PLT location in the application

Since the official address of certain variables or functions change, the
link-time resolving that happened inside the library is now different from what
the application and other libraries will resolve.

So far, using -fPIE has been enough to make the main executable not create copy
relocations on i386 and x86-64, with GCC 4.9 and earlier, Clang and ICC. GCC 5
breaks that.

Given the relative code size of the application vs the libraries (the libraries
are at least 10x larger and more complex), I argue that we're optimising for
the wrong thing by using copy relocations. It's a historic mistake that needs
fixing in the ABI.

Please provide a way for libraries to be allowed to use -Bsymbolic and
-fvisibility=protected by making applications never use copy relocations.
Applications should resolve symbols coming from libraries via indirect,
position-independent addressing. We are ok with tagging every symbol in
question with a new __attribute__ (they are already all tagged with
__attribute__((visibility("default".


[Bug target/65886] [5/6 Regression] Copy reloc in PIE incompatible with DSO created by -Wl,-Bsymbolic

2015-04-25 Thread thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886

--- Comment #3 from Thiago Macieira  ---
Thanks H.J.!

Can I ask that -fsymbolic be the default? Otherwise, code with -fPIE MUST add
-fsymbolic in GCC 5+, but can't add it prior because the option didn't exist.
Please leave that for a release or two so that we can adapt buildsystems.


  1   2   3   4   >