Re: Merging gdc (GNU D Compiler) into gcc

2011-10-05 Thread David Brown

On 04/10/2011 23:47, Andrew Pinski wrote:

On Tue, Oct 4, 2011 at 2:40 PM, David Brown  wrote:

"naked" functions are often useful in embedded systems, and are therefore
useful (and implemented) on many gcc targets.  It would make sense to have
the attribute available universally in gcc, if that doesn't involve a lot of
extra work, even though it is of little use on "big" systems (Linux,
Windows, etc.).


Is it really useful if you can have a small top-level inline-asm wrapper?



The "naked" attribute is used precisely when you don't want any wrapper 
or other code that you don't absolutely need.


An example of use is when you want to write the prologue and epilogue 
manually - perhaps the compiler's standard interrupt function prologue 
and epilogue are not quite good enough for a special case, so you write 
your own inline assembly.  You don't want the compiler to generate 
additional register saves, frame manipulation, etc.


Some toolchains are configured to have a series of "init" sections at 
startup (technically, that's a matter of the default linker scripts and 
libraries rather than the compiler).  You can get code to run at 
specific times during startup by placing the instructions directly 
within these sections - but it must be "raw" instructions, not function 
definitions (and definitely no "return" statement).  The stack may not 
be defined at this stage - perhaps you are using such code to initialise 
the memory system or do other low-level setup.  For an example, see this 
link:





I've also used "naked" functions to provide a place to store specific 
assembly code - I prefer to use inline assembly in C rather than a 
separate assembly module.


RTOS'es often use "naked" in connection with tasks and task switching. 
After the RTOS has gone to all the effort of saving the registers and 
task context, it needs to jump into the next task without there being 
more overhead of stored registers.  On small RISC microcontrollers, it's 
possible for the register bank to be a significant proportion of the 
size of the chip's ram - you don't want to waste the space or the time 
saving things unnecessarily.



Using the "naked" attribute means a little more of such code can be 
written in C, and a little less needs to be written in assembly.  That's 
a good thing.  And making "naked" consistent across more gcc ports means 
a little more of such code can be cross-target portable, which is also a 
good thing.  You are never going to eliminate all target-specific parts 
of such code, nor can you eliminate all assembly - but "naked" is a step 
in that direction.



While I'm on the subject, it would be /very/ nice if the gcc port 
maintainers agreed on function (and data) attributes.  Clearly there 
will be some variation between target ports, but surely it would be 
easier to write, port and maintain if there were more consistency.  From 
a user's viewpoint, it is hard to understand why some targets use 
"long_call" attributes and others use "longcall" to do the same thing, 
or why there are a dozen different ways to declare an interrupt function 
on different architectures.  If these can be combined, it would be less 
effort for the port maintainers, easier for users, and less 
documentation.  It might also make it easier for useful attributes (like 
"naked", or "critical") to spread to other target ports.


mvh.,

David





Re: Incidents in ARM-NEON-Intrinsics

2011-10-05 Thread Julian Brown
On Wed, 05 Oct 2011 10:37:22 +0900
shiot...@rd.ten.fujitsu.com (塩谷晶彦) wrote:

> Hi, Maintainer,
> 
> I found some incidents in
> http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html#ARM-NEON-Intrinsics
> 
> Please check the following:
> 
> |6.54.3.8 Comparison (less-than-or-equal-to)
> |
> |  uint32x2_t vcle_u32 (uint32x2_t, uint32x2_t) 
> |  Form of expected instruction(s): vcge.u32 d0, d0, d0 
>   (snip)
> |  uint32x4_t vcleq_f32 (float32x4_t, float32x4_t) 
> |  Form of expected instruction(s): vcge.f32 q0, q0, q0 
> 
> in above "vcge"s may be "vcle" or "vcgt".

This is deliberate I think: the register/register forms of vcle, vclt,
vacle etc. are pseudo-instructions, and assemble to vcge, vcgt, etc.
with the operands reversed (a <= b === b >= a).

Are you seeing incorrect code being generated?

Julian


Re: Merging gdc (GNU D Compiler) into gcc

2011-10-05 Thread Andi Kleen
David Brown  writes:
>
> Some toolchains are configured to have a series of "init" sections at
> startup (technically, that's a matter of the default linker scripts
> and libraries rather than the compiler).  You can get code to run at
> specific times during startup by placing the instructions directly
> within these sections - but it must be "raw" instructions, not
> function definitions (and definitely no "return" statement).

Note that this only works with modern gcc with -fno-reorder-toplevel,
which disables some optimizations, and is currently broken in several
ways with LTO (in progress of being fixed)

Normally you don't get any defined order in these sections, both
unit-at-a-time and LTO do reorder. The linker may also in some
circumstances.

So it's somewhat dangerous to rely on.

-Andi


Re: Merging gdc (GNU D Compiler) into gcc

2011-10-05 Thread David Brown

On 05/10/2011 12:00, Andi Kleen wrote:

David Brown  writes:


Some toolchains are configured to have a series of "init" sections at
startup (technically, that's a matter of the default linker scripts
and libraries rather than the compiler).  You can get code to run at
specific times during startup by placing the instructions directly
within these sections - but it must be "raw" instructions, not
function definitions (and definitely no "return" statement).


Note that this only works with modern gcc with -fno-reorder-toplevel,
which disables some optimizations, and is currently broken in several
ways with LTO (in progress of being fixed)



You should not need to worry about re-ordering code within a module 
unless you have several pieces of code that have to be put into the same 
"initX" section in a specific order.  And if that's the case, then you 
can just put all that code within the one "naked" function.


Of course, you have to put the code in the right section.  An example 
(from the avr-libc documentation) is:


void my_init_portb (void) __attribute__ ((naked))
__attribute__ ((section (".init3")));
void my_init_portb (void)
{
PORTB = 0xff;
DDRB = 0xff;
}


LTO might well remove that function unless you also use the "used" 
attribute - although hopefully the "Keep" directive in the linker script 
file will do the same thing.


But I can't see why the -fno-reorder-toplevel flag would be needed here.


Normally you don't get any defined order in these sections, both
unit-at-a-time and LTO do reorder. The linker may also in some
circumstances.



You need a linker script that puts these sections in the right place, of 
course.



So it's somewhat dangerous to rely on.

-Andi





Re: C++11 atomic library notes

2011-10-05 Thread Andrew MacLeod

On 10/05/2011 12:14 AM, Jeffrey Yasskin wrote:


If, as the document proposes, "16 byte volatile will have to call the
external rotines, but 16 byte non-volatiles would be lock-free.", and
the external routines use locked accesses for 16-byte volatile
atomics, then this makes the concurrent accesses to shared_var not
thread-safe. To be thread-safe, we'd have to call the external
routines for every 16-byte atomic, not just the volatile ones, and
those routines would have to use locked accesses uniformly rather than
distinguishing between volatile and non-volatile accesses. Not good.



This would seem to support that an object of a given size must be 
consistent, and that volatility is not a basis to segregate behaviour.

Which is good because thats the result I want but was concerned about :-)

Even worse, on LL/SC architectures, every lock-free RMW operation
potentially involves multiple loads, so this interpretation of
volatility would prohibit lock-free access to all objects.

I see two ways out:
1) Say that accessing a non-volatile atomic through a volatile
reference or pointer causes undefined behavior. The standard doesn't
say that, and the casts are implicit, so this is icky.
2) Say that volatile atomic accesses may be implemented with more than
one instruction-level access.

(2) is something like how volatile reads of 128-bit structs involve
multiple mov instructions that execute in an arbitrary order. It's
also unlikely to cause problems in existing programs because nobody's
using volatile atomics yet, and they'll only start using them in ways
that work with what compilers implement.


To clarify, you are suggesting that we say atomic accesses to volatile 
objects may involve more than a single load?


Can we also state that a 'harmless' store may also happen? (ie, a 0 to 
an existing 0, or some other arbitrary value)   Otherwise I don't know 
how to get a 128 bit atomic load on x86-64 :-P  which then means no 
inlined lock-free atomics on 16 byte values.

Its unpleasant, but...  other suggestions?

Andrew


Re: C++11 atomic library notes

2011-10-05 Thread Jeffrey Yasskin
On Wed, Oct 5, 2011 at 5:49 AM, Andrew MacLeod  wrote:
> On 10/05/2011 12:14 AM, Jeffrey Yasskin wrote:
>> I see two ways out:
>> 1) Say that accessing a non-volatile atomic through a volatile
>> reference or pointer causes undefined behavior. The standard doesn't
>> say that, and the casts are implicit, so this is icky.
>> 2) Say that volatile atomic accesses may be implemented with more than
>> one instruction-level access.
>>
>> (2) is something like how volatile reads of 128-bit structs involve
>> multiple mov instructions that execute in an arbitrary order. It's
>> also unlikely to cause problems in existing programs because nobody's
>> using volatile atomics yet, and they'll only start using them in ways
>> that work with what compilers implement.
>
> To clarify, you are suggesting that we say atomic accesses to volatile
> objects may involve more than a single load?
>
> Can we also state that a 'harmless' store may also happen? (ie, a 0 to an
> existing 0, or some other arbitrary value)   Otherwise I don't know how to
> get a 128 bit atomic load on x86-64 :-P  which then means no inlined
> lock-free atomics on 16 byte values.
> Its unpleasant, but...  other suggestions?

Yes, that's what I'm suggesting. The rule for 'volatile' from the
language is just that "Accesses to volatile objects are evaluated
strictly according to the rules of the abstract machine." If the
instruction-level implementation for a 16-byte atomic load is
cmpxchg16b, then that's just how the abstract machine is implemented,
and the rule says you have to do that consistently for volatile
objects rather than sometimes optimizing it away. That's my argument
anyway. If there's another standard you're following beyond "kernel
people tend to ask for it," the situation may be trickier.

Jeffrey

P.S. On x86, cmpxchg's description says, "To simplify the interface to
the processor’s bus, the destination operand receives a write cycle
without regard to the result of the comparison. The destination
operand is written back if the comparison fails; otherwise, the source
operand is written into the destination. (The processor never produces
a locked read without also producing a locked write.)", so 128-bit
atomic loads will always write the original value back.


Re: C++11 atomic library notes

2011-10-05 Thread Andrew MacLeod

On 10/05/2011 10:44 AM, Jeffrey Yasskin wrote:


Yes, that's what I'm suggesting. The rule for 'volatile' from the
language is just that "Accesses to volatile objects are evaluated
strictly according to the rules of the abstract machine." If the
instruction-level implementation for a 16-byte atomic load is
cmpxchg16b, then that's just how the abstract machine is implemented,
and the rule says you have to do that consistently for volatile
objects rather than sometimes optimizing it away. That's my argument
anyway. If there's another standard you're following beyond "kernel
people tend to ask for it," the situation may be trickier.


perfect, I like it.
Andrew



avx2 incorrect representations of shifts

2011-10-05 Thread Richard Henderson
These patterns:

> (define_insn "avx2_lshlqv4di3"
>   [(set (match_operand:V4DI 0 "register_operand" "=x")
> (ashift:V4DI (match_operand:V4DI 1 "register_operand" "x")
>  (match_operand:SI 2 "const_0_to_255_mul_8_operand" 
> "n")))]
>   "TARGET_AVX2"
> {
>   operands[2] = GEN_INT (INTVAL (operands[2]) / 8);
>   return "vpslldq\t{%2, %1, %0|%0, %1, %2}";
> }
...
> (define_insn "avx2_lshrqv4di3"
>   [(set (match_operand:V4DI 0 "register_operand" "=x")
> (lshiftrt:V4DI
>  (match_operand:V4DI 1 "register_operand" "x")
>  (match_operand:SI 2 "const_0_to_255_mul_8_operand" "n")))]
>   "TARGET_AVX2"
> {
>   operands[2] = GEN_INT (INTVAL (operands[2]) / 8);
>   return "vpsrldq\t{%2, %1, %0|%0, %1, %2}";

are incorrect.  This is a 128-bit lane shift, i.e. V2TImode.

Uros, do you have an opinion on whether we should add V2TImode to
the set of modes, which means adding that to the move patterns.
Although given the scarse number of operations that we'd be able
to perform on V2TImode, perhaps it is better to simply use an unspec.


r~


Re: avx2 incorrect representations of shifts

2011-10-05 Thread Uros Bizjak
On Wed, Oct 5, 2011 at 8:57 PM, Richard Henderson  wrote:
> These patterns:
>
>> (define_insn "avx2_lshlqv4di3"
>>   [(set (match_operand:V4DI 0 "register_operand" "=x")
>>         (ashift:V4DI (match_operand:V4DI 1 "register_operand" "x")
>>                      (match_operand:SI 2 "const_0_to_255_mul_8_operand" 
>> "n")))]
>>   "TARGET_AVX2"
>> {
>>   operands[2] = GEN_INT (INTVAL (operands[2]) / 8);
>>   return "vpslldq\t{%2, %1, %0|%0, %1, %2}";
>> }
> ...
>> (define_insn "avx2_lshrqv4di3"
>>   [(set (match_operand:V4DI 0 "register_operand" "=x")
>>         (lshiftrt:V4DI
>>          (match_operand:V4DI 1 "register_operand" "x")
>>          (match_operand:SI 2 "const_0_to_255_mul_8_operand" "n")))]
>>   "TARGET_AVX2"
>> {
>>   operands[2] = GEN_INT (INTVAL (operands[2]) / 8);
>>   return "vpsrldq\t{%2, %1, %0|%0, %1, %2}";
>
> are incorrect.  This is a 128-bit lane shift, i.e. V2TImode.
>
> Uros, do you have an opinion on whether we should add V2TImode to
> the set of modes, which means adding that to the move patterns.
> Although given the scarse number of operations that we'd be able
> to perform on V2TImode, perhaps it is better to simply use an unspec.

We already have V2TImode, but hidden in VIMAX_AVX2 mode iterator.
Based on that, I would suggest that we model correct insn
functionality and try to avoid unspec. On the related note, there is
no move insn for V2TImode, so V2TI should be added to V16 mode
iterator and a couple of other places (please grep for V1TImode, used
for SSE full-register shift insns only).

Uros.


Option to make unsigned->signed conversion always well-defined?

2011-10-05 Thread Ulf Magnusson
Hi,

I've been experimenting with different methods for emulating the
signed overflow of an 8-bit CPU. The method I've found that seems to
generate the most efficient code on both ARM and x86 is

bool overflow(unsigned int a, unsigned int b) {
const unsigned int sum = (int8_t)a + (int8_t)b;
return (int8_t)sum != sum;
}

(The real function would probably be 'inline', of course. Regs are
stored in overlong variables, hence 'unsigned int'.)

Looking at the spec, it unfortunately seems the behavior of this
function is undefined, as it relies on signed int addition wrapping,
and that (int8_t)sum truncates bits. Is there some way to make this
guaranteed safe with GCC without resorting to inline asm? Locally
enabling -fwrap takes care of the addition, but that still leaves the
conversion.

/Ulf


Re: Option to make unsigned->signed conversion always well-defined?

2011-10-05 Thread Ulf Magnusson
On Wed, Oct 5, 2011 at 10:11 PM, Ulf Magnusson  wrote:
> Hi,
>
> I've been experimenting with different methods for emulating the
> signed overflow of an 8-bit CPU. The method I've found that seems to
> generate the most efficient code on both ARM and x86 is
>
> bool overflow(unsigned int a, unsigned int b) {
>    const unsigned int sum = (int8_t)a + (int8_t)b;
>    return (int8_t)sum != sum;
> }
>
> (The real function would probably be 'inline', of course. Regs are
> stored in overlong variables, hence 'unsigned int'.)
>
> Looking at the spec, it unfortunately seems the behavior of this
> function is undefined, as it relies on signed int addition wrapping,
> and that (int8_t)sum truncates bits. Is there some way to make this
> guaranteed safe with GCC without resorting to inline asm? Locally
> enabling -fwrap takes care of the addition, but that still leaves the
> conversion.
>
> /Ulf
>

Is *((int8_t*)&sum) safe (assuming little endian)? Unfortunately that
seems to generate worse code. On X86 it generates the following (GCC
4.5.2):

0050 <_Z9overflow4jj>:
  50:   83 ec 10sub$0x10,%esp
  53:   0f be 54 24 18  movsbl 0x18(%esp),%edx
  58:   0f be 44 24 14  movsbl 0x14(%esp),%eax
  5d:   8d 04 02lea(%edx,%eax,1),%eax
  60:   0f be d0movsbl %al,%edx
  63:   39 d0   cmp%edx,%eax
  65:   0f 95 c0setne  %al
  68:   83 c4 10add$0x10,%esp
  6b:   c3  ret

With the straight (int8_t) cast you get

  50:   0f be 54 24 08  movsbl 0x8(%esp),%edx
  55:   0f be 44 24 04  movsbl 0x4(%esp),%eax
  5a:   8d 04 02lea(%edx,%eax,1),%eax
  5d:   0f be d0movsbl %al,%edx
  60:   39 c2   cmp%eax,%edx
  62:   0f 95 c0setne  %al
  65:   c3  ret

What's with the extra add/sub of ESP?

/Ulf


Re: Merging gdc (GNU D Compiler) into gcc

2011-10-05 Thread Walter Bright



On 10/4/2011 12:08 AM, Iain Buclaw wrote:

I've have received news from Walter Bright that the license of the D
frontend has been assigned to the FSF. As the current maintainer of
GDC, I would like to get this moved forward, starting with getting the
ball rolling. What would need to be done? And what are the processes
required? (ie: passing the project through to technical review.)

The current home of GDC is here: https://bitbucket.org/goshawk/gdc




I did this for the express purpose of enabling GDC to be merged into gcc. Being 
part of gcc will be a great boon to the D community and to the gcc user 
community at large. Iain is in charge of the GDC project, has been doing a 
fantastic job, and has my full support and the support of the D community. Let's 
make this happen!


Re: Option to make unsigned->signed conversion always well-defined?

2011-10-05 Thread Pedro Pedruzzi
Em 05-10-2011 17:11, Ulf Magnusson escreveu:
> Hi,
> 
> I've been experimenting with different methods for emulating the
> signed overflow of an 8-bit CPU.

You would like to check whether a 8-bit signed addition will overflow or
not, given the two operands. Is that correct?

As you used the word `emulating', I am assuming that your function will
not run by the mentioned CPU.

Does this 8-bit CPU use two's complement representation?

> The method I've found that seems to
> generate the most efficient code on both ARM and x86 is
> 
> bool overflow(unsigned int a, unsigned int b) {
> const unsigned int sum = (int8_t)a + (int8_t)b;
> return (int8_t)sum != sum;
> }
> 
> (The real function would probably be 'inline', of course. Regs are
> stored in overlong variables, hence 'unsigned int'.)
> 
> Looking at the spec, it unfortunately seems the behavior of this
> function is undefined, as it relies on signed int addition wrapping,
> and that (int8_t)sum truncates bits. Is there some way to make this
> guaranteed safe with GCC without resorting to inline asm? Locally
> enabling -fwrap takes care of the addition, but that still leaves the
> conversion.

I believe the cast from unsigned int to int8_t is implementation-defined
for values that can't be represented in int8_t (e.g. 0xff). A kind of
`undefined behavior' as well.

I tried:

bool overflow(unsigned int a, unsigned int b) {
const unsigned int sum = a + b;
return ((a & 0x80) == (b & 0x80)) && ((a & 0x80) != (sum & 0x80));
}

But it is not as efficient as yours.

-- 
Pedro Pedruzzi