Re: gcc feature request / RFC: extra clobbered regs

2015-06-30 Thread H. Peter Anvin
On 06/30/2015 02:22 PM, Andy Lutomirski wrote:
> Hi all-
> 
> I'm working on a massive set of cleanups to Linux's syscall handling.
> We currently have a nasty optimization in which we don't save rbx,
> rbp, r12, r13, r14, and r15 on x86_64 before calling C functions.
> This works, but it makes the code a huge mess.  I'd rather save all
> regs in asm and then call C code.
> 
> Unfortunately, this will add five cycles (on SNB) to one of the
> hottest paths in the kernel.  To counteract it, I have a gcc feature
> request that might not be all that crazy.  When writing C functions
> intended to be called from asm, what if we could do:
> 
> __attribute__((extra_clobber("rbx", "rbp", "r12", "r13", "r14",
> "r15"))) void func(void);
> 
> This will save enough pushes and pops that it could easily give us our
> five cycles back and then some.  It's also easy to be compatible with
> old GCC versions -- we could just omit the attribute, since preserving
> a register is always safe.
> 
> Thoughts?  Is this totally crazy?  Is it easy to implement?
> 
> (I'm not necessarily suggesting that we do this for the syscall bodies
> themselves.  I want to do it for the entry and exit helpers, so we'd
> still lose the five cycles in the full fast-path case, but we'd do
> better in the slower paths, and the slower paths are becoming
> increasingly important in real workloads.)
> 

Some gcc targets have done this in the past.  There are command-line
options to do that, but using attributes you have to handle cross-ABI
compilation.

However, I don't see this being done in the upstream gcc.

Keep in mind the runway that we'll need, though.

-hpa




Re: gcc feature request / RFC: extra clobbered regs

2015-06-30 Thread H. Peter Anvin
On 06/30/2015 02:37 PM, Jakub Jelinek wrote:
> I'd say the most natural API for this would be to allow
> f{fixed,call-{used,saved}}-REG in target attribute.

Either that or

__attribute__((fixed(rbp,rcx),used(rax,rbx),saved(r11)))

... just to be shorter.  Either way, I would consider this to be
desirable -- I have myself used this to good effect in a past life
(*cough* Transmeta *cough*) -- but not a high priority feature.

-hpa




Re: gcc feature request / RFC: extra clobbered regs

2015-06-30 Thread H. Peter Anvin
On 06/30/2015 02:48 PM, Andy Lutomirski wrote:
> On Tue, Jun 30, 2015 at 2:41 PM, H. Peter Anvin  wrote:
>> On 06/30/2015 02:37 PM, Jakub Jelinek wrote:
>>> I'd say the most natural API for this would be to allow
>>> f{fixed,call-{used,saved}}-REG in target attribute.
>>
>> Either that or
>>
>> __attribute__((fixed(rbp,rcx),used(rax,rbx),saved(r11)))
>>
>> ... just to be shorter.  Either way, I would consider this to be
>> desirable -- I have myself used this to good effect in a past life
>> (*cough* Transmeta *cough*) -- but not a high priority feature.
> 
> I think I mean the per-function equivalent of -fcall-used-reg, so
> hpa's "used" suggestion would do the trick.
> 
> I guess that clobbering the frame pointer is a non-starter, but five
> out of six isn't so bad.  It would be nice to error out instead of
> producing "disastrous results", though, if another bad reg is chosen.
> (Presumably the PIC register on PIC builds would be an example of
> that.)
> 

Clobbering the frame pointer is perfectly fine, as is the PIC register.
 However, gcc might need to handle them as "fixed" rather than "clobbered".

-hpa




Re: gcc feature request / RFC: extra clobbered regs

2015-06-30 Thread H. Peter Anvin
On 06/30/2015 02:55 PM, Andy Lutomirski wrote:
> On Tue, Jun 30, 2015 at 2:52 PM, H. Peter Anvin  wrote:
>> On 06/30/2015 02:48 PM, Andy Lutomirski wrote:
>>> On Tue, Jun 30, 2015 at 2:41 PM, H. Peter Anvin  wrote:
>>>> On 06/30/2015 02:37 PM, Jakub Jelinek wrote:
>>>>> I'd say the most natural API for this would be to allow
>>>>> f{fixed,call-{used,saved}}-REG in target attribute.
>>>>
>>>> Either that or
>>>>
>>>> __attribute__((fixed(rbp,rcx),used(rax,rbx),saved(r11)))
>>>>
>>>> ... just to be shorter.  Either way, I would consider this to be
>>>> desirable -- I have myself used this to good effect in a past life
>>>> (*cough* Transmeta *cough*) -- but not a high priority feature.
>>>
>>> I think I mean the per-function equivalent of -fcall-used-reg, so
>>> hpa's "used" suggestion would do the trick.
>>>
>>> I guess that clobbering the frame pointer is a non-starter, but five
>>> out of six isn't so bad.  It would be nice to error out instead of
>>> producing "disastrous results", though, if another bad reg is chosen.
>>> (Presumably the PIC register on PIC builds would be an example of
>>> that.)
>>>
>>
>> Clobbering the frame pointer is perfectly fine, as is the PIC register.
>>  However, gcc might need to handle them as "fixed" rather than "clobbered".
> 
> Hmm.  True, I guess, although I wouldn't necessarily expect gcc to be
> able to generate code to call a function like that.
> 

No, but you need to be able to call other functions, or you just push
the issue down one level.

-hpa




Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread H. Peter Anvin
On 07/01/2015 10:43 AM, Jakub Jelinek wrote:
> On Wed, Jul 01, 2015 at 01:35:16PM -0400, Vladimir Makarov wrote:
>> Actually it raise a question for me.  If we describe that a function
>> clobbers more than calling convention and then use it as a value (assigning
>> a variable or passing as an argument) and loosing a track of it and than
>> call it.  How can RA know what the call clobbers actually.  So for the
>> function with the attributes we should prohibit use it as a value or make
>> the attributes as a part of the function type, or at least say it is unsafe.
>> So now I see this as a *bigger problem* with this extension.  Although I
>> guess it already exists as we have description of different ABI as an
>> extension.
> 
> Unfortunately target attribute is function decl attribute rather than
> function type.  And having more attributes affect switchable targets will be
> non-fun.
> 

How on Earth does that work with existing switchable ABIs?  Keep in mind
that we already support multiple ABIs...

-hpa




Re: [cfe-dev] RFC: Support x86 interrupt and exception handlers

2015-09-22 Thread H. Peter Anvin
On 09/22/15 01:41, David Chisnall wrote:
> On 21 Sep 2015, at 21:45, H.J. Lu via cfe-dev  wrote:
>>
>> The main purpose of x86 interrupt attribute is to allow programmers
>> to write x86 interrupt/exception handlers in C WITHOUT assembly
>> stubs to avoid extra branch from assembly stubs to C functions.  I
>> want to keep the number of new intrinsics to minimum without sacrificing
>> handler performance. I leave faking error code in interrupt handler to
>> the programmer.
> 
> The assembly stubs have to come from somewhere.  You either put them
> in an assembly file (most people doing embedded x86 stuff steal the
> ones from NetBSD), or you put them in the compiler where they can be
> inlined.  In terms of user interface, there’s not much difference in
> complexity.  Having written this kind of code in the past, I can
> honestly say that using the assembly stubs was the least difficult
> part of getting them right.  In terms of compiler complexity, there’s
> a big difference: in one case the compiler contains nothing, in the
> other it contains something special for a single use case.  In terms
> of performance, the compiler version has the potential to be faster,
> but if we’re going to pay for the complexity then I think that we’d
> need to see some strong evidence that someone else is getting a
> noticeable benefit.
> 

It is worth noting that most architectures has this support for a reason.

-hpa



Re: [cfe-dev] RFC: Support x86 interrupt and exception handlers

2015-09-22 Thread H. Peter Anvin
On 09/22/15 04:44, David Chisnall wrote:
> On 22 Sep 2015, at 12:39, H.J. Lu via cfe-dev  wrote:
>>
>> The center piece of my proposal is not to change how parameters
>> are passed in compiler.  As for user experience, the feedbacks on
>> my proposal from our users are very positive.
> 
> Implementing the intrinsics for getting the current interrupt
> requires a lot of support code for it to actually be useful.  For it
> to be useful, you are requiring all of the C code to be run with
> interrupts disabled (and even that doesn’t work if you get a NMI in
> the middle).  Most implementations use a small amount of assembly to
> capture the interrupt cause and the register state on entry to the
> handler, then reenable interrupts while the C code runs.  This means
> that any interrupts (e.g. page faults, illegal instruction traps,
> whatever) that happen while the C code is running do not mask the
> values.  Accessing these values from *existing* C code is simply a
> matter of loading a field from a structure.
> 
> I’m really unconvinced by something that something with such a narrow
> use case (and one that encourages writing bad code) belongs in the
> compiler.
> 

You seem to not understand how x86 works, nor have noted how this is
nearly universally supported by various architectures; x86 is the
exception here.

x86 stores its interrupt state on the stack, not in a register which can
be clobbered.  Also, a lot of your assertions about "most
implementations" only apply to full-scale operating systems.

-hpa




Re: [cfe-dev] RFC: Support x86 interrupt and exception handlers

2015-09-22 Thread H. Peter Anvin
On 09/22/15 04:52, David Chisnall wrote:
> On 22 Sep 2015, at 12:47, H.J. Lu  wrote:
>>
>> since __builtin_exception_error () is the same as
>> __builtin_return_address (0) and __builtin_interrupt_data () is
>> address of __builtin_exception_error () + size of register.
> 
> Except that they’re *not*.  __builtin_return_address(0) is guaranteed to be 
> the same for the duration of the function.  __builtin_exception_error() needs 
> to either:
> 
> 1) Fetch the values early with interrupts disabled, store them in a
> well-known location, and load them from this place when the intrinsic
> is called, or
> 
> 2) Force any function that calls the intrinsic (and wants a
> meaningful result) to run with interrupts disabled, which is
> something that the compiler can’t verify without knowing the full
> chain of code from the interrupt handler to the current point (and
> therefore prone to error).
> 
> It is trivial to write a little bit of inline assembly that reads
> these values from the CPU and expose that for C code.  There is a
> good reason why no one does this.
> 

This is why it makes no sense for the intrinsics to be callable from
anywhere except inside the interrupt handler.  It is really nothing
other than a way to pass arguments -- whether or not it is simpler for
the compilers to implement than supporting a different function
signature is beyond my scope of expertise.

-hpa



Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread H. Peter Anvin
On 05/04/2015 12:33 PM, Richard Henderson wrote:
> 
> (0) The C level output variable should be an integral type, from bool on up.
> 
> The flags are a scarse resource, easily clobbered.  We cannot allow user code
> to keep data in the flags.  While x86 does have lahf/sahf, they don't exactly
> perform well.  And other targets like arm don't even have that bad option.
> 
> Therefore, the language level semantics are that the output is a boolean store
> into the variable with a condition specified by a magic constraint.
> 
> That said, just like the compiler should be able to optimize
> 
> void bar(int y)
> {
>   int x = (y <= 0);
>   if (x) foo();
> }
> 
> such that we only use a single compare against y, the expectation is that
> within a similarly constrained context the compiler will not require two tests
> for these boolean outputs.
> 
> Therefore:
> 
> (1) Each target defines a set of constraint strings,
> 
>E.g. for x86, wherein we're almost out of constraint letters,
> 
>  ja   aux carry flag
>  jc   carry flag
>  jo   overflow flag
>  jp   parity flag
>  js   sign flag
>  jz   zero flag
> 

I would argue that for x86 what you actually want is to model the
*conditions* that are available on the flags, not the flags themselves.
 There are 16 such conditions, 8 if we discard the inversions.

It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
instructions; it is only ever consumed by the DAA/DAS instructions which
makes it pointless to try to model it in a compiler any more than, say, IF.

> (2) A new target hook post-processes the asm_insn, looking for the
> new constraint strings.  The hook expands the condition prescribed
> by the string, adjusting the asm_insn as required.
> 
>   E.g.
> 
> bool x, y, z;
> asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : );

Other than that, this is exactly what would be wonderful to see.

-hpa



Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread H. Peter Anvin
On 05/04/2015 01:14 PM, H. Peter Anvin wrote:
>>
>> Therefore:
>>
>> (1) Each target defines a set of constraint strings,
>>
>>E.g. for x86, wherein we're almost out of constraint letters,
>>
>>  ja   aux carry flag
>>  jc   carry flag
>>  jo   overflow flag
>>  jp   parity flag
>>  js   sign flag
>>  jz   zero flag
>>
> 
> I would argue that for x86 what you actually want is to model the
> *conditions* that are available on the flags, not the flags themselves.
>  There are 16 such conditions, 8 if we discard the inversions.
> 
> It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
> instructions; it is only ever consumed by the DAA/DAS instructions which
> makes it pointless to try to model it in a compiler any more than, say, IF.
> 

OK, let me qualify that.  This is only necessary if it is impractical
for gcc to optimize boolean combinations of flags.  If such
optimizations are available then it doesn't matter and is probably
needlessly complex.  For example:

char foo(void)
{
bool zf, sf, of;

asm("xyzzy" : "=jz" (zf), "=js" (sf), "=jo" (of));

return zf || (sf != of);
}

... should compile to ...

xyzzy
setng %al
ret

-hpa




Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread H. Peter Anvin
On 05/04/2015 01:35 PM, Linus Torvalds wrote:
> On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin  wrote:
>>
>> I would argue that for x86 what you actually want is to model the
>> *conditions* that are available on the flags, not the flags themselves.
> 
> Yes. Otherwise it would be a nightmare to try to describe simple
> conditions like "le", which a rather complicated combination of three
> of the actual flag bits:
> 
> ((SF ^^ OF) || ZF) = 1
> 
> which would just be ridiculously painful for (a) the user to describe
> and (b) fior the compiler to recognize once described.
> 
> Now, I do admit that most of the cases where you'd use inline asm with
> condition codes would probably fall into just simple "test ZF or CF".
> But I could certainly imagine other cases.
> 

Yes, although once again I'm more than happy to let gcc do the boolean
optimizations if it already has logic to do so (which it might have/want
for its own reasons.)

-hpa




Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread H. Peter Anvin
On 05/04/2015 01:57 PM, Richard Henderson wrote:
> 
> Sure.
> 
> I'd be more inclined to support these compound conditionals directly, rather
> than try to get the compiler to recognize them after the fact.
> 
> Indeed, I believe we have a near complete set of them in the x86 backend
> already.  It'd just be a matter of selecting the spellings for the 
> constraints.
> 

Whichever works for you.

The full set of conditions, mnemonics, and a bitmask with the bits in
the order from MSB to LSB (OF,SF,ZF,PF,CF) which is probably the sanest
way to model these for the purpose of boolean optimization.

Opcode  Mnemonics   Condition   Bitmask
0   o   OF  0x
1   no  !OF 0x
2   b/c/nae CF  0x
3   ae/nb/nc!CF 0x
4   e/z ZF  0xf0f0f0f0
5   ne/nz   !ZF 0x0f0f0f0f
6   na  CF || ZF0xfafafafa
7   a   !CF && !ZF  0x05050505
8   s   SF  0xff00ff00
9   ns  !SF 0x00ff00ff
A   p/pePF  0x
B   np/po   !PF 0x
C   l/nge   SF != OF0x0000
D   ge/nl   SF == OF0xffff
E   le/ng   ZF || (SF != OF)0xf0f0
F   g/nle   !ZF && (SF == OF)   0x0f0f

-hpa



Is it safe to use _Bool as asm statement outputs on x86?

2015-06-02 Thread H. Peter Anvin
For the x86 backend explicitly, is doing something like:

_Bool x;

asm("blah ; setc %0" : "=qm" (x));

... guaranteed to be safe for older versions of gcc?

-hpa


Re: Is it safe to use _Bool as asm statement outputs on x86?

2015-06-02 Thread H. Peter Anvin
On 06/02/2015 06:02 PM, Richard Henderson wrote:
> On 06/02/2015 04:46 PM, H. Peter Anvin wrote:
>> For the x86 backend explicitly, is doing something like:
>>
>> _Bool x;
>>
>> asm("blah ; setc %0" : "=qm" (x));
>>
>> ... guaranteed to be safe for older versions of gcc?
> 
> I believe so, for the restricted set of conditions I expect you're asking.
> In particular:
> 
>  (1) Linux has always defined _Bool as a byte (indeed, afaik only Darwin
>  has ever done otherwise).
> 
>  (2) You must really produce 0/1 from the asm; the compiler doesn't re-do
>  the canonicalization afterward, and afaik we do rely on that in the
>  optimizers.  But certainly that's true for any version of GCC.
> 

That is all good as far as I am concerned.

-hpa




Re: Is it safe to use _Bool as asm statement outputs on x86?

2015-06-02 Thread H. Peter Anvin
On 06/02/2015 11:23 PM, H.J. Lu wrote:
> Trampling code for nested functions puts code on stack.

This is an issue for the CS ≠ DS case, as opposed to _Bool, I assume.
In other words, we are okay if and only if we can run with an NX stack?

-hpa



Re: X32 psABI status

2011-02-12 Thread H. Peter Anvin
On 02/12/2011 01:10 PM, Florian Weimer wrote:
> Why is the ia32 compatiblity kernel interface used?

Because there is no way in hell we're designing in a second
compatibility ABI in the kernel (and it has to be a compatibility ABI,
because of the pointer size difference.)

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin
On 02/13/2011 01:10 PM, H.J. Lu wrote:
>>>
>>> 1. Kernel interface with syscall is close to be finalized.
>>

I don't think calling it "finalized" is accurate... it is more
accurately described as "prototyped".

>> Really? I haven't seen this being posted for review yet ;-)
>>
>> The basic concept looks entirely reasonable to me, but I'm
>> curious what drove the decision to start out with the x86_64
>> system calls instead of the generic ones.
>>
>> Since tile was merged, we now have support for compat syscalls
>> in the generic syscall ABI. I would have assumed that it
>> was possible to just use those if you decide to do a new
>> ABI in the first place.
>> 
>> The other option that would have appeared natural to me is
>> to just use the existing 32 bit compat ABI with the few
>> necessary changes done based on the personality.

The actual idea is to use the i386 compat ABI for memory layout, but
with a 64-bit register convention.  That means that system calls that
don't make references to memory structures can simply use the 64-bit
system calls, otherwise we're planning to reuse the i386 compat system
calls, but invoke them via the syscall instruction (which requires a new
system call table) and to pass 64-bit arguments in single registers.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin
On 02/13/2011 01:28 PM, H.J. Lu wrote:
> 
> That is is currently implemented on hjl/x32 branch.
> 
> I also added
> 
> __NR_sigaction
> __NR_sigpending
> __NR_sigprocmask
> __NR_sigsuspend
> 
> to help the Bionic C library.
> 

That seems a little redundant... even on the i386 front we want people
to use the rt_sig* system calls.  As a porting aid I can see it, but we
should avoid deprecated system calls in the final version.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin
On 02/13/2011 01:16 PM, H. Peter Anvin wrote:
> 
> The actual idea is to use the i386 compat ABI for memory layout, but
> with a 64-bit register convention.  That means that system calls that
> don't make references to memory structures can simply use the 64-bit
> system calls, otherwise we're planning to reuse the i386 compat system
> calls, but invoke them via the syscall instruction (which requires a new
> system call table) and to pass 64-bit arguments in single registers.
> 

Oh, and as to why not copy the i386 system call list straight off... we
don't really want to add a new ABI with crap like sys_socketcall.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin
On 02/13/2011 02:28 PM, Arnd Bergmann wrote:
> On Sunday 13 February 2011, H. Peter Anvin wrote:
>> The actual idea is to use the i386 compat ABI for memory layout, but
>> with a 64-bit register convention.  That means that system calls that
>> don't make references to memory structures can simply use the 64-bit
>> system calls, otherwise we're planning to reuse the i386 compat system
>> calls, but invoke them via the syscall instruction (which requires a new
>> system call table) and to pass 64-bit arguments in single registers.
> 
> As far as I know, any task can already call both the 32 and 64 bit syscall
> entry points on x86. Is there anything you can't do just as well by
> using a combination of the two methods, without introducing a third one?

We prototyped using the int $0x80 system call entry point.  However,
there are two disadvantages:

a. the int $0x80 instruction is much slower than syscall.  An actual
   i386 process can use the syscall instruction which is disambiguated
   by the CPU based on mode, but an x32 process is in the same CPU mode
   as a normal 64-bit process.
b. 64-bit arguments have to be split between two registers for the
   i386 entry points, requiring user-space stubs.

All in all, the cost of an extra system call table is quite modest.  The
cost of an entire different ABI layer (supporting a new memory layout)
would be enormous, a.k.a. "not worth it", which is why the memory layout
of kernel objects needs to be compatible with i386.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin

On 02/13/2011 01:33 PM, Alan Cox wrote:


Who actually needs this new extra API - whats the justification for
everyone having more crud dumping their kernels, more syscall paths
(which are one of the most security critical areas) and the like.

What are the benchmark numbers to justify this versus just using the
existing kernel interfaces ?



That's what the prototype is meant to show.

-hpa


Re: X32 psABI status

2011-02-13 Thread H. Peter Anvin

On 02/13/2011 03:39 PM, Alan Cox wrote:

a. the int $0x80 instruction is much slower than syscall.  An actual
i386 process can use the syscall instruction which is disambiguated
by the CPU based on mode, but an x32 process is in the same CPU mode
as a normal 64-bit process.


So set a flag, whoopee


That's what we're doing, functionally.


b. 64-bit arguments have to be split between two registers for the
i386 entry points, requiring user-space stubs.


Diddums. Given you've yet to explain why everyone desperately needs this
extra interface why do we care ?


All in all, the cost of an extra system call table is quite modest.


And the cost of not doing it is a gloriously wonderful zero. Yo've still
not explained the justification or what large number of apps are going to
use it.

It's a simple question - why do we care, why do we want the overhead and
the hassle, what do users get in return ?


The target applications are an embedded (closed or mostly closed) 
environment, and the question is if the performance gain is worth it. 
It is an open question at this stage and we'll see what the numbers look 
like and, if it turns out to be worthwhile, what exactly the final 
implementation will look like.


-hpa


Re: x32 psABI draft version 0.2

2011-02-16 Thread H. Peter Anvin
On 02/16/2011 11:22 AM, H.J. Lu wrote:
> Hi,
> 
> I updated  x32 psABI draft to version 0.2 to change x32 library path
> from lib32 to libx32 since lib32 is used for ia32 libraries on Debian,
> Ubuntu and other derivative distributions. The new x32 psABI is
> available from:
> 
> https://sites.google.com/site/x32abi/home
> 

I'm wondering if we should define a section header flag (sh_flags)
and/or an ELF header flag (e_flags) for x32 for the people unhappy about
keying it to the ELF class...

-hpa



Re: x32 psABI draft version 0.2

2011-02-17 Thread H. Peter Anvin
On 02/17/2011 10:06 AM, Jakub Jelinek wrote:
> On Thu, Feb 17, 2011 at 04:44:53PM +0100, Jan Hubicka wrote:
 According to Mozilla folks however REL+RELA scheme used by EABI leads
 to significandly smaller libxul.so size

 According to http://glandium.org/blog/?p=1177 the difference is about 4-5MB
 (out of approximately 20-30MB shared lib)
>>>
>>> This is orthogonal to x32 psABI.
>>
>> Understood.  I am just pointing out that x86-64 Mozilla suffers from startup
>> problems (extra 5MB of disk read needed) compared to both x86 and ARM EABI
>> because x86-64 ABI is RELA only. If x86-64 ABI was REL+RELA like EABI is, we
>> would not have this problem here.
> 
> libxul.so has < 20 relocs, so 5MB is total size of .rela section in
> 64-bit ELF, you don't magically save those 5MB by using REL.  You save
> just 1.5MB.  And for x32 we'd be talking about 2.5MB for RELA vs. 1.6MB for
> REL.  There might be better ways how to get the numbers down.
> 

The size is, of course, half of that for the x32 ABI in the first place.

-hpa



Re: x32 psABI draft version 0.2

2011-02-17 Thread H. Peter Anvin
On 02/17/2011 02:49 PM, Jan Hubicka wrote:
>> On Thu, Feb 17, 2011 at 04:44:53PM +0100, Jan Hubicka wrote:
> According to Mozilla folks however REL+RELA scheme used by EABI leads
> to significandly smaller libxul.so size
>
> According to http://glandium.org/blog/?p=1177 the difference is about 
> 4-5MB
> (out of approximately 20-30MB shared lib)

 This is orthogonal to x32 psABI.
>>>
>>> Understood.  I am just pointing out that x86-64 Mozilla suffers from startup
>>> problems (extra 5MB of disk read needed) compared to both x86 and ARM EABI
>>> because x86-64 ABI is RELA only. If x86-64 ABI was REL+RELA like EABI is, we
>>> would not have this problem here.
>>
>> libxul.so has < 20 relocs, so 5MB is total size of .rela section in
>> 64-bit ELF, you don't magically save those 5MB by using REL.  You save
>> just 1.5MB.  And for x32 we'd be talking about 2.5MB for RELA vs. 1.6MB for
> 
> The blog claims
> Architecture  libxul.so size  relocations size%
> x86   21,869,684  1,884,864   8.61%
> x86-6429,629,040  5,751,984   19.41%
> 
> The REL encoding also grows twice for 64bit target?
> 

REL would be twice the size for a 64-bit target (which x32 is not, from
an ELF point of view).  Keep in mind that REL cannot do error handing
very well, especially not on a 64-bit platform.

Elf32_Rel:   8 bytes
Elf32_Rela: 12 bytes
Elf64_Rel:  16 bytes
Elf64_Rela: 24 bytes

So 1,884,864 to 5,751,984 indicates a (very) small increase in
relocation count, the exactly equivalent numbers would be:

Elf32_Rel:  1,884,864 bytes
Elf32_Rela: 2,827,296 bytes
Elf64_Rel:  3,769,728 bytes
Elf64_Rela: 5,654,592 bytes

-hpa


Re: X32 project status update

2011-05-21 Thread H. Peter Anvin
On 05/21/2011 09:27 AM, H.J. Lu wrote:
> On Sat, May 21, 2011 at 8:34 AM, H.J. Lu  wrote:
>> On Sat, May 21, 2011 at 8:27 AM, Arnd Bergmann  wrote:
>>> On Saturday 21 May 2011 17:01:33 H.J. Lu wrote:
 This is the x32 project status update:

 https://sites.google.com/site/x32abi/

>>>
>>> I've had another look at the kernel patch. It basically
>>> looks all good, but the system call table appears to
>>> diverge from the x86_64 list for no (documented) reason,
>>> in the calls above 302. Is that intentional?
>>>
>>> I can see why you might want to keep the numbers identical,
>>> but if they are already different, why not use the generic
>>> system call table from asm-generic/unistd.h for the new
>>> ABI?
>>
>> We can sort it out when we start merging x32 kernel changes.
>>
> 
> Peter, is that possible to use the single syscall table for
> both x86-64 and x32 system calls? Out of 300+ system
> calls, only 84 are different for x86-64 and x32.  That
> is additional 8*84 == 672 bytes in syscall table.
> 

Sort of... remember we talked about merging system calls at the tail
end?  The problem with that is that some system calls (like read()!)
actually are different system calls in very subtle situations, due to
abuse in some subsystems of the is_compat() construct.  I think that may
mean we have to have an unambiguous flag after all...

Now, perhaps we can use a high bit for that and mask it before dispatch,
then we don't need the additional table.  A bit of a hack, but it should
work.

-hpa


Re: Storing 16bit values in upper part of 32bit registers

2009-11-05 Thread H. Peter Anvin
On 10/15/2009 08:56 AM, Richard Henderson wrote:
> On 10/15/2009 07:41 AM, Markus L wrote:
>> However the IS is designed so that it is beneficial to to store 16bit
>> values in the high part of the registers (rNh) and also the calling
>> conventions that we want follow require 16bit values to be passed and
>> returned in rNh.
>>
>> What would be the "proper way" make the compiler use the upper parts
>> of these registers for the 16bit operands?
> 
> This feature is going to be difficult, but not impossible, and unless 
> your ISA has some really odd features I won't vouch for the code quality.
> 
> You say you want to canonically represent HImode in the high-part of the 
> register.  Additionally, you'll have to represent QImode in the 
> high-part (if not further in the high byte).
> 
> You'll need to follow the mips port and define TRULY_NOOP_TRUNCATION and 
> the associated truncMN2 patterns.
> 
> If you do all this, you won't have to do anything with FUNCTION_VALUE 
> etc at all.
> 

Sorry for a *way* *late* reply to this, but wouldn't it also work to
model the register file as a set of 16-bit registers (since that's what
you really have -- individually addressable 16-bit registers) and
exclude SImode values from register pairs which are not aligned.  Then
one can simply prefer the high 16-bit registers to the low 16-bit
registers in the register priority sequence.

I'm assuming there is something wrong with this, but I'm kind of curious
as to what it would be.

-hpa


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 07:37 AM, Thomas Gleixner wrote:
> 
> modified function start on a handful of functions only seen with gcc
> 4.4.x on x86 32 bit:
> 
>   push   %edi
>   lea0x8(%esp),%edi
>   and$0xfff0,%esp
>   pushl  -0x4(%edi)
>   push   %ebp
>   mov%esp,%ebp
>   ...
>   call   mcount
> 

The real questions is why we're aligning the stack in the kernel.  It is
probably not what we want -- we don't use SSE for anything but a handful
of special cases in the kernel, and we don't want the overhead.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 07:44 AM, Andrew Haley wrote:
> 
> We're aligning the stack properly, as per the ABI requirements.  Can't
> you just fix the tracer?
> 

"Per the ABI requirements?"  We're talking 32 bits, here.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 08:02 AM, Steven Rostedt wrote:
> On Thu, 2009-11-19 at 15:44 +, Andrew Haley wrote:
>> Thomas Gleixner wrote:
> 
>> We're aligning the stack properly, as per the ABI requirements.  Can't
>> you just fix the tracer?
> 
> And how do we do that? The hooks that are in place have no idea of what
> happened before they were called?
> 

Furthermore, it is nonsense -- ABI stack alignment on *32 bits* is 4
bytes, not 16.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 10:33 AM, Steven Rostedt wrote:
> 
> It has to align the entire stack? Why not just the variable within the
> stack?
> 

Because if the stack pointer isn't aligned, it won't know where it can
stuff the variable.  It has to pad *somewhere*, and since you may have
more than one such variable, the most efficient way -- and by far least
complex -- is for the compiler to align the stack when it sets up the
stack frame.

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 11:28 AM, Steven Rostedt wrote:
> 
> Hehe, scratch register on i686 ;-)
> 
> i686 has no extra regs. It just has:
> 
> %eax, %ebx, %ecx, %edx - as the general purpose regs
> %esp - stack
> %ebp - frame pointer
> %edi, %esi - counter regs
> 
> That's just 8 regs, and half of those are special.
> 

For a modern ABI it is better described as:

%eax, %edx, %ecx- argument/return/scratch registers
%ebx, %esi, %edi- saved registers
%esp- stack pointer
%ebp- frame pointer (saved)

> Perhaps we could create another profiler? Instead of calling mcount,
> call a new function: __fentry__ or something. Have it activated with
> another switch. This could make the performance of the function tracer
> even better without all these exceptions.
> 
>   :
>   call __fentry__
>   [...]
> 

Calling the profiler immediately at the entry point is clearly the more
sane option.  It means the ABI is well-defined, stable, and independent
of what the actual function contents are.  It means that ABI isn't the
normal C ABI (the __fentry__ function would have to preserve all
registers), but that's fine...

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On i386, if we call __fentry__ immediately on entry the return address will be 
in 4(%esp), so I fail to see how you could not reliably have the return 
address.  Other arches would have different constraints, of course.

"Frederic Weisbecker"  wrote:

>On Thu, Nov 19, 2009 at 03:05:41PM -0500, Steven Rostedt wrote:
>> On Thu, 2009-11-19 at 20:46 +0100, Frederic Weisbecker wrote:
>> > On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote:
>> 
>> > >  :
>> > >  call __fentry__
>> > >  [...]
>> > > 
>> > >  
>> > > -- Steve
>> > 
>> > 
>> > I would really like this. So that we can forget about other possible
>> > further suprises due to sophisticated function prologues beeing before
>> > the mcount call.
>> > 
>> > And I guess that would fix it in every archs.
>> 
>> Well, other archs use a register to store the return address. But it
>> would also be easy to do (pseudo arch assembly):
>> 
>>  :
>>  mov lr, (%sp)
>>  add 8, %sp
>>  blr __fentry__
>>  sub 8, %sp
>>  mov (%sp), lr
>> 
>> 
>> That way the lr would have the current function, and the parent would
>> still be at 8(%sp)
>> 
>
>
>Yeah right, we need at least such very tiny prologue for
>archs that store return addresses in a reg.
>
>   
>> > 
>> > That said, Linus had a good point about the fact there might other uses
>> > of mcount even more tricky than what does the function graph tracer,
>> > outside the kernel, and those may depend on the strict ABI assumption
>> > that 4(ebp) is always the _real_ return address, and that through all
>> > the previous stack call. This is even a concern that extrapolates the
>> > single mcount case.
>> 
>> As I am proposing a new call. This means that mcount stay as is for
>> legacy reasons. Yes I know there exists the -finstrument-functions but
>> that adds way too much bloat to the code. One single call to the
>> profiler is all I want.
>
>
>Sure, the purpose is not to change the existing -mcount thing.
>What I meant is that we could have -mcount and -real-ra-before-fp
>at the same time to guarantee fp + 4 is really what we want while
>using -mcount.
>
>The __fentry__ idea is more neat, but the guarantee of a real pointer
>to the return address is still something that lacks.
>
>
>> > 
>> > So I wonder that actually the real problem is the lack of something that
>> > could provide this guarantee. We may need a -real-ra-before-fp (yeah
>> > I suck in naming).
>> 
>> Don't worry, so do the C compiler folks, I mean, come on "mcount"?
>
>
>I guess it has been first created for the single purpose of counting
>specific functions but then it has been used for wider, unpredicted uses :)
>

--
Sent from my mobile phone. Please excuse any lack of formatting.

Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
Hence a new unconstrained option...

"Jeff Law"  wrote:

>On 11/19/09 12:50, H. Peter Anvin wrote:
>>
>> Calling the profiler immediately at the entry point is clearly the more
>> sane option.  It means the ABI is well-defined, stable, and independent
>> of what the actual function contents are.  It means that ABI isn't the
>> normal C ABI (the __fentry__ function would have to preserve all
>> registers), but that's fine...
>>
>Note there are targets (even some old x86 variants) that required the 
>profiling calls to occur after the prologue.  Unfortunately, nobody 
>documented *why* that  was the case.   Sigh.
>
>Jeff

--
Sent from my mobile phone. Please excuse any lack of formatting.

Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread H. Peter Anvin
On 11/19/2009 04:59 PM, Linus Torvalds wrote:
> 
> [ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT 
>   totally unrelated numbers? Very confusing. ]
> 

Yes, there is another thread to clean up that particular mess; it is
already in -tip:

http://git.kernel.org/tip/350f8f5631922c7848ec4b530c111cb8c2ff7caa

-hpa


Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-20 Thread H. Peter Anvin
On 11/20/2009 09:00 AM, Steven Rostedt wrote:
> Ingo, Thomas and Linus,
> 
> I know Thomas did a patch to force the -mtune=generic, but just in case
> gcc decides to do something crazy again, this patch will catch it.
> 
> Should we try to get this in now?
> 

Sounds like a very good idea to me.
    
    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [PATCH] gcc mcount-nofp was Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-20 Thread H. Peter Anvin
On 11/20/2009 04:34 AM, Steven Rostedt wrote:
>>
>> If there's interest I can polish it up and submit formally.
> 
> I would definitely be interested, and I would also be willing to test
> it.
> 

I don't think there is any question that interception at the
architectural entry point would be the right thing to do.  As such, it
would be great to get this patch productized.

-hpa


Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-20 Thread H. Peter Anvin
On 11/20/2009 11:46 AM, Steven Rostedt wrote:
> 
> Yes a gcc test suite will help new instances of gcc. But we need to
> worry about the instances of gcc that people have on their desktops now.
> This test case will catch the discrepancy between gcc and the function
> graph tracer. I'm not 100% convince that just adding -mtune=generic will
> help in all cases. If we miss another instance, then the function graph
> tracer may crash someone's kernel.
> 

Furthermore, for future gcc instances what we really want is the early
interception support anyway.

-hpa


Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-24 Thread H. Peter Anvin
On 11/24/2009 07:46 AM, Andrew Haley wrote:
>>
>> Yes, a lot.  The difference is that -maccumulate-outgoing-args allocates
>> space for arguments of the callee with most arguments in the prologue, using
>> subtraction from sp, then to pass arguments uses movl XXX, 4(%esp) etc.
>> and the stack pointer doesn't usually change within the function (except for
>> alloca/VLAs).
>> With -mno-accumulate-outgoing-args args are pushed using push instructions
>> and stack pointer is constantly changing.
> 
> Alright.  So, it is possible in theory for gcc to generate code that
> only uses -maccumulate-outgoing-args when it needs to realign SP.
> And, therefore, we could have a nice option for the kernel: one with
> (mostly) good code density and never generates the bizarre code
> sequence in the prologue.
> 

If we're changing gcc anyway, then let's add the option of intercepting
the function at the point where the machine state is well-defined by
ABI, which is before the function stack frame is set up.

-maccumulate-outgoing-args sounds like it would be painful on x86 (not
using its cheap push/pop instructions), but I guess since it's only when
tracing it's less of an issue.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-24 Thread H. Peter Anvin
On 11/24/2009 09:12 AM, Andrew Haley wrote:
>>
>> If we're changing gcc anyway, then let's add the option of intercepting
>> the function at the point where the machine state is well-defined by
>> ABI, which is before the function stack frame is set up.
> 
> Hmm.  On the x86 I suppose we could just inject a naked call instruction,
> but not all aeches allow us to call anything before we've saved the return
> address.  Or are you talking x86 only?
> 

For x86, we should use a naked call.

For architectures where that is not possible, we should use a minimal
sequence such that the ABI state at the invocation point is 100% derivable.

On MIPS, for example, we could use a sequence such as:

mov at, ra
jal __fentry__

It would be up to __fentry__ to save the value in at and to restore it
back into ra before resuming, meaning that __fentry__ has a nonstandard
calling convention.

-hpa


Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-25 Thread H. Peter Anvin
On 11/24/2009 09:30 AM, Steven Rostedt wrote:
> 
> For other archs, Linus showed some examples:
> 
> http://lkml.org/lkml/2009/11/19/349
> 

Yes; the key here is that the ABI-defined entry state is readily
mappable onto the state on entry to the __fentry__ function.

        -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


Re: [PATCH][GIT PULL][v2.6.32] tracing/x86: Add check to detect GCC messing with mcount prologue

2009-11-25 Thread H. Peter Anvin
On 11/25/2009 08:44 AM, Jakub Jelinek wrote:
> 
> If you compile kernels 90%+ people out there run with -p on i?86/x86_64,
> then certainly coming up with a new gcc switch and new profiling ABI is
> desirable.  -p on i?86/x86_64 e.g. forces -fno-omit-frame-pointer, which
> makes code on these register starved arches significantly worse.
> Making GCC output profiling call before prologue instead of after prologue
> is a 4 liner in generic code and a few lines in target specific code.
> The important thing is that we shouldn't have 100 different profiling ABIs,
> so it is desirable to agree on something that will be generally useful not
> just for the kernel, but perhaps for other purposes.
> 

There is really just one that makes sense, which is providing the
ABI-defined entry state, which means intercepting at the point of entry.

Anything else is/was a mistake.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: Bug in x86-64 psABI or in gcc?

2009-12-07 Thread H. Peter Anvin
On 12/07/2009 10:33 AM, H.J. Lu wrote:
> Hi,
> 
> 
> x86-64 psABI says _Bool has 1 byte and aligned at 1 byte. It also says:
> 
> ---
> When a value of type _Bool is passed in a register or on the stack,
> the upper 63 bits of the eightbyte shall be zero.
> ---
> 
> However, gcc treats _Bool as char:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42324
> 
> Given that gcc never zeros the upper 63 bits in register nor
> on stack, should we update x86-64 psABI to reflect what gcc
> really does?
> 

Keep in mind it's not just gcc but at least also icc, the Solaris
compiler, and the Qlogic compiler... possibly others.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: Bug in x86-64 psABI or in gcc?

2009-12-07 Thread H. Peter Anvin
On 12/07/2009 10:33 AM, H.J. Lu wrote:
> Hi,
> 
> 
> x86-64 psABI says _Bool has 1 byte and aligned at 1 byte. It also says:
> 
> ---
> When a value of type _Bool is passed in a register or on the stack,
> the upper 63 bits of the eightbyte shall be zero.
> ---
> 
> However, gcc treats _Bool as char:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42324
> 
> Given that gcc never zeros the upper 63 bits in register nor
> on stack, should we update x86-64 psABI to reflect what gcc
> really does?
> 

Another thing to check is to see if there are failure scenarios where
gcc 3.4.6 expects a fully zeroed register (it produces them, I don't
know if it expects to consume them that way, too.)

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: Bug in x86-64 psABI or in gcc?

2009-12-08 Thread H. Peter Anvin
On 12/08/2009 10:24 AM, H.J. Lu wrote:
> 
> We should just drop
> 
> ---
> When a value of type _Bool is passed in a register or on the stack,
> the upper 63 bits of the eightbyte shall be zero.
> ---
> 
> from psABI. Since _Bool has one byte in size with values of 0 and 1.
> Compilers have to clear upper 7 bits in one byte.
> 

What about the Solaris compiler?  It's probably the only other
significant user of the x86-64 ABI (the Qlogic and LLVM compilers I
presume will follow gcc.)

-hpa



Re: Bug in x86-64 psABI or in gcc?

2009-12-09 Thread H. Peter Anvin
On 12/09/2009 06:44 AM, H.J. Lu wrote:
> 
> Aren't bits in the _Bool byte of"bar" specified by the psABI or the C
> language standard already?
> 

The psABI, yes.  They are obviously not defined by the C language standard.

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: Bug in x86-64 psABI or in gcc?

2009-12-09 Thread H. Peter Anvin
On 12/09/2009 06:56 AM, Michael Matz wrote:
>>
>> Aren't bits in the _Bool byte of"bar" specified by the psABI
> 
> Right now they are specified in the psABI, you suggested to remove that 
> specification.
> 

The intent of H.J.'s proposal is to require bits <7:1> == 0 in all cases
(and higher bits as don't cares, the same way a char is passed), as
opposed to the current text which requires <63:1> == 0 when passed as
registers or on the stack (and <7:1> == 0 when stored in a memory
object.)  Furthermore, the current psABI text is inconsistent for
arguments are return values; this is a bug in the wordsmithing of the
text rather than intentional, if I remember the original discussions
correctly.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: "ld -r" on mixed IR/non-IR objects (

2010-12-06 Thread H. Peter Anvin
On 12/06/2010 02:30 PM, H.J. Lu wrote:
> Hi,
> 
> "ld -r" doesn't work with mixed IR/non-IR objects:
> 
> http://www.sourceware.org/bugzilla/show_bug.cgi?id=12291
> 
> Some compilers support it. Should it be supported?
> 

As we discussed in person, I think it would be user friendly to support
it, otherwise you'll break any build which uses ld -r and includes
assembly objects.

-hpa


Re: "ld -r" on mixed IR/non-IR objects (

2010-12-07 Thread H. Peter Anvin
On 12/07/2010 04:20 PM, Andi Kleen wrote:
> 
> The only problem left is mixing of lto and non lto objects. this right
> now is not handled. IMHO still the best way to handle it is to use
> slim lto and then simply separate link the "left overs" after deleting
> the LTO objects. This can be actually done with objcopy (with some
> limitations), doesn't even need linker support.
> 

Quite possibly a better way to deal with that is to provide a mechanism
for encapsulating arbitrary binary code objects inside the LTO IR.

-hpa


Re: "ld -r" on mixed IR/non-IR objects (

2010-12-07 Thread H. Peter Anvin
On 12/07/2010 03:58 PM, Dave Korn wrote:
> On 07/12/2010 23:15, Cary Coutant wrote:
> 
>>>   ○ Object-only section:
>>>   § Section name won't be generated by any tools, something like
>>> ".objectonly\004".
>>>   § Contains non-IR object file.
>>>   § Input is discarded after link.
>>
>> Please -- use a special section type, not a magic name.
> 
>   We're still gonna have to use a magic name on non-ELF platforms.
> 

Yes, but it probably should still be a special section type on ELF.

-hpa


Re: "ld -r" on mixed IR/non-IR objects (

2010-12-08 Thread H. Peter Anvin
On 12/08/2010 01:19 AM, Andi Kleen wrote:
>>
>> Quite possibly a better way to deal with that is to provide a mechanism
>> for encapsulating arbitrary binary code objects inside the LTO IR.
> 
> Then you would need to teach your assembler and everything
> else that may generate ELF objects to generate this magic object. But why
> not just ELF directly? that is what it is after all.
> 

No.  You just need to teach the linker to generate it when you're doing
a ld -r on mixed objects.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: "ld -r" on mixed IR/non-IR objects (

2010-12-08 Thread H. Peter Anvin
On 12/08/2010 01:19 AM, Andi Kleen wrote:
> 
> To be honest I don't really see the point of all this complexity you
> guys are proposing just to save fat LTO. Fat LTO is always a bad idea
> because it's slow and  does lots of redundant work. If LTO is to become
> a more wide spread mode it has to go simply because of the poor
> performance.
> 

As someone who encountered slim LTO on Unix 17 years ago (on MIPS) I can
promise you that unless fat LTO is supported, there will never be a
successful transition.  The amount of work to deal with the make
environment every time simply made it not worth it.

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 10:59 AM, H.J. Lu wrote:
> 
>> (If you could arrange for the syscall ABI always to be the same as the
>> existing 64-bit ABI, rather than needing to handle three different syscall
>> ABIs in the kernel, that might be one solution, but it could have its own
>> complexities in ensuring that none of the types whose layout forms part of
>> the kernel/userspace interface have layout differing between n32 and the
>> existing ABI; without any action, structures would tend to get layout
>> similar to that of the existing 32-bit ABI, though quite possibly not the
>> same depending on alignment peculiarities - I'm guessing that the new ABI
>> will use natural alignment - while long long arguments would tend to be
>> passed in a single register, resulting in the complicated hybrid syscall
>> ABI present on MIPS.  If you do have an all-new syscall ABI rather than
>> sharing the existing 64-bit one, I imagine it would need to follow the
>> cut-down set of syscalls for new ports, so involving the issue of how to
>> build glibc for that set of syscalls discussed three months ago in the
>> Tilera context.)
>>
> 
> You are right.  Add ILP32 support to Linux kernel may be tricky.
> We did some experiment to use IA32 syscall interface for ILP32:
> 

The current plan is to simply use the 32-bit kernel ABI more or less
unmodified, although probably with a different entry point using syscall
rather than int 0x80 for performance.  In order for the ABI to map 1:1,
there needs to be a few concessions:

a) 64-bit arguments will need to be split in user space.
b) The Linux kernel  exported __u64 type will need to be declared
   __attribute__((aligned(4))).  This will only affect a handful of
   structures in practice since implicit padding is frowned upon.

(a) could also be fixed by a different syscall dispatch table, it's not
the hard part of this.  We definitely want to avoid adding a different
memory ABI; that's the part that hurts.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 11:53 AM, Richard Guenther wrote:
> 
> Would be nice if LFS would be mandatory on the new ABI, thus
> off_t being 64bits.
> 

Yes, although that's a higher-order thing.

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 11:34 AM, David Daney wrote:
> 
> My suggestion:  Since people already spend a great deal of effort
> maintaining the existing i386 compatible Linux syscall infrastructure,
> make your new 32-bit x86-64 Linux syscall ABI identical to the existing
> i386 syscall ABI.  This means that the psABI must use the same size and
> alignment rules for in-memory structures as the i386 does.
> 

No, it doesn't.  It just means it need to do so *for the types used by
the kernel*.  The kernel uses types like __u64, which would indeed have
to be declared aligned(4).

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
I believe it covers all cases *relevant for this particular situation* (unlike, 
say, MIPS) and that any deviation is a bug which can and should be fixed.

"David Daney"  wrote:

>On 12/30/2010 12:12 PM, H. Peter Anvin wrote:
>> On 12/30/2010 11:34 AM, David Daney wrote:
>>>
>>> My suggestion:  Since people already spend a great deal of effort
>>> maintaining the existing i386 compatible Linux syscall
>infrastructure,
>>> make your new 32-bit x86-64 Linux syscall ABI identical to the
>existing
>>> i386 syscall ABI.  This means that the psABI must use the same size
>and
>>> alignment rules for in-memory structures as the i386 does.
>>>
>>
>> No, it doesn't.  It just means it need to do so *for the types used
>by
>> the kernel*.  The kernel uses types like __u64, which would indeed
>have
>> to be declared aligned(4).
>>
>
>Some legacy interfaces don't use fixed width types.  There almost 
>certainly are some ioctls that don't use your fancy __u64.
>
>Then there are things like ppoll() that take a pointer to:
>
>struct timespec {
>longtv_sec; /* seconds */
>longtv_nsec;/* nanoseconds */
>};
>
>There are no fields in there that are controlled by __u64 either. 
>Admittedly this case might not differ between the two 32-bit ABIs, but 
>it shows that __u64/__u32 are not universally used in the Linux syscall
>
>ABIs.
>
>If you are happy with potential memory layout differences between the 
>two 32-bit ABIs, then don't specify that they are the same.  But don't 
>claim that use of __u64/__u32 covers all cases.
>
>David Daney

-- 
Sent from my mobile phone.  Please pardon any lack of formatting.


Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
We do have a slightly more extensive patch already implemented.

"Robert Millan"  wrote:

>Hi folks,
>
>I had this unsubmitted patch in my local filesystem.  It makes Linux
>detect ELF32 AMD64 binaries and sets a flag to restrict them to
>32-bit address space.
>
>It's not rocket science but can save you some work in case you
>haven't implemented this already.
>
>Best regards
>
>-- 
>Robert Millan

-- 
Sent from my mobile phone.  Please pardon any lack of formatting.


Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 12:39 PM, David Daney wrote:
> 
> Really I don't care one way or the other.  The necessity of syscall
> wrappers is actually probably beneficial to me.  It will create a
> greater future employment demand for people with the necessary skills to
> write them.
> 

Or perhaps automatic generation will actually get implemented.  I wrote
an automatic syscall wrapper generator for klibc; one of the best design
decisions I made for that project.

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 11:57 AM, Jakub Jelinek wrote:
>>
>> Would be nice if LFS would be mandatory on the new ABI, thus
>> off_t being 64bits.
> 
> And avoid ambiguous cases that x86-64 ABI has, e.g. whether
> caller or callee is responsible for sign/zero extension of arguments, to
> avoid the need to sign/zero extend twice, etc.
> 

Ehwhat?  x86-64 is completely unambiguous on that point; the i386 one is
not.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 02:18 PM, Robert Millan wrote:
> 2010/12/30 H.J. Lu :
>> I also have a patch for gcc 4.4 which works on simple codes.
>>
>> H.J.
>> On Thu, Dec 30, 2010 at 1:31 PM, H. Peter Anvin  wrote:
>>> We do have a slightly more extensive patch already implemented.
> 
> Could you make those patches available somewhere?  It'd be
> interesting to play with them.
> 
> Btw, I recommend against 8-byte longs.  In the tests I did in
> 2009, I recall glibc source was extremely unhappy due to
> sizeof(long)==sizeof(void *) assumptions.
> 

Yes, it's ILP32.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 02:21 PM, Robert Millan wrote:
> 2010/12/30 Richard Guenther :
>> Would be nice if LFS would be mandatory on the new ABI, thus
>> off_t being 64bits.
> 
> Please do also consider time_t.
> 

Changing the kernel-facing time_t might completely wreck the reuse of
the i386 kernel ABI; I'm not sure.

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-30 Thread H. Peter Anvin
On 12/30/2010 01:08 PM, Robert Millan wrote:
> Hi folks,
> 
> I had this unsubmitted patch in my local filesystem.  It makes Linux
> detect ELF32 AMD64 binaries and sets a flag to restrict them to
> 32-bit address space.
> 
> It's not rocket science but can save you some work in case you
> haven't implemented this already.
> 

I have pushed my old kernel patches to a git tree at:

git://git.kernel.org//pub/scm/linux/kernel/git/hpa/linux-2.6-ilp32.git

They are currently based on 2.6.31 since that was the released version
when I first did this work; they are not intended to be mergeble but
rather as a prototype.

Note that we have no intention of supporting this ABI for the kernel
itself.  The kernel will be a normal x86-64 kernel.

-hpa



Re: RFC: Add 32bit x86-64 support to binutils

2010-12-31 Thread H. Peter Anvin
On 12/31/2010 02:03 AM, Jakub Jelinek wrote:
> On Thu, Dec 30, 2010 at 01:42:05PM -0800, H. Peter Anvin wrote:
>> On 12/30/2010 11:57 AM, Jakub Jelinek wrote:
>>>>
>>>> Would be nice if LFS would be mandatory on the new ABI, thus
>>>> off_t being 64bits.
>>>
>>> And avoid ambiguous cases that x86-64 ABI has, e.g. whether
>>> caller or callee is responsible for sign/zero extension of arguments, to
>>> avoid the need to sign/zero extend twice, etc.
>>>
>>
>> Ehwhat?  x86-64 is completely unambiguous on that point; the i386 one is
>> not.
> 
> It is not, sadly, see http://gcc.gnu.org/PR46942
> From what I can see the psABI doesn't talk about it, GCC usually sign/zero
> extends on both sides (exception is 32-bit arguments into 64-bit isn't
> apparently sign/zero extended on the caller side when doing tail calls),
> from what I gathered LLVM expects the caller to sign/zero extend (which is
> incompatible with GCC tail calls then), not sure about ICC, and kernel
> probably expects for security reasons that the callee sign/zero extends.
> 

This is weird... we had long discussions about this when the psABI was
originally written, and the decision was that any bits outside the
fundamental type was undefined -- callee extends (caller in the case of
a return value.)  Yet somehow that (and several other discussions) seem
to either never have made it into the document or otherwise have
disappeared somewhere in the process.

There seems to have been problems with closing the loop on a number of
things, and in some cases the compiler writers have gone off and
implemented something completely different from the written document,
yet failed to get the documentation updated to match reality (it took
many years until the definition of _Bool matched what the compilers
actually implemented.)

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: RFC: Add 32bit x86-64 support to binutils

2011-01-04 Thread H. Peter Anvin
On 01/04/2011 09:56 AM, H.J. Lu wrote:
>>
>> I think it is a gross misconception to tie the ABI to the ELF class of
>> an object. Specifying the ABI should imo be done via e_flags or
>> one of the unused bytes of e_ident, and in all reality the ELF class
>> should *only* affect the file layout (and 64-bit should never have
>> forbidden to use 32-bit ELF containers; similarly 64-bit ELF objects
>> may have uses for 32-bit architectures/ABIs, e.g. when debug
>> information exceeds the 4G boundary).
> 
> I agree with you in principle. But I think it should be done via
> a new attribute section, similar to ARM.
> 

Oh god, please, no.

I have to say I'm highly questioning to Jan's statement in the first
place.  Crossing 32- and 64-bit ELF like that sounds like a kernel
security hole waiting to happen.

-hpa



Re: RFC: Add 32bit x86-64 support to binutils

2011-01-05 Thread H. Peter Anvin
On 01/04/2011 11:46 PM, Jan Beulich wrote:
>>>
>>> Oh god, please, no.
>>>
>>> I have to say I'm highly questioning to Jan's statement in the first
>>> place.  Crossing 32- and 64-bit ELF like that sounds like a kernel
>>> security hole waiting to happen.
> 
> A particular OS/kernel has the freedom to not implement support for
> other than the default format. But having the ABI disallow it
> altogether certainly isn't the right choice. And yes, we had been
> allowing cross-bitness ELF in an experimental (long canceled) OS
> of ours.
> 
>> Yeah, and there are other targets where the elf class determines ABI
>> too (e.g. EM_S390 is used for both 31-bit and 64-bit binaries and
>> the ELF class determines which).
> 
> So the usual thing is going to happen - someone made a mistake (I'm
> convinced the ELF class was never meant to affect anything but the
> file format), and this gets taken as an excuse to let the mistake
> spread.
> 

I don't think it's all that unreasonable to say the ELF class affects
the ABI.  After all, there are lots of things about the ABI that is
related to the ELF class -- the format of the GOT and PLT, for one thing.

-hpa


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 09:55 AM, Steven Rostedt wrote:
> 
> Almost a full year ago, Mathieu suggested something like:
> 
> if (unlikely(x)) __attribute__((section(".unlikely"))) {
> ...
> } else __attribute__((section(".likely"))) {
> ...
> }
> 
> https://lkml.org/lkml/2012/8/9/658
> 
> Which got me thinking. How hard would it be to set a block in its own
> section. Like what Mathieu suggested, but it doesn't have to be
> ".unlikely".
> 
> if (x) __attibute__((section(".foo"))) {
>   /* do something */
> }
> 

One concern I have is how this kind of code would work when embedded
inside a function which already has a section attribute.  This could
easily cause really weird bugs when someone "optimizes" an inline or
macro and breaks a single call site...

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 10:55 AM, Steven Rostedt wrote:
> 
> Well, as tracepoints are being added quite a bit in Linux, my concern is
> with the inlined functions that they bring. With jump labels they are
> disabled in a very unlikely way (the static_key_false() is a nop to skip
> the code, and is dynamically enabled to a jump).
> 

Have you considered using traps for tracepoints?  A trapping instruction
can be as small as a single byte.  The downside, of course, is that it
is extremely suppressed -- the trap is always expensive -- and you then
have to do a lookup to find the target based on the originating IP.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 11:23 AM, Steven Rostedt wrote:
> On Mon, 2013-08-05 at 11:17 -0700, H. Peter Anvin wrote:
>> On 08/05/2013 10:55 AM, Steven Rostedt wrote:
>>>
>>> Well, as tracepoints are being added quite a bit in Linux, my concern is
>>> with the inlined functions that they bring. With jump labels they are
>>> disabled in a very unlikely way (the static_key_false() is a nop to skip
>>> the code, and is dynamically enabled to a jump).
>>>
>>
>> Have you considered using traps for tracepoints?  A trapping instruction
>> can be as small as a single byte.  The downside, of course, is that it
>> is extremely suppressed -- the trap is always expensive -- and you then
>> have to do a lookup to find the target based on the originating IP.
> 
> No, never considered it, nor would I. Those that use tracepoints, do use
> them extensively, and adding traps like this would probably cause
> heissenbugs and make tracepoints useless.
> 
> Not to mention, how would we add a tracepoint to a trap handler?
> 

Traps nest, that's why there is a stack.  (OK, so you don't want to take
the same trap inside the trap handler, but that code should be very
limited.)  The trap instruction just becomes very short, but rather
slow, call-return.

However, when you consider the cost you have to consider that the
tracepoint is doing other work, so it may very well amortize out.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 11:20 AM, Linus Torvalds wrote:
> 
> Of course, it would be good to optimize static_key_false() itself -
> right now those static key jumps are always five bytes, and while they
> get nopped out, it would still be nice if there was some way to have
> just a two-byte nop (turning into a short branch) *if* we can reach
> another jump that way..For small functions that would be lovely. Oh
> well.
> 

That would definitely require gcc support.  It would be useful, but
probably requires a lot of machinery.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 11:34 AM, Linus Torvalds wrote:
> On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
>  wrote:
>>
>> Ugh. I can see the attraction of your section thing for that case, I
>> just get the feeling that we should be able to do better somehow.
> 
> Hmm.. Quite frankly, Steven, for your use case I think you actually
> want the C goto *labels* associated with a section. Which sounds like
> it might be a cleaner syntax than making it about the basic block
> anyway.
> 

A label wouldn't have an endpoint, though...

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 11:49 AM, Steven Rostedt wrote:
> On Mon, 2013-08-05 at 11:29 -0700, H. Peter Anvin wrote:
> 
>> Traps nest, that's why there is a stack.  (OK, so you don't want to take
>> the same trap inside the trap handler, but that code should be very
>> limited.)  The trap instruction just becomes very short, but rather
>> slow, call-return.
>>
>> However, when you consider the cost you have to consider that the
>> tracepoint is doing other work, so it may very well amortize out.
> 
> Also, how would you pass the parameters? Every tracepoint has its own
> parameters to pass to it. How would a trap know what where to get "prev"
> and "next"?
> 

How do you do that now?

You have to do an IP lookup to find out what you are doing.

(Note: I wonder how much the parameter generation costs the tracepoints.)

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 02:28 PM, Mathieu Desnoyers wrote:
> * Linus Torvalds (torva...@linux-foundation.org) wrote:
>> On Mon, Aug 5, 2013 at 12:54 PM, Mathieu Desnoyers
>>  wrote:
>>>
>>> I remember that choosing between 2 and 5 bytes nop in the asm goto was
>>> tricky: it had something to do with the fact that gcc doesn't know the
>>> exact size of each instructions until further down within compilation
>>
>> Oh, you can't do it in the coompiler, no. But you don't need to. The
>> assembler will pick the right version if you just do "jmp target".
> 
> Yep.
> 
> Another thing that bothers me with Steven's approach is that decoding
> jumps generated by the compiler seems fragile IMHO.
> 
> x86 decoding proposed by https://lkml.org/lkml/2012/3/8/464 :
> 
> +static int make_nop_x86(void *map, size_t const offset)
> +{
> + unsigned char *op;
> + unsigned char *nop;
> + int size;
> +
> + /* Determine which type of jmp this is 2 byte or 5. */
> + op = map + offset;
> + switch (*op) {
> + case 0xeb: /* 2 byte */
> + size = 2;
> + nop = ideal_nop2_x86;
> + break;
> + case 0xe9: /* 5 byte */
> + size = 5;
> + nop = ideal_nop;
> + break;
> + default:
> + die(NULL, "Bad jump label section (bad op %x)\n", *op);
> + __builtin_unreachable();
> + }
> 
> My though is that the code above does not cover all jump encodings that
> can be generated by past, current and future x86 assemblers.
> 

For unconditional jmp that should be pretty safe barring any fundamental
changes to the instruction set, in which case we can enable it as
needed, but for extra robustness it probably should skip prefix bytes.

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread H. Peter Anvin
On 08/05/2013 09:14 PM, Mathieu Desnoyers wrote:
>>
>> For unconditional jmp that should be pretty safe barring any fundamental
>> changes to the instruction set, in which case we can enable it as
>> needed, but for extra robustness it probably should skip prefix bytes.
> 
> On x86-32, some prefixes are actually meaningful. AFAIK, the 0x66 prefix
> is used for:
> 
> E9 cw   jmp rel16   relative jump, only in 32-bit
> 
> Other prefixes can probably be safely skipped.
> 

Yes.  Some of them are used as hints or for MPX.

> Another question is whether anything prevents the assembler from
> generating a jump near (absolute indirect), or far jump. The code above
> seems to assume that we have either a short or near relative jump.

Absolutely something prevents!  It would be a very serious error for the
assembler to generate such instructions.

-hpa






Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread H. Peter Anvin
On 08/06/2013 09:15 AM, Steven Rostedt wrote:
> On Mon, 2013-08-05 at 14:43 -0700, H. Peter Anvin wrote:
> 
>> For unconditional jmp that should be pretty safe barring any fundamental
>> changes to the instruction set, in which case we can enable it as
>> needed, but for extra robustness it probably should skip prefix bytes.
> 
> Would the assembler add prefix bytes to:
> 
>   jmp 1f
> 

No, but if we ever end up doing MPX in the kernel, for example, we would
have to put an MPX prefix on the jmp.

-hpa




Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread H. Peter Anvin
On 08/06/2013 09:26 AM, Steven Rostedt wrote:
>>
>> No, but if we ever end up doing MPX in the kernel, for example, we would
>> have to put an MPX prefix on the jmp.
> 
> Well then we just have to update the rest of the jump label code :-)
> 

For MPX in the kernel, this would be a small part of the work...!

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread H. Peter Anvin
On 08/12/2013 02:17 AM, Peter Zijlstra wrote:
> 
> I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT
> if-forest functions like perf_prepare_sample() and perf_output_sample().
> 
> They are of the form:
> 
> void func(obj, args..)
> {
>   unsigned long f = ...;
> 
>   if (f & F1)
>   do_f1();
> 
>   if (f & F2)
>   do_f2();
> 
>   ...
> 
>   if (f & FN)
>   do_fn();
> }
> 

Am I reading this right that f can be a combination of any of these?

> Where f is constant for the entire lifetime of the particular object.
> 
> So I was thinking of having these functions use static_key/asm-goto;
> then write the proper static key values unsafe so as to avoid all
> trickery (as these functions would never actually be used) and copy the
> end result into object private memory. The object will then use indirect
> calls into these functions.

I'm really not following what you are proposing here, especially not
"copy the end result into object private memory."

With asm goto you end up with at minimum a jump or NOP for each of these
function entries, whereas an actual JIT can elide that as well.

On the majority of architectures, including x86, you cannot simply copy
a piece of code elsewhere and have it still work.  You end up doing a
bunch of the work that a JIT would do anyway, and would end up with
considerably higher complexity and worse results than a true JIT.  You
also say "the object will then use indirect calls into these
functions"... you mean the JIT or pseudo-JIT generated functions, or the
calls inside them?

> I suppose the question is, do people strenuously object to creativity
> like that and or is there something GCC can do to make this
> easier/better still?

I think it would be much easier to just write a minimal JIT for this,
even though it is per architecture.  However, I would really like to
understand what the value is.

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-12 Thread H. Peter Anvin
On 08/12/2013 09:09 AM, Peter Zijlstra wrote:
>>
>> On the majority of architectures, including x86, you cannot simply copy
>> a piece of code elsewhere and have it still work.
> 
> I thought we used -fPIC which would allow just that.
> 

Doubly wrong.  The kernel is not compiled with -fPIC, nor does -fPIC
allow this kind of movement for code that contains intramodule
references (that is *all* references in the kernel).  Since we really
doesn't want to burden the kernel with a GOT and a PLT, that is life.

>> You end up doing a
>> bunch of the work that a JIT would do anyway, and would end up with
>> considerably higher complexity and worse results than a true JIT.  
> 
> Well, less complexity but worse result, yes. We'd only poke the specific
> static_branch sites with either NOPs or the (relative) jump target for
> each of these branches. Then copy the result.

Once again, you can't "copy the result".  You end up with a full
disassembler.

-hpa



Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-13 Thread H. Peter Anvin
> On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote:
>> Since we really doesn't want to...

Ow.  Can't believe I wrote that.

-hpa



Re: Add __ILP32 and __ILP32__ for X32 programming model

2012-04-13 Thread H. Peter Anvin
On 04/13/2012 09:18 AM, H.J. Lu wrote:
> Hi,
> 
> We need a reliable way to tell if we are compiling for x32 through
> pre-defined preprocessor symbol.  __LP64/__LP64__ aren't
> specified by x86-64 psABI, although they have been added to
> GCC 3.3.  They can't be counted on to detect x32 since not x86-64
> compilers define them.   I updated x32 psABI:
> 
> https://sites.google.com/site/x32abi/documents
> 
> to define __ILP32 and __ILP32__ for X32 programming model.  I
> will submit a patch for GCC trunk and 4.7 branch.
> 

Can we add __LP64__ to the psABI too?

-hpa




Re: [x86-64 psABI] RFC: Extend x86-64 psABI to support x32

2012-05-14 Thread H. Peter Anvin
On 05/14/2012 10:31 AM, H.J. Lu wrote:
> Hi,
> 
> Support for the x32 psABI:
> 
> http://sites.google.com/site/x32abi/
> 
> is added in Linux kernel 3.4-rc1.  X32 uses the ILP32 model for x86-64
> instruction set with size of long and pointers == 4 bytes.  X32 is
> already supported in GCC 4.7.0 and binutils 2.22.  I am now working
> to integrate x32 support into GLIBC 2.16 and GDB 7.5   Here is a
> patch to extend x86-64 psABI for x32.  Any comments?
> 

As a minor nitpick, I have always used x32 with a lower case x.  The
capital X32 looks odd to me.

-hpa



Re: [x86-64 psABI] RFC: Extend x86-64 psABI to support x32

2012-06-26 Thread H. Peter Anvin
On 06/26/2012 12:47 PM, H.J. Lu wrote:
>>
>> May I ask why the decision was made to use ILP32 instead of L64P32?   The
>> latter would seem to avoid lots of porting problems in particular.  And if
>> porting difficulties are the major complained about x32, is it really too
>> late to switch?  Thanks - mdb
> 
> x32 is designed to replace ia32 where long is 32-bit, not x86-64.
> 

It's worth noting that there are *no* Linux platforms that are not ILP32
or LP64, so adding a third memory model is likely to cause even more
problems...

-hpa



Re: [x86-64 psABI] RFC: Extend x86-64 psABI to support x32

2012-06-28 Thread H. Peter Anvin

On 06/28/2012 02:03 PM, Mark Butler wrote:

On Tuesday, June 26, 2012 1:53:01 PM UTC-6, H. Peter Anvin wrote:

It's worth noting that there are *no* Linux platforms that are not
ILP32
or LP64, so adding a third memory model is likely to cause even more
problems...


Care to comment on what sort of things would be likely to cause a large
number of problems porting to an L64P32 model?  I understand that L32P64
(as in Windows 64 bit) causes lots of problems, because there is a lot
of code that assumes that a pointer can be converted to a long and back.
  That would not be a problem with L64P32 however, because there
pointers would be smaller than longs rather than larger.


Every time you introduce a new model you will have problems, but in 
Linux it is a strong assumption that sizeof(long) == sizeof(void *).


-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.





Re: Deprecate i386 for GCC 4.8?

2013-01-01 Thread H. Peter Anvin

On 12/12/2012 01:07 PM, David Brown wrote:


I believe it has been a very long time since any manufacturers made a
pure 386 chip.



I believe embedded 386 production ceased in 2007.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



Re: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Aurelien Jarno wrote:

Hi all,

Since version 4.3, gcc changed its behaviour concerning the x86/x86-64 
ABI and the direction flag, that is it now assumes that the direction 
flag is cleared at the entry of a function and it doesn't clear once 
more if needed.


This causes some problems with the Linux kernel which does not clear
the direction flag when entering a signal handler. The small code below
(for x86-64) demonstrates that. 


If the signal handler is using code that need the direction flag cleared
(for example bzero() or memset()), the code is incorrectly executed.

I guess this has to be fixed on the kernel side, but also gcc-4.3 could
revert back to the old behaviour, that is clearing the direction flag
when entering a routine that touches it until most people are running a
fixed kernel.



Linux should definitely follow the ABI.  This is a bug, and a pretty 
serious such.


-hpa


Re: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Michael Matz wrote:

Hi,

On Wed, 5 Mar 2008, Aurelien Jarno wrote:


So I think gcc at least needs an *option* to revert to the old behavior,
and there's a good argument to make it the default for now, at least for
x86/x86-64 on Linux.
And for other kernels. I tested OpenBSD 4.1, FreeBSD 6.3, NetBSD 4.0, 
they have the same behaviour as Linux, that is they don't clear DF 
before calling the signal handler.


Sigh.  We could perhaps insert a cld for all functions which can be 
recognized as possible signal handlers and call other unknown or string 
functions.  But it's probably even faster to emit cld in front of the 
inline copies of mem functions again :-(




Well, there is a (slight) difference: you know that a called function 
will not clobber your DF state; it's only the entry condition which is 
imprecise.


The best would be if this could be controlled by a flag, which we can 
flip once kernel fixes has been around for long enough.


-hpa


Re: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Jan Hubicka wrote:

Yes, if there are four kernels that get it "wrong", that effectively means
that the ABI document doesn't describe reality and gcc has to adjust.


Kernels almost never follow ABI used by applications to last detail.
Linux kernel is disabling red zone and use kernel code model, yet the
ABI is not going to be adjusted for that.

This is resonably easy to fix on kernel side in signal handling, or by
removing std usage completely (I believe it is not performance win, but
some benchmarking would be needed to double check)


That's not the issue.  The issue is that the kernel leaks the DF from 
the code that took a signal to the signal handler.


-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Richard Guenther wrote:


We didn't yet run into this issue and build openSUSE with 4.3 since more than
three month.



Well, how often do you take a trap inside an overlapping memmove()?

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Chris Lattner wrote:


On Mar 5, 2008, at 1:34 PM, H. Peter Anvin wrote:


Richard Guenther wrote:
We didn't yet run into this issue and build openSUSE with 4.3 since 
more than

three month.


Well, how often do you take a trap inside an overlapping memmove()?


How hard is it to change the kernel signal entry path from "pushf" to 
"pushf;cld"?  Problem solved, no?


Not quite, but fixing it in the kernel is easy.

Still breaks for running on all old kernels.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Michael Matz wrote:


Many bugs are a big issue to people who actually hit them, and we had (and 
probably still have) far nastier corner case miscompilations here and 
there and nevertheless released.  It never was the end of the world :)




This is the sort of stuff that security holes are made from.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Chris Lattner wrote:

Richard Guenther wrote:
We didn't yet run into this issue and build openSUSE with 4.3 since 
more

than
three month.


Well, how often do you take a trap inside an overlapping memmove()?


How hard is it to change the kernel signal entry path from "pushf" to
"pushf;cld"?  Problem solved, no?


The problem is with old kernels, which by definition stay unfixed.


My impression was that the problem occurs in GCC compiled code in the 
kernel itself, not in user space:


That's wrong.

The issue is that the kernel is entered (due to a trap, interrupt or 
whatever) and the state is saved.  The kernel decides to revector 
userspace to a signal handler.  The kernel modifies the userspace state 
to do so, but doesn't set DF=0.


Upon return to userspace, the modified state kicks in.  Thus the signal 
handler is entered with DF from userspace at trap time, not DF=0.


So it's an asynchronous state leak from one piece of userspace to another.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-05 Thread H. Peter Anvin

Chris Lattner wrote:


Upon return to userspace, the modified state kicks in.  Thus the 
signal handler is entered with DF from userspace at trap time, not DF=0.


So it's an asynchronous state leak from one piece of userspace to 
another.


Fine, it can happen either way.  In either case, the distro vendor 
should fix the the signal handler in the kernels they distribute.  If 
you don't do that, you are still leaking information from one piece of 
user space code to another, you're just papering over it in a horrible 
way :)


GCC defines the direction flag to be clear before inline asm. Enforcing 
the semantics you propose would require issuing a cld before every 
inline asm, not just before every string operation.




It's a kernel bug, and it needs to be fixed.  The discussion is about 
what to do in the meantime.


(And yes, you're absolutely right: between global subroutine entry and 
the first asm or string operation, you'd have to emit cld.)


-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Jakub Jelinek wrote:

On Thu, Mar 06, 2008 at 09:44:05AM +0100, Andi Kleen wrote:

"H. Peter Anvin" <[EMAIL PROTECTED]> writes:


Richard Guenther wrote:

We didn't yet run into this issue and build openSUSE with 4.3 since
more than
three month.


Well, how often do you take a trap inside an overlapping memmove()?

That was the state with older gcc, but with newer gcc it does not necessarily
reset the flag before the next function call.


If so, that's a much worse bug.


so e.g. if you have

memmove(...)
for (... very long loop  ) {
/* no function calls */
/* signals happen */
}

the signal could see the direction flag


memmove is supposed to (and does) do a cld insn after it finishes the
backward copying.


You can still take a signal inside memmove() itself, of course.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

H.J. Lu wrote:

I agree with it. There is no  right or wrong here Let's start from
scratch and figure out
what is the best way to handle this, assuming we are defining a new psABI.


No, I believe the right way to approach this is by applying the good 
old-fashioned principle from Ask Mr. Protocol:


Be liberal in what you receive, conservative in what you send

In other words:

a. Fix the kernel.  Already in progress.
b. Do *not* make gcc assume DF is clean for now.  Adding a
   switch would be a useful thing, since if nothing else it
   would benefit embedded environments.  We might assume
   DF is clean on 64 bits, since it appears it is rarely used
   anyway, and 64 bits is more important in the long run.
c. Once fixed kernels have been out long enough, we can
   flip the default of the switch, one platform at a time if
   need be (e.g. there may never be another SCO OpenServer.)

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

H.J. Lu wrote:

On Thu, Mar 6, 2008 at 8:23 AM, Jakub Jelinek <[EMAIL PROTECTED]> wrote:

On Thu, Mar 06, 2008 at 07:50:12AM -0800, H. Peter Anvin wrote:
 > H.J. Lu wrote:
 > >I agree with it. There is no  right or wrong here Let's start from
 > >scratch and figure out
 > >what is the best way to handle this, assuming we are defining a new psABI.

 BTW, just tested icc and icc doesn't generate cld either (so it matches the
 new gcc behavior).
 char buf1[32], buf2[32];
 void bar (void);
 void foo (void)
 {
  __builtin_memset (buf1, 0, 32);
  bar ();
  __builtin_memset (buf2, 0, 32);
 }



Icc follows the psABI. If we are saying icc/gcc 4.3 need a fix, we'd
better define
a new psABI first.



Not a fix, an (optional) workaround for a system bug.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Robert Dewar wrote:

H.J. Lu wrote:

So that is the bug in the Linux kernel. Since fixing kernel is much 
easier

than providing a workaround in compilers, I think kernel should be fixed
and no need for icc/gcc fix.


Fixing a bug in the Linux kernel is not "much easier". You are taking
a purely engineering viewpoint, but life is not like that. There are
lots of copies of Linux kernels around and in use. The issue is not
fixing the kernel per se, it is propagating that change to all
Linux kernels in use -- THAT'S another matter entirely, and is
far far more difficult than making sure that a kernel fix is
qualified and widely proopagated.



Not really, it's just a matter of time.  Typical distro cycles are on 
the order of 3 years.


-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

H.J. Lu wrote:


 Not a fix, an (optional) workaround for a system bug.


So that is the bug in the Linux kernel. Since fixing kernel is much easier
than providing a workaround in compilers, I think kernel should be fixed
and no need for icc/gcc fix.



The problem is, you're going to have to be able to produce binaries 
compatible with old kernels for a *long* time for come.  Are you 
honestly saying you'll tell those people "use gcc 4.2 or earlier"?  If 
so, I think most distros will have to freeze gcc for the next several years.


-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Robert Dewar wrote:


Sounds good, but has almost nothing to do with the real world. I 
remember back in Realia COBOL days, we had to carefully copy IBM

bugs in the IBM mainframe COBOL compiler. Doing things right and
fixing the bug would have been the right thing to do, but no one
would have used Realia COBOL :-)

Another story, the sad story of the intel chip (I think it was
the 80188) where Intel made use of Int 5, which was documented
as reserved. Unfortunately, Microsoft/IBM had used this for
print screen or some such. Intel was absolutely right that
their documentation was clear and it was wrong to have used
these interrupts .. but the result was a warehouse of unused
chips.


IBM used it for print screen (and other calls), because Microsoft 
cassette BASIC used all the non-reserved INT instructions as byte codes 
(they cut it down to *only* half the interrupt vectors in the disk version.)


We're still stuck with the consequences of that hack.

-hpa


Re: RELEASE BLOCKER: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Robert Dewar wrote:


Not really, it's just a matter of time.  Typical distro cycles are on 
the order of 3 years.


-hpa


again, in the real world, there are MANY projects that are nothing
like this interactive when it comes to moving to new versions of
operating systems.


This is true, but beyond a certain point projects generally accept that 
they have to monitor their toolchain dependencies.


-hpa


Re: Linux doesn't follow x86/x86-64 ABI wrt direction flag

2008-03-06 Thread H. Peter Anvin

Richard Guenther wrote:


A patched GCC IMHO makes only sense if it is always-on, yet another option
won't help in corner cases.  And corner cases is exactly what people seem
to care about.  For this reason that we have this single release, 4.3.0, that
behaves "bad" is already a problem.



The option will help embedded vendors who can guarantee that it's not a 
problem.


-hpa