On Thu, Jun 05, 2008 at 07:31:12AM -0700, H.J. Lu wrote:
> 1. Extend the register save area to put upper 128bit at the end.
>   Pros:
>     Aligned access.
>     Save stack space if 256bit registers are used.
>   Cons
>     Split access. Require more split access beyond 256bit.
> 
> 2. Extend the register save area to put full 265bit YMMs at the end.
> The first DWORD after the register save area has the offset of
> the extended array for YMM registers. The next DWORD has the
> element size of the extended array. Unaligned access will be used.
>   Pros:
>     No split access.
>     Easily extendable beyond 256bit.
>     Limited unaligned access penalty if stack is aligned at 32byte.
>   Cons:
>     May require store both the lower 128bit and full 256bit register
>     content. We may avoid saving the lower 128bit if correct type
>     is required when accessing variable argument list, similar to int
>     vs. double.
>     Waste 272 byte on stack when 256bit registers are used.
>     Unaligned load and store.

Or:

3. Pass unnamed __m256 arguments both in YMM registers and on the
stack or just on the stack.  How often do you think people pass
vectors to varargs functions?  I think I haven't seen that yet except
in gcc testcases.  The x86_64 float varargs setup prologue is already
quite slow now, do we want to make it even slower for something
very rarely used?  Although we have tree-stdarg optimization pass
which is able to optimize the varargs prologue setup code in some cases,
e.g. for printf etc. it can't help, as printf etc. just
does va_start, passes the va_list to another function and does va_end,
so it must count with any possibility.  Named __m256 arguments would
still be passed in YMM registers only...

        Jakub

Reply via email to