On Thu, Jun 05, 2008 at 07:31:12AM -0700, H.J. Lu wrote: > 1. Extend the register save area to put upper 128bit at the end. > Pros: > Aligned access. > Save stack space if 256bit registers are used. > Cons > Split access. Require more split access beyond 256bit. > > 2. Extend the register save area to put full 265bit YMMs at the end. > The first DWORD after the register save area has the offset of > the extended array for YMM registers. The next DWORD has the > element size of the extended array. Unaligned access will be used. > Pros: > No split access. > Easily extendable beyond 256bit. > Limited unaligned access penalty if stack is aligned at 32byte. > Cons: > May require store both the lower 128bit and full 256bit register > content. We may avoid saving the lower 128bit if correct type > is required when accessing variable argument list, similar to int > vs. double. > Waste 272 byte on stack when 256bit registers are used. > Unaligned load and store.
Or: 3. Pass unnamed __m256 arguments both in YMM registers and on the stack or just on the stack. How often do you think people pass vectors to varargs functions? I think I haven't seen that yet except in gcc testcases. The x86_64 float varargs setup prologue is already quite slow now, do we want to make it even slower for something very rarely used? Although we have tree-stdarg optimization pass which is able to optimize the varargs prologue setup code in some cases, e.g. for printf etc. it can't help, as printf etc. just does va_start, passes the va_list to another function and does va_end, so it must count with any possibility. Named __m256 arguments would still be passed in YMM registers only... Jakub