On Tue, Mar 20, 2018 at 8:26 AM, Ingo Molnar <mi...@kernel.org> wrote: > > * Thomas Gleixner <t...@linutronix.de> wrote: > >> > Useful also for code that needs AVX-like registers to do things like CRCs. >> >> x86/crypto/ has a lot of AVX optimized code. > > Yeah, that's true, but the crypto code is processing fundamentally bigger > blocks > of data, which amortizes the cost of using kernel_fpu_begin()/_end(). > > kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full > FPU > save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily > copy a > thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter > for > something small, like a single 256-bit or 512-bit word access. > > But there's actually a new thing in modern kernels: we got rid of (most of) > lazy > save/restore FPU code, our new x86 FPU model is very "direct" with no FPU > faults > taken normally. > > So assuming the target driver will only load on modern FPUs I *think* it > should > actually be possible to do something like (pseudocode): > > vmovdqa %ymm0, 40(%rsp) > vmovdqa %ymm1, 80(%rsp) > > ... > # use ymm0 and ymm1 > ... > > vmovdqa 80(%rsp), %ymm1 > vmovdqa 40(%rsp), %ymm0 >
I think this kinda sorts works, but only kinda sorta: - I'm a bit worried about newer CPUs where, say, a 256-bit vector operation will implicitly clear the high 768 bits of the regs. (IIRC that's how it works for the new VEX stuff.) - This code will cause XINUSE to be set, which is user-visible and may have performance and correctness effects. I think the kernel may already have some but where it occasionally sets XINUSE on its own, and this caused problems for rr in the past. This might not be a total showstopper, but it's odd. I'd rather see us finally finish the work that Rik started to rework this differently. I'd like kernel_fpu_begin() to look like: if (test_thread_flag(TIF_NEED_FPU_RESTORE)) { return; // we're already okay. maybe we need to check in_interrupt() or something, though? } else { XSAVES/XSAVEOPT/XSAVE; set_thread_flag(TIF_NEED_FPU_RESTORE): } and kernel_fpu_end() does nothing at all. We take the full performance hit for a *single* kernel_fpu_begin() on an otherwise short syscall or interrupt, but there's no additional cost for more of them or for long-enough-running things that we schedule in the middle. As I remember, the main hangup was that this interacts a bit oddly with PKRU, but that's manageable. --Andy