On Tue, Mar 20, 2018 at 8:26 AM, Ingo Molnar <mi...@kernel.org> wrote:
>
> * Thomas Gleixner <t...@linutronix.de> wrote:
>
>> > Useful also for code that needs AVX-like registers to do things like CRCs.
>>
>> x86/crypto/ has a lot of AVX optimized code.
>
> Yeah, that's true, but the crypto code is processing fundamentally bigger 
> blocks
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().
>
> kernel_fpu_begin()/_end() is a pretty heavy operation because it does a full 
> FPU
> save/restore via the XSAVE[S] and XRSTOR[S] instructions, which can easily 
> copy a
> thousand bytes around! So kernel_fpu_begin()/_end() is probably a non-starter 
> for
> something small, like a single 256-bit or 512-bit word access.
>
> But there's actually a new thing in modern kernels: we got rid of (most of) 
> lazy
> save/restore FPU code, our new x86 FPU model is very "direct" with no FPU 
> faults
> taken normally.
>
> So assuming the target driver will only load on modern FPUs I *think* it 
> should
> actually be possible to do something like (pseudocode):
>
>         vmovdqa %ymm0, 40(%rsp)
>         vmovdqa %ymm1, 80(%rsp)
>
>         ...
>         # use ymm0 and ymm1
>         ...
>
>         vmovdqa 80(%rsp), %ymm1
>         vmovdqa 40(%rsp), %ymm0
>

I think this kinda sorts works, but only kinda sorta:

 - I'm a bit worried about newer CPUs where, say, a 256-bit vector
operation will implicitly clear the high 768 bits of the regs.  (IIRC
that's how it works for the new VEX stuff.)

 - This code will cause XINUSE to be set, which is user-visible and
may have performance and correctness effects.  I think the kernel may
already have some but where it occasionally sets XINUSE on its own,
and this caused problems for rr in the past.  This might not be a
total showstopper, but it's odd.

I'd rather see us finally finish the work that Rik started to rework
this differently.  I'd like kernel_fpu_begin() to look like:

if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
  return; // we're already okay.  maybe we need to check
in_interrupt() or something, though?
} else {
  XSAVES/XSAVEOPT/XSAVE;
  set_thread_flag(TIF_NEED_FPU_RESTORE):
}

and kernel_fpu_end() does nothing at all.

We take the full performance hit for a *single* kernel_fpu_begin() on
an otherwise short syscall or interrupt, but there's no additional
cost for more of them or for long-enough-running things that we
schedule in the middle.

As I remember, the main hangup was that this interacts a bit oddly
with PKRU, but that's manageable.

--Andy

Reply via email to