On 12/08/2021 19:20, Jason Andryuk wrote:
> Hi,
>
> I was reviewing xen/common/wait.c:__prepare_to_wait() and I think I've
> identified an incompatibility with shadow stacks like Intel CET-SS.
>
> The inline asm does:
>
>         "call 1f;"
>         "1: addq $2f-1b,(%%rsp);"
>         "sub %%esp,%%ecx;"
>         "cmp %3,%%ecx;"
>         "ja 3f;"
>         "mov %%rsp,%%rsi;"
>
>         /* check_wakeup_from_wait() longjmp()'s to this point. */
>         "2: rep movsb;"
>         "mov %%rsp,%%rsi;"
>         "3: pop %%rax;"
>
> `call 1f` gets the address of the code, but the address is popped off
> without ret.  This will leave the shadow stack out-of-sync which will
> trigger the protection.  Is my analysis correct?

There is a shadow stack bug here, but it isn't this.

A CALL with 0 displacement explicitly doesn't push a return address onto
the shadow stack, because CALL; POP is a common technique used in 32bit
PIC logic.  Similarly, 0-displacement calls don't enter the RSB/RAS
branch predictor (and why the safety of retpoline depends on using
non-zero displacements).

However, the fact we copy the regular stack sideways and totally skip
the shadow stack is a problem.  We'll survive the setjmp(), but the
longjmp() will explode because of trying to use the old context's shstk
with the new context's regular stack.

There is no credible way to make the waitqueue infrastructure compatible
with shadow stacks.

Furthermore, the infrastructure is an eyesore and we've discussed how to
go about deleting it several times in the past - it is only needed
because of an shortcut taken in the monitor/sharing/paging ring handling
with dom0.

In the short term, we should have logic to perform a boot-time disable
of monitor/sharing/paging if CET is wanted, which works in exactly the
same way as opt_pv32.  This will prevent Xen from exploding on a CET
system when you first try to introspect a VM with more than 7(?) vcpus,
and will let the user choose between "working introspection" and
"working CET".

As a tangent, if we know that waitqueues aren't in use, then it is safe
to turn off HVM RSB stuffing, which is enough of a perf win to actively
chase.  I already have some work in progress for disabling PV RSB
stuffing, and the raw perf numbers are
https://paste.debian.net/hidden/0d59b024/ for anyone interested.


As for monitor/sharing/paging, their dependence on the waitqueue
infrastructure is easy to remove, now that we've got the
acquire_resource infrastructure.

The problem is that, given a 4k ring shared between all vcpus, Xen may
need to wait for space on the ring in the middle of a logical action,
necessitating scheduling the vcpu out while retaining the interim state
which isn't at a clean return-to-guest boundary.  The problem is
exasperated by the fact that monitor in particular allows for sync or
async notifications, meaning that one single vcpu could consume all room
in the ring in very short order.

The fix is also easy.  For the sync interface, switch to a slot-per-vcpu
model exactly like the IOREQ interface.  This simplifies Xen and the
userspace agent, and improves performance because you don't need to
marshal data in and out of the ring.  Also offers an opportunity to
switch to evtchn-per-vcpu rather than having to scan all vcpus on every
interrupt.

For the async ring, we can either delete that functionality (it is
unclear whether it is actually used at all) or we can implement a
multi-page ring similar to the vmtrace buffer (actually easier, because
there is no requirement to be physically contiguous).

As long as the ring is at least big enough for one entry per vcpu, we
can rearrange the logic to never need to sleep before putting an entry
into the ring.  Specifically, pause the current vCPU after writing the
entry if the remaining space is smaller than $N * d->max_vcpus, after
which we can return to the scheduler like all other operations which
pause a vCPU, and the waitqueue logic loses its only caller.


If anyone has time to address any parts of this, I'll happily provide as
much guidance as I can.

~Andrew


Reply via email to