On Fri, Jan 10, 2025 at 08:37:46AM +0100, Sven Schnelle wrote:
> "Dmitry V. Levin" writes:
>
> > Similar to syscall_set_arguments() that complements
> > syscall_get_arguments(), introduce syscall_set_nr()
> > that complements syscall_get_nr().
> >
> > syscall_set_nr() is going to be needed along
From: Junaid Shahid
Two functions, asi_map() and asi_map_gfp(), are added to allow mapping
memory into ASI page tables. The mapping will be identical to the one
for the same virtual address in the unrestricted page tables. This is
necessary to allow switching between the page tables at any arbitr
__PAGEFLAG_FALSE is a non-atomic equivalent of PAGEFLAG_FALSE.
Checkpatch-args: --ignore=COMPLEX_MACRO
Signed-off-by: Brendan Jackman
---
include/linux/page-flags.h | 7 +++
1 file changed, 7 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index
cc839e436
From: Ofir Weisse
On a page-fault - do asi_exit(). Then check if now after the exit the
address is accessible. We do this by refactoring spurious_kernel_fault()
into two parts:
1. Verify that the error code value is something that could arise from a
lazy TLB update.
2. Walk the page table and ve
This is just simplest possible page_alloc patch I could come up with to
demonstrate ASI working in a "denylist" mode: we map the direct map into
the restricted address space, except pages allocated with GFP_USER.
Pages must be asi_unmap()'d before they can be re-allocated. This
requires a TLB flus
Basically we need to map the kernel code and all its static variables.
Per-CPU variables need to be treated specially as described in the
comments. The cpu_entry_area is similar - this needs to be
nonsensitive so that the CPU can access the GDT etc when handling
a page fault.
Under 5-level paging,
We add new VM flags for sensitive and global-nonsensitive, parallel to
the corresponding GFP flags.
__get_vm_area_node and friends will default to creating
global-nonsensitive VM areas, and vmap then calls asi_map as necessary.
__vmalloc_node_range has additional logic to check and set defaults f
This is a hack, no need to review it carefully. asi_unmap() doesn't
currently work unless it corresponds exactly to an asi_map() of the
exact same region.
This is mostly harmless (it's only a functional problem if you wanna
touch those pages from the ASI critical section) but it's messy. For
now,
Add support for potentially switching address spaces from within
interrupts/exceptions/NMIs etc. An interrupt does not automatically
switch to the unrestricted address space. It can switch if needed to
access some memory not available in the restricted address space, using
the normal asi_exit call.
From: Junaid Shahid
This adds custom allocation and free functions for ASI page tables.
The alloc functions support allocating memory using different GFP
reclaim flags, in order to be able to support non-sensitive allocations
from both standard and atomic contexts. They also install the page
tab
Some existing utility functions would need to be called from a noinstr
context in the later patches. So mark these as either noinstr or
__always_inline.
An earlier version of this by Junaid had a macro that was intended to
tell the compiler "either inline this function, or call it in the
noinstr s
nmi_uaccess_okay() emits a warning if current CR3 != mm->pgd.
Limit the warning to only when ASI is not active.
Co-developed-by: Junaid Shahid
Signed-off-by: Junaid Shahid
Co-developed-by: Yosry Ahmed
Signed-off-by: Yosry Ahmed
Signed-off-by: Brendan Jackman
---
arch/x86/mm/tlb.c | 26 ++
From: Yosry Ahmed
Each restricted address space is assigned a separate PCID. Since
currently only one ASI instance per-class exists for a given process,
the PCID is just derived from the class index.
This commit only sets the appropriate PCID when switching CR3, but does
not actually use the NOF
Add a boot time parameter to control the newly added X86_FEATURE_ASI.
"asi=on" or "asi=off" can be used in the kernel command line to enable
or disable ASI at boot time. If not specified, ASI enablement depends
on CONFIG_ADDRESS_SPACE_ISOLATION_DEFAULT_ON, which is off by default.
asi_check_boottim
Introduce core API for Address Space Isolation (ASI). Kernel address
space isolation provides the ability to run some kernel
code with a restricted kernel address space.
There can be multiple classes of such restricted kernel address spaces
(e.g. KPTI, KVM-PTI etc.). Each ASI class is identified
Currently a nop config. Keeping as a separate commit for easy review of
the boring bits. Later commits will use and enable this new config.
This config is only added for non-UML x86_64 as other architectures do
not yet have pending implementations. It also has somewhat artificial
dependencies on !
ASI is a technique to mitigate a broad class of CPU vulnerabilities
by unmapping sensitive data from the kernel address space. If no data
is mapped that needs protecting, this class of exploits cannot leak
that data and so the kernel can skip expensive mitigation actions.
For a more detailed overvi
From: Yosry Ahmed
During suspend-like operations (suspend, hibernate, kexec w/
preserve_context), the processor state (including CR3) is usually saved
and restored later.
In the kexec case, this only happens when KEXEC_PRESERVE_CONTEXT is
used to jump back to the original kernel. In relocate_ker
This is the absolute minimum change for TLB flushing to be correct under
ASI. There are two arguably orthogonal changes in here but they feel
small enough for a single commit.
.:: CR3 stabilization
As noted in the comment ASI can destabilize CR3, but we can stabilize it
again by calling asi_exit,
Because asi_exit()s can be triggered by NMIs, CR3 is unstable when in
the ASI restricted address space. (Exception: code in the ASI critical
section can treat it as stable, because if an asi_exit() occurs during
an exception it will be undone before the critical section resumes).
Code that accesse
Now userspace gets a restricted address space too. The critical section
begins on exit to userspace and ends when it makes a system call.
Other entries from userspace just interrupt the critical section via
asi_intr_enter().
The reason why system calls have to actually asi_relax() (i.e. fully
term
ASI will need to use this L1D flushing logic so put it in a library
where it can be used independently of KVM.
Since we're creating this library, it starts to look messy if we don't
also use it in the double-opt-in (both kernel cmdline and prctl)
mm-switching flush logic which is there for mitigat
At this point the minimum requirements are in place for the kernel to
operate correctly with ASI enabled.
Signed-off-by: Brendan Jackman
---
arch/x86/mm/asi.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index
f10f6614b26148e5ba42
In preparation for sandboxing bare-metal processes, teach ASI to map
userspace addresses into the restricted address space.
Add a new policy helper to determine based on the class whether to do
this. If the helper returns true, mirror userspace mappings into the ASI
pagetables.
Later, it will be
An ASI restricted address space is added for KVM. This protects the
userspace from attack by the guest, and the guest from attack by other
processes. It doesn't attempt to prevent the guest from attack by the
current process.
This change incorporates an extra asi_exit at the end of vcpu_run. We
ex
An ASI-restricted CR3 is unstable as interrupts can cause ASI-exits.
Although we already unconditionally ASI-exit during context-switch, and
before returning from the VM-run path, it's still possible to reach
switch_mm_irqs_off() in a restricted context, because KVM code updates
static keys, which
Here we ASI actually starts becoming a real exploit mitigation,
On CPUs with L1TF, flush L1D when the ASI data taints say so.
On all CPUs, do some general branch predictor clearing
whenever the control taints say so.
This policy is very much just a starting point for discussion.
Primarily it's a
Now that ASI has support for sandboxing userspace, although userspace
now has much more mapped than it would under KPTI, in theory none of
that data is important to protect.
Note that one particular impact of this is it makes locally defeating
KASLR easier. I don't think this is a great loss given
From: Reiji Watanabe
Currently, all dynamic percpu memory is implicitly (and
unintentionally) treated as sensitive memory.
Unconditionally map pages for dynamically allocated percpu
memory as global nonsensitive memory, other than pages that
are allocated for pcpu_{first,reserved}_chunk during e
From: Junaid Shahid
When ASI is active, __get_current_cr3_fast() adjusts the returned CR3
value accordingly to reflect the actual ASI CR3.
Signed-off-by: Junaid Shahid
Signed-off-by: Brendan Jackman
---
arch/x86/mm/tlb.c | 37 +++--
1 file changed, 31 insertion
From: Junaid Shahid
A pseudo-PGD is added to store global non-sensitive ASI mappings.
Actual ASI PGDs copy entries from this pseudo-PGD during asi_init().
Memory can be mapped as globally non-sensitive by calling asi_map()
with ASI_GLOBAL_NONSENSITIVE.
Page tables allocated for global non-sensi
31 matches
Mail list logo