Re: [PATCH 3/6] syscall.h: introduce syscall_set_nr()

2025-01-10 Thread Dmitry V. Levin
On Fri, Jan 10, 2025 at 08:37:46AM +0100, Sven Schnelle wrote: > "Dmitry V. Levin" writes: > > > Similar to syscall_set_arguments() that complements > > syscall_get_arguments(), introduce syscall_set_nr() > > that complements syscall_get_nr(). > > > > syscall_set_nr() is going to be needed along

[PATCH RFC v2 11/29] mm: asi: Functions to map/unmap a memory range into ASI page tables

2025-01-10 Thread Brendan Jackman
From: Junaid Shahid Two functions, asi_map() and asi_map_gfp(), are added to allow mapping memory into ASI page tables. The mapping will be identical to the one for the same virtual address in the unrestricted page tables. This is necessary to allow switching between the page tables at any arbitr

[PATCH RFC v2 13/29] mm: Add __PAGEFLAG_FALSE

2025-01-10 Thread Brendan Jackman
__PAGEFLAG_FALSE is a non-atomic equivalent of PAGEFLAG_FALSE. Checkpatch-args: --ignore=COMPLEX_MACRO Signed-off-by: Brendan Jackman --- include/linux/page-flags.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index cc839e436

[PATCH RFC v2 10/29] mm: asi: asi_exit() on PF, skip handling if address is accessible

2025-01-10 Thread Brendan Jackman
From: Ofir Weisse On a page-fault - do asi_exit(). Then check if now after the exit the address is accessible. We do this by refactoring spurious_kernel_fault() into two parts: 1. Verify that the error code value is something that could arise from a lazy TLB update. 2. Walk the page table and ve

[PATCH RFC v2 14/29] mm: asi: Map non-user buddy allocations as nonsensitive

2025-01-10 Thread Brendan Jackman
This is just simplest possible page_alloc patch I could come up with to demonstrate ASI working in a "denylist" mode: we map the direct map into the restricted address space, except pages allocated with GFP_USER. Pages must be asi_unmap()'d before they can be re-allocated. This requires a TLB flus

[PATCH RFC v2 16/29] mm: asi: Map kernel text and static data as nonsensitive

2025-01-10 Thread Brendan Jackman
Basically we need to map the kernel code and all its static variables. Per-CPU variables need to be treated specially as described in the comments. The cpu_entry_area is similar - this needs to be nonsensitive so that the CPU can access the GDT etc when handling a page fault. Under 5-level paging,

[PATCH RFC v2 17/29] mm: asi: Map vmalloc/vmap data as nonsensitive

2025-01-10 Thread Brendan Jackman
We add new VM flags for sensitive and global-nonsensitive, parallel to the corresponding GFP flags. __get_vm_area_node and friends will default to creating global-nonsensitive VM areas, and vmap then calls asi_map as necessary. __vmalloc_node_range has additional logic to check and set defaults f

[PATCH TEMP WORKAROUND RFC v2 15/29] mm: asi: Workaround missing partial-unmap support

2025-01-10 Thread Brendan Jackman
This is a hack, no need to review it carefully. asi_unmap() doesn't currently work unless it corresponds exactly to an asi_map() of the exact same region. This is mostly harmless (it's only a functional problem if you wanna touch those pages from the ASI critical section) but it's messy. For now,

[PATCH RFC v2 05/29] mm: asi: ASI support in interrupts/exceptions

2025-01-10 Thread Brendan Jackman
Add support for potentially switching address spaces from within interrupts/exceptions/NMIs etc. An interrupt does not automatically switch to the unrestricted address space. It can switch if needed to access some memory not available in the restricted address space, using the normal asi_exit call.

[PATCH RFC v2 09/29] mm: asi: ASI page table allocation functions

2025-01-10 Thread Brendan Jackman
From: Junaid Shahid This adds custom allocation and free functions for ASI page tables. The alloc functions support allocating memory using different GFP reclaim flags, in order to be able to support non-sensitive allocations from both standard and atomic contexts. They also install the page tab

[PATCH RFC v2 01/29] mm: asi: Make some utility functions noinstr compatible

2025-01-10 Thread Brendan Jackman
Some existing utility functions would need to be called from a noinstr context in the later patches. So mark these as either noinstr or __always_inline. An earlier version of this by Junaid had a macro that was intended to tell the compiler "either inline this function, or call it in the noinstr s

[PATCH RFC v2 08/29] mm: asi: Avoid warning from NMI userspace accesses in ASI context

2025-01-10 Thread Brendan Jackman
nmi_uaccess_okay() emits a warning if current CR3 != mm->pgd. Limit the warning to only when ASI is not active. Co-developed-by: Junaid Shahid Signed-off-by: Junaid Shahid Co-developed-by: Yosry Ahmed Signed-off-by: Yosry Ahmed Signed-off-by: Brendan Jackman --- arch/x86/mm/tlb.c | 26 ++

[PATCH RFC v2 06/29] mm: asi: Use separate PCIDs for restricted address spaces

2025-01-10 Thread Brendan Jackman
From: Yosry Ahmed Each restricted address space is assigned a separate PCID. Since currently only one ASI instance per-class exists for a given process, the PCID is just derived from the class index. This commit only sets the appropriate PCID when switching CR3, but does not actually use the NOF

[PATCH RFC v2 04/29] mm: asi: Add infrastructure for boot-time enablement

2025-01-10 Thread Brendan Jackman
Add a boot time parameter to control the newly added X86_FEATURE_ASI. "asi=on" or "asi=off" can be used in the kernel command line to enable or disable ASI at boot time. If not specified, ASI enablement depends on CONFIG_ADDRESS_SPACE_ISOLATION_DEFAULT_ON, which is off by default. asi_check_boottim

[PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

2025-01-10 Thread Brendan Jackman
Introduce core API for Address Space Isolation (ASI). Kernel address space isolation provides the ability to run some kernel code with a restricted kernel address space. There can be multiple classes of such restricted kernel address spaces (e.g. KPTI, KVM-PTI etc.). Each ASI class is identified

[PATCH RFC v2 02/29] x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION

2025-01-10 Thread Brendan Jackman
Currently a nop config. Keeping as a separate commit for easy review of the boring bits. Later commits will use and enable this new config. This config is only added for non-UML x86_64 as other architectures do not yet have pending implementations. It also has somewhat artificial dependencies on !

[PATCH RFC v2 00/29] Address Space Isolation (ASI)

2025-01-10 Thread Brendan Jackman
ASI is a technique to mitigate a broad class of CPU vulnerabilities by unmapping sensitive data from the kernel address space. If no data is mapped that needs protecting, this class of exploits cannot leak that data and so the kernel can skip expensive mitigation actions. For a more detailed overvi

[PATCH RFC v2 23/29] mm: asi: exit ASI before suspend-like operations

2025-01-10 Thread Brendan Jackman
From: Yosry Ahmed During suspend-like operations (suspend, hibernate, kexec w/ preserve_context), the processor state (including CR3) is usually saved and restored later. In the kexec case, this only happens when KEXEC_PRESERVE_CONTEXT is used to jump back to the original kernel. In relocate_ker

[PATCH RFC v2 20/29] mm: asi: Make TLB flushing correct under ASI

2025-01-10 Thread Brendan Jackman
This is the absolute minimum change for TLB flushing to be correct under ASI. There are two arguably orthogonal changes in here but they feel small enough for a single commit. .:: CR3 stabilization As noted in the comment ASI can destabilize CR3, but we can stabilize it again by calling asi_exit,

[PATCH RFC v2 22/29] mm: asi: exit ASI before accessing CR3 from C code where appropriate

2025-01-10 Thread Brendan Jackman
Because asi_exit()s can be triggered by NMIs, CR3 is unstable when in the ASI restricted address space. (Exception: code in the ASI critical section can treat it as stable, because if an asi_exit() occurs during an exception it will be undone before the critical section resumes). Code that accesse

[PATCH RFC v2 25/29] mm: asi: Restricted execution fore bare-metal processes

2025-01-10 Thread Brendan Jackman
Now userspace gets a restricted address space too. The critical section begins on exit to userspace and ends when it makes a system call. Other entries from userspace just interrupt the critical section via asi_intr_enter(). The reason why system calls have to actually asi_relax() (i.e. fully term

[PATCH RFC v2 26/29] x86: Create library for flushing L1D for L1TF

2025-01-10 Thread Brendan Jackman
ASI will need to use this L1D flushing logic so put it in a library where it can be used independently of KVM. Since we're creating this library, it starts to look messy if we don't also use it in the double-opt-in (both kernel cmdline and prctl) mm-switching flush logic which is there for mitigat

[PATCH RFC v2 29/29] mm: asi: Stop ignoring asi=on cmdline flag

2025-01-10 Thread Brendan Jackman
At this point the minimum requirements are in place for the kernel to operate correctly with ASI enabled. Signed-off-by: Brendan Jackman --- arch/x86/mm/asi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c index f10f6614b26148e5ba42

[PATCH RFC v2 24/29] mm: asi: Add infrastructure for mapping userspace addresses

2025-01-10 Thread Brendan Jackman
In preparation for sandboxing bare-metal processes, teach ASI to map userspace addresses into the restricted address space. Add a new policy helper to determine based on the class whether to do this. If the helper returns true, mirror userspace mappings into the ASI pagetables. Later, it will be

[PATCH RFC v2 21/29] KVM: x86: asi: Restricted address space for VM execution

2025-01-10 Thread Brendan Jackman
An ASI restricted address space is added for KVM. This protects the userspace from attack by the guest, and the guest from attack by other processes. It doesn't attempt to prevent the guest from attack by the current process. This change incorporates an extra asi_exit at the end of vcpu_run. We ex

[PATCH RFC v2 19/29] mm: asi: Stabilize CR3 in switch_mm_irqs_off()

2025-01-10 Thread Brendan Jackman
An ASI-restricted CR3 is unstable as interrupts can cause ASI-exits. Although we already unconditionally ASI-exit during context-switch, and before returning from the VM-run path, it's still possible to reach switch_mm_irqs_off() in a restricted context, because KVM code updates static keys, which

[PATCH RFC v2 27/29] mm: asi: Add some mitigations on address space transitions

2025-01-10 Thread Brendan Jackman
Here we ASI actually starts becoming a real exploit mitigation, On CPUs with L1TF, flush L1D when the ASI data taints say so. On all CPUs, do some general branch predictor clearing whenever the control taints say so. This policy is very much just a starting point for discussion. Primarily it's a

[PATCH RFC v2 28/29] x86/pti: Disable PTI when ASI is on

2025-01-10 Thread Brendan Jackman
Now that ASI has support for sandboxing userspace, although userspace now has much more mapped than it would under KPTI, in theory none of that data is important to protect. Note that one particular impact of this is it makes locally defeating KASLR easier. I don't think this is a great loss given

[PATCH RFC v2 18/29] mm: asi: Map dynamic percpu memory as nonsensitive

2025-01-10 Thread Brendan Jackman
From: Reiji Watanabe Currently, all dynamic percpu memory is implicitly (and unintentionally) treated as sensitive memory. Unconditionally map pages for dynamically allocated percpu memory as global nonsensitive memory, other than pages that are allocated for pcpu_{first,reserved}_chunk during e

[PATCH RFC v2 07/29] mm: asi: Make __get_current_cr3_fast() ASI-aware

2025-01-10 Thread Brendan Jackman
From: Junaid Shahid When ASI is active, __get_current_cr3_fast() adjusts the returned CR3 value accordingly to reflect the actual ASI CR3. Signed-off-by: Junaid Shahid Signed-off-by: Brendan Jackman --- arch/x86/mm/tlb.c | 37 +++-- 1 file changed, 31 insertion

[PATCH RFC v2 12/29] mm: asi: Add basic infrastructure for global non-sensitive mappings

2025-01-10 Thread Brendan Jackman
From: Junaid Shahid A pseudo-PGD is added to store global non-sensitive ASI mappings. Actual ASI PGDs copy entries from this pseudo-PGD during asi_init(). Memory can be mapped as globally non-sensitive by calling asi_map() with ASI_GLOBAL_NONSENSITIVE. Page tables allocated for global non-sensi