Some of you have probably noticed activity about "xonly" happening to a bunch of architectures. First arm64, then riscv64, then hppa, and ongoing efforts with octeon, sparc64 (sun4u only), and more of this is going to come in the future.
Like past work decades ago (and I suppose continually also) on W^X, and increasing use of .rodata, the idea here is to have code (text segments) not be readable. Or in a more generic sense, if you mprotect a region with only PROT_EXEC, it is not readable. This has a number of nice characteristics. It makes BROP techniques not work as well (when accompanied by the effects of many other migitations), it makes complex gadgets containing side effects harder to use (if the registers involved in the side effect contain pointers to code), etc etc. But most of us have amd64 machines. Thrilling news: It turns out we can do this fairly modern modern 64-bit x86 machines from Intel and AMD, if operating in LONG mode, aka. amd64. The cpu needs to have a feature called PKU. The way this works is not 100% perfect, but it a serious enough hinderance. A PKU memory key is instantiated for all memory which is PROT_EXEC-only, and that key is told to block memory reads. Instruction reads are still permitted. Now some of you may know how PKU works, so you will say that userland can change the behaviour and make the memory readable. That is true. Until a system call happens. Then we force it back to blocking read. Or a memory fault, we force it back. Or an interrupt, even a clock interrupt. We force it back. Generally if an attacker is trying to read code it is because they don't have a lot of (turing complete or a subset) of flexibility and want more information. Imagine they are able to generate a the "wrpkru" sequence to disable it, and then do something else? My guess is if they can do two things in a row, then they already have power, and won't be doing this. So this is a protection method against a lower-level attack construction. The concept is this: If you can bypass this to gain a tiny foothold, you would not have bothered, because you have more power and would be reaching higher. As I mention, some other architectures have crossed already, but not without little bits of pain. Especially in the ports tree, where unprepared code is more common. Mostly this consists of assembly language files that have put read-only data into the .text region, instead of into the .rodata (some people still haven't read the memos from around 1997). So I'd like to recruit some help from those of you capable of building your own kernels. Can you apply the following kernel diff, and try the applications you are used to. A list of applications that fail on some way would be handy. Be sure to ktrace -di then, and check there is a SIGSEGV near the end, and include a snippet of that information. If you don't know how to do this, please don't ask for help. Just let others do it. The kernel diff isn't perfect yet, because it is less than 24 hours since this started working. But it appears good enough for testing, while we work out the wrinkles. After the kernel diff are two additional diffs, which can be done independently 1) you can recompile gnu/usr.bin/clang to get a new linker that defaults to --execute-only libraries and binaries, and competely cross over 2) the shared library linker ld.so also needs a tweak And after that, a silly test program, which generates this before: userland kernel ld.so 0x59bd652920 readable readable mmap xz 0x5a08e6d000 unreadable unreadable mmap x 0x5a33152000 readable readable mmap nrx 0x597c8af000 readable readable mmap nwx 0x5988309000 readable readable mmap xnwx 0x59e6118000 readable readable main 0x5773dfe390 readable readable libc 0x5a2ec49b00 readable readable and after: userland kernel ld.so 0x9476e6f36a0 unreadable unreadable mmap xz 0x947e5bb3000 unreadable unreadable mmap x 0x947bc6ba000 unreadable unreadable mmap nrx 0x947d07fa000 unreadable unreadable mmap nwx 0x947bc631000 unreadable unreadable mmap xnwx 0x947a2aa9000 unreadable unreadable main 0x9455cc2e490 unreadable unreadable libc 0x9476b5f8a60 unreadable unreadable Index: sys/arch/amd64/amd64/cpu.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v retrieving revision 1.163 diff -u -p -u -r1.163 cpu.c --- sys/arch/amd64/amd64/cpu.c 29 Nov 2022 21:41:39 -0000 1.163 +++ sys/arch/amd64/amd64/cpu.c 14 Jan 2023 05:46:34 -0000 @@ -735,6 +735,8 @@ cpu_init(struct cpu_info *ci) cr4 |= CR4_SMAP; if (ci->ci_feature_sefflags_ecx & SEFF0ECX_UMIP) cr4 |= CR4_UMIP; + if (ci->ci_feature_sefflags_ecx & SEFF0ECX_PKU) + cr4 |= CR4_PKE; if ((cpu_ecxfeature & CPUIDECX_XSAVE) && cpuid_level >= 0xd) cr4 |= CR4_OSXSAVE; if (pmap_use_pcid) Index: sys/arch/amd64/amd64/locore.S =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/locore.S,v retrieving revision 1.131 diff -u -p -u -r1.131 locore.S --- sys/arch/amd64/amd64/locore.S 1 Dec 2022 00:26:15 -0000 1.131 +++ sys/arch/amd64/amd64/locore.S 14 Jan 2023 14:49:54 -0000 @@ -597,6 +597,7 @@ IDTVEC_NOALIGN(syscall) jz .Lsyscall_restore_fsbase .Lsyscall_restore_registers: + call xonly RET_STACK_REFILL_WITH_RCX movq TF_R8(%rsp),%r8 @@ -773,6 +774,7 @@ intr_user_exit_post_ast: jz .Lintr_restore_fsbase .Lintr_restore_registers: + call xonly RET_STACK_REFILL_WITH_RCX movq TF_R8(%rsp),%r8 @@ -940,6 +942,8 @@ NENTRY(intr_fast_exit) testq $PSL_I,%rdx jnz .Lintr_exit_not_blocked #endif /* DIAGNOSTIC */ + + call xonly movq TF_RDI(%rsp),%rdi movq TF_RSI(%rsp),%rsi movq TF_R8(%rsp),%r8 @@ -1104,6 +1108,20 @@ ENTRY(pagezero) ret lfence END(pagezero) + +/* void xonly(void) */ +ENTRY(xonly) + movl pmap_pke,%eax /* have pke support? */ + cmpl $0,%eax + je 1f + movl $0,%ecx /* force pke back on for xonly restriction */ + movl $0,%edx + movl $0xfffffffc,%eax + wrpkru +1: + ret + lfence +END(xonly) /* int rdmsr_safe(u_int msr, uint64_t *data) */ ENTRY(rdmsr_safe) Index: sys/arch/amd64/amd64/pmap.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v retrieving revision 1.156 diff -u -p -u -r1.156 pmap.c --- sys/arch/amd64/amd64/pmap.c 29 Nov 2022 21:41:39 -0000 1.156 +++ sys/arch/amd64/amd64/pmap.c 14 Jan 2023 05:47:49 -0000 @@ -256,6 +256,7 @@ static u_int cr3_pcid_temp; /* these two are accessed from locore.o */ paddr_t cr3_reuse_pcid; paddr_t cr3_pcid_proc_intel; +int pmap_pke; /* * other data structures @@ -656,13 +657,26 @@ pmap_bootstrap(paddr_t first_avail, padd virtual_avail = kva_start; /* first free KVA */ /* + * If PKU is available, initialize PROT_EXEC entry correctly, + * and enable the feature before it gets used + */ + if (cpuid_level >= 0x7) { + uint32_t ecx, dummy; + CPUID_LEAF(0x7, 0, dummy, dummy, ecx, dummy); + if (ecx & SEFF0ECX_PKU) { + lcr4(rcr4() | CR4_PKE); + pmap_pke = 1; + } + } + + /* * set up protection_codes: we need to be able to convert from * a MI protection code (some combo of VM_PROT...) to something * we can jam into a i386 PTE. */ protection_codes[PROT_NONE] = pg_nx; /* --- */ - protection_codes[PROT_EXEC] = PG_RO; /* --x */ + protection_codes[PROT_EXEC] = pmap_pke ? PG_XO : PG_RO; /* --x */ protection_codes[PROT_READ] = PG_RO | pg_nx; /* -r- */ protection_codes[PROT_READ | PROT_EXEC] = PG_RO; /* -rx */ protection_codes[PROT_WRITE] = PG_RW | pg_nx; /* w-- */ @@ -2105,7 +2119,7 @@ pmap_clear_attrs(struct vm_page *pg, uns void pmap_write_protect(struct pmap *pmap, vaddr_t sva, vaddr_t eva, vm_prot_t prot) { - pt_entry_t nx, *spte, *epte; + pt_entry_t nx = 0, *spte, *epte, xo = 0; vaddr_t blockend; int shootall = 0, shootself; vaddr_t va; @@ -2118,9 +2132,10 @@ pmap_write_protect(struct pmap *pmap, va sva &= PG_FRAME; eva &= PG_FRAME; - nx = 0; if (!(prot & PROT_EXEC)) nx = pg_nx; + else if (!(prot & PROT_READ)) + xo = pmap_pke ? PG_XO : PG_RO; if ((eva - sva > 32 * PAGE_SIZE) && sva < VM_MIN_KERNEL_ADDRESS) shootall = 1; @@ -2159,7 +2174,7 @@ pmap_write_protect(struct pmap *pmap, va if (!pmap_valid_entry(*spte)) continue; pmap_pte_clearbits(spte, PG_RW); - pmap_pte_setbits(spte, nx); + pmap_pte_setbits(spte, nx | xo); } } Index: sys/arch/amd64/amd64/trap.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/trap.c,v retrieving revision 1.93 diff -u -p -u -r1.93 trap.c --- sys/arch/amd64/amd64/trap.c 7 Nov 2022 01:41:57 -0000 1.93 +++ sys/arch/amd64/amd64/trap.c 14 Jan 2023 15:03:49 -0000 @@ -176,7 +176,12 @@ upageflttrap(struct trapframe *frame, ui vaddr_t va = trunc_page((vaddr_t)cr2); vm_prot_t access_type = pgex2access(frame->tf_err); union sigval sv; - int signal, sicode, error; + int signal, sicode, error = EACCES; + + if (frame->tf_err == PGEX_PK) { + sigabort(p); + return 1; + } error = uvm_fault(&p->p_vmspace->vm_map, va, 0, access_type); if (error == 0) { @@ -271,7 +276,9 @@ kpageflttrap(struct trapframe *frame, ui if (va >= VM_MIN_KERNEL_ADDRESS) map = kernel_map; - if (curcpu()->ci_inatomic == 0 || map == kernel_map) { + if (pcb->pcb_onfault != NULL && frame->tf_err == PGEX_PK) + error = EFAULT; + else if (curcpu()->ci_inatomic == 0 || map == kernel_map) { onfault = pcb->pcb_onfault; pcb->pcb_onfault = NULL; error = uvm_fault(map, va, 0, access_type); Index: sys/arch/amd64/amd64/vector.S =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/vector.S,v retrieving revision 1.87 diff -u -p -u -r1.87 vector.S --- sys/arch/amd64/amd64/vector.S 1 Dec 2022 00:26:15 -0000 1.87 +++ sys/arch/amd64/amd64/vector.S 14 Jan 2023 14:48:49 -0000 @@ -149,6 +149,8 @@ INTRENTRY_LABEL(calltrap_specstk): movq %r12,%rax movq %r13,%rdx wrmsr + + call xonly popq %rdi popq %rsi popq %rdx Index: sys/arch/amd64/include/pte.h =================================================================== RCS file: /cvs/src/sys/arch/amd64/include/pte.h,v retrieving revision 1.15 diff -u -p -u -r1.15 pte.h --- sys/arch/amd64/include/pte.h 14 Jan 2023 03:37:13 -0000 1.15 +++ sys/arch/amd64/include/pte.h 14 Jan 2023 16:44:11 -0000 @@ -122,6 +122,8 @@ typedef u_int64_t pt_entry_t; /* PTE */ #define PG_AVAIL2 0x0000000000000400UL #define PG_AVAIL3 0x0000000000000800UL #define PG_PATLG 0x0000000000001000UL /* PAT on large pages */ +#define PG_PKMASK 0x7800000000000000UL /* Protection Key Mask */ +#define PG_XO 0x0800000000000000UL /* ^^^ key1 = execute-only */ #define PG_NX 0x8000000000000000UL /* non-executable */ #define PG_FRAME 0x000ffffffffff000UL Index: gnu/llvm/lld/ELF/Driver.cpp =================================================================== RCS file: /cvs/src/gnu/llvm/lld/ELF/Driver.cpp,v retrieving revision 1.7 diff -u -p -u -r1.7 Driver.cpp --- gnu/llvm/lld/ELF/Driver.cpp 14 Jan 2023 16:15:43 -0000 1.7 +++ gnu/llvm/lld/ELF/Driver.cpp 14 Jan 2023 16:56:07 -0000 @@ -1482,6 +1482,7 @@ static void setConfigs(opt::InputArgList switch (m) { case EM_AARCH64: case EM_RISCV: + case EM_X86_64: config->executeOnly = true; break; } Index: gnu/llvm/lld/docs/ld.lld.1 =================================================================== RCS file: /cvs/src/gnu/llvm/lld/docs/ld.lld.1,v retrieving revision 1.5 diff -u -p -u -r1.5 ld.lld.1 --- gnu/llvm/lld/docs/ld.lld.1 14 Jan 2023 16:20:32 -0000 1.5 +++ gnu/llvm/lld/docs/ld.lld.1 14 Jan 2023 17:01:02 -0000 @@ -212,7 +212,7 @@ followed by the name of the missing libr followed by the name of the undefined symbol. .It Fl -execute-only Mark executable sections unreadable. -This option is currently only supported on x86-64, AArch64 (default), +This option is currently only supported on x86-64 (default), AArch64 (default), MIPS64, RISC-V (default), and SPARC64. .It Fl -exclude-libs Ns = Ns Ar value Exclude static libraries from automatic export. Index: libexec/ld.so/amd64/ld.script =================================================================== RCS file: /cvs/src/libexec/ld.so/amd64/ld.script,v retrieving revision 1.2 diff -u -p -u -r1.2 ld.script --- libexec/ld.so/amd64/ld.script 7 Nov 2022 20:41:38 -0000 1.2 +++ libexec/ld.so/amd64/ld.script 14 Jan 2023 01:21:54 -0000 @@ -1,7 +1,7 @@ PHDRS { rodata PT_LOAD FILEHDR PHDRS FLAGS (4); - text PT_LOAD; + text PT_LOAD FLAGS (1); btext PT_LOAD FLAGS (0x08000005); data PT_LOAD; random PT_OPENBSD_RANDOMIZE; ---- #include <sys/types.h> #include <sys/mman.h> #include <unistd.h> #include <stdio.h> #include <fcntl.h> #include <signal.h> #include <setjmp.h> #include <dlfcn.h> #include <string.h> #include <err.h> int main(int argc, char *argv[]); void *setup_ldso(void); void *setup_mmap_xz(void); void *setup_mmap_x(void); void *setup_mmap_nrx(void); void *setup_mmap_nwx(void); void *setup_mmap_xnwx(void); struct readable { char *name; void *(*setup)(void); int isfn; void *addr; int uu, ku; int skip; } readables[] = { { "ld.so", setup_ldso, 1, NULL, 0, 0 }, { "mmap xz", setup_mmap_xz, 0, NULL, 0, 0 }, { "mmap x", setup_mmap_x, 0, NULL, 0, 0 }, { "mmap nrx", setup_mmap_nrx, 0, NULL, 0, 0 }, //#if defined(__sparc64__) || defined(__alpha) // /* XX sparc64 crashes on write to W. pmap should treat W-only as RW? */ //#else { "mmap nwx", setup_mmap_nwx, 0, NULL, 0, 0 }, { "mmap xnwx", setup_mmap_xnwx, 0, NULL, 0, 0 }, //#endif { "main", NULL, 1, &main, 0, 0 }, { "libc", NULL, 1, &open, 0, 0 }, }; jmp_buf fail; void sigsegv(int _unused) { longjmp(fail, 1); } void * setup_mmap_xz(void) { return mmap(NULL, getpagesize(), PROT_EXEC, MAP_ANON | MAP_PRIVATE, -1, 0); /* no data written. tests read-fault of an unbacked exec-only page */ } void * setup_mmap_x(void) { char *addr; addr = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); explicit_bzero(addr, getpagesize()); mprotect(addr, getpagesize(), PROT_EXEC); return addr; } void * setup_mmap_nrx(void) { char *addr; addr = mmap(NULL, getpagesize(), PROT_NONE, MAP_ANON | MAP_PRIVATE, -1, 0); mprotect(addr, getpagesize(), PROT_READ | PROT_WRITE); explicit_bzero(addr, getpagesize()); mprotect(addr, getpagesize(), PROT_EXEC); return addr; } void * setup_mmap_nwx(void) { char *addr; addr = mmap(NULL, getpagesize(), PROT_NONE, MAP_ANON | MAP_PRIVATE, -1, 0); mprotect(addr, getpagesize(), PROT_WRITE); explicit_bzero(addr, getpagesize()); mprotect(addr, getpagesize(), PROT_EXEC); return addr; } void * setup_mmap_xnwx(void) { char *addr; addr = mmap(NULL, getpagesize(), PROT_EXEC, MAP_ANON | MAP_PRIVATE, -1, 0); mprotect(addr, getpagesize(), PROT_NONE); mprotect(addr, getpagesize(), PROT_WRITE); explicit_bzero(addr, getpagesize()); mprotect(addr, getpagesize(), PROT_EXEC); return addr; } void * setup_ldso(void) { void *handle, *dlopenp; handle = dlopen("ld.so", RTLD_NOW); if (handle == NULL) errx(1, "dlopen"); dlopenp = dlsym(handle, "dlopen"); return dlopenp; } void setup_table() { int i; for (i = 0; i < sizeof(readables)/sizeof(readables[0]); i++) { if (setjmp(fail) == 0) { if (readables[i].setup) readables[i].addr = readables[i].setup(); } else readables[i].skip = 1; #ifdef __hppa__ /* hppa ptable headers point at the instructions */ if (readables[i].isfn) readables[i].addr = (void *)*(u_int *) ((u_int)readables[i].addr & ~3); #endif } } int main(int argc, char *argv[]) { int p[2], i; signal(SIGSEGV, sigsegv); signal(SIGBUS, sigsegv); setup_table(); for (i = 0; i < sizeof(readables)/sizeof(readables[0]); i++) { struct readable *r = &readables[i]; char c; if (r->skip) continue; pipe(p); fcntl(p[0], F_SETFL, O_NONBLOCK); if (write(p[1], r->addr, 1) == 1 && read(p[0], &c, 1) == 1) r->ku = 1; if (setjmp(fail) == 0) { volatile int x = *(int *)(r->addr); r->uu = 1; } close(p[0]); close(p[1]); } printf("%-16s %18s userland kernel\n", "", ""); for (i = 0; i < sizeof(readables)/sizeof(readables[0]); i++) { struct readable *r = &readables[i]; if (r->skip) printf("%-16s %18p %-10s %-10s\n", r->name, r->addr, "skipped", "skipped"); else printf("%-16s %18p %-10s %-10s\n", r->name, r->addr, r->uu ? "readable" : "unreadable", r->ku ? "readable" : "unreadable"); } }