Some of you have probably noticed activity about "xonly" happening
to a bunch of architectures.  First arm64, then riscv64, then hppa,
and ongoing efforts with octeon, sparc64 (sun4u only), and more of this
is going to come in the future.

Like past work decades ago (and I suppose continually also) on W^X, and
increasing use of .rodata, the idea here is to have code (text segments)
not be readable.  Or in a more generic sense, if you mprotect a region
with only PROT_EXEC, it is not readable.

This has a number of nice characteristics.  It makes BROP techniques not
work as well (when accompanied by the effects of many other migitations),
it makes complex gadgets containing side effects harder to use (if the
registers involved in the side effect contain pointers to code), etc etc.

But most of us have amd64 machines.  Thrilling news:

It turns out we can do this fairly modern modern 64-bit x86 machines
from Intel and AMD, if operating in LONG mode, aka. amd64.  The cpu
needs to have a feature called PKU.  The way this works is not 100%
perfect, but it a serious enough hinderance.  A PKU memory key is
instantiated for all memory which is PROT_EXEC-only, and that key is
told to block memory reads.  Instruction reads are still permitted.  Now
some of you may know how PKU works, so you will say that userland can
change the behaviour and make the memory readable.  That is true.  Until
a system call happens.  Then we force it back to blocking read.  Or a
memory fault, we force it back.  Or an interrupt, even a clock
interrupt.  We force it back.  Generally if an attacker is trying to
read code it is because they don't have a lot of (turing complete or a
subset) of flexibility and want more information.  Imagine they are able
to generate a the "wrpkru" sequence to disable it, and then do something
else?  My guess is if they can do two things in a row, then they already
have power, and won't be doing this.  So this is a protection method
against a lower-level attack construction.  The concept is this:  If you
can bypass this to gain a tiny foothold, you would not have bothered,
because you have more power and would be reaching higher. 

As I mention, some other architectures have crossed already, but not
without little bits of pain.  Especially in the ports tree, where
unprepared code is more common.  Mostly this consists of assembly language
files that have put read-only data into the .text region, instead of into
the .rodata (some people still haven't read the memos from around 1997).

So I'd like to recruit some help from those of you capable of building
your own kernels.  Can you apply the following kernel diff, and try the
applications you are used to.  A list of applications that fail on some
way would be handy.  Be sure to ktrace -di then, and check there is a
SIGSEGV near the end, and include a snippet of that information.

If you don't know how to do this, please don't ask for help.  Just let
others do it.

The kernel diff isn't perfect yet, because it is less than 24 hours
since this started working.  But it appears good enough for testing,
while we work out the wrinkles.

After the kernel diff are two additional diffs, which can be done
independently

1) you can recompile gnu/usr.bin/clang to get a new linker that
defaults to --execute-only libraries and binaries, and competely
cross over

2) the shared library linker ld.so also needs a tweak


And after that, a silly test program, which generates this before:

                                      userland   kernel
ld.so                   0x59bd652920  readable   readable  
mmap xz                 0x5a08e6d000  unreadable unreadable
mmap x                  0x5a33152000  readable   readable  
mmap nrx                0x597c8af000  readable   readable  
mmap nwx                0x5988309000  readable   readable  
mmap xnwx               0x59e6118000  readable   readable  
main                    0x5773dfe390  readable   readable  
libc                    0x5a2ec49b00  readable   readable  

and after:

                                      userland   kernel
ld.so                  0x9476e6f36a0  unreadable unreadable
mmap xz                0x947e5bb3000  unreadable unreadable
mmap x                 0x947bc6ba000  unreadable unreadable
mmap nrx               0x947d07fa000  unreadable unreadable
mmap nwx               0x947bc631000  unreadable unreadable
mmap xnwx              0x947a2aa9000  unreadable unreadable
main                   0x9455cc2e490  unreadable unreadable
libc                   0x9476b5f8a60  unreadable unreadable






Index: sys/arch/amd64/amd64/cpu.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.163
diff -u -p -u -r1.163 cpu.c
--- sys/arch/amd64/amd64/cpu.c  29 Nov 2022 21:41:39 -0000      1.163
+++ sys/arch/amd64/amd64/cpu.c  14 Jan 2023 05:46:34 -0000
@@ -735,6 +735,8 @@ cpu_init(struct cpu_info *ci)
                cr4 |= CR4_SMAP;
        if (ci->ci_feature_sefflags_ecx & SEFF0ECX_UMIP)
                cr4 |= CR4_UMIP;
+       if (ci->ci_feature_sefflags_ecx & SEFF0ECX_PKU)
+               cr4 |= CR4_PKE;
        if ((cpu_ecxfeature & CPUIDECX_XSAVE) && cpuid_level >= 0xd)
                cr4 |= CR4_OSXSAVE;
        if (pmap_use_pcid)
Index: sys/arch/amd64/amd64/locore.S
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/locore.S,v
retrieving revision 1.131
diff -u -p -u -r1.131 locore.S
--- sys/arch/amd64/amd64/locore.S       1 Dec 2022 00:26:15 -0000       1.131
+++ sys/arch/amd64/amd64/locore.S       14 Jan 2023 14:49:54 -0000
@@ -597,6 +597,7 @@ IDTVEC_NOALIGN(syscall)
        jz      .Lsyscall_restore_fsbase
 
 .Lsyscall_restore_registers:
+       call    xonly
        RET_STACK_REFILL_WITH_RCX
 
        movq    TF_R8(%rsp),%r8
@@ -773,6 +774,7 @@ intr_user_exit_post_ast:
        jz      .Lintr_restore_fsbase
 
 .Lintr_restore_registers:
+       call    xonly
        RET_STACK_REFILL_WITH_RCX
 
        movq    TF_R8(%rsp),%r8
@@ -940,6 +942,8 @@ NENTRY(intr_fast_exit)
        testq   $PSL_I,%rdx
        jnz     .Lintr_exit_not_blocked
 #endif /* DIAGNOSTIC */
+
+       call    xonly
        movq    TF_RDI(%rsp),%rdi
        movq    TF_RSI(%rsp),%rsi
        movq    TF_R8(%rsp),%r8
@@ -1104,6 +1108,20 @@ ENTRY(pagezero)
        ret
        lfence
 END(pagezero)
+
+/* void xonly(void) */
+ENTRY(xonly)
+       movl    pmap_pke,%eax   /* have pke support? */
+       cmpl    $0,%eax
+       je      1f
+       movl    $0,%ecx         /* force pke back on for xonly restriction */
+       movl    $0,%edx
+       movl    $0xfffffffc,%eax
+       wrpkru
+1:
+       ret
+       lfence
+END(xonly)
 
 /* int rdmsr_safe(u_int msr, uint64_t *data) */
 ENTRY(rdmsr_safe)
Index: sys/arch/amd64/amd64/pmap.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v
retrieving revision 1.156
diff -u -p -u -r1.156 pmap.c
--- sys/arch/amd64/amd64/pmap.c 29 Nov 2022 21:41:39 -0000      1.156
+++ sys/arch/amd64/amd64/pmap.c 14 Jan 2023 05:47:49 -0000
@@ -256,6 +256,7 @@ static u_int cr3_pcid_temp;
 /* these two are accessed from locore.o */
 paddr_t cr3_reuse_pcid;
 paddr_t cr3_pcid_proc_intel;
+int pmap_pke;
 
 /*
  * other data structures
@@ -656,13 +657,26 @@ pmap_bootstrap(paddr_t first_avail, padd
        virtual_avail = kva_start;              /* first free KVA */
 
        /*
+        * If PKU is available, initialize PROT_EXEC entry correctly,
+        * and enable the feature before it gets used
+        */
+       if (cpuid_level >= 0x7) {
+               uint32_t ecx, dummy;
+               CPUID_LEAF(0x7, 0, dummy, dummy, ecx, dummy);
+               if (ecx & SEFF0ECX_PKU) {
+                       lcr4(rcr4() | CR4_PKE);
+                       pmap_pke = 1;
+               }
+       }
+
+       /*
         * set up protection_codes: we need to be able to convert from
         * a MI protection code (some combo of VM_PROT...) to something
         * we can jam into a i386 PTE.
         */
 
        protection_codes[PROT_NONE] = pg_nx;                    /* --- */
-       protection_codes[PROT_EXEC] = PG_RO;                    /* --x */
+       protection_codes[PROT_EXEC] = pmap_pke ? PG_XO : PG_RO; /* --x */
        protection_codes[PROT_READ] = PG_RO | pg_nx;            /* -r- */
        protection_codes[PROT_READ | PROT_EXEC] = PG_RO;        /* -rx */
        protection_codes[PROT_WRITE] = PG_RW | pg_nx;           /* w-- */
@@ -2105,7 +2119,7 @@ pmap_clear_attrs(struct vm_page *pg, uns
 void
 pmap_write_protect(struct pmap *pmap, vaddr_t sva, vaddr_t eva, vm_prot_t prot)
 {
-       pt_entry_t nx, *spte, *epte;
+       pt_entry_t nx = 0, *spte, *epte, xo = 0;
        vaddr_t blockend;
        int shootall = 0, shootself;
        vaddr_t va;
@@ -2118,9 +2132,10 @@ pmap_write_protect(struct pmap *pmap, va
        sva &= PG_FRAME;
        eva &= PG_FRAME;
 
-       nx = 0;
        if (!(prot & PROT_EXEC))
                nx = pg_nx;
+       else if (!(prot & PROT_READ))
+               xo = pmap_pke ? PG_XO : PG_RO;
 
        if ((eva - sva > 32 * PAGE_SIZE) && sva < VM_MIN_KERNEL_ADDRESS)
                shootall = 1;
@@ -2159,7 +2174,7 @@ pmap_write_protect(struct pmap *pmap, va
                        if (!pmap_valid_entry(*spte))
                                continue;
                        pmap_pte_clearbits(spte, PG_RW);
-                       pmap_pte_setbits(spte, nx);
+                       pmap_pte_setbits(spte, nx | xo);
                }
        }
 
Index: sys/arch/amd64/amd64/trap.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/trap.c,v
retrieving revision 1.93
diff -u -p -u -r1.93 trap.c
--- sys/arch/amd64/amd64/trap.c 7 Nov 2022 01:41:57 -0000       1.93
+++ sys/arch/amd64/amd64/trap.c 14 Jan 2023 15:03:49 -0000
@@ -176,7 +176,12 @@ upageflttrap(struct trapframe *frame, ui
        vaddr_t va = trunc_page((vaddr_t)cr2);
        vm_prot_t access_type = pgex2access(frame->tf_err);
        union sigval sv;
-       int signal, sicode, error;
+       int signal, sicode, error = EACCES;
+
+       if (frame->tf_err == PGEX_PK) {
+               sigabort(p);
+               return 1;
+       }
 
        error = uvm_fault(&p->p_vmspace->vm_map, va, 0, access_type);
        if (error == 0) {
@@ -271,7 +276,9 @@ kpageflttrap(struct trapframe *frame, ui
        if (va >= VM_MIN_KERNEL_ADDRESS)
                map = kernel_map;
 
-       if (curcpu()->ci_inatomic == 0 || map == kernel_map) {
+       if (pcb->pcb_onfault != NULL && frame->tf_err == PGEX_PK)
+               error = EFAULT;
+       else if (curcpu()->ci_inatomic == 0 || map == kernel_map) {
                onfault = pcb->pcb_onfault;
                pcb->pcb_onfault = NULL;
                error = uvm_fault(map, va, 0, access_type);
Index: sys/arch/amd64/amd64/vector.S
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/vector.S,v
retrieving revision 1.87
diff -u -p -u -r1.87 vector.S
--- sys/arch/amd64/amd64/vector.S       1 Dec 2022 00:26:15 -0000       1.87
+++ sys/arch/amd64/amd64/vector.S       14 Jan 2023 14:48:49 -0000
@@ -149,6 +149,8 @@ INTRENTRY_LABEL(calltrap_specstk):
        movq    %r12,%rax
        movq    %r13,%rdx
        wrmsr
+
+       call    xonly
        popq    %rdi
        popq    %rsi
        popq    %rdx
Index: sys/arch/amd64/include/pte.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/pte.h,v
retrieving revision 1.15
diff -u -p -u -r1.15 pte.h
--- sys/arch/amd64/include/pte.h        14 Jan 2023 03:37:13 -0000      1.15
+++ sys/arch/amd64/include/pte.h        14 Jan 2023 16:44:11 -0000
@@ -122,6 +122,8 @@ typedef u_int64_t pt_entry_t;               /* PTE */
 #define        PG_AVAIL2       0x0000000000000400UL
 #define        PG_AVAIL3       0x0000000000000800UL
 #define        PG_PATLG        0x0000000000001000UL    /* PAT on large pages */
+#define        PG_PKMASK       0x7800000000000000UL    /* Protection Key Mask 
*/
+#define        PG_XO           0x0800000000000000UL    /* ^^^ key1 = 
execute-only */
 #define        PG_NX           0x8000000000000000UL    /* non-executable */
 #define        PG_FRAME        0x000ffffffffff000UL
 









Index: gnu/llvm/lld/ELF/Driver.cpp
===================================================================
RCS file: /cvs/src/gnu/llvm/lld/ELF/Driver.cpp,v
retrieving revision 1.7
diff -u -p -u -r1.7 Driver.cpp
--- gnu/llvm/lld/ELF/Driver.cpp 14 Jan 2023 16:15:43 -0000      1.7
+++ gnu/llvm/lld/ELF/Driver.cpp 14 Jan 2023 16:56:07 -0000
@@ -1482,6 +1482,7 @@ static void setConfigs(opt::InputArgList
   switch (m) {
   case EM_AARCH64:
   case EM_RISCV:
+  case EM_X86_64:
     config->executeOnly = true;
     break;
   }
Index: gnu/llvm/lld/docs/ld.lld.1
===================================================================
RCS file: /cvs/src/gnu/llvm/lld/docs/ld.lld.1,v
retrieving revision 1.5
diff -u -p -u -r1.5 ld.lld.1
--- gnu/llvm/lld/docs/ld.lld.1  14 Jan 2023 16:20:32 -0000      1.5
+++ gnu/llvm/lld/docs/ld.lld.1  14 Jan 2023 17:01:02 -0000
@@ -212,7 +212,7 @@ followed by the name of the missing libr
 followed by the name of the undefined symbol.
 .It Fl -execute-only
 Mark executable sections unreadable.
-This option is currently only supported on x86-64, AArch64 (default),
+This option is currently only supported on x86-64 (default), AArch64 (default),
 MIPS64, RISC-V (default), and SPARC64.
 .It Fl -exclude-libs Ns = Ns Ar value
 Exclude static libraries from automatic export.
Index: libexec/ld.so/amd64/ld.script
===================================================================
RCS file: /cvs/src/libexec/ld.so/amd64/ld.script,v
retrieving revision 1.2
diff -u -p -u -r1.2 ld.script
--- libexec/ld.so/amd64/ld.script       7 Nov 2022 20:41:38 -0000       1.2
+++ libexec/ld.so/amd64/ld.script       14 Jan 2023 01:21:54 -0000
@@ -1,7 +1,7 @@
 PHDRS
 {
        rodata  PT_LOAD FILEHDR PHDRS FLAGS (4);
-       text    PT_LOAD;
+       text    PT_LOAD FLAGS (1);
        btext   PT_LOAD FLAGS (0x08000005);
        data    PT_LOAD;
        random  PT_OPENBSD_RANDOMIZE;




----
#include <sys/types.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <signal.h>
#include <setjmp.h>
#include <dlfcn.h>
#include <string.h>
#include <err.h>

int     main(int argc, char *argv[]);

void *setup_ldso(void);
void *setup_mmap_xz(void);
void *setup_mmap_x(void);
void *setup_mmap_nrx(void);
void *setup_mmap_nwx(void);
void *setup_mmap_xnwx(void);

struct readable {
        char *name;
        void *(*setup)(void);
        int isfn;
        void *addr;
        int uu, ku;
        int skip;
} readables[] = {
        { "ld.so",      setup_ldso, 1,          NULL,           0, 0 },
        { "mmap xz",    setup_mmap_xz, 0,       NULL,           0, 0 },
        { "mmap x",     setup_mmap_x, 0,        NULL,           0, 0 },
        { "mmap nrx",   setup_mmap_nrx, 0,      NULL,           0, 0 },
//#if defined(__sparc64__) || defined(__alpha)
//      /* XX sparc64 crashes on write to W.  pmap should treat W-only as RW? */
//#else
        { "mmap nwx",   setup_mmap_nwx, 0,      NULL,           0, 0 },
        { "mmap xnwx",  setup_mmap_xnwx, 0,     NULL,           0, 0 },
//#endif
        { "main",       NULL, 1,                &main,          0, 0 },
        { "libc",       NULL, 1,                &open,          0, 0 },
};

jmp_buf fail;

void
sigsegv(int _unused)
{
        longjmp(fail, 1);
}

void *
setup_mmap_xz(void)
{
        return mmap(NULL, getpagesize(), PROT_EXEC,
            MAP_ANON | MAP_PRIVATE, -1, 0);
        /* no data written. tests read-fault of an unbacked exec-only page */
}

void *
setup_mmap_x(void)
{
        char *addr;

        addr = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
            MAP_ANON | MAP_PRIVATE, -1, 0);
        explicit_bzero(addr, getpagesize());
        mprotect(addr, getpagesize(), PROT_EXEC);
        return addr;
}

void *
setup_mmap_nrx(void)
{
        char *addr;

        addr = mmap(NULL, getpagesize(), PROT_NONE,
            MAP_ANON | MAP_PRIVATE, -1, 0);
        mprotect(addr, getpagesize(), PROT_READ | PROT_WRITE);
        explicit_bzero(addr, getpagesize());
        mprotect(addr, getpagesize(), PROT_EXEC);
        return addr;
}

void *
setup_mmap_nwx(void)
{
        char *addr;

        addr = mmap(NULL, getpagesize(), PROT_NONE,
            MAP_ANON | MAP_PRIVATE, -1, 0);
        mprotect(addr, getpagesize(), PROT_WRITE);
        explicit_bzero(addr, getpagesize());
        mprotect(addr, getpagesize(), PROT_EXEC);
        return addr;
}

void *
setup_mmap_xnwx(void)
{
        char *addr;

        addr = mmap(NULL, getpagesize(), PROT_EXEC,
            MAP_ANON | MAP_PRIVATE, -1, 0);
        mprotect(addr, getpagesize(), PROT_NONE);
        mprotect(addr, getpagesize(), PROT_WRITE);
        explicit_bzero(addr, getpagesize());
        mprotect(addr, getpagesize(), PROT_EXEC);
        return addr;
}

void *
setup_ldso(void)
{
        void *handle, *dlopenp;

        handle = dlopen("ld.so", RTLD_NOW);
        if (handle == NULL)
                errx(1, "dlopen");
        dlopenp = dlsym(handle, "dlopen");
        return dlopenp;
}

void
setup_table()
{
        int i;

        for (i = 0; i < sizeof(readables)/sizeof(readables[0]); i++) {

                if (setjmp(fail) == 0) {
                        if (readables[i].setup)
                                readables[i].addr = readables[i].setup();
                } else
                        readables[i].skip = 1;
#ifdef __hppa__
                /* hppa ptable headers point at the instructions */
                if (readables[i].isfn)
                        readables[i].addr = (void *)*(u_int *)
                            ((u_int)readables[i].addr & ~3);
#endif
        }
}

int
main(int argc, char *argv[])
{
        int p[2], i;

        signal(SIGSEGV, sigsegv);
        signal(SIGBUS, sigsegv);

        setup_table();

        for (i = 0; i < sizeof(readables)/sizeof(readables[0]); i++) {
                struct readable *r = &readables[i];
                char c;

                if (r->skip)
                        continue;
                pipe(p);
                fcntl(p[0], F_SETFL, O_NONBLOCK);

                if (write(p[1], r->addr, 1) == 1 && read(p[0], &c, 1) == 1)
                        r->ku = 1;

                if (setjmp(fail) == 0) {
                        volatile int x = *(int *)(r->addr);
                        r->uu = 1;
                }

                close(p[0]);
                close(p[1]);
        }

        printf("%-16s  %18s  userland   kernel\n", "", "");
        for (i = 0; i < sizeof(readables)/sizeof(readables[0]); i++) {
                struct readable *r = &readables[i];

                if (r->skip)
                        printf("%-16s  %18p  %-10s %-10s\n", r->name, r->addr,
                            "skipped", "skipped");
                else
                        printf("%-16s  %18p  %-10s %-10s\n", r->name, r->addr,
                            r->uu ? "readable" : "unreadable",
                            r->ku ? "readable" : "unreadable");
        }
}

Reply via email to