mm: add API for marking only part of a MMIO page read only

Marek Marczykowski-Górecki Tue, 11 Jun 2024 04:39:31 -0700

On Tue, Jun 11, 2024 at 12:40:49PM +0200, Roger Pau Monné wrote:
> On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki wrote:
> > In some cases, only few registers on a page needs to be write-protected.
> > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > PBA table (which doesn't need to span the whole table either), although
> > in the latter case the spec forbids placing other registers on the same
> > page. Current API allows only marking whole pages pages read-only,
> > which sometimes may cover other registers that guest may need to
> > write into.
> > 
> > Currently, when a guest tries to write to an MMIO page on the
> > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > from userspace (like, /dev/mem), it will try to fixup by updating page
> > tables (that Xen again will force to read-only) and will hit that #PF
> > again (looping endlessly). Both behaviors are undesirable if guest could
> > actually be allowed the write.
> > 
> > Introduce an API that allows marking part of a page read-only. Since
> > sub-page permissions are not a thing in page tables (they are in EPT,
> > but not granular enough), do this via emulation (or simply page fault
> > handler for PV) that handles writes that are supposed to be allowed.
> > The new subpage_mmio_ro_add() takes a start physical address and the
> > region size in bytes. Both start address and the size need to be 8-byte
> > aligned, as a practical simplification (allows using smaller bitmask,
> > and a smaller granularity isn't really necessary right now).
> > It will internally add relevant pages to mmio_ro_ranges, but if either
> > start or end address is not page-aligned, it additionally adds that page
> > to a list for sub-page R/O handling. The list holds a bitmask which
> > qwords are supposed to be read-only and an address where page is mapped
> > for write emulation - this mapping is done only on the first access. A
> > plain list is used instead of more efficient structure, because there
> > isn't supposed to be many pages needing this precise r/o control.
> > 
> > The mechanism this API is plugged in is slightly different for PV and
> > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > violation on p2m_mmio_direct page results in a direct domain_crash() for
> > non hardware domains.  To reach mmio_ro_emulated_write(), change how
> > write violations for p2m_mmio_direct are handled - specifically, check
> > if they relate to such partially protected page via
> > subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
> > them too. This decodes what guest is trying write and finally calls
> > mmio_ro_emulated_write(). The EPT write violation is detected as
> > npfec.write_access and npfec.present both being true (similar to other
> > places), which may cover some other (future?) cases - if that happens,
> > emulator might get involved unnecessarily, but since it's limited to
> > pages marked with subpage_mmio_ro_add() only, the impact is minimal.
> > Both of those paths need an MFN to which guest tried to write (to check
> > which part of the page is supposed to be read-only, and where
> > the page is mapped for writes). This information currently isn't
> > available directly in mmio_ro_emulated_write(), but in both cases it is
> > already resolved somewhere higher in the call tree. Pass it down to
> > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> > 
> > This may give a bit more access to the instruction emulator to HVM
> > guests (the change in hvm_hap_nested_page_fault()), but only for pages
> > explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
> > passed through a device partially used by Xen.
> > As of the next patch, it applies only configuration explicitly
> > documented as not security supported.
> > 
> > The subpage_mmio_ro_add() function cannot be called with overlapping
> > ranges, and on pages already added to mmio_ro_ranges separately.
> > Successful calls would result in correct handling, but error paths may
> > result in incorrect state (like pages removed from mmio_ro_ranges too
> > early). Debug build has asserts for relevant cases.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki <[email protected]>
> > ---
> > Shadow mode is not tested, but I don't expect it to work differently than
> > HAP in areas related to this patch.
> > 
> > Changes in v4:
> > - rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
> > - guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
> >   there
> > - rename ro_qwords to ro_elems
> > - use unsigned arguments for subpage_mmio_ro_remove_page()
> > - use volatile for __iomem
> > - do not set mmio_ro_ctxt.mfn for mmcfg case
> > - comment where fields of mmio_ro_ctxt are used
> > - use bool for result of __test_and_set_bit
> > - do not open-code mfn_to_maddr()
> > - remove leftover RCU
> > - mention hvm_hap_nested_page_fault() explicitly in the commit message
> > Changes in v3:
> > - use unsigned int for loop iterators
> > - use __set_bit/__clear_bit when under spinlock
> > - avoid ioremap() under spinlock
> > - do not cast away const
> > - handle unaligned parameters in release build
> > - comment fixes
> > - remove RCU - the add functions are __init and actual usage is only
> >   much later after domains are running
> > - add checks overlapping ranges in debug build and document the
> >   limitations
> > - change subpage_mmio_ro_add() so the error path doesn't potentially
> >   remove pages from mmio_ro_ranges
> > - move printing message to avoid one goto in
> >   subpage_mmio_write_emulate()
> > Changes in v2:
> > - Simplify subpage_mmio_ro_add() parameters
> > - add to mmio_ro_ranges from within subpage_mmio_ro_add()
> > - use ioremap() instead of caller-provided fixmap
> > - use 8-bytes granularity (largest supported single write) and a bitmap
> >   instead of a rangeset
> > - clarify commit message
> > - change how it's plugged in for HVM domain, to not change the behavior for
> >   read-only parts (keep it hitting domain_crash(), instead of ignoring
> >   write)
> > - remove unused subpage_mmio_ro_remove()
> > ---
> >  xen/arch/x86/hvm/emulate.c      |   2 +-
> >  xen/arch/x86/hvm/hvm.c          |   4 +-
> >  xen/arch/x86/include/asm/mm.h   |  25 +++-
> >  xen/arch/x86/mm.c               | 273 +++++++++++++++++++++++++++++++++-
> >  xen/arch/x86/pv/ro-page-fault.c |   6 +-
> >  5 files changed, 305 insertions(+), 5 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> > index ab1bc516839a..e98513afc69b 100644
> > --- a/xen/arch/x86/hvm/emulate.c
> > +++ b/xen/arch/x86/hvm/emulate.c
> > @@ -2735,7 +2735,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned 
> > long gla)
> >          .write      = mmio_ro_emulated_write,
> >          .validate   = hvmemul_validate,
> >      };
> > -    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla };
> > +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = 
> > _mfn(mfn) };
> >      struct hvm_emulate_ctxt ctxt;
> >      const struct x86_emulate_ops *ops;
> >      unsigned int seg, bdf;
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index 9594e0a5c530..73bbfe2bdc99 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -2001,8 +2001,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned 
> > long gla,
> >          goto out_put_gfn;
> >      }
> >  
> > -    if ( (p2mt == p2m_mmio_direct) && is_hardware_domain(currd) &&
> > -         npfec.write_access && npfec.present &&
> > +    if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present 
> > &&
> > +         (is_hardware_domain(currd) || subpage_mmio_write_accept(mfn, 
> > gla)) &&
> >           (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> >      {
> >          rc = 1;
> > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > index 98b66edaca5e..d04cf2c4165e 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,34 @@ extern struct rangeset *mmio_ro_ranges;
> >  void memguard_guard_stack(void *p);
> >  void memguard_unguard_stack(void *p);
> >  
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Range specified here
> > + * will still be R/O, but the rest of the page (not marked as R/O via 
> > another
> > + * call) will have writes passed through.
> > + * The start address and the size must be aligned to MMIO_RO_SUBPAGE_GRAN.
> > + *
> > + * This API cannot be used for overlapping ranges, nor for pages already 
> > added
> > + * to mmio_ro_ranges separately.
> > + *
> > + * Since there is currently no subpage_mmio_ro_remove(), relevant device 
> > should
> > + * not be hot-unplugged.
> > + *
> > + * Return values:
> > + *  - negative: error
> > + *  - 0: success
> > + */
> > +#define MMIO_RO_SUBPAGE_GRAN 8
> > +int subpage_mmio_ro_add(paddr_t start, size_t size);
> > +#ifdef CONFIG_HVM
> > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla);
> > +#endif
> > +
> >  struct mmio_ro_emulate_ctxt {
> >          unsigned long cr2;
> > +        /* Used only for mmcfg case */
> >          unsigned int seg, bdf;
> > +        /* Used only for non-mmcfg case */
> > +        mfn_t mfn;
> >  };
> >  
> >  int cf_check mmio_ro_emulated_write(
> > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> > index d968bbbc7315..dab7cc018c3f 100644
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -150,6 +150,17 @@ bool __read_mostly machine_to_phys_mapping_valid;
> >  
> >  struct rangeset *__read_mostly mmio_ro_ranges;
> >  
> > +/* Handling sub-page read-only MMIO regions */
> > +struct subpage_ro_range {
> > +    struct list_head list;
> > +    mfn_t mfn;
> > +    void __iomem *mapped;
> > +    DECLARE_BITMAP(ro_elems, PAGE_SIZE / MMIO_RO_SUBPAGE_GRAN);
> > +};
> > +
> > +static LIST_HEAD(subpage_ro_ranges);
> > +static DEFINE_SPINLOCK(subpage_ro_lock);
> > +
> >  static uint32_t base_disallow_mask;
> >  /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. 
> > */
> >  #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & 
> > ~_PAGE_GLOBAL)
> > @@ -4910,6 +4921,265 @@ long arch_memory_op(unsigned long cmd, 
> > XEN_GUEST_HANDLE_PARAM(void) arg)
> >      return rc;
> >  }
> >  
> > +/*
> > + * Mark part of the page as R/O.
> > + * Returns:
> > + * - 0 on success - first range in the page
> > + * - 1 on success - subsequent range in the page
> > + * - <0 on error
> > + *
> > + * This needs subpage_ro_lock already taken.
> > + */
> > +static int __init subpage_mmio_ro_add_page(
> > +    mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
> 
> Nit: parameters here seem to be indented differently than below.
> 
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    unsigned int i;
> > +
> > +    list_for_each_entry(iter, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> 
> AFAICT you could put the search logic into a separate function and use
> it here, plus in subpage_mmio_ro_remove_page(),
> subpage_mmio_write_emulate() and subpage_mmio_write_accept() possibly.


Good idea.

> > +    if ( !entry )
> > +    {
> > +        /* iter == NULL marks it was a newly allocated entry */
> > +        iter = NULL;
> > +        entry = xzalloc(struct subpage_ro_range);
> > +        if ( !entry )
> > +            return -ENOMEM;
> > +        entry->mfn = mfn;
> > +    }
> > +
> > +    for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > +    {
> > +        bool oldbit = __test_and_set_bit(i / MMIO_RO_SUBPAGE_GRAN,
> > +                                        entry->ro_elems);
> > +        ASSERT(!oldbit);
> > +    }
> > +
> > +    if ( !iter )
> > +        list_add(&entry->list, &subpage_ro_ranges);
> > +
> > +    return iter ? 1 : 0;
> > +}
> > +
> > +/* This needs subpage_ro_lock already taken */
> > +static void __init subpage_mmio_ro_remove_page(
> > +    mfn_t mfn,
> > +    unsigned int offset_s,
> > +    unsigned int offset_e)
> > +{
> > +    struct subpage_ro_range *entry = NULL, *iter;
> > +    unsigned int i;
> > +
> > +    list_for_each_entry(iter, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(iter->mfn, mfn) )
> > +        {
> > +            entry = iter;
> > +            break;
> > +        }
> > +    }
> > +    if ( !entry )
> > +        return;
> > +
> > +    for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > +        __clear_bit(i / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems);
> > +
> > +    if ( !bitmap_empty(entry->ro_elems, PAGE_SIZE / MMIO_RO_SUBPAGE_GRAN) )
> > +        return;
> > +
> > +    list_del(&entry->list);
> > +    if ( entry->mapped )
> > +        iounmap(entry->mapped);
> > +    xfree(entry);
> > +}
> > +
> > +int __init subpage_mmio_ro_add(
> > +    paddr_t start,
> > +    size_t size)
> > +{
> > +    mfn_t mfn_start = maddr_to_mfn(start);
> > +    paddr_t end = start + size - 1;
> > +    mfn_t mfn_end = maddr_to_mfn(end);
> > +    unsigned int offset_end = 0;
> > +    int rc;
> > +    bool subpage_start, subpage_end;
> > +
> > +    ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > +    ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > +    if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > +        size = ROUNDUP(size, MMIO_RO_SUBPAGE_GRAN);
> > +
> > +    if ( !size )
> > +        return 0;
> > +
> > +    if ( mfn_eq(mfn_start, mfn_end) )
> > +    {
> > +        /* Both starting and ending parts handled at once */
> > +        subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != 
> > PAGE_SIZE - 1;
> > +        subpage_end = false;
> 
> Given the intended usage of this, don't we want to limit to only a
> single page?  So that PFN_DOWN(start + size) == PFN_DOWN/(start), as
> that would simplify the logic here?

I have considered that, but I haven't found anything in the spec
mandating the XHCI DbC registers to not cross page boundary. Currently
(on a system I test this on) they don't cross page boundary, but I don't
want to assume extra constrains - to avoid issues like before (when
on the older system I tested the DbC registers didn't shared page with
other registers, but then they shared the page on a newer hardware).

> Mostly asking because I think for the usage of XHCI the registers that
> need to be marked RO are all inside the same page, and hence would
> like to avoid introducing logic to handle multipage ranges if that's
> not tested at all.
> 
> > +    }
> > +    else
> > +    {
> > +        subpage_start = PAGE_OFFSET(start);
> > +        subpage_end = PAGE_OFFSET(end) != PAGE_SIZE - 1;
> > +    }
> > +
> > +    spin_lock(&subpage_ro_lock);
> 
> Do you really need the lock if modifications can only happen during
> init?  Xen initialization is single threaded, so you can likely avoid
> the lock during boot.

With adding (and removing) firmly tied to init (via __ro_after_init), I
think I'm okay with dropping the spinlock here. Yet, it's still needed
for mapping the page.

> > +
> > +    if ( subpage_start )
> > +    {
> > +        offset_end = mfn_eq(mfn_start, mfn_end) ?
> > +                     PAGE_OFFSET(end) :
> > +                     (PAGE_SIZE - 1);
> > +        rc = subpage_mmio_ro_add_page(mfn_start,
> > +                                      PAGE_OFFSET(start),
> > +                                      offset_end);
> > +        if ( rc < 0 )
> > +            goto err_unlock;
> > +        /* Check if not marking R/W part of a page intended to be fully 
> > R/O */
> > +        ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> > +                                                  mfn_x(mfn_start)));
> 
> I think it would be better if this check was done ahead, and an error
> was returned.  I see no point in delaying the check until the region
> has already been registered.

I need return value from subpage_mmio_ro_add_page() for this check,
because currently it's okay to mark further regions read-only (at which
point the page is already on mmio_ro_ranges). Theoretically I could
probably limit the scope of this API even further - to just one R/O
region per page, but even in the XHCI driver, I can imagine needing
marking more regions (which might share a page, depending on hardware
layout) in some future version that could gain some more features.

> > +    }
> > +
> > +    if ( subpage_end )
> > +    {
> > +        rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end));
> > +        if ( rc < 0 )
> > +            goto err_unlock_remove;
> > +        /* Check if not marking R/W part of a page intended to be fully 
> > R/O */
> > +        ASSERT(rc || !rangeset_contains_singleton(mmio_ro_ranges,
> > +                                                  mfn_x(mfn_end)));
> > +    }
> > +
> > +    spin_unlock(&subpage_ro_lock);
> > +
> > +    rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), 
> > mfn_x(mfn_end));
> > +    if ( rc )
> > +        goto err_remove;
> > +
> > +    return 0;
> > +
> > + err_remove:
> > +    spin_lock(&subpage_ro_lock);
> > +    if ( subpage_end )
> > +        subpage_mmio_ro_remove_page(mfn_end, 0, PAGE_OFFSET(end));
> > + err_unlock_remove:
> > +    if ( subpage_start )
> > +        subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), 
> > offset_end);
> > + err_unlock:
> > +    spin_unlock(&subpage_ro_lock);
> > +    return rc;
> > +}
> > +
> > +static void __iomem *subpage_mmio_map_page(
> > +    struct subpage_ro_range *entry)
> > +{
> > +    void __iomem *mapped_page;
> > +
> > +    if ( entry->mapped )
> > +        return entry->mapped;
> > +
> > +    mapped_page = ioremap(mfn_to_maddr(entry->mfn), PAGE_SIZE);
> > +
> > +    spin_lock(&subpage_ro_lock);
> > +    /* Re-check under the lock */
> > +    if ( entry->mapped )
> > +    {
> > +        spin_unlock(&subpage_ro_lock);
> > +        if ( mapped_page )
> > +            iounmap(mapped_page);
> > +        return entry->mapped;
> > +    }
> > +
> > +    entry->mapped = mapped_page;
> > +    spin_unlock(&subpage_ro_lock);
> > +    return entry->mapped;
> > +}
> > +
> > +static void subpage_mmio_write_emulate(
> > +    mfn_t mfn,
> > +    unsigned int offset,
> > +    const void *data,
> > +    unsigned int len)
> > +{
> > +    struct subpage_ro_range *entry;
> > +    volatile void __iomem *addr;
> > +
> > +    list_for_each_entry(entry, &subpage_ro_ranges, list)
> > +    {
> > +        if ( mfn_eq(entry->mfn, mfn) )
> > +        {
> > +            if ( test_bit(offset / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems) )
> > +            {
> > + write_ignored:
> > +                gprintk(XENLOG_WARNING,
> > +                        "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len 
> > %u\n",
> > +                        mfn_x(mfn), offset, len);
> > +                return;
> > +            }
> > +
> > +            addr = subpage_mmio_map_page(entry);
> 
> Given the very limited usage of this subpage RO infrastructure, I
> would be tempted to just map the mfn when the page is registered, in
> order to simplify the logic here.  The only use-case we have is XHCI,
> and further usage of this are likely to be limited to similar hardware
> that's shared between Xen and the hardware domain.

In an earlier similar series (which was about 1 or 2 pages in practice
per device) Jan requested doing lazy mapping, so I did it similar in
this series too.

> > +            if ( !addr )
> > +            {
> > +                gprintk(XENLOG_ERR,
> > +                        "Failed to map page for MMIO write at 
> > 0x%"PRI_mfn"%03x\n",
> > +                        mfn_x(mfn), offset);
> > +                return;
> > +            }
> > +
> > +            switch ( len )
> > +            {
> > +            case 1:
> > +                writeb(*(const uint8_t*)data, addr);
> > +                break;
> > +            case 2:
> > +                writew(*(const uint16_t*)data, addr);
> > +                break;
> > +            case 4:
> > +                writel(*(const uint32_t*)data, addr);
> > +                break;
> > +            case 8:
> > +                writeq(*(const uint64_t*)data, addr);
> > +                break;
> > +            default:
> > +                /* mmio_ro_emulated_write() already validated the size */
> > +                ASSERT_UNREACHABLE();
> > +                goto write_ignored;
> > +            }
> > +            return;
> > +        }
> > +    }
> > +    /* Do not print message for pages without any writable parts. */
> > +}
> > +
> > +#ifdef CONFIG_HVM
> > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla)
> > +{
> > +    unsigned int offset = PAGE_OFFSET(gla);
> > +    const struct subpage_ro_range *entry;
> > +
> > +    list_for_each_entry(entry, &subpage_ro_ranges, list)
> > +        if ( mfn_eq(entry->mfn, mfn) &&
> > +             !test_bit(offset / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems) )
> > +        {
> > +            /*
> > +             * We don't know the write size at this point yet, so it could 
> > be
> > +             * an unaligned write, but accept it here anyway and deal with 
> > it
> > +             * later.
> > +             */
> > +            return true;
> 
> For accesses that fall into the RO region, I think you need to accept
> them here and just terminate them?  I see no point in propagating
> them further in hvm_hap_nested_page_fault().

If write hits an R/O region on a page with some writable regions the
handling should be the same as it would be just on the mmio_ro_ranges.
This is what the patch does.
There may be an opportunity to simplify mmio_ro_ranges handling
somewhere, but I don't think it belongs to this patch.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

signature.asc
Description: PGP signature

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

Reply via email to