riscv: implement p2m_set_range()

Oleksii Kurochko Fri, 14 Nov 2025 09:04:48 -0800


On 11/10/25 3:53 PM, Jan Beulich wrote:

On 20.10.2025 17:57, Oleksii Kurochko wrote:

--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -8,12 +8,45 @@
  #include <xen/rwlock.h>
  #include <xen/types.h>

+#include <asm/page.h>

  #include <asm/page-bits.h>

extern unsigned char gstage_mode;

+extern unsigned int gstage_root_level;

#define P2M_ROOT_ORDER (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)

  #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
+#define P2M_ROOT_LEVEL  gstage_root_level
+
+/*
+ * According to the RISC-V spec:
+ *   When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
+ *   or Sv57x4, G-stage address translation is a variation on the usual
+ *   page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
+ *   Sv57, respectively. In each case, the size of the incoming address is
+ *   widened by 2 bits (to 34, 41, 50, or 59 bits).
+ *
+ * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
+ * the index for this level of the P2M page table starts. The extra 2
+ * bits added by the "x4" schemes only affect the root page table width.
+ *
+ * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
+ * levels: the extra 2 bits do not change the indices of lower levels.
+ *
+ * The extra 2 bits are only relevant if one tried to address beyond the
+ * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
+ * invalid.
+ */
+#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)

Is the last paragraph of the comment really needed? It talks about something
absurd / impossible only.


Agree, it isn't really needed, lets drop it.

+#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
+
+#define P2M_PAGETABLE_ENTRIES(lvl) \
+    (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
+
+#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)

If I'm not mistaken, this is a mask with the low 10 or 12 bits set.


I'm not sure I fully understand you here. With the current implementation,
it returns a bitmask that corresponds to the number of index bits used
at each level. So, if|P2M_ROOT_LEVEL = 2|, then:
  |G||FN_MASK(0) = 0x1ff| (9-bit GFN for the level 0)
  |GFN_MASK(1) = 0x1ff| (9-bit GFN width for level 1)
  |GFN_MASK(2) = 0x7ff| (11-bit GFN width for level 2)

Or do you mean that GFN_MASK(lvl) should return something like this:

|G||FN_MASK_(0) = 0x1FF000 (0x1ff << 0xc) GFN_MASK_(1) = 0x3FE00000(GFN_MASK_(0)<<9) GFN_MASK_(2) = 0x1FFC0000000 (GFN_MASK_(1)<<9 + extra2 bits) And then here ...|

That's not really something you can apply to a GFN, unlike the name
suggests.


That is why virtual address should be properly shifted before, something
like it is done in calc_offset():
  (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);

...
 (va & GFN_MASK_(lvl)) >> P2M_LEVEL_SHIFT(lvl) ?
In this option more shifts will be needed.

Would it be better to just rename GFN_MASK() to P2M_PT_INDEX_MASK()? Or,
maybe, even just P2M_INDEX_MASK().

+#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)

Whereas here the macro name doesn't make clear what is shifted: An
address or a GFN. (It's the former, aiui.)


Yes, it is expected to be used to shift gfn.

The similar as with above would it be better to rename P2M_LEVEL_SHIFT to
P2M_GFN_LEVEL_SHIFT()?

--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -9,6 +9,7 @@
  #include <xen/rwlock.h>
  #include <xen/sched.h>
  #include <xen/sections.h>
+#include <xen/xvmalloc.h>

#include <asm/csr.h>

  #include <asm/flushtlb.h>
@@ -17,6 +18,43 @@
  #include <asm/vmid.h>

unsigned char __ro_after_init gstage_mode;

+unsigned int __ro_after_init gstage_root_level;

Like for mode, I'm unconvinced of this being a global (and not per-P2M /
per-domain).


The question is then if we really will (or want to) have cases when gstage
mode will be different per-domain/per-p2m?

+/*
+ * The P2M root page table is extended by 2 bits, making its size 16KB
+ * (instead of 4KB for non-root page tables). Therefore, P2M root page
+ * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+ * only allocates 4KB pages).
+ */
+#define ENTRIES_PER_ROOT_PAGE \
+    (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
+
+static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)

Where would a vaddr_t come from here? Your input are guest-physical addresses,
if I'm not mistaken.


You are right. Would it be right to 'paddr_t gpa' here? Or paddr_t is supposed 
to use
only with machine physical address?

+{
+    unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
+
+    /*
+     * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
+     * page table spans 4 consecutive 4KB pages.
+     * We want to return an index within one of these 4 pages.
+     * The specific page to use is determined by `p2m_get_root_pointer()`.
+     *
+     * Example: if `offset == 512`:
+     *  - A single 4KB page holds 512 entries.
+     *  - Therefore, entry 512 corresponds to index 0 of the second page.
+     *
+     * At all other levels, only one page is allocated, and `offset` is
+     * always in the range 0 to 511, since the VPN is 9 bits long.
+     */
+    return offset % ENTRIES_PER_ROOT_PAGE;

Seeing something "root" used here (when this is for all levels) is pretty odd,
despite all the commentary. Given all the commentary, why not simply

     return offset & ((1U << PAGETABLE_ORDER) - 1);

?


It works for all levels where|lvl < P2M_ROOT_LEVEL|, because in those cases the 
GFN
bit length is equal to|PAGETABLE_ORDER|. However, at the root level the GFN bit 
length
is 2 bits larger. So something like the following is needed:
  offset & ((1U << (PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl))) - 1);
This still returns an offset within a single 16 KB page, but in the case of the 
P2M
root we actually have four consecutive 4 KB pages, so the intention was to 
return
an offset inside one of those four 4 KB pages.

While writing the above, I started thinking whether|calc_offset()| could be 
implemented
much more simply. Since the root page table consists of four/consecutive/ 
pages, it seems
acceptable to have the offset in the range|[0, 2^11)| instead of doing all the 
extra
manipulation to determine which of the four pages is used and the offset within 
that
specific page:

  static inline unsigned int calc_offset(unsigned int lvl, paddr_t gpa)
  {
     return (gpa >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
  }

  static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
  {
    return __map_domain_page(p2m->root);
  }

It probably still makes sense for|p2m_get_root_pointer()| to check that the 
root GFN
index is not larger than|2^11|.

Am I missing something?

+}
+
+#define P2M_MAX_ROOT_LEVEL 4
+
+#define P2M_DECLARE_OFFSETS(var, addr) \
+    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
+    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
+        var[i] = calc_offset(i, addr);

This surely is more than just "declare", and it's dealing with all levels no
matter whether you actually will use all offsets.


I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.

But how can I know which offset I will actually need to use?
If we take the following loop as an example:
  |for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||* Don't try to 
allocate intermediate page tables if the mapping ||* is about to be removed. ||*/ ||rc 
= p2m_next_level(p2m, !removing_mapping, ||level, &table, offsets[level]); ||... 
||} |It walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at 
runtime.

If you mean that, for example, when the G-stage mode is Sv39, there is no need 
to allocate
an array with 4 entries (or 5 entries if we consider Sv57, so 
P2M_MAX_ROOT_LEVEL should be
updated), because Sv39 only uses 3 page table levels — then yes, in theory it 
could be
smaller. But I don't think it is a real issue if the|offsets[]| array on the 
stack has a
few extra unused entries.

If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
Would that be better?

@@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long 
pages, bool *preempted)
      return rc;
  }

+/*

+ * Map one of the four root pages of the P2M root page table.
+ *
+ * The P2M root page table is larger than normal (16KB instead of 4KB),
+ * so it is allocated as four consecutive 4KB pages. This function selects
+ * the appropriate 4KB page based on the given GFN and returns a mapping
+ * to it.
+ *
+ * The caller is responsible for unmapping the page after use.
+ *
+ * Returns NULL if the calculated offset into the root table is invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+    unsigned long root_table_indx;
+
+    root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);

With the variable name shortened (to e.g. idx) this could be its initializer
without ending up with too long a line. The root_table_ prefix isn't really
adding much value in the context of this function.

+    if ( root_table_indx >= P2M_ROOT_PAGES )
+        return NULL;
+
+    /*
+     * The P2M root page table is extended by 2 bits, making its size 16KB
+     * (instead of 4KB for non-root page tables). Therefore, p2m->root is
+     * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+     * only allocates 4KB pages).
+     *
+     * Initially, `root_table_indx` is derived directly from `va`.

There's no 'va' here.


Should be gfn.

+static inline void p2m_clean_pte(pte_t *p, bool clean_pte)

"clean_pte" as a parameter name of a function of this name is, well, odd.


clean_cache should be better:
  p2m_clean_pte(pte_t *p, bool clean_cache)

I suppose it would be nice to rename everywhere clean_pte to clean_cache.

Thanks.

~ Oleksii

Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()

Reply via email to