pc: relocate 4g start to 1T where applicable

Jason Wang Thu, 24 Feb 2022 19:52:41 -0800

On Fri, Feb 25, 2022 at 2:30 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
> > On 2/24/22 17:23, Michael S. Tsirkin wrote:
> > > On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> > >> On 2/23/22 23:35, Joao Martins wrote:
> > >>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> > >>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> > >>>>> +                                          uint64_t pci_hole64_size)
> > >>>>> +{
> > >>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> > >>>>> +    uint32_t eax, vendor[3];
> > >>>>> +
> > >>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > >>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> > >>>>> +        return;
> > >>>>> +    }
> > >>>>
> > >>>> Wait a sec, should this actually be tying things to the host CPU ID?
> > >>>> It's really about what we present to the guest though,
> > >>>> isn't it?
> > >>>
> > >>> It was the easier catch all to use cpuid without going into
> > >>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> > >>> for systems with an IOMMU present.
> > >>>
> > >>>> Also, can't we tie this to whether the AMD IOMMU is present?
> > >>>>
> > >>> I think so, I can add that. Something like a amd_iommu_exists() helper
> > >>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> > >>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> > >>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
> > >>> think it's
> > >>> even worth checking the range exists in:
> > >>>
> > >>>   /sys/kernel/iommu_groups/0/reserved_regions
> > >>>
> > >>> (Also that sysfs ABI is >= 4.11 only)
> > >>
> > >> Here's what I have staged in local tree, to address your comment.
> > >>
> > >> Naturally the first chunk is what's affected by this patch the rest is a
> > >> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> > >> all the tests and what not.
> > >>
> > >> I am not entirely sure this is the right place to put such a helper, open
> > >> to suggestions. wrt to the naming of the helper, I tried to follow the 
> > >> rest
> > >> of the file's style.
> > >>
> > >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > >> index a9be5d33a291..2ea4430d5dcc 100644
> > >> --- a/hw/i386/pc.c
> > >> +++ b/hw/i386/pc.c
> > >> @@ -868,10 +868,8 @@ static void 
> > >> x86_update_above_4g_mem_start(PCMachineState *pcms,
> > >>                                            uint64_t pci_hole64_size)
> > >>  {
> > >>      X86MachineState *x86ms = X86_MACHINE(pcms);
> > >> -    uint32_t eax, vendor[3];
> > >>
> > >> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > >> -    if (!IS_AMD_VENDOR(vendor)) {
> > >> +    if (!qemu_amd_iommu_is_present()) {
> > >>          return;
> > >>      }
> > >>
> > >> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > >> index 7bcce3bceb0f..eb4ea071ecec 100644
> > >> --- a/include/qemu/osdep.h
> > >> +++ b/include/qemu/osdep.h
> > >> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> > >>   */
> > >>  size_t qemu_get_host_physmem(void);
> > >>
> > >> +/**
> > >> + * qemu_amd_iommu_is_present:
> > >> + *
> > >> + * Operating system agnostic way of querying if an AMD IOMMU
> > >> + * is present.
> > >> + *
> > >> + */
> > >> +bool qemu_amd_iommu_is_present(void);
> > >> +
> > >>  /*
> > >>   * Toggle write/execute on the pages marked MAP_JIT
> > >>   * for the current thread.
> > >> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > >> index f2be7321c59f..54cef21217c4 100644
> > >> --- a/util/oslib-posix.c
> > >> +++ b/util/oslib-posix.c
> > >> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> > >>  #endif
> > >>      return 0;
> > >>  }
> > >> +
> > >> +bool qemu_amd_iommu_is_present(void)
> > >> +{
> > >> +    bool found = false;
> > >> +#ifdef CONFIG_LINUX
> > >> +    struct dirent *entry;
> > >> +    char *path;
> > >> +    DIR *dir;
> > >> +
> > >> +    path = g_strdup_printf("/sys/class/iommu");
> > >> +    dir = opendir(path);
> > >> +    if (!dir) {
> > >> +        g_free(path);
> > >> +        return found;
> > >> +    }
> > >> +
> > >> +    do {
> > >> +            entry = readdir(dir);
> > >> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> > >> +                found = true;
> > >> +                break;
> > >> +            }
> > >> +    } while (entry);
> > >> +
> > >> +    g_free(path);
> > >> +    closedir(dir);
> > >> +#endif
> > >> +    return found;
> > >> +}
> > >
> > > why are we checking whether an AMD IOMMU is present
> > > on the host?
> > > Isn't what we care about whether it's
> > > present in the VM? What we are reserving after all
> > > is a range of GPAs, not actual host PA's ...
> > >
> > Right.
> >
> > But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> > and so we need to not map that portion of address space entirely
> > and use some other portion of the GPA-space. This has to
> > do with host IOMMU which is where the DMA mapping of guest PA takes
> > place. So, if you do not have an host IOMMU, you can't
> > service guest DMA/PCIe services via VFIO through the host IOMMU, therefore 
> > you
> > don't need this problem.
> >
> > Whether one uses a guest IOMMU or not does not affect the result,
> > it would be more of a case of mimicking real hardware not fixing
> > the issue of this series.
>
>
> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> And ideally move the logic reporting reserved ranges there too.
> However, I think vdpa has the same issue too.
> CC Jason for an opinion.


vDPA is even more complicated than VFIO as it allows the device to
reserve specific IOVA ranges:

static void vhost_vdpa_set_iova_range(struct vhost_vdpa *v)
{
        struct vdpa_iova_range *range = &v->range;
        struct vdpa_device *vdpa = v->vdpa;
        const struct vdpa_config_ops *ops = vdpa->config;

        if (ops->get_iova_range) {
                *range = ops->get_iova_range(vdpa);
...
}

I wonder if we need to expose those via netlink protocol.

Thanks

> Also, some concerns
> - I think this changes memory layout for working VMs not using VFIO.
>   Better preserve the old layout for old machine types?
>
> - You mention the phys bits issue very briefly, and it's pretty
>   concerning.  Do we maybe want to also disable the work around if phys
>   bits is too small? Also, we'd need to test a bunch of old
>   guests to see what happens. Which guests were tested?
>
> --
> MST
>

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

Reply via email to