On 10.05.2022 16:30, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
>> When a page table ends up with no present entries left, it can be
>> replaced by a non-present entry at the next higher level. The page table
>> itself can then be scheduled for freeing.
>>
>> Note that while its output isn't used there yet,
>> pt_update_contig_markers() right away needs to be called in all places
>> where entries get updated, not just the one where entries get cleared.
>>
>> Note further that while pt_update_contig_markers() updates perhaps
>> several PTEs within the table, since these are changes to "avail" bits
>> only I do not think that cache flushing would be needed afterwards. Such
>> cache flushing (of entire pages, unless adding yet more logic to me more
>> selective) would be quite noticable performance-wise (very prominent
>> during Dom0 boot).
>>
>> Signed-off-by: Jan Beulich <[email protected]>
>> ---
>> v4: Re-base over changes earlier in the series.
>> v3: Properly bound loop. Re-base over changes earlier in the series.
>> v2: New.
>> ---
>> The hang during boot on my Latitude E6410 (see the respective code
>> comment) was pretty close after iommu_enable_translation(). No errors,
>> no watchdog would kick in, just sometimes the first few pixel lines of
>> the next log message's (XEN) prefix would have made it out to the screen
>> (and there's no serial there). It's been a lot of experimenting until I
>> figured the workaround (which I consider ugly, but halfway acceptable).
>> I've been trying hard to make sure the workaround wouldn't be masking a
>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>> at this point is that on these old IOMMUs the ignored bits 52...61
>> aren't really ignored for present entries, but also aren't "reserved"
>> enough to trigger faults. This guess is from having tried to set other
>> bits in this range (unconditionally, and with the workaround here in
>> place), which yielded the same behavior.
> 
> Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on
> some? IOMMUs are not usable?
> 
> Would be good if we could get a more formal response I think.

A more formal response would be nice, but given the age of the affected
hardware I don't expect anything more will be done there by Intel.

>> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
>>  
>>              write_atomic(&pte->val, new_pte.val);
>>              iommu_sync_cache(pte, sizeof(struct dma_pte));
>> +            pt_update_contig_markers(&parent->val,
>> +                                     address_level_offset(addr, level),
> 
> I think (unless previous patches in the series have changed this)
> there already is an 'offset' local variable that you could use.

The variable is clobbered by "IOMMU/x86: prefill newly allocate page
tables".

>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>>  
>>      old = *pte;
>>      dma_clear_pte(*pte);
>> +    iommu_sync_cache(pte, sizeof(*pte));
>> +
>> +    while ( pt_update_contig_markers(&page->val,
>> +                                     address_level_offset(addr, level),
>> +                                     level, PTE_kind_null) &&
>> +            ++level < min_pt_levels )
>> +    {
>> +        struct page_info *pg = maddr_to_page(pg_maddr);
>> +
>> +        unmap_vtd_domain_page(page);
>> +
>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
>> +                                          false);
>> +        BUG_ON(pg_maddr < PAGE_SIZE);
>> +
>> +        page = map_vtd_domain_page(pg_maddr);
>> +        pte = &page[address_level_offset(addr, level)];
>> +        dma_clear_pte(*pte);
>> +        iommu_sync_cache(pte, sizeof(*pte));
>> +
>> +        *flush_flags |= IOMMU_FLUSHF_all;
>> +        iommu_queue_free_pgtable(hd, pg);
>> +    }
> 
> I think I'm setting myself for trouble, but do we need to sync cache
> the lower lever entries if higher level ones are to be changed.
> 
> IOW, would it be fine to just flush the highest level modified PTE?
> As the lower lever ones won't be reachable anyway.

I definitely want to err on the safe side here. If later we can
prove that some cache flush is unneeded, I'd be happy to see it
go away.

>> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>>      }
>>  
>>      *pte = new;
>> -
>>      iommu_sync_cache(pte, sizeof(struct dma_pte));
>> +
>> +    /*
>> +     * While the (ab)use of PTE_kind_table here allows to save some work in
>> +     * the function, the main motivation for it is that it avoids a so far
>> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
>> +     * based laptop.
>> +     */
>> +    pt_update_contig_markers(&page->val,
>> +                             address_level_offset(dfn_to_daddr(dfn), level),
>> +                             level,
>> +                             (hd->platform_ops->page_sizes &
>> +                              (1UL << level_to_offset_bits(level + 1))
>> +                              ? PTE_kind_leaf : PTE_kind_table));
> 
> So this works because on what we believe to be affected models the
> only supported page sizes are 4K?

Yes.

> Do we want to do the same with AMD if we don't allow 512G super pages?

Why? They don't have a similar flaw.

Jan


Reply via email to