On Wed, Oct 22, 2025 at 11:21:15AM +0200, Jan Beulich wrote:
> On 21.10.2025 15:49, Roger Pau Monné wrote:
> > On Tue, Oct 21, 2025 at 08:42:13AM +0200, Jan Beulich wrote:
> >> On 20.10.2025 18:22, Roger Pau Monné wrote:
> >>> On Mon, Oct 20, 2025 at 01:18:34PM +0200, Jan Beulich wrote:
> >>>> Using dynamically allocated / maintained vectors has several downsides:
> >>>> - possible nesting of IRQs due to the effects of IRQ migration,
> >>>> - reduction of vectors available for devices,
> >>>> - IRQs not moving as intended if there's shortage of vectors,
> >>>> - higher runtime overhead.
> >>>>
> >>>> As the vector also doesn't need to be of any priority (first and foremost
> >>>> it really shouldn't be of higher or same priority as the timer IRQ, as
> >>>> that raises TIMER_SOFTIRQ anyway), avoid any "ordinary" vectors altogther
> >>>> and use a vector from the 0x10...0x1f exception vector space. Exception 
> >>>> vs
> >>>> interrupt can easily be distinguished by checking for the presence of an
> >>>> error code.
> >>>>
> >>>> With a fixed vector, less updating is now necessary in
> >>>> set_channel_irq_affinity(); in particular channels don't need transiently
> >>>> masking anymore, as the necessary update is now atomic. To fully leverage
> >>>> this, however, we want to stop using hpet_msi_set_affinity() there. With
> >>>> the transient masking dropped, we're no longer at risk of missing events.
> >>>>
> >>>> In principle a change to setup_vector_irq() would be necessary, but only
> >>>> if we used low-prio vectors as direct-APIC ones. Since the change would 
> >>>> be
> >>>> at best benign here, it is being omitted.
> >>>>
> >>>> Fixes: 996576b965cc ("xen: allow up to 16383 cpus")
> >>>> Signed-off-by: Jan Beulich <[email protected]>
> >>>> Release-Acked-by: Oleksii Kurochko<[email protected]>
> >>>> ---
> >>>> This is an alternative proposal to
> >>>> https://lists.xen.org/archives/html/xen-devel/2014-03/msg00399.html.
> >>>>
> >>>> Should we keep hpet_msi_set_affinity() at all? We'd better not have the
> >>>> generic IRQ subsystem play with our IRQs' affinities ... (If so, this
> >>>> likely would want to be a separate patch, though.)
> >>>
> >>> I think that needs to become a no-op, with possibly an ASSERT?  Is it
> >>> possibly for dom0 to try to balance this IRQ?  I would think not.
> >>
> >> I'd consider it an error if that was possible. But then the same goes for
> >> other Xen-internal IRQs, like the IOMMU ones. They all implement a
> >> .set_affinity hook ...
> > 
> > We need such hook for fixup_irqs() at least, so that interrupts can be
> > evacuated when an AP goes offline.
> 
> Hmm, yes. Just not here.
> 
> >>>> @@ -476,19 +486,50 @@ static struct hpet_event_channel *hpet_g
> >>>>  static void set_channel_irq_affinity(struct hpet_event_channel *ch)
> >>>>  {
> >>>>      struct irq_desc *desc = irq_to_desc(ch->msi.irq);
> >>>> +    struct msi_msg msg = ch->msi.msg;
> >>>>  
> >>>>      ASSERT(!local_irq_is_enabled());
> >>>>      spin_lock(&desc->lock);
> >>>> -    hpet_msi_mask(desc);
> >>>> -    hpet_msi_set_affinity(desc, cpumask_of(ch->cpu));
> >>>> -    hpet_msi_unmask(desc);
> >>>> +
> >>>> +    per_cpu(vector_irq, ch->cpu)[HPET_BROADCAST_VECTOR] = ch->msi.irq;
> >>>> +
> >>>> +    /*
> >>>> +     * Open-coding a reduced form of hpet_msi_set_affinity() here.  
> >>>> With the
> >>>> +     * actual update below (either of the IRTE or of [just] message 
> >>>> address;
> >>>> +     * with interrupt remapping message address/data don't change) now 
> >>>> being
> >>>> +     * atomic, we can avoid masking the IRQ around the update.  As a 
> >>>> result
> >>>> +     * we're no longer at risk of missing IRQs (provided 
> >>>> hpet_broadcast_enter()
> >>>> +     * keeps setting the new deadline only afterwards).
> >>>> +     */
> >>>> +    cpumask_copy(desc->arch.cpu_mask, cpumask_of(ch->cpu));
> >>>> +
> >>>>      spin_unlock(&desc->lock);
> >>>>  
> >>>> -    spin_unlock(&ch->lock);
> >>>> +    msg.dest32 = cpu_physical_id(ch->cpu);
> >>>> +    msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
> >>>> +    msg.address_lo |= MSI_ADDR_DEST_ID(msg.dest32);
> >>>> +    if ( msg.dest32 != ch->msi.msg.dest32 )
> >>>> +    {
> >>>> +        ch->msi.msg = msg;
> >>>> +
> >>>> +        if ( iommu_intremap != iommu_intremap_off )
> >>>> +        {
> >>>> +            int rc = iommu_update_ire_from_msi(&ch->msi, &msg);
> >>>>  
> >>>> -    /* We may have missed an interrupt due to the temporary masking. */
> >>>> -    if ( ch->event_handler && ch->next_event < NOW() )
> >>>> -        ch->event_handler(ch);
> >>>> +            ASSERT(rc <= 0);
> >>>> +            if ( rc > 0 )
> >>>> +            {
> >>>> +                ASSERT(msg.data == hpet_read32(HPET_Tn_ROUTE(ch->idx)));
> >>>> +                ASSERT(msg.address_lo ==
> >>>> +                       hpet_read32(HPET_Tn_ROUTE(ch->idx) + 4));
> >>>> +            }
> >>>
> >>> The sequence of asserts seem wrong here, the asserts inside of the rc
> >>>> 0 check will never trigger, because there's an ASSERT(rc <= 0)
> >>> ahead of them?
> >>
> >> Hmm. My way of thinking was that if we get back 1 (which we shouldn't),
> >> we ought to check (and presumably fail on) data or address having changed.
> > 
> > Right, but the ASSERT(rc <= 0) will prevent reaching any of the
> > followup ASSERTs if rc == 1?
> 
> Which is no problem, as we'd be dead already anyway if the first assertion
> triggered. Nevertheless I've switched the if() to >= 0 (which then pointed
> out a necessary change in AMD IOMMU code).

Right, so and adjusted if condition plus an ASSERT_UNREACHABLE() at
the end of the if code block?

> >  IOW, we possibly want:
> > 
> >             if ( rc > 0 )
> >             {
> >                 dprintk(XENLOG_ERR,
> >                         "Unexpected HPET MSI setup returned: data: %#x 
> > address: %#lx expected data %#x address %#lx\n",
> >                         msg.data, msg.address,
> >                         ch->msi.msg.data, ch->msi.msg.address);
> >                 ASSERT_UNREACHABLE();
> >                 hpet_msi_mask(desc);
> >                 hpet_write32(msg.data, HPET_Tn_ROUTE(ch->idx));
> >                 hpet_write32(msg.address_lo, HPET_Tn_ROUTE(ch->idx) + 4);
> >                 hpet_msi_unmask(desc);
> >             }
> >             ASSERT(!rc);
> 
> To be honest, for my taste this goes too far as to what follows an
> ASSERT_UNREACHABLE().

I can understand that.  It's the best way I've come up with attempting
to recover from a possible error in the release case, but I don't
particularly like it either.

> > I'm unsure about attempting to propagate the returned values on release
> > builds, I guess it's slightly better than possibly using an outdated
> > RTE entry?  Albeit this should never happen.
> 
> Yes to the last remark; I don't actually see what you would want to do
> with the propagated return value.

OK, I can this this not being clear.  By propagate here I mean
propagate to the hardware registers, not to the function caller.

> > Also, should the desc->arch.cpu_mask update only be done after the MSI
> > fields have correctly updated, so that in case of failure of
> > iommu_update_ire_from_msi(9 we could return early form the function
> > and avoid changing desc->arch.cpu_mask?
> 
> Hmm, yes, I could do that, but then also in hpet_msi_set_affinity().
> However, as this needs doing under the IRQ descriptor lock, I'd have to
> either extend the locked region here (again), or re-acquire the lock
> later. Neither looks very attractive to me.

Hm, I guess given the point in the release we can leave it as-is.  It
would be nice, but this change is big enough as it is.

Thanks, Roger.

Reply via email to