Hi, the regression tracker here. What's the status of this issue? Was the problem fixed? It seems nothing happened for more than 10 days -- or did the discussion move somewhere else? Ciao, Thorsten
On 20.09.2017 02:30, Chuck Ebbert wrote: > On Tue, 19 Sep 2017 16:51:06 +0100 > Marc Zyngier <[email protected]> wrote: > >> On 19/09/17 16:40, Yanko Kaneti wrote: >>> On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote: >>>> On 19/09/17 16:12, Yanko Kaneti wrote: >>>>> Hello, >>>>> >>>>> Fedora rawhide config here. >>>>> AMD FX-8370E >>>>> >>>>> Bisected a problem to: >>>>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts >>>>> actually using it) >>>>> >>>>> It seems to be causing stalls, short lived or long lived lockups >>>>> very shortly after boot. Everything becomes jerky. >>>>> >>>>> The only visible in the log indication is something like : >>>>> .... >>>>> [ 59.802129] clocksource: timekeeping watchdog on CPU3: Marking >>>>> clocksource 'tsc' as unstable because the skew is too large: >>>>> [ 59.802134] clocksource: 'hpet' wd_now: >>>>> 3326e7aa wd_last: 329956f8 mask: ffffffff [ 59.802137] >>>>> clocksource: 'tsc' cs_now: 423662bc6f >>>>> cs_last: 41dfc91650 mask: ffffffffffffffff [ 59.802140] tsc: >>>>> Marking TSC unstable due to clocksource watchdog [ 59.802158] >>>>> TSC found unstable after boot, most likely due to broken BIOS. >>>>> Use 'tsc=unstable'. [ 59.802161] sched_clock: Marking unstable >>>>> (59802142067, 15510)<-(59920871789, -118714277) [ 60.015604] >>>>> clocksource: Switched to clocksource hpet [ 89.015994] INFO: >>>>> NMI handler (perf_event_nmi_handler) took too long to run: >>>>> 209.660 msecs [ 89.016003] perf: interrupt took too long >>>>> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to >>>>> 1000 .... >>>>> >>>>> Just reverting that commit on top of linus mainline cures all the >>>>> symptoms >>>> >>>> Interesting. Do you still get HPET interrupts? >>> >>> Sorry, I might need some basic help here (i.e where do I count >>> them...) >> >> /proc/interrupts should display them. >> >>> After the watchdog switches the clocksource to hpet the system is >>> still somewhat alive, so I'll guess some clock is still >>> ticking.... >> Probably, but I suspect they're not hitting the right CPU, hence the >> lockups. >> >> Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off >> the net for a few days. >> >> Thomas, any insight? > > Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0 > can be correct: > > struct cpumask *effmsk = > irq_data_get_effective_affinity_mask(irqdata); unsigned long > cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS; > > if (!cpu_mask) > return -EINVAL; > *apicid = (unsigned int)cpu_mask; > cpumask_bits(effmsk)[0] = cpu_mask; > > Before that patch, this function wrote to the effective mask > unconditionally. After, it only writes to effective_mask if it is > already non-zero. > > > http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com > > http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com >

