Update on fight with rge on my laptop:

I've tried to set acpi and rge options in "eeprom" (bootenv.rc),
and indeed bound the NIC to a particular CPU core. I've also
tried an alternate driver (gani), and disabled HW checksums.
None of these helped much.

The intermittent hangs do still appear with both stock rge and
gani-2.6.9 drivers - almost any activity on the NIC (about 180
to 220 Kb worth of downloads with wget or ssh) makes it process
or issue(?) between 75k and 110k interrupts per second on driver,
eating 25%-35% of a CPU core, and often locking up the mouse/kbd.
The lockups don't happen 100% of the time, and not always there
is even a noticeable lag. X11 screen updates come through, so
in "vmstat" and "intrstat" I see this storm of intrs. Sometimes
they dissipate on their own (a watchdog timer in driver?) and
often they disappear when I unplug the network, wait several
seconds (>5) and plug it back in. Apparently, this requires
any TCP sessions to be restarted (ssh, rsync, wget, etc.)

Now I have not yet seen the networking disappear completely until
reboot, but its usability is still unpredictable and usually bad.

Ideas welcome - i.e. how can I trace what's happening in the intr
storm? I'd guess some infinite loop, small enough to happen very
quickly. Maybe hardware related, since it happens with two drivers
(rge and gani, I didn't check how close their code is)...
I wonder if ultimately this condition can be detected and aborted
early, i.e. within a second and causing no networking loss to upper
layers.

2012-09-28 10:48, Jim Klimov wrote:
Thanks Marion for the pointers, I also figured it looks like
a interrupt problem (back in MSDOS times "kicking" the computer
hung in a game by moving a mouse could unhang it ;) )

However, now that I've checked, I don't see rge sharing an
IRQ vector with anything. It is of an MSI type currently bound
to CPU1 (and a pcieb is bound to CPU0 being the only other MSI
interrupt); I wonder if the CPU binding matters for this bug,
and if it can be controlled to test.

The only shared interrupts I see are two ehci driver instances
on one IRQ and three ohci instances on another.

Also, this is not an NVidia box but an AMD/ATI one (with the
integrated APU = CPU+GPU).

I've had a boot after my email where again I worked for hours
and intensively used the net without problems; then I had
boots where net hung upon IOs (i.e. I could start an "scp"
session from another machine and authenticate with a password,
but the actual file copy hung it), and currently the link is
down right from the bootup...

Thanks for more ideas,
//Jim

2012-09-27 6:50, Marion Hakanson wrote:
[email protected] said:
. . .
    Ultimately, after about one hour of such intermittent work with
no actual
usage on my behalf, the LAN interface went down and did not come back up
until a full reboot (I did not try fastboot though). I have no idea
if this
will be reproducible :)
. . .

I wonder if this is similar to something I've seen, which I think was
eventually categorized as an interrupt-sharing problem.  On my systems,
the graphics locked up, along with USB mouse, and on one of them an
internal disk interface also had timeouts during the "freeze".  All
the affected devices were sharing the same IRQ.

You can see what OI thinks your laptop is doing, interrupt-wise, via:
    echo "::interrupts -d" | mdb -k

I'm not sure how one can fix it.  I was able to disable enough USB ports
in the BIOS on one machine to alleviate the IRQ-sharing, and the problem
stopped happening there.  Here's the (closed) bug report:
    https://www.illumos.org/issues/1625

Regards,

Marion



_______________________________________________
OpenIndiana-discuss mailing list
[email protected]
http://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to