[ Davem - see the final conclusion: this might not be a driver bug as much 
  as a netconsole problem, where netconsole might perhaps continue sendign 
  on a device that really can't take it any more? ]

On Tue, 13 Jun 2006, Stephen Hemminger wrote:
>
> There were a several problems buried in suspend/resume. The real
> failure was caused by the idle timer not being stopped/restarted.
> But several other races, and cleanups were needed.
> 
> Since I don't have a machine that will suspend successfully with
> that hardware, I can't test it.

With this, I get a page-fault in sky2_tx_complete+0x91 (with traceback to 
sky2_poll, net_rx_action, do_softirq, do_IRQ, skb_release_data, kfree_skb, 
sky2_rx_clean, sky2_down, sky2_suspend, pci_device_suspend, all the way 
down to suspend_device()).

So an IRQ happened while the sky2 driver was doign sky2_rx_clean, which is 
just _after_ it did "sky2_tx_clean()", and then the TX side was unhappy 
for some reason.

Again, the driver has actually tried to disable its _own_ irq, but that 
doesn't much help. Also, with write posting, even its own irq might have 
gotten delayed (ie if you really want to synchronize irq's, you need to 
read from the device, and then also wait a bit to see that the irq isn't 
being posted int he _other_ direction), but in the presense of shared 
irq's, it just doesn't do anything at all.

I can't seem to get a bigger VGA console on the Mac mini, so I'm unable to 
see the exact register values.

Btw, this probably happens with my patch too, and is likely 
timing-related.

Oh, and to make matters worse, I also enabled netconsole (in order to see 
what goes wrong), which is probably what brought on the horrid timing 
issue (ie packets going out _just_ at the right time saying "shutting 
down sky2")

Btw, that "sky2_tx_complete+0x91" seems to be 

        loop:
                inc    %edx
                mov    0x9c(%ecx),%eax
        **      movzwl 0x4(%eax),%eax   **
                cmp    %eax,%edx
                jb loop

which in turn is:

                for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
                        struct tx_ring_info *fre;
                        fre = sky2->tx_ring + RING_NEXT(put + i, TX_RING_SIZE);
                        pci_unmap_page(pdev, pci_unmap_addr(fre, mapaddr),
                                       skb_shinfo(skb)->frags[i].size,
                                       PCI_DMA_TODEVICE);
                }

since pci_unmap_page() is a no-op here ;)

So it looks like it's the "skb_shinfo(skb)->nr_frags" access that oopses.

Which probably means that

        skb = re->skb;

just got garbage (rememebr: the pci_unmap_single() directly after it is 
_also_ a no-op, so it wouldn't oops there).

I dunno the details. I'd have _expected_ tx_cons to be equal to tx_prod 
here (since we just did a sky2_tx_clean() before), and the loop to not 
have been entered at all, but I wonder if maybe it's the netconsole that 
doesn't honor "netif_stop_queue()"?

Dunno.

                        Linus
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to