[Kernel-packages] [Bug 488898] Re: [hardy; e1000] ethtool operation to bonding slave device may hang system; corrupt data

Bug Watch Updater Fri, 27 Oct 2017 04:26:33 -0700

Launchpad has imported 18 comments from the remote bug at
https://bugzilla.redhat.com/show_bug.cgi?id=473258.


If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2008-11-27T12:51:46+00:00 Flavio wrote:

Description of Problem:
 When the ethtool command invoked for getting the information of the
 slave device under the bonding device, the system hung occurred.

 We think that the system hung occurred at the obtaining process of
 the semaphore in the e1000 driver.

 This problem is the same as IT#221541/BZ#445951 that occurs by RHEL5.1.

Version-Release number of selected component:

 Red Hat Enterprise Linux Version Number: RHEL4
 Release Number: 4.7
 Architecture: x86
 Kernel Version: 2.6.9-78.ELsmp
 Related Package Version: kernel-2.6.9-78.ELsmp
 Related Middleware / Application: /sbin/ethtool

Drivers or hardware or architecture dependency:
 depend on following LAN controller and driver.
  - Intel 80003ES2LAN Gigabit Ethernet Controller
  - e1000 driver
  - bonding driver

How reproducible:
 Sometimes

Step to Reproduce:
  - create a bonding device using Intel 80003ES2LAN Gigabit Ethernet Cnotroller
      [device configuration]
       bond0 ---+--- eth0 (80003ES2LAN)
                |
                +--- eth1 (80003ES2LAN)

      1) Add the following to /etc/modprobe.conf
           alias bond0 bonding
           options bond0 mode=1 miimon=1 primary=eth0 updelay=1000
      2) Set the following to /etc/sysconfig/network-scripts/ifcfg-bond0
           DEVICE=bond0
           BOOTPROTO=none
           BROADCAST=192.168.0.255
           IPADDR=192.168.0.5
           NETMASK=255.255.255.0
           NETWORK=192.168.0.0
           ONBOOT=yes
           USERCTL=no
      3) Set the following to /etc/sysconfig/network-scripts/ifcfg-ethX
           DEVICE=ethX
           BOOTPROTO=none
           HWADDR=XX:XX:XX:XX:XX:XX
           MASTER=bond0
           SLAVE=yes
           USERCTL=no
           ONBOOT=yes

  - execute a ethtool command to slave device.
      # /sbin/ethtool eth0
    This may happen easily under the multiple invocation of command.

Actual Results:
 The system hung up.

Expected Results:
 The system does not hung up.

Hardware configuration:

 Model: PRIMERGY RX600S4
 CPU Info: Intel Xeon Processor 2.93GHz x 1
 Memory Info: 2[GB]
 Hardware Component Information:
   M/B: Intel 7300
   LAN: Intel 80003ES2LAN Gigabit Ethernet Controller (onbord)


Call Trace:
 [<c021203c>] __handle_sysrq+0x62/0xd9
 [<c020cd4c>] kbd_event+0x83/0xb0
 [<c02737e9>] input_event+0x331/0x351
 [<c0271bd8>] hidinput_hid_event+0x1d0/0x208
 [<c0224142>] get_device+0xe/0x14
 [<c026e1a5>] hid_process_event+0x28/0x52
 [<c026e47b>] hid_input_field+0x2ac/0x2f9
 [<c026e541>] hid_input_report+0x79/0x98
 [<c026e5f6>] hid_irq_in+0x96/0xf2
 [<c026577e>] usb_hcd_giveback_urb+0x14/0x3e
 [<f88aeceb>] uhci_finish_urb+0x27/0x32 [uhci_hcd]
 [<f88aed28>] uhci_finish_completion+0x32/0x38 [uhci_hcd]
 [<f88aef00>] uhci_irq+0x19b/0x240 [uhci_hcd]
 [<c02657ce>] usb_hcd_irq+0x26/0x4b
 [<c01074d2>] handle_IRQ_event+0x25/0x4f
 [<c0107a32>] do_IRQ+0x11c/0x1ae
 =======================
 [<c02e13b4>] common_interrupt+0x18/0x20
 [<c0112e1e>] delay_pmtmr+0xd/0x13
 [<c01c6ca1>] __delay+0x9/0xa
 [<f896e267>] e1000_get_software_semaphore+0x6c/0x7d [e1000]
 [<f896e184>] e1000_get_hw_eeprom_semaphore+0x17/0x66 [e1000]
 [<f896a091>] e1000_swfw_sync_acquire+0x42/0xdb [e1000]
 [<f896a4a3>] e1000_write_kmrn_reg+0x2d/0x62 [e1000]
 [<f8969305>] e1000_configure_kmrn_for_10_100+0x1e/0x78 [e1000]
 [<f8969eae>] e1000_get_speed_and_duplex+0xd4/0xf8 [e1000]
 [<f896eab0>] e1000_get_settings+0x9c/0xdb [e1000]
 [<f88f8679>] bond_update_speed_duplex+0x38/0xea [bonding1]
 [<c011e838>] __wake_up_common+0x36/0x51
 [<f88fa31e>] bond_mii_monitor+0x366/0x3f7 [bonding1]
 [<f88f9fb8>] bond_mii_monitor+0x0/0x3f7 [bonding1]
 [<c012a939>] run_timer_softirq+0x123/0x145
 [<c0126d84>] __do_softirq+0x4c/0xb1
 [<c01081a3>] do_softirq+0x4f/0x56
 =======================
 =======================
 [<c01174c8>] smp_apic_timer_interrupt+0x9a/0x9c
 [<c02e1436>] apic_timer_interrupt+0x1a/0x20
 [<f896a09d>] e1000_swfw_sync_acquire+0x4e/0xdb [e1000]
 [<f896a198>] e1000_read_phy_reg+0x2e/0xc4 [e1000]
 [<f8969330>] e1000_configure_kmrn_for_10_100+0x49/0x78 [e1000]
 [<f8969eae>] e1000_get_speed_and_duplex+0xd4/0xf8 [e1000]
 [<f896eab0>] e1000_get_settings+0x9c/0xdb [e1000]
 [<c028de3e>] ethtool_get_settings+0x2f/0x85
 [<c028f438>] dev_ethtool+0xc6/0x27a
 [<c028d318>] dev_ioctl+0x2c7/0x4ab
 [<c02c0ae7>] udp_ioctl+0x0/0xc5
 [<c02c6c8e>] inet_ioctl+0xa0/0xa5
 [<c0283bcd>] sock_ioctl+0x28c/0x2b4
 [<c016cc5e>] sys_ioctl+0x227/0x269
 [<c02e09db>] syscall_call+0x7/0xb

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/0

------------------------------------------------------------------------
On 2008-11-27T14:43:34+00:00 Flavio wrote:

The last trace shows the CPU still looping:
e1000_swfw_sync_acquire+0x4e/0xdb
...
3414     while (timeout) {
3415             if (e1000_get_hw_eeprom_semaphore(hw))
3416                 return -E1000_ERR_SWFW_SYNC;
3417
3418             swfw_sync = E1000_READ_REG(hw, SW_FW_SYNC);
** CPU was around here ** 
3419             if (!(swfw_sync & (fwmask | swmask))) {
3420                 break;
3421             }
3422
3423             /* firmware currently using resource (fwmask) */
3424             /* or other software thread currently using resource (swmask
        ) */
3425             e1000_put_hw_eeprom_semaphore(hw);
3426             mdelay(5);
3427             timeout--;
3428     }

then it was interrupted and do_softirq() was called and also went thru 
e1000_swfw_sync_acquire(). There it tries to get software_semaphore which
depends on a HW register E1000_SWSM_SMBI. I'd say this should deadlock
but it is looping with a timeout, so after some time it should fail and 
moves on normally.

The stack trace with do_softirq() code path shows the CPU is looping, 
waiting on mdelay().

The delay on e1000_swfw_sync_acquire() can be 1sec and on  
e1000_get_hw_eeprom_semaphore() it is hw->eeprom.word_size + 1
This board doesn't have eeprom.word_size ready as other models, it reads 
from eeprom and I have no idea yet how big it can be.

An easy way to fix this would be adding local_bh_{dis,ena}able() at
e1000_swfw_sync_acquire() to avoid do_softirq() to get in the middle,
but looking at Linus tree or from sf.net this board is supported only
by e1000e and e1000_get_settings() didn't call anything else, so this 
bug shouldn't happen there.

next questions to solve:
1) Is this really hung? Or after some time it moves on?

2) What is the size of this eeprom? Perhaps the timeout is just too big.

3) Is there any plan for e1000/e1000e driver on RHEL4u7? I think if
we move this board to e1000e and update this driver, it should fix too.

Flavio

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/1

------------------------------------------------------------------------
On 2008-12-15T16:40:51+00:00 Issue wrote:

Hello,

This is the output from console log:

-> e1000_swfw_sync_acquire: timeout=200
  -> e1000_get_software_semaphore: timeout=2049
  <- e1000_get_software_semaphore: timeout=2049
  -> e1000_get_hw_eeprom_semaphore: timeout=2049
  <- e1000_get_hw_eeprom_semaphore: timeout=2049
<- e1000_swfw_sync_acquire: timeout=200

and it is repeating this part, so it's not a deadlock neither a HW lock
problem. The stack trace also looks good to me.

Well, I'm thinking if bond_mii_monitor() is too fast then, so I went
check

your bonding options and I found these lines below:

  $ grep bond rx600s4-372001/etc/modprobe.conf
  alias bond0 bonding
  options bond0 mode=1 miimon=1 primary=eth2 updelay=1000
  install bond1 /sbin/modprobe --ignore-install bonding -o bonding1 mode=1

  miimon=1 primary=eth3 updelay=2000

See the miimon=1 (default is 100), perhaps it is too fast.

Can you get 5 shots of sysrq+t followed by sysrq+w while reproducing it?
i.e:
1. <sysrq+t>
2. <sysrq+t>
3. wait 1 sec
4. repeat this 5x.
there is no need to use debug kernel.
 
Can you change that to miimon to 100 and see if it helps?

thanks,
Flavio

Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by fleitner 
 issue 231780

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/2

------------------------------------------------------------------------
On 2008-12-22T13:40:10+00:00 Issue wrote:

Hi,

The traces shows CPU looping but it didn't match with previous log
provided.
Can you confirm if the data provided with the test kernel
kernel-2.6.9-78.10.EL.IT231780.1.src.rpm was gathered while the problem 
was happening?

I'm referring to this update:
---
I booted by this kernel, and gathered the console log.
I append it.
Could you check the contents?
---

If not, can you get it again while the problem is happening?

thanks,
Flavio

Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by fleitner 
 issue 231780

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/3

------------------------------------------------------------------------
On 2008-12-29T17:30:42+00:00 Flavio wrote:

Created attachment 327926
output of debug kernel 2.6.9-78.10.EL.IT231780.1

Hi,

The problem here seems to be the big loop with delay on
e1000_get_software_semaphore().

e1000_get_software_semaphore(struct e1000_hw *hw)
{
    int32_t timeout = hw->eeprom.word_size + 1;

    while (timeout) {
         ...
    
        mdelay(1);
        timeout--;
    }
}

The timeout value there is 2049, which means almost 2 seconds inside of
that while() {}. The bonding timer expires every 100ms, (in customer 
case was 1ms), then it calls bond_mii_monitor() and ends up there again
for more 2 seconds leaving ethtool stuck.

This is the ethtool stack interrupted:
 [<c02e13b4>] common_interrupt+0x18/0x20
 [<c0112e1e>] delay_pmtmr+0xd/0x13
 [<c01c6ca1>] __delay+0x9/0xa
 [<f896e267>] e1000_get_software_semaphore+0x6c/0x7d [e1000]
 [<f896e184>] e1000_get_hw_eeprom_semaphore+0x17/0x66 [e1000]
 [<f896a091>] e1000_swfw_sync_acquire+0x42/0xdb [e1000]
 [<f896a4a3>] e1000_write_kmrn_reg+0x2d/0x62 [e1000]
 [<f8969305>] e1000_configure_kmrn_for_10_100+0x1e/0x78 [e1000]
 [<f8969eae>] e1000_get_speed_and_duplex+0xd4/0xf8 [e1000]
 [<f896eab0>] e1000_get_settings+0x9c/0xdb [e1000]
 [<f88f8679>] bond_update_speed_duplex+0x38/0xea [bonding1]
 [<c011e838>] __wake_up_common+0x36/0x51
 [<f88fa31e>] bond_mii_monitor+0x366/0x3f7 [bonding1]
 [<f88f9fb8>] bond_mii_monitor+0x0/0x3f7 [bonding1]
 [<c012a939>] run_timer_softirq+0x123/0x145
 [<c0126d84>] __do_softirq+0x4c/0xb1
 [<c01081a3>] do_softirq+0x4f/0x56
 =======================
 [<c01174c8>] smp_apic_timer_interrupt+0x9a/0x9c
 [<c02e1436>] apic_timer_interrupt+0x1a/0x20
 [<f896a09d>] e1000_swfw_sync_acquire+0x4e/0xdb [e1000]
 [<f896a198>] e1000_read_phy_reg+0x2e/0xc4 [e1000]
 [<f8969330>] e1000_configure_kmrn_for_10_100+0x49/0x78 [e1000]
 [<f8969eae>] e1000_get_speed_and_duplex+0xd4/0xf8 [e1000]
 [<f896eab0>] e1000_get_settings+0x9c/0xdb [e1000]
 [<c028de3e>] ethtool_get_settings+0x2f/0x85
 [<c028f438>] dev_ethtool+0xc6/0x27a
 [<c028d318>] dev_ioctl+0x2c7/0x4ab
 [<c02c0ae7>] udp_ioctl+0x0/0xc5
 [<c02c6c8e>] inet_ioctl+0xa0/0xa5
 [<c0283bcd>] sock_ioctl+0x28c/0x2b4
 [<c016cc5e>] sys_ioctl+0x227/0x269
 [<c02e09db>] syscall_call+0x7/0xb

Well, I can't say if this big delay is really needed or not.
Thoughts?


(attaching the console output of debug kernel showing it looping)
Flavio

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/4

------------------------------------------------------------------------
On 2009-01-13T18:15:30+00:00 Flavio wrote:

Hello,

Here is another call trace leading to the same problem:

- This problem occurs when SW interrupt for MII monitor occurs while
 acquiring HW semaphore in SNMP state acquisition processing.
 At that time, the MII monitor cannot acquire the semaphore, because 
 the SNMP has already aquired the semaphore, and it must wait for 16 
 seconds for timeout. Therefore the system stops for 16 seconds.

 -----
 sys_ioctl
  ->e1000_get_speed_and_duplex
    ->e1000_configure_kmrn_for_1000
      ->e1000_write_kmrn_reg
        ->e1000_swfw_sync_acquire
          ->e1000_get_hw_eeprom_semaphore
            * SW interrupt by MII monitor occurs while dealing with this 
function.
            ->bond_mii_monitor
              ->e1000_get_speed_and_duplex
                ->e1000_configure_kmrn_for_1000
                  ->e1000_write_kmrn_reg
                    ->e1000_swfw_sync_acquire
                      ->e1000_get_hw_eeprom_semaphore
                        ->e1000_get_software_semaphore
                        * It is made to wait until timeout(16 seconds) in this 
function.


Flavio

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/5

------------------------------------------------------------------------
On 2009-02-09T20:55:01+00:00 Andy wrote:

Flavio's analysis is absolutely correct.  It's interesting that
concurrent calls to check link status have such long timeouts.  Whether
the mii monitoring time-out is set to be too small or not (I'd say 1ms
is too small), there is a chance for deadlock with anything set, so we
should try and come up with a fix.

As I look at what is done here

I'm not convinced that these bits:

3203 
3204     if ((hw->mac_type == e1000_80003es2lan) &&
3205         (hw->media_type == e1000_media_type_copper)) {
3206         if (*speed == SPEED_1000)
3207             ret_val = e1000_configure_kmrn_for_1000(hw);
3208         else
3209             ret_val = e1000_configure_kmrn_for_10_100(hw, *duplex);
3210         if (ret_val)
3211             return ret_val;
3212     }
3213 

need to be run on every call to e1000_get_speed_and_duplex().  I can see
how they are valuable when the link first comes up since they make sure
the interpacket gap and it appears that they disable some 'false carrier
detection' bit (I haven't looked at the datasheet to determine what the
false-carrier stuff actually does).

As I see it there are a couple of options:

1.  Break out the code above so that it's in a separate function and
only called when needed.  This would mean we would drop it from
e1000_get_speed_and_duplex and have to place some calls do the new
function after the calls to e1000_get_speed_and_duplex.

2.  Add a new conditional argument to e1000_get_speed_and_duplex so the
offending code is not called when we deem it not necessary.  This would
require the same level of research to determine where this call is
needed and where it is not.

3.  Do not modify the function signature and make sure the code above
doesn't get called when e1000_get_speed_and_duplex is called in irq
context.  This is a bit hacky, but would certainly work around the issue
since the call would not me made by miimon (since timers run in softirq
context).  This would allow it to be called in the watchdog though since
the watchdog runs as a workqueue in our code so it's a normal, non-irq
thread.

4. Other.

Any way you look at it, I don't see how calling
e1000_configure_kmrn_for_1000 or e1000_configure_kmrn_for_10_100 is
necessary when simply checking link status.  I'm hoping to get someone
at Intel to verify this.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/6

------------------------------------------------------------------------
On 2009-02-09T21:20:09+00:00 Andy wrote:

Places where we should not call 'questionable code' from comment #9
since it will be deadlock or delay prone are:

e1000_get_settings
e1000_check_for_link
e1000_config_fc_after_link_up
e1000_config_dsp_after_link_change

Places where we could call 'questionable code' from comment #9 since it
will not be as deadlock prone:

e1000_watchdog_task

Since there is only one case where we should really call this I am going
to suggest creating a new function to deal with this and only calling it
there.

Upon inspecting the e1000e driver it appears the same basic code it only
called is called in functions equivalent to e1000_watchdog_task and
e1000_config_fc_after_link_up, so I will do some looking around and try
to determine if this is necessary in both places in e1000 or just in
e1000_watchdog_task.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/7

------------------------------------------------------------------------
On 2009-02-09T22:30:17+00:00 Andy wrote:

Jesse, you can read the whole bug if you want, but if you start at
comment #9 you will get the information you need.  I pinged Jeff K about
this too, but he appears to be in a meeting and does not appear to have
a bugzilla account his Intel address (at least not the one I know!).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/8

------------------------------------------------------------------------
On 2009-02-10T00:21:18+00:00 Jesse wrote:

I agree with your assertion that it is not important to make those deep
call paths from the ethtool code.

I believe they fixed these issues in newer bonding by never calling
ethtool (IOCTL) calls from interrupt context (softirq) which seems like
it was always a bad idea once it is explained that all ethtool calls
were designed with the assumption that they were being called from
*user* context, not interrupt context.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/9

------------------------------------------------------------------------
On 2009-02-10T00:47:34+00:00 Andy wrote:

Jesse, that was one of the benefits of the move from timers to
workqueues with bonding.  Those changes are in RHEL5, but I don't think
we'll ever see them in RHEL4.

I'll whip up a patch and post it here soon.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/10

------------------------------------------------------------------------
On 2009-02-10T01:04:41+00:00 Andy wrote:

Created attachment 331385
e1000-ethtool-fix.patch

First pass at a patch to resolve this.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/11

------------------------------------------------------------------------
On 2009-02-10T20:09:31+00:00 Issue wrote:

Hi,

Engineering has a patch moving the offending function to another context.
Could you give a try and report back your results?

thanks,
Flavio

Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by fleitner 
 issue 249481

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/12

------------------------------------------------------------------------
On 2009-02-12T12:34:51+00:00 Andy wrote:

My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel4

Please test them and report back your results.  Without immediate
feedback there is a good chance this or any other fix for this driver
will not be included in the upcoming update.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/13

------------------------------------------------------------------------
On 2009-02-14T12:51:42+00:00 Issue wrote:

>From FJ:
---
I checked the same test on kernel-smp-2.6.9-81.EL.gtest.59 again.

And, it worked just fine.
Thank you for your fix.
---

Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by mosh...@redhat.com 
 issue 231780

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/14

------------------------------------------------------------------------
On 2009-02-20T19:43:34+00:00 RHEL wrote:

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/15

------------------------------------------------------------------------
On 2009-02-26T16:47:20+00:00 Vivek wrote:

Committed in 82.EL . RPMS are available at
http://people.redhat.com/vgoyal/rhel4/

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/16

------------------------------------------------------------------------
On 2009-05-18T19:05:00+00:00 errata-xmlrpc wrote:

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/17


** Changed in: linux (Fedora)
   Importance: Unknown => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/488898

Title:
  [hardy; e1000] ethtool operation to bonding slave device may hang
  system;  corrupt data

Status in Ubuntu China Edition:
  Invalid
Status in linux package in Ubuntu:
  Incomplete
Status in linux package in Fedora:
  Fix Released

Bug description:
  There is a problem using bonding with 80003ES2LAN network cards (e1000
  module) on Hardy.  The system can hang and data corruption has been
  observed.

  Please see attachments: kernel_logs.txt

  This is from a Sun Fire X2250

  Red Hat confirmed bug and patch:

  https://bugzilla.redhat.com/show_bug.cgi?id=473258

  The problem can be reproduced by executing, several times, the
  command:

  # ethetool eth0

To manage notifications about this bug go to:
https://bugs.launchpad.net/cnubuntu/+bug/488898/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 488898] Re: [hardy; e1000] ethtool operation to bonding slave device may hang system; corrupt data

Reply via email to