Launchpad has imported 18 comments from the remote bug at https://bugzilla.redhat.com/show_bug.cgi?id=473258.
If you reply to an imported comment from within Launchpad, your comment will be sent to the remote bug automatically. Read more about Launchpad's inter-bugtracker facilities at https://help.launchpad.net/InterBugTracking. ------------------------------------------------------------------------ On 2008-11-27T12:51:46+00:00 Flavio wrote: Description of Problem: When the ethtool command invoked for getting the information of the slave device under the bonding device, the system hung occurred. We think that the system hung occurred at the obtaining process of the semaphore in the e1000 driver. This problem is the same as IT#221541/BZ#445951 that occurs by RHEL5.1. Version-Release number of selected component: Red Hat Enterprise Linux Version Number: RHEL4 Release Number: 4.7 Architecture: x86 Kernel Version: 2.6.9-78.ELsmp Related Package Version: kernel-2.6.9-78.ELsmp Related Middleware / Application: /sbin/ethtool Drivers or hardware or architecture dependency: depend on following LAN controller and driver. - Intel 80003ES2LAN Gigabit Ethernet Controller - e1000 driver - bonding driver How reproducible: Sometimes Step to Reproduce: - create a bonding device using Intel 80003ES2LAN Gigabit Ethernet Cnotroller [device configuration] bond0 ---+--- eth0 (80003ES2LAN) | +--- eth1 (80003ES2LAN) 1) Add the following to /etc/modprobe.conf alias bond0 bonding options bond0 mode=1 miimon=1 primary=eth0 updelay=1000 2) Set the following to /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 BOOTPROTO=none BROADCAST=192.168.0.255 IPADDR=192.168.0.5 NETMASK=255.255.255.0 NETWORK=192.168.0.0 ONBOOT=yes USERCTL=no 3) Set the following to /etc/sysconfig/network-scripts/ifcfg-ethX DEVICE=ethX BOOTPROTO=none HWADDR=XX:XX:XX:XX:XX:XX MASTER=bond0 SLAVE=yes USERCTL=no ONBOOT=yes - execute a ethtool command to slave device. # /sbin/ethtool eth0 This may happen easily under the multiple invocation of command. Actual Results: The system hung up. Expected Results: The system does not hung up. Hardware configuration: Model: PRIMERGY RX600S4 CPU Info: Intel Xeon Processor 2.93GHz x 1 Memory Info: 2[GB] Hardware Component Information: M/B: Intel 7300 LAN: Intel 80003ES2LAN Gigabit Ethernet Controller (onbord) Call Trace: [<c021203c>] __handle_sysrq+0x62/0xd9 [<c020cd4c>] kbd_event+0x83/0xb0 [<c02737e9>] input_event+0x331/0x351 [<c0271bd8>] hidinput_hid_event+0x1d0/0x208 [<c0224142>] get_device+0xe/0x14 [<c026e1a5>] hid_process_event+0x28/0x52 [<c026e47b>] hid_input_field+0x2ac/0x2f9 [<c026e541>] hid_input_report+0x79/0x98 [<c026e5f6>] hid_irq_in+0x96/0xf2 [<c026577e>] usb_hcd_giveback_urb+0x14/0x3e [<f88aeceb>] uhci_finish_urb+0x27/0x32 [uhci_hcd] [<f88aed28>] uhci_finish_completion+0x32/0x38 [uhci_hcd] [<f88aef00>] uhci_irq+0x19b/0x240 [uhci_hcd] [<c02657ce>] usb_hcd_irq+0x26/0x4b [<c01074d2>] handle_IRQ_event+0x25/0x4f [<c0107a32>] do_IRQ+0x11c/0x1ae ======================= [<c02e13b4>] common_interrupt+0x18/0x20 [<c0112e1e>] delay_pmtmr+0xd/0x13 [<c01c6ca1>] __delay+0x9/0xa [<f896e267>] e1000_get_software_semaphore+0x6c/0x7d [e1000] [<f896e184>] e1000_get_hw_eeprom_semaphore+0x17/0x66 [e1000] [<f896a091>] e1000_swfw_sync_acquire+0x42/0xdb [e1000] [<f896a4a3>] e1000_write_kmrn_reg+0x2d/0x62 [e1000] [<f8969305>] e1000_configure_kmrn_for_10_100+0x1e/0x78 [e1000] [<f8969eae>] e1000_get_speed_and_duplex+0xd4/0xf8 [e1000] [<f896eab0>] e1000_get_settings+0x9c/0xdb [e1000] [<f88f8679>] bond_update_speed_duplex+0x38/0xea [bonding1] [<c011e838>] __wake_up_common+0x36/0x51 [<f88fa31e>] bond_mii_monitor+0x366/0x3f7 [bonding1] [<f88f9fb8>] bond_mii_monitor+0x0/0x3f7 [bonding1] [<c012a939>] run_timer_softirq+0x123/0x145 [<c0126d84>] __do_softirq+0x4c/0xb1 [<c01081a3>] do_softirq+0x4f/0x56 ======================= ======================= [<c01174c8>] smp_apic_timer_interrupt+0x9a/0x9c [<c02e1436>] apic_timer_interrupt+0x1a/0x20 [<f896a09d>] e1000_swfw_sync_acquire+0x4e/0xdb [e1000] [<f896a198>] e1000_read_phy_reg+0x2e/0xc4 [e1000] [<f8969330>] e1000_configure_kmrn_for_10_100+0x49/0x78 [e1000] [<f8969eae>] e1000_get_speed_and_duplex+0xd4/0xf8 [e1000] [<f896eab0>] e1000_get_settings+0x9c/0xdb [e1000] [<c028de3e>] ethtool_get_settings+0x2f/0x85 [<c028f438>] dev_ethtool+0xc6/0x27a [<c028d318>] dev_ioctl+0x2c7/0x4ab [<c02c0ae7>] udp_ioctl+0x0/0xc5 [<c02c6c8e>] inet_ioctl+0xa0/0xa5 [<c0283bcd>] sock_ioctl+0x28c/0x2b4 [<c016cc5e>] sys_ioctl+0x227/0x269 [<c02e09db>] syscall_call+0x7/0xb Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/0 ------------------------------------------------------------------------ On 2008-11-27T14:43:34+00:00 Flavio wrote: The last trace shows the CPU still looping: e1000_swfw_sync_acquire+0x4e/0xdb ... 3414 while (timeout) { 3415 if (e1000_get_hw_eeprom_semaphore(hw)) 3416 return -E1000_ERR_SWFW_SYNC; 3417 3418 swfw_sync = E1000_READ_REG(hw, SW_FW_SYNC); ** CPU was around here ** 3419 if (!(swfw_sync & (fwmask | swmask))) { 3420 break; 3421 } 3422 3423 /* firmware currently using resource (fwmask) */ 3424 /* or other software thread currently using resource (swmask ) */ 3425 e1000_put_hw_eeprom_semaphore(hw); 3426 mdelay(5); 3427 timeout--; 3428 } then it was interrupted and do_softirq() was called and also went thru e1000_swfw_sync_acquire(). There it tries to get software_semaphore which depends on a HW register E1000_SWSM_SMBI. I'd say this should deadlock but it is looping with a timeout, so after some time it should fail and moves on normally. The stack trace with do_softirq() code path shows the CPU is looping, waiting on mdelay(). The delay on e1000_swfw_sync_acquire() can be 1sec and on e1000_get_hw_eeprom_semaphore() it is hw->eeprom.word_size + 1 This board doesn't have eeprom.word_size ready as other models, it reads from eeprom and I have no idea yet how big it can be. An easy way to fix this would be adding local_bh_{dis,ena}able() at e1000_swfw_sync_acquire() to avoid do_softirq() to get in the middle, but looking at Linus tree or from sf.net this board is supported only by e1000e and e1000_get_settings() didn't call anything else, so this bug shouldn't happen there. next questions to solve: 1) Is this really hung? Or after some time it moves on? 2) What is the size of this eeprom? Perhaps the timeout is just too big. 3) Is there any plan for e1000/e1000e driver on RHEL4u7? I think if we move this board to e1000e and update this driver, it should fix too. Flavio Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/1 ------------------------------------------------------------------------ On 2008-12-15T16:40:51+00:00 Issue wrote: Hello, This is the output from console log: -> e1000_swfw_sync_acquire: timeout=200 -> e1000_get_software_semaphore: timeout=2049 <- e1000_get_software_semaphore: timeout=2049 -> e1000_get_hw_eeprom_semaphore: timeout=2049 <- e1000_get_hw_eeprom_semaphore: timeout=2049 <- e1000_swfw_sync_acquire: timeout=200 and it is repeating this part, so it's not a deadlock neither a HW lock problem. The stack trace also looks good to me. Well, I'm thinking if bond_mii_monitor() is too fast then, so I went check your bonding options and I found these lines below: $ grep bond rx600s4-372001/etc/modprobe.conf alias bond0 bonding options bond0 mode=1 miimon=1 primary=eth2 updelay=1000 install bond1 /sbin/modprobe --ignore-install bonding -o bonding1 mode=1 miimon=1 primary=eth3 updelay=2000 See the miimon=1 (default is 100), perhaps it is too fast. Can you get 5 shots of sysrq+t followed by sysrq+w while reproducing it? i.e: 1. <sysrq+t> 2. <sysrq+t> 3. wait 1 sec 4. repeat this 5x. there is no need to use debug kernel. Can you change that to miimon to 100 and see if it helps? thanks, Flavio Internal Status set to 'Waiting on Support' This event sent from IssueTracker by fleitner issue 231780 Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/2 ------------------------------------------------------------------------ On 2008-12-22T13:40:10+00:00 Issue wrote: Hi, The traces shows CPU looping but it didn't match with previous log provided. Can you confirm if the data provided with the test kernel kernel-2.6.9-78.10.EL.IT231780.1.src.rpm was gathered while the problem was happening? I'm referring to this update: --- I booted by this kernel, and gathered the console log. I append it. Could you check the contents? --- If not, can you get it again while the problem is happening? thanks, Flavio Internal Status set to 'Waiting on Support' This event sent from IssueTracker by fleitner issue 231780 Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/3 ------------------------------------------------------------------------ On 2008-12-29T17:30:42+00:00 Flavio wrote: Created attachment 327926 output of debug kernel 2.6.9-78.10.EL.IT231780.1 Hi, The problem here seems to be the big loop with delay on e1000_get_software_semaphore(). e1000_get_software_semaphore(struct e1000_hw *hw) { int32_t timeout = hw->eeprom.word_size + 1; while (timeout) { ... mdelay(1); timeout--; } } The timeout value there is 2049, which means almost 2 seconds inside of that while() {}. The bonding timer expires every 100ms, (in customer case was 1ms), then it calls bond_mii_monitor() and ends up there again for more 2 seconds leaving ethtool stuck. This is the ethtool stack interrupted: [<c02e13b4>] common_interrupt+0x18/0x20 [<c0112e1e>] delay_pmtmr+0xd/0x13 [<c01c6ca1>] __delay+0x9/0xa [<f896e267>] e1000_get_software_semaphore+0x6c/0x7d [e1000] [<f896e184>] e1000_get_hw_eeprom_semaphore+0x17/0x66 [e1000] [<f896a091>] e1000_swfw_sync_acquire+0x42/0xdb [e1000] [<f896a4a3>] e1000_write_kmrn_reg+0x2d/0x62 [e1000] [<f8969305>] e1000_configure_kmrn_for_10_100+0x1e/0x78 [e1000] [<f8969eae>] e1000_get_speed_and_duplex+0xd4/0xf8 [e1000] [<f896eab0>] e1000_get_settings+0x9c/0xdb [e1000] [<f88f8679>] bond_update_speed_duplex+0x38/0xea [bonding1] [<c011e838>] __wake_up_common+0x36/0x51 [<f88fa31e>] bond_mii_monitor+0x366/0x3f7 [bonding1] [<f88f9fb8>] bond_mii_monitor+0x0/0x3f7 [bonding1] [<c012a939>] run_timer_softirq+0x123/0x145 [<c0126d84>] __do_softirq+0x4c/0xb1 [<c01081a3>] do_softirq+0x4f/0x56 ======================= [<c01174c8>] smp_apic_timer_interrupt+0x9a/0x9c [<c02e1436>] apic_timer_interrupt+0x1a/0x20 [<f896a09d>] e1000_swfw_sync_acquire+0x4e/0xdb [e1000] [<f896a198>] e1000_read_phy_reg+0x2e/0xc4 [e1000] [<f8969330>] e1000_configure_kmrn_for_10_100+0x49/0x78 [e1000] [<f8969eae>] e1000_get_speed_and_duplex+0xd4/0xf8 [e1000] [<f896eab0>] e1000_get_settings+0x9c/0xdb [e1000] [<c028de3e>] ethtool_get_settings+0x2f/0x85 [<c028f438>] dev_ethtool+0xc6/0x27a [<c028d318>] dev_ioctl+0x2c7/0x4ab [<c02c0ae7>] udp_ioctl+0x0/0xc5 [<c02c6c8e>] inet_ioctl+0xa0/0xa5 [<c0283bcd>] sock_ioctl+0x28c/0x2b4 [<c016cc5e>] sys_ioctl+0x227/0x269 [<c02e09db>] syscall_call+0x7/0xb Well, I can't say if this big delay is really needed or not. Thoughts? (attaching the console output of debug kernel showing it looping) Flavio Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/4 ------------------------------------------------------------------------ On 2009-01-13T18:15:30+00:00 Flavio wrote: Hello, Here is another call trace leading to the same problem: - This problem occurs when SW interrupt for MII monitor occurs while acquiring HW semaphore in SNMP state acquisition processing. At that time, the MII monitor cannot acquire the semaphore, because the SNMP has already aquired the semaphore, and it must wait for 16 seconds for timeout. Therefore the system stops for 16 seconds. ----- sys_ioctl ->e1000_get_speed_and_duplex ->e1000_configure_kmrn_for_1000 ->e1000_write_kmrn_reg ->e1000_swfw_sync_acquire ->e1000_get_hw_eeprom_semaphore * SW interrupt by MII monitor occurs while dealing with this function. ->bond_mii_monitor ->e1000_get_speed_and_duplex ->e1000_configure_kmrn_for_1000 ->e1000_write_kmrn_reg ->e1000_swfw_sync_acquire ->e1000_get_hw_eeprom_semaphore ->e1000_get_software_semaphore * It is made to wait until timeout(16 seconds) in this function. Flavio Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/5 ------------------------------------------------------------------------ On 2009-02-09T20:55:01+00:00 Andy wrote: Flavio's analysis is absolutely correct. It's interesting that concurrent calls to check link status have such long timeouts. Whether the mii monitoring time-out is set to be too small or not (I'd say 1ms is too small), there is a chance for deadlock with anything set, so we should try and come up with a fix. As I look at what is done here I'm not convinced that these bits: 3203 3204 if ((hw->mac_type == e1000_80003es2lan) && 3205 (hw->media_type == e1000_media_type_copper)) { 3206 if (*speed == SPEED_1000) 3207 ret_val = e1000_configure_kmrn_for_1000(hw); 3208 else 3209 ret_val = e1000_configure_kmrn_for_10_100(hw, *duplex); 3210 if (ret_val) 3211 return ret_val; 3212 } 3213 need to be run on every call to e1000_get_speed_and_duplex(). I can see how they are valuable when the link first comes up since they make sure the interpacket gap and it appears that they disable some 'false carrier detection' bit (I haven't looked at the datasheet to determine what the false-carrier stuff actually does). As I see it there are a couple of options: 1. Break out the code above so that it's in a separate function and only called when needed. This would mean we would drop it from e1000_get_speed_and_duplex and have to place some calls do the new function after the calls to e1000_get_speed_and_duplex. 2. Add a new conditional argument to e1000_get_speed_and_duplex so the offending code is not called when we deem it not necessary. This would require the same level of research to determine where this call is needed and where it is not. 3. Do not modify the function signature and make sure the code above doesn't get called when e1000_get_speed_and_duplex is called in irq context. This is a bit hacky, but would certainly work around the issue since the call would not me made by miimon (since timers run in softirq context). This would allow it to be called in the watchdog though since the watchdog runs as a workqueue in our code so it's a normal, non-irq thread. 4. Other. Any way you look at it, I don't see how calling e1000_configure_kmrn_for_1000 or e1000_configure_kmrn_for_10_100 is necessary when simply checking link status. I'm hoping to get someone at Intel to verify this. Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/6 ------------------------------------------------------------------------ On 2009-02-09T21:20:09+00:00 Andy wrote: Places where we should not call 'questionable code' from comment #9 since it will be deadlock or delay prone are: e1000_get_settings e1000_check_for_link e1000_config_fc_after_link_up e1000_config_dsp_after_link_change Places where we could call 'questionable code' from comment #9 since it will not be as deadlock prone: e1000_watchdog_task Since there is only one case where we should really call this I am going to suggest creating a new function to deal with this and only calling it there. Upon inspecting the e1000e driver it appears the same basic code it only called is called in functions equivalent to e1000_watchdog_task and e1000_config_fc_after_link_up, so I will do some looking around and try to determine if this is necessary in both places in e1000 or just in e1000_watchdog_task. Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/7 ------------------------------------------------------------------------ On 2009-02-09T22:30:17+00:00 Andy wrote: Jesse, you can read the whole bug if you want, but if you start at comment #9 you will get the information you need. I pinged Jeff K about this too, but he appears to be in a meeting and does not appear to have a bugzilla account his Intel address (at least not the one I know!). Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/8 ------------------------------------------------------------------------ On 2009-02-10T00:21:18+00:00 Jesse wrote: I agree with your assertion that it is not important to make those deep call paths from the ethtool code. I believe they fixed these issues in newer bonding by never calling ethtool (IOCTL) calls from interrupt context (softirq) which seems like it was always a bad idea once it is explained that all ethtool calls were designed with the assumption that they were being called from *user* context, not interrupt context. Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/9 ------------------------------------------------------------------------ On 2009-02-10T00:47:34+00:00 Andy wrote: Jesse, that was one of the benefits of the move from timers to workqueues with bonding. Those changes are in RHEL5, but I don't think we'll ever see them in RHEL4. I'll whip up a patch and post it here soon. Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/10 ------------------------------------------------------------------------ On 2009-02-10T01:04:41+00:00 Andy wrote: Created attachment 331385 e1000-ethtool-fix.patch First pass at a patch to resolve this. Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/11 ------------------------------------------------------------------------ On 2009-02-10T20:09:31+00:00 Issue wrote: Hi, Engineering has a patch moving the offending function to another context. Could you give a try and report back your results? thanks, Flavio Internal Status set to 'Waiting on Support' This event sent from IssueTracker by fleitner issue 249481 Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/12 ------------------------------------------------------------------------ On 2009-02-12T12:34:51+00:00 Andy wrote: My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel4 Please test them and report back your results. Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update. Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/13 ------------------------------------------------------------------------ On 2009-02-14T12:51:42+00:00 Issue wrote: >From FJ: --- I checked the same test on kernel-smp-2.6.9-81.EL.gtest.59 again. And, it worked just fine. Thank you for your fix. --- Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by mosh...@redhat.com issue 231780 Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/14 ------------------------------------------------------------------------ On 2009-02-20T19:43:34+00:00 RHEL wrote: This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/15 ------------------------------------------------------------------------ On 2009-02-26T16:47:20+00:00 Vivek wrote: Committed in 82.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/16 ------------------------------------------------------------------------ On 2009-05-18T19:05:00+00:00 errata-xmlrpc wrote: An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html Reply at: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/488898/comments/17 ** Changed in: linux (Fedora) Importance: Unknown => High -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/488898 Title: [hardy; e1000] ethtool operation to bonding slave device may hang system; corrupt data Status in Ubuntu China Edition: Invalid Status in linux package in Ubuntu: Incomplete Status in linux package in Fedora: Fix Released Bug description: There is a problem using bonding with 80003ES2LAN network cards (e1000 module) on Hardy. The system can hang and data corruption has been observed. Please see attachments: kernel_logs.txt This is from a Sun Fire X2250 Red Hat confirmed bug and patch: https://bugzilla.redhat.com/show_bug.cgi?id=473258 The problem can be reproduced by executing, several times, the command: # ethetool eth0 To manage notifications about this bug go to: https://bugs.launchpad.net/cnubuntu/+bug/488898/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp