This is not a bug on makedumpfile. Closing. Please, reopen against linux, if still valid.
** Changed in: makedumpfile (Ubuntu) Status: New => Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to makedumpfile in Ubuntu. https://bugs.launchpad.net/bugs/1339199 Title: mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE Status in makedumpfile package in Ubuntu: Invalid Bug description: mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE ^^^^^^^^ SUMMARY: ^^^^^^^^ mpi user job launch getmem.64 user job crashes ubuntu 14.04 We still have the node c656f2n05 in debug with console opened and can give access when needed. ^^^^^^^^^^^^^^ CONFIGURATION: ^^^^^^^^^^^^^^ Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net ) c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143 c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144 c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146 ^^^^^^ BUILD: ^^^^^^ Ubuntu 14.04 LTS c656f2n05 hvc0 Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc 64le ppc64le ppc64le GNU/Linux ^^^^^^^^^ SCENARIO: ^^^^^^^^^ Hi, all: P8 server c656f2n05 crashed again. I have launched openshmem regression with the efix of D198270 before the server crashed. Here below are the steps that I have used to launched the test: --------------------- 1. ssh c656f2n03 with root/davega. 2. su - qixiaol 3. cd /u/qixiaol/fvt/openshmem/get 4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP MP_ADAPTER_USE=shared MP_EUIDEVICE=sn_all MP_EUILIB=us MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/ MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list MP_PROCS=8 MP_RESD=poe 5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS: ------------------------------- [c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]> /afs/apd/u/shapiro/gpfs/PRO140708.12 (12:52:10) c199sq03:/u/shapiro/le # cat /afs/apd/u/shapiro/gpfs/PRO140708.11 mpi getmem.64 user job crashes ubuntu 14.04 ppc64le LE ^^^^^^^^ SUMMARY: ^^^^^^^^ mpi user job launch getmem.64 user job crashes ubuntu 14.04 We still have the node c656f2n05 in debug with console opened and can give access when needed. ^^^^^^^^^^^^^^ CONFIGURATION: ^^^^^^^^^^^^^^ Job launched across cluster of 4 p8 22L systems ( hostnames on the ibm 9.* net ) c656f2n03 is c656f2n03.pok.stglabs.ibm.com is 9.114.39.143 c656f2n04 is c656f2n04.pok.stglabs.ibm.com is 9.114.39.144 c656f2n05 is c656f2n05.pok.stglabs.ibm.com is 9.114.39.145 <--- node crashed c656f2n06 is c656f2n06.pok.stglabs.ibm.com is 9.114.39.146 ^^^^^^ BUILD: ^^^^^^ Ubuntu 14.04 LTS c656f2n05 hvc0 Linux c656f2n05 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:46:02 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux ^^^^^^^^^ SCENARIO: ^^^^^^^^^ Hi, all: P8 server c656f2n05 crashed again. I have launched openshmem regression with the efix of D198270 before the server crashed. Here below are the steps that I have used to launched the test: --------------------- 1. ssh c656f2n03 with root/davega. 2. su - qixiaol 3. cd /u/qixiaol/fvt/openshmem/get 4. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> env | grep MP MP_ADAPTER_USE=shared MP_EUIDEVICE=sn_all MP_EUILIB=us MP_EUILIBPATH=/u/qixiaol/fvt/efix/RVIN_LE/ MP_HOSTFILE=/u/qixiaol/fvt/openshmem/host.list MP_PROCS=8 MP_RESD=poe 5. [c656f2n03][/u/qixiaol/fvt/openshmem/get]> ./run.all And the test log is at /u/qixiaol/fvt/openshmem/get/RESULTS: ------------------------------- [c656f2n03][/u/qixiaol/fvt/openshmem/get/RESULTS]> The binary file of getmem.64 is at c656f2n03:/u/qixiaol/fvt/openshmem/get/bin/ppc64le/getmem.64 And its source code is at c656f2n03:/u/qixiaol/fvt/openshmem/get/src/getmem.c Best Wishes Xiao Lu Qi(???) ^^^^^^^^ SYMPTOM: ^^^^^^^^ Ubuntu 14.04 LTS c656f2n05 hvc0 c656f2n05 login: [418442.082487] rport-1:0-3: blocked FC remote port time out: removing rport [418442.082492] rport-5:0-6: blocked FC remote port time out: removing rport [418442.082494] rport-2:0-3: blocked FC remote port time out: removing rport [466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365! cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310] pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019] kernel BUG at /b uild/buildd/linux-3.13.0/mm/slub.c:3365! lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0 sp: c000001f39a13590 msr: 9000000000029033 current = 0xc000001f58828bf0 paca = 0xc00000000fe45780 softe: 0 irq_happened: 0x01 pid = 98622, comm = getmem.64 [466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365! [466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub .c:3365! cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310] pc: c00000000022fa34: .kfree+0x124/0x220 lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0 sp: c000001f39a6b590 msr: 9000000000029033 current = 0xc000001f5886f840 paca = 0xc00000000fe43b80 softe: 0 irq_happened: 0x01 pid = 98621, comm = getmem.64 kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365! enter ? for help 19:mon> t print backtrace 19:mon> t [c000001f39a13620] c0000000007fb6a8 .skb_free_head+0x78/0xb0 [c000001f39a136a0] c0000000007fb904 .__kfree_skb+0x24/0x40 [c000001f39a13720] c00000000080396c .skb_free_datagram_locked+0xbc/0x140 [c000001f39a137b0] c0000000008992c0 .udp_recvmsg+0x1b0/0x530 [c000001f39a13890] c0000000008a6e70 .inet_recvmsg+0xa0/0x100 [c000001f39a13940] c0000000007ee7d8 .sock_recvmsg+0x108/0x160 [c000001f39a13ab0] c0000000007ef700 .___sys_recvmsg+0x150/0x320 [c000001f39a13c90] c0000000007f2858 .__sys_recvmsg+0x58/0xc0 [c000001f39a13d70] c0000000007f30ec .SyS_socketcall+0x38c/0x3f0 [c000001f39a13e30] c00000000000a158 syscall_exit+0x0/0x98 --- Exception: c01 (System Call) at 00003ffdbf7c10bc SP (3fffc777abc0) is in userspace 19:mon> ^^^^^^ DEBUG: ^^^^^^ Hi Xiao Lu, A user space job should not crash the node. If the process uses too much memory, then the process itself should get ENOMEM and the node should not crash. Could you please check the system error log to see what was the real issue causes the node to crash? From the information Dave provided, it looks like we triggered some bug in kernel accidentally. I believe we need someone from LTC to check it. [466884.681785] kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365! cpu 0x19: Vector: 700 (Program Check) at [c000001f39a13310] pc: c00000000022fa34: .kfree+0x124/0x220[466884.683019] kernel BUG at /b uild/buildd/linux-3.13.0/mm/slub.c:3365! lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0 sp: c000001f39a13590 msr: 9000000000029033 current = 0xc000001f58828bf0 paca = 0xc00000000fe45780 softe: 0 irq_happened: 0x01 pid = 98622, comm = getmem.64 [466884.685546] kekernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365! [466884.687415] kernel BUG at /build/buildd/linux-3.13.0/mm/slub .c:3365! cpu 0x11: Vector: 700 (Program Check) at [c000001f39a6b310] pc: c00000000022fa34: .kfree+0x124/0x220 lr: c0000000007fb6a8: .skb_free_head+0x78/0xb0 sp: c000001f39a6b590 msr: 9000000000029033 current = 0xc000001f5886f840 paca = 0xc00000000fe43b80 softe: 0 irq_happened: 0x01 pid = 98621, comm = getmem.64 kernel BUG at /build/buildd/linux-3.13.0/mm/slub.c:3365! enter ? for help 19:mon> To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/makedumpfile/+bug/1339199/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp