Hello,

I'm running a process tree inside a network and pid namespace and try to 
checkpoint it using CRIU (over RPC API), restore it on another node, checkpoint 
it again and restore the process tree on the original node. Unfortunately, the 
last operation fails if I restore with something around 5 minutes before the 
first checkpoint operation.

The problem seems to be connected to an attempt to restore the ip address of 
the loopback device inside the network namespace. Here is the relevant part of 
the log:

(00.032802) 1: Skip veth0/use_optimistic, coincides with default
(00.032805) 1: Skip veth0/use_tempaddr, coincides with default
(00.032835) 1: Try to restore a link 9:1:lo(00.032838) 1: Restoring link lo 
type 1
(15.034082) 1: Running ip addr restore
RTNETLINK answers: File exists
RTNETLINK answers: File exists
(15.037634) 1: Running ip route restore
RTNETLINK answers: File exists
(15.040505) 1: Running ip route restore
RTNETLINK answers: File exists
(15.043492) 1: Running ip rule flush
(15.046370) 1: Running ip rule delete table local
(15.048892) 1: Running ip rule restore
(15.051935) 1: Running iptables-restore -w for iptables-restore -w
(15.076275) 1: Running ip6tables-restore -w for ip6tables-restore -w
(15.104769) 1: Warn (criu/libnetlink.c:55): ERROR -16 reported by netlink
(15.123717) 1: Error (criu/util.c:1563): Can't wait or bad status: errno=0, 
status=65280
(15.123912) Error (criu/cr-restore.c:2300): Restoring FAILED.

Netlink returns -16 (-EBUSY), when CRIU tries to send "ifaddr-%u.img" file to 
ip addr restore.

Further, I figured out that -EBUSY returned by ctnetlink_change_status, when it 
compares status against constant IPS_ASSURED. At this point d == 4 == 
IPS_ASSURED, and status == 10 == IPS_CONFIRMED | IPS_SEEN_REPLY.

Here is the backtrace:

#0  0xffffffffc0d0a4f3 in ctnetlink_change_status (ct=0xffff880428e84000, 
cda=<optimized out>)
    at net/netfilter/nf_conntrack_netlink.c:1522
#1  0xffffffffc0d0f15c in ctnetlink_change_conntrack (cda=<optimized out>, 
ct=<optimized out>)
    at net/netfilter/nf_conntrack_netlink.c:1811
#2  ctnetlink_new_conntrack (net=0xffff8804282d9880, ctnl=<optimized out>, 
skb=<optimized out>, nlh=0x7fffffff,
    cda=0xffffc90002393a40, extack=<optimized out>) at 
net/netfilter/nf_conntrack_netlink.c:2092
#3  0xffffffffc0bfa4ed in nfnetlink_rcv_msg (skb=0xffff880426d2e600, 
nlh=0xffff880428f2c800,
    extack=0xffffc90002393b88) at net/netfilter/nfnetlink.c:228
#4  0xffffffff8163f052 in netlink_rcv_skb (skb=0xffff880428e84000, cb=0xa 
<irq_stack_union+10>)
    at net/netlink/af_netlink.c:2455
#5  0xffffffffc0bfadbf in nfnetlink_rcv (skb=0xffff880428e84000) at 
net/netfilter/nfnetlink.c:555
#6  0xffffffff8163e88f in netlink_unicast_kernel (ssk=<optimized out>, 
skb=<optimized out>, sk=<optimized out>)
    at net/netlink/af_netlink.c:1317
#7  netlink_unicast (ssk=0xffff880427e0f000, skb=0xffff880426d2e600, portid=0, 
nonblock=<optimized out>)
    at net/netlink/af_netlink.c:1343
#8  0xffffffff8163eb3b in netlink_sendmsg (sock=<optimized out>, msg=0xa 
<irq_stack_union+10>, len=<optimized out>)
    at net/netlink/af_netlink.c:1908
#9  0xffffffff815d3b7e in sock_sendmsg_nosec (msg=<optimized out>, 
sock=<optimized out>) at ./include/linux/uio.h:202
#10 sock_sendmsg (sock=0xffff880428a9c840, msg=0xffffc90002393ea0) at 
net/socket.c:652
#11 0xffffffff815d4115 in ___sys_sendmsg (sock=0xffff880428a9c840, 
msg=<optimized out>, msg_sys=0xffffc90002393ea0,
    flags=<optimized out>, used_address=0x0 <irq_stack_union>, 
allowed_msghdr_flags=<optimized out>)
    at net/socket.c:2126
#12 0xffffffff815d551c in __sys_sendmsg (fd=<optimized out>, msg=0xa 
<irq_stack_union+10>, flags=4,
    forbid_cmsg_compat=<optimized out>) at net/socket.c:2164
#13 0xffffffff815d557f in __do_sys_sendmsg (flags=<optimized out>, 
msg=<optimized out>, fd=<optimized out>)
    at net/socket.c:2173
#14 __se_sys_sendmsg (flags=<optimized out>, msg=<optimized out>, fd=<optimized 
out>) at net/socket.c:2171
#15 __x64_sys_sendmsg (regs=<optimized out>) at net/socket.c:2171
#16 0xffffffff810041d8 in do_syscall_64 (nr=<optimized out>, regs=0xa 
<irq_stack_union+10>)
    at arch/x86/entry/common.c:299
#17 0xffffffff81800088 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:238
#18 0x0000000000000000 in ?? ()

Originally, I posted this issue on CRIU github issue tracker 
(https://github.com/checkpoint-restore/criu/issues/581), but later I was 
advised to post it also, here, on netdev mailing list.

-- 
Regards,
Maksym Planeta

Reply via email to