On (01/17/18 19:23), Willem de Bruijn wrote: > > +static void rds_rm_zerocopy_callback(struct rds_sock *rs) : > > + skb = alloc_skb(ncookies * sizeof(u32), GFP_ATOMIC); > > TCP zerocopy avoids this issue by allocating the notification skb when the > zerocopy packet is created, which would be rds_message_copy_from_user.
right, I could allocate the skb when we set up the zcopy data strucutres. > This does not add an allocation, if using the same trick of stashing > the intermediate notification object (op_mmp_znotifier) in skb->cb. Though, > alloc_skb is probably more expensive than that kzalloc. If nothing else, > because of more zeroing. I would have liked to reuse skb->cb, but had to fall back to the alloc_skb because of the attempt to fall back to flexibly numbered (and sized) cookie notifications. If we choose the "max number of cookies" option discussed in the previous thread, I think I should be able to do this with the existing 48 byte sized cb and pack in 8 32 bit cookies after the sock_extended_err.. maybe that's sufficient. > > + serr = SKB_EXT_ERR(skb); > > + serr->ee.ee_errno = 0; > > + serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY; > > + serr->ee.ee_data = ncookies; > > This changes the semantics of these fields. Please add a new SO_EE_CODE flag, > even if the semantics can be derived from the packet family (for now). sounds good, I'll take care of this (and other review comments) for the next rev. > Even better would be if we can avoid the cookies completely. I understand > the issue with concurrent send threads racing on obtaining a zckey value. If > the sender could learn which zckey was chosen for a call, would that suffice? I'm not sure I understand the proposal- you want sendmsg to return the cookie ("zckey") for you? How? even if we mangled sendmsg() to return something other than the existing POSIX semantics, how will the application be asynchronously notified that send has completed (on a per-message/per-datagram) basis? > I suspect that in even with concurrent senders, notifications arrive > largely in > order, in which case we could just maintain the existing semantics and even > reuse that implementation. not so. rds-stress [1] with -d8 -t8 quickly disproves this on my 10G ixgbe connection. When you have multiple threads writing to a socket, you cannot know what was the "order of send", unless you bottleneck all the threads to go through a single-point-of-send. rds-stress is an example of this sort of usage (fairly typical in our applications) [1] http://public-yum.oracle.com/repo/OracleLinux/OL6/ofed_UEK/x86_64//getPackageSource/rds-tools-2.0.7-1.12.el6.src.rpm Consider 2 RDS sockets fd1 and fd2, each one sending to the same pair of IP addresses: if fd1 gets bound to tcp sock1, fd2 to tcp sock2, it's very possible that the send completion is not in the same order as the order of send. The application cannot know which socket gets bound to which TCP connection (or even whether/how-many tcp connections are involved: mprds strives to make this transparent to the application) Same problem exists for pure-datagram sockets like PF_PACKET, UDP etc. application may send buf1 (datagram1) and buf2 (datagram2), and buf2 may make it to the destination before buf1. The "notifications arrive largely in order" may be mostly true about a single-stream TCP connection but not for the datagram models (or even threaded tcp apps like iperf?).. > > + tail = skb_peek_tail(q); > > + if (!tail || > > + SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY) { > > + __skb_queue_tail(q, skb); > > + skb = NULL; > > + } > > This drops the packet if the branch is not taken. In the TCP case > this condition > means that we can try to coalesce packets, but that is not the case here. good point, I'll check into this and fix (same applies for the comments to patches 4/6 and 6/6) > > } rdma; > > struct rm_data_op { > > unsigned int op_active:1; > > - unsigned int op_notify:1; > > + unsigned int op_notify:1, > > + op_zcopy:1, : > > + struct rds_znotifier *op_mmp_znotifier; > > not necessary if op_mmp_znotifier is NULL unless set in > rds_message_copy_from_user To make sure I dont misunderstand, you are suggesting that we dont need op_zcopy, but can just check for the null-ness of op_mmp_znotifier (yes, true, I agree)? or something else? --Sowmini