On Wed, Aug 29, 2018 at 3:56 AM Jason Wang <jasow...@redhat.com> wrote: > > > > On 2018年08月29日 03:57, Willem de Bruijn wrote: > > On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasow...@redhat.com> wrote: > >> > >> > >> On 2018年07月25日 08:17, Jon Olson wrote: > >>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <m...@redhat.com> > >>> wrote: > >>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote: > >>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <m...@redhat.com> > >>>>> wrote: > >>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote: > >>>>>>> >From the above linked patch, I understand that there are yet > >>>>>>> other special cases in production, such as a hard cap on #tx queues to > >>>>>>> 32 regardless of number of vcpus. > >>>>>> I don't think upstream kernels have this limit - we can > >>>>>> now use vmalloc for higher number of queues. > >>>>> Yes. that patch* mentioned it as a google compute engine imposed > >>>>> limit. It is exactly such cloud provider imposed rules that I'm > >>>>> concerned about working around in upstream drivers. > >>>>> > >>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/ > >>>> Yea. Why does GCE do it btw? > >>> There are a few reasons for the limit, some historical, some current. > >>> > >>> Historically we did this because of a kernel limit on the number of > >>> TAP queues (in Montreal I thought this limit was 32). To my chagrin, > >>> the limit upstream at the time we did it was actually eight. We had > >>> increased the limit from eight to 32 internally, and it appears in > >>> upstream it has subsequently increased upstream to 256. We no longer > >>> use TAP for networking, so that constraint no longer applies for us, > >>> but when looking at removing/raising the limit we discovered no > >>> workloads that clearly benefited from lifting it, and it also placed > >>> more pressure on our virtual networking stack particularly on the Tx > >>> side. We left it as-is. > >>> > >>> In terms of current reasons there are really two. One is memory usage. > >>> As you know, virtio-net uses rx/tx pairs, so there's an expectation > >>> that the guest will have an Rx queue for every Tx queue. We run our > >>> individual virtqueues fairly deep (4096 entries) to give guests a wide > >>> time window for re-posting Rx buffers and avoiding starvation on > >>> packet delivery. Filling an Rx vring with max-sized mergeable buffers > >>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can > >>> be up to 512MB of memory posted for network buffers. Scaling this to > >>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping > >>> all of the Rx rings full would (in the large average Rx packet size > >>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T > >>> of RAM available, but I don't believe we've observed a situation where > >>> they would have benefited from having 2.5 gigs of buffers posted for > >>> incoming network traffic :) > >> We can work to have async txq and rxq instead of paris if there's a > >> strong requirement. > >> > >>> The second reason is interrupt related -- as I mentioned above, we > >>> have found no workloads that clearly benefit from so many queues, but > >>> we have found workloads that degrade. In particular workloads that do > >>> a lot of small packet processing but which aren't extremely latency > >>> sensitive can achieve higher PPS by taking fewer interrupt across > >>> fewer VCPUs due to better batching (this also incurs higher latency, > >>> but at the limit the "busy" cores end up suppressing most interrupts > >>> and spending most of their cycles farming out work). Memcache is a > >>> good example here, particularly if the latency targets for request > >>> completion are in the ~milliseconds range (rather than the > >>> microseconds we typically strive for with TCP_RR-style workloads). > >>> > >>> All of that said, we haven't been forthcoming with data (and > >>> unfortunately I don't have it handy in a useful form, otherwise I'd > >>> simply post it here), so I understand the hesitation to simply run > >>> with napi_tx across the board. As Willem said, this patch seemed like > >>> the least disruptive way to allow us to continue down the road of > >>> "universal" NAPI Tx and to hopefully get data across enough workloads > >>> (with VMs small, large, and absurdly large :) to present a compelling > >>> argument in one direction or another. As far as I know there aren't > >>> currently any NAPI related ethtool commands (based on a quick perusal > >>> of ethtool.h) > >> As I suggest before, maybe we can (ab)use tx-frames-irq. > > I forgot to respond to this originally, but I agree. > > > > How about something like the snippet below. It would be simpler to > > reason about if only allow switching while the device is down, but > > napi does not strictly require that. > > > > +static int virtnet_set_coalesce(struct net_device *dev, > > + struct ethtool_coalesce *ec) > > +{ > > + const u32 tx_coalesce_napi_mask = (1 << 16); > > + const struct ethtool_coalesce ec_default = { > > + .cmd = ETHTOOL_SCOALESCE, > > + .rx_max_coalesced_frames = 1, > > + .tx_max_coalesced_frames = 1, > > + }; > > + struct virtnet_info *vi = netdev_priv(dev); > > + int napi_weight = 0; > > + bool running; > > + int i; > > + > > + if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) { > > + ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask; > > + napi_weight = NAPI_POLL_WEIGHT; > > + } > > + > > + /* disallow changes to fields not explicitly tested above */ > > + if (memcmp(ec, &ec_default, sizeof(ec_default))) > > + return -EINVAL; > > + > > + if (napi_weight ^ vi->sq[0].napi.weight) { > > + running = netif_running(vi->dev); > > + > > + for (i = 0; i < vi->max_queue_pairs; i++) { > > + vi->sq[i].napi.weight = napi_weight; > > + > > + if (!running) > > + continue; > > + > > + if (napi_weight) > > + virtnet_napi_tx_enable(vi, vi->sq[i].vq, > > + &vi->sq[i].napi); > > + else > > + napi_disable(&vi->sq[i].napi); > > + } > > + } > > + > > + return 0; > > +} > > + > > +static int virtnet_get_coalesce(struct net_device *dev, > > + struct ethtool_coalesce *ec) > > +{ > > + const u32 tx_coalesce_napi_mask = (1 << 16); > > + const struct ethtool_coalesce ec_default = { > > + .cmd = ETHTOOL_GCOALESCE, > > + .rx_max_coalesced_frames = 1, > > + .tx_max_coalesced_frames = 1, > > + }; > > + struct virtnet_info *vi = netdev_priv(dev); > > + > > + memcpy(ec, &ec_default, sizeof(ec_default)); > > + > > + if (vi->sq[0].napi.weight) > > + ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask; > > + > > + return 0; > > +} > > Looks good. Just one nit, maybe it's better simply check against zero?
I wanted to avoid making napi and interrupt moderation mutually exclusive. If the virtio-net driver ever gets true moderation support, it should be able to work alongside napi. But I can make no-napi be 0 and napi be 1. That is future proof, in the sense that napi is enabled if there is any interrupt moderation.