Re: [Rd] parallel PSOCK connection latency is greater on Linux?

Jeff Mon, 09 Nov 2020 16:49:46 -0800

I do enjoy free lunch solutions if they exist.

That said, I think the abstraction proposed by Simon is reasonable. 
Whether it should be applied to TCP_NODELAY or TCP_QUICKACK is 
unfortunately beyond my Linux/networking knowledge.


Jeff Keller

On Wed, Nov 4, 2020 at 11:41, I�aki Ucar <iu...@fedoraproject.org> 
wrote:
> Please, check a tcpdump session on localhost while running the 
> following script:
> 
> library(parallel)
> library(tictoc)
> cl <- makeCluster(1)
> Sys.sleep(1)
> 
> for (i in 1:10) {
>   tic()
>   x <- clusterEvalQ(cl, iris)
>   toc()
> }
> 
> The initialization phase comprises 7 packets. Then, the 1-second sleep
> will help you see where the evaluation starts. Each clusterEvalQ
> generates 6 packets:
> 
> 1. main -> worker PSH, ACK 1026 bytes
> 2. worker -> main ACK 66 bytes
> 3. worker -> main PSH, ACK 3758 bytes
> 4. main -> worker ACK 66 bytes
> 5. worker -> main PSH, ACK 2484 bytes
> 6. main -> worker ACK 66 bytes
> 
> The first two are the command and its ACK, the following are the data
> back and their ACKs. In the first 4-5 iterations, I see no delay at
> all. Then, in the following iterations, a 40 ms delay starts to happen
> between packets 3 and 4, that is: the main process delays the ACK to
> the first packet of the incoming result.
> 
> So I'd say Nagle is hardly to blame for this. It would be interesting
> to see how many packets are generated with TCP_NODELAY on. If there
> are still 6 packets, then we are fine. If we suddenly see a gazillion
> packets, then TCP_NODELAY does more harm than good. On the other hand,
> TCP_QUICKACK would surely solve the issue without any drawback. As
> Nagle himself put it once, "set TCP_QUICKACK. If you find a case where
> that makes things worse, let me know."
> 
> I�aki
> 
> On Wed, 4 Nov 2020 at 04:34, Simon Urbanek 
> <simon.urba...@r-project.org <mailto:simon.urba...@r-project.org>> 
> wrote:
>> 
>>  I'm not sure the user would know ;). This is very system-specific 
>> issue just because the Linux network stack behaves so differently 
>> from other OSes (for purely historical reasons). That makes it hard 
>> to abstract as a "feature" for the R sockets that are supposed to be 
>> platform-independent. At least TCP_NODELAY is actually part of POSIX 
>> so it is on better footing, and disabling delayed ACK is practically 
>> only useful to work around the other side having Nagle on, so I 
>> would expect it to be rarely used.
>> 
>>  This is essentially RFC since we don't have a mechanism for socket 
>> options (well, almost, there is timeout and blocking already...) and 
>> I don't think we want to expose low-level details so perhaps one 
>> idea would be to add something like delay=NA to socketConnection() 
>> in order to not touch (NA), enable (TRUE) or disable (FALSE) 
>> TCP_NODELAY. I wonder if there is any other way we could infer the 
>> intention of the user to try to choose the right approach...
>> 
>>  Cheers,
>>  Simon
>> 
>> 
>>  > On Nov 3, 2020, at 02:28, Jeff <j...@vtkellers.com 
>> <mailto:j...@vtkellers.com>> wrote:
>>  >
>>  > Could TCP_NODELAY and TCP_QUICKACK be exposed to the R user so 
>> that they might determine what is best for their potentially 
>> latency- or throughput-sensitive application?
>>  >
>>  > Best,
>>  > Jeff
>>  >
>>  > On Mon, Nov 2, 2020 at 14:05, I�aki Ucar 
>> <iu...@fedoraproject.org <mailto:iu...@fedoraproject.org>> wrote:
>>  >> On Mon, 2 Nov 2020 at 02:22, Simon Urbanek 
>> <simon.urba...@r-project.org <mailto:simon.urba...@r-project.org>> 
>> wrote:
>>  >>> It looks like R sockets on Linux could do with TCP_NODELAY -- 
>> without (status quo):
>>  >> How many network packets are generated with and without it? If 
>> there
>>  >> are many small writes and thus setting TCP_NODELAY causes many 
>> small
>>  >> packets to be sent, it might make more sense to set TCP_QUICKACK
>>  >> instead.
>>  >> I�aki
>>  >>> Unit: microseconds
>>  >>>                    expr      min       lq     mean  median      
>>  uq      max
>>  >>>  clusterEvalQ(cl, iris) 1449.997 43991.99 43975.21 43997.1 
>> 44001.91 48027.83
>>  >>>  neval
>>  >>>   1000
>>  >>> exactly the same machine + R but with TCP_NODELAY enabled in 
>> R_SockConnect():
>>  >>> Unit: microseconds
>>  >>>                    expr     min     lq     mean  median      uq 
>>      max neval
>>  >>>  clusterEvalQ(cl, iris) 156.125 166.41 180.8806 170.247 174.298 
>> 5322.234  1000
>>  >>> Cheers,
>>  >>> Simon
>>  >>> > On 2/11/2020, at 3:39 AM, Jeff <j...@vtkellers.com 
>> <mailto:j...@vtkellers.com>> wrote:
>>  >>> >
>>  >>> > I'm exploring latency overhead of parallel PSOCK workers and 
>> noticed that serializing/unserializing data back to the main R 
>> session is significantly slower on Linux than it is on Windows/MacOS 
>> with similar hardware. Is there a reason for this difference and is 
>> there a way to avoid the apparent additional Linux overhead?
>>  >>> >
>>  >>> > I attempted to isolate the behavior with a test that simply 
>> returns an existing object from the worker back to the main R 
>> session.
>>  >>> >
>>  >>> > library(parallel)
>>  >>> > library(microbenchmark)
>>  >>> > gcinfo(TRUE)
>>  >>> > cl <- makeCluster(1)
>>  >>> > (x <- microbenchmark(clusterEvalQ(cl, iris), times = 1000, 
>> unit = "us"))
>>  >>> > plot(x$time, ylab = "microseconds")
>>  >>> > head(x$time, n = 10)
>>  >>> >
>>  >>> > On Windows/MacOS, the test runs in 300-500 microseconds 
>> depending on hardware. A few of the 1000 runs are an order of 
>> magnitude slower but this can probably be attributed to garbage 
>> collection on the worker.
>>  >>> >
>>  >>> > On Linux, the first 5 or so executions run at comparable 
>> speeds but all subsequent executions are two orders of magnitude 
>> slower (~40 milliseconds).
>>  >>> >
>>  >>> > I see this behavior across various platforms and hardware 
>> combinations:
>>  >>> >
>>  >>> > Ubuntu 18.04 (Intel Xeon Platinum 8259CL)
>>  >>> > Linux Mint 19.3 (AMD Ryzen 7 1800X)
>>  >>> > Linux Mint 20 (AMD Ryzen 7 3700X)
>>  >>> > Windows 10 (AMD Ryzen 7 4800H)
>>  >>> > MacOS 10.15.7 (Intel Core i7-8850H)
>>  >>> >
>>  >>> > ______________________________________________
>>  >>> > R-devel@r-project.org <mailto:R-devel@r-project.org> mailing 
>> list
>>  >>> > <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>  >>> >
>>  >>> ______________________________________________
>>  >>> R-devel@r-project.org <mailto:R-devel@r-project.org> mailing 
>> list
>>  >>> <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>  >> --
>>  >> I�aki �car
>>  >
>>  > ______________________________________________
>>  > R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list
>>  > <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>  >
>> 
> 
> 
> --
> I�aki �car


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] parallel PSOCK connection latency is greater on Linux?

Reply via email to