Final followup to close the loop on this.

Having debugged rsync, ssh, and finally Cygwin...the problem turned out to be a 
D-Link router doing (a bad job of) QoS processing.

Each of rsync, ssh, and Cygwin appear to have operated exactly correct, 
including pipe(), select(), stdin/stdout, and Windows socket handling.

Thanks,
Devin


-----Original Message-----
Sent: Tuesday, July 16, 2013 12:04 AM
Subject: RE: ssh.exe on cygwin: Write error

Dear Cygwin list;

So I've made some progress on the problem with ssh I started out trying to 
solve... unfortunately, it's got me in select.cc in Cygwin.

Basically, the ssh.exe program operates as this:

Ssh sets up a connection, and starts client_loop;

client_loop monitors (in the debugging case) a single channel. It checks to see 
if input is to be read (from stdin in this case), and checks if there's data to 
write from an output buffer and also if select() says the outbound connection 
is writable. In the case of debugging, the network connection from ssh.exe to 
the server is on fd 3.

If there's data to read, it reads it into a buffer.

If there's data to send in the output buffer AND select() says that fd 3 is 
writable, then it calls packet_write_poll, which then calls roaming_write, 
which does a write() on the fd.  If there's a failure to write(), then 
packet_write_poll sees what the error is. EAGAIN, EINTR, and EWOULDBLOCK (same 
as EAGAIN on Cygwin) are non-fatal. Any other error is fatal.


In debugging, what happens is that the client_loop is processing away just 
fine. As it happens, it's reading more data than writing on stdin. It is 
happily writing data on the outbound socket, using write() as called by 
roaming_write as called by packet_write_poll. At some point, something ?bad? 
occurs.

1. Select() says that the fd 3 (outbound connection) is writeable to the 
network.

2. Write() goes to write, but gets an error 11 (EAGAIN).

3. Many (probably 50-100) calls to select() say that the socket is not 
writeable, and a packet trace on the server side confirm that the flow of 
packets has completely stopped. I can see that peek_socket() in select.cc is 
returning 'peek_socket: read_ready: 0, write_ready: 0, except_ready: 0' in the 
strace.

4. After some time (30 seconds) select() on fd 3 returns both 
readable+writable. It tries to read from fd 3, but it gets an error 104 
(ECONNRESET). It subsequently tries to write on the socket, and also gets an 
error 104 (ECONNRESET).

5. Since the write() failed, it returns that to roaming_write, which returns it 
to packet_write_poll. This prints the fatal error "Write failed: connection 
reset by peer".

6. Interestingly, the server side has not issued a tcp/ip rst. In fact, from 
the server perspective, it just looks like the tcp/ip connection stalled 
(happens right at the error 11). The server side isn't shut down till some time 
later.

7. Definitely, the connection does get 'backed up' so to speak - i.e. I'm 
pushing more data than the internet connection can handle without blocking to 
process data, and I would expect select() and/or write() to fail waiting for 
the network to clear some buffers. That said, it's almost like the socket die's 
or needs to reset or something after the error 11 (EAGAIN).

8. I don't see any signals or timeouts happening. Also, I've retested with 
Cygwin 1.7.21 with no additional success.


I'm going to keep looking, but any thoughts with the new information?

Thanks,
Devin




--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Reply via email to