I am getting some really hard to understand failures.

I have three managed machines, running Ubuntu 14.04.03 --- which normally 
has Linux kernel 3.13.  I run Ansible 1.9.4 on my Mac (OS 10.10.5) to 
manage them.  The managed machines are VMs in a cloud service, on a 
datacenter subnet, and I use an OpenVPN tunnel between them and my Mac.  I 
have a playbook, and it runs.  All is well.  BTW: I have no ansible.cfg 
anywhere (I set the SL_INVENTORY envar to point at the appropriate dynamic 
inventory program), and none of my SSH config files say anything about 
TCPKeepAlive or ServerAlive<whatever>.

Then on the managed machines I run (as root) `apt-get install 
linux-generic-lts-wily`, which upgrades the kernel to 4.2.  After that my 
Ansible is flaky.  When I run my playbook, usually one or two of the three 
managed machines fails the gather-facts; the remainder fail in the first 
few tasks.  Which fails where is variable, so I think this is due to some 
intermittent problem.

Thinking it might be a connectivity flakiness, I created `ansible.cfg` that 
specifies a timeout of 100, and set `TCPKeepAlive no` and 
`ServerAliveCountMax 100` in my SSH client config (leaving 
ServerAliveInterval unspecified everywhere, which I think disables that 
function).  These did not solve the problem.  The playbook runs to full 
failure in under 30 seconds.

The failure message shows a truncated reply from the remote machine and 
then some SSH debugging that ends with a complaint about a broken pipe.  
Here is an example from the gather-facts step (I passed `-vvvv` to 
`ansible-playbook`):

GATHERING FACTS 
*************************************************************** 
<10.140.155.166> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.166> REMOTE_MODULE setup
<10.140.155.161> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> REMOTE_MODULE setup
<10.140.155.161> REMOTE_MODULE setup
<10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.166 /bin/sh -c 'mkdir -p 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && chmod a+rx 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && echo 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573'
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.162 /bin/sh -c 'mkdir -p 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && chmod a+rx 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && echo 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274'
<10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.161 /bin/sh -c 'mkdir -p 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && chmod a+rx 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && echo 
$HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077'
<10.140.155.166> PUT 
/var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpFwQwuI TO setup
<10.140.155.161> PUT 
/var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpcY2it_ TO setup
<10.140.155.162> PUT 
/var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpuXklyy TO setup
<10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.166 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, 
key=spxlnvdjayufeyagwywqhtdtptisrnej] password: " -u root /bin/sh -c 
'"'"'echo BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej; LANG=en_US.UTF-8 
LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
<10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.161 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, 
key=dprrxybimhltrqziynaikzahzsgdjmlc] password: " -u root /bin/sh -c 
'"'"'echo BECOME-SUCCESS-dprrxybimhltrqziynaikzahzsgdjmlc; LANG=en_US.UTF-8 
LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.162 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, 
key=bnojgeocfbeoxbgakzuluppvtxaviuyr] password: " -u root /bin/sh -c 
'"'"'echo BECOME-SUCCESS-bnojgeocfbeoxbgakzuluppvtxaviuyr; LANG=en_US.UTF-8 
LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
failed: [10.140.155.166] => {"failed": true, "parsed": false}
BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej
{"verbose_override": true, "changed": false, "ansible_facts": {*<lots of 
gather-facts output here, but it is not complete/>*OpenSSH_6.2p2, OSSLShim 
0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/mspreitz/.ssh/config
debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.*
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 102: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 9654
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 10.140.155.166 closed.

>From the same run, here is a failure at a task for the one machine that 
succeeded to gather-facts in this case:

TASK: [bootstrap | add any dependent repository keys from url] 
**************** 
<10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> REMOTE_MODULE apt_key 
url=http://10.140.132.215:8082/Release.pgp
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.162 /bin/sh -c 'mkdir -p 
$HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && chmod a+rx 
$HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && echo 
$HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812'
<10.140.155.162> PUT 
/var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpxHKer5 TO apt_key
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o 
ControlPersist=60s -o 
ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o 
KbdInteractiveAuthentication=no -o 
PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 
10.140.155.162 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, 
key=jsmmpfgpouhdlwmcvgmwvbhidafftccn] password: " -u root /bin/sh -c 
'"'"'echo BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn; LANG=en_US.UTF-8 
LC_CTYPE=en_US.UTF-8 /usr/bin/python apt_key'"'"''
failed: [10.140.155.162] => (item={'repo': u'deb 
http://10.140.132.215:8082/ docker main', 'key_url': 
u'http://10.140.132.215:8082/Release.pgp'}) => {"failed": true, "item": 
{"key_url": "http://10.140.132.215:8082/Release.pgp";, "repo": "deb 
http://10.140.132.215:8082/ docker main"}, "parsed": false}
BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn
OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/mspreitz/.ssh/config
debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.*
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 102: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 9658
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 10.140.155.162 closed.


In desperation I captured all packets on the VPN network interface on my 
Mac during that run, then looked at it with Wireshark.  I see one 
long-lived connection for each of the three managed machines, plus one 
short-lived connection for 10.140.155.162.

I see no end to the long-lived connections.  Just to make sure it was not 
due to a missing buffer flush in the packet capture, I did a `ping` of one 
managed machine after `ansible-playbook` finished; I see the `ping` in the 
captured packets.  So how can Ansible suffer from truncated replies when 
the connections stay open?  How can the connections stay open after 
`ansible-playbook` exits?

The short-lived connection is even stranger.  It comes late in the trace.  
The client closes it soon after the completion of the three-packet 
handshake that opens the connection.  The client sends no TCP payload at 
all, not even the "client hello" of TLS.  The server sends a "server 
hello", which prompts the client to respond with a RST.  What the heck is 
going on with that?

Thanks for any clues,
Mike

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/2420a36a-a5ec-4dbc-b65a-5185793c4c0a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to