Re: [asterisk-dev] [Code Review] 3402: http: response body frequently missing when running in virtual machine

Scott Griepentrog Thu, 27 Mar 2014 16:24:11 -0700


> On March 27, 2014, 3:30 p.m., Mark Michelson wrote:
> > My only problem with this patch is that I really want to know for sure that 
> > there is something low level and buggy happening that requires such a 
> > change before approving it. So I have a few questions:
> > 
> > 1) I notice you're using both fflush() and shutdown(). If you remove the 
> > fflush() does the bug still occur? What about if you add the fflush() but 
> > keep the close() in instead of using shutdown()? The reason I ask is that 
> > shutdown() and close() are both supposed to send a TCP FIN and flush out 
> > the write buffer on its own, so in theory the fflush() call should not be 
> > necessary. However, if removing fflush() causes the bug to return, then 
> > that leads to my next question:
> > 
> > 2) Did this bug manifest itself on many HTTP client implementations or was 
> > it only seen on one? When calling close() or shutdown(), if there is data 
> > in the socket's write buffer, this data may be sent in the TCP FIN that 
> > Asterisk sends. It may be that whatever HTTP client is being used ignores 
> > data in TCP FIN and that may be where the real bug lies. It would be 
> > interesting to use TCPdump or some tool to monitor the HTTP traffic to see 
> > if there is any correlation between the way the HTTP traffic is sent and 
> > when the bug manifests itself.
> > 
> > Overall, I'm just not comfortable saying that this fixes a race condition 
> > in the kernel since I don't know what due diligence has been done to prove 
> > that. On the other hand, I also can't really fault the changes here if 
> > they're actually helping improve the situation for an HTTP client. I guess 
> > what I'm saying is that the comments and the reasons for the code changes 
> > are bothering me more than the actual code changes themselves :)

1) The use of fflush is strictly a CYA to insure that the last data from fwrite 
got pushed by clib to write() prior to the low level shutdown call, and does 
not resolve the problem on it's own.  Nor does removing it cause the failure 
case to return.

2) The bug manifests itself only under very specific conditions: 

  a) POST not GET 
  b) Client is not on same ip/tcp stack/host as Asterisk
  c) request coming from json Shred user-agent (unable to duplicate failure 
using python requests - can't get exact matching request headers)

I used to believe that a virtual machine was a requirement, but it also fails 
between two physical machines - so long as the client and host have to actually 
transmit IP packets between them it can fail (even across WiFi and VPN it 
fails).

Under these circumstances, bug is about 90% failure rate (every now and then 
the correct body message does arrive).  Switching the fwrite() call with the 
body to a write() does not resolve issue.  Using write() to send an extra byte 
before the body results in that character arriving and not the body.  Strangely 
enough however, writing an extra byte after the body still results in neither 
arriving (may have to do with length of data written).

Exact test results with original body delivery code and alternate close 
sequences:

Original code that fails:
        fclose(ser->f);

Works:
        shutdown(ser->fd,2);
        fclose(ser->f);

Fails:
        close(ser->fd);
        fclose(ser->f);

Fails:
        fflush(ser->f);
        fclose(ser->f);

Fails:
        fflush(ser->f);
        close(ser->fd);

Technically, my comments mention a race condition existing without specifying 
where.  Honestly, I'm not entirely certain that it's in the kernel, although 
it's the most likely culprit at this point.

- Scott

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3402/#review11401
-----------------------------------------------------------

On March 27, 2014, 1:52 p.m., Scott Griepentrog wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/3402/
> -----------------------------------------------------------
> 
> (Updated March 27, 2014, 1:52 p.m.)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Bugs: ASTERISK-23548
>     https://issues.asterisk.org/jira/browse/ASTERISK-23548
> 
> 
> Repository: Asterisk
> 
> 
> Description
> -------
> 
> When running asterisk in a virtual machine, responses to ARI requests would 
> frequently be missing.  A race condition related to closing the socket 
> immediately after writing data onto it is resolved in this patch by insuring 
> the output stream is flushed, and then informing TCP of the shutdown prior to 
> the close.
> 
> 
> Diffs
> -----
> 
>   /branches/12/main/tcptls.c 411242 
>   /branches/12/main/manager.c 411242 
>   /branches/12/main/http.c 411242 
> 
> Diff: https://reviewboard.asterisk.org/r/3402/diff/
> 
> 
> Testing
> -------
> 
> Tested on Ubuntu server 12.04 with Sam's json api test from issue.
> 
> 
> Thanks,
> 
> Scott Griepentrog
> 
>

-- 
_____________________________________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --

asterisk-dev mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-dev

Re: [asterisk-dev] [Code Review] 3402: http: response body frequently missing when running in virtual machine

Reply via email to