Hi,

I get DNS lookup timeouts many times a day when running Go HTTP clients on 
Kubernetes, where UDP packets are sometimes dropped due to a race condition 
in the Linux kernel. I'd like to propose a small standard library change to 
help debug these scenarios and get some feedback on the idea and 
implementation here before posting it as a Github issue.

For any http.Client use with a timeout set on the transport, when the 
timeout is exceeded and an error is returned it can be challenging to 
discover the cause of the timeout from the error message. Is it DNS related 
or TCP related or something else? Consider the following simulation of a 
DNS lookup that does not return within the timeout.

package main
  
import (
    "context"
    "net"
    "net/http"
    "time"
)

var timeout = 50 * time.Millisecond
var host = "http://example.com/";

// delayedDialer introduces a delay when performing DNS lookups.
func delayedDialer(ctx context.Context, network, address string) (net.Conn, 
error) {
    time.Sleep(2 * timeout)
    d := net.Dialer{}
    return d.DialContext(ctx, network, address)
}

func main() {
    r := &net.Resolver{
        PreferGo: true,
        Dial:     delayedDialer,
    }
    tr := &http.Transport{
        DialContext: (&net.Dialer{
            Timeout:  timeout,
            Resolver: r,
        }).DialContext,
    }
    client := &http.Client{Transport: tr}
    _, err := client.Get(host)
    if err != nil {
        panic(err)
    }
}

In the event of a DNS lookup timeout, the output is:

panic: Get "http://example.com/": dial tcp: i/o timeout

In the event of a TCP connection timeout (comment out the time.Sleep() 
call) the output is:

panic: Get "http://example.com/": dial tcp 93.184.216.34:80: i/o timeout

The missing IP address in the DNS lookup timeout output is the only clue 
that the root cause is a DNS problem, rather than a TCP problem. If users 
do not expect to see an IP address in that error message, they won't know 
it is missing, and their debugging and testing effort may be wasted 
investigating possible TCP-related causes, instead of investigating DNS 
(which normally runs over UDP).

The error message could communicate the root cause to users more 
clearly. After a DNS lookup timeout, the error returned by client.Get() 
looks like this in Go 1.14.3:

&url.Error{
  Op:  "Get",
  URL: "http://example.com/";,
  Err: &net.OpError{
    Op:     "dial",
    Net:    "tcp",
    Source: nil,
    Addr:   nil,
    Err:    &poll.TimeoutError{},
  },
}

Note: The &poll.TimeoutError{} will become &net.timeoutError{} in future 
releases after commit d422f54 
<https://github.com/golang/go/commit/d422f54619b5b6e6301eaa3e9f22cfa7b65063c8>
.

The resolver in net/lookup.go 
<https://github.com/golang/go/blob/master/src/net/lookup.go#L316> is the 
source of the *poll.TimeoutError. Instead it could return a *net.DNSError 
error when a lookup times out. Doing so would provide additional error 
context for all use cases (not only http.Client) and may save debugging 
time. For example client.Get() could return:

&url.Error{
  Op:  "Get",
  URL: "http://example.com/";,
  Err: &net.OpError{
    Op:     "dial",
    Net:    "tcp",
    Source: nil,
    Addr:   nil,
    Err:    &net.DNSError{
      Err:         "i/o timeout",
      Name:        "example.com",
      Server:      "",
      IsTimeout:   true,
      IsTemporary: false,
      IsNotFound:  false,
    },
  },
}

This would preserve the existing "i/o timeout" string and the ability to 
call err.Timeout(), while adding "lookup example.com: " to the Error() 
output and the capability to explicitly check for a *net.DNSError. The 
output from my simulator code would also be more explicit:

panic: Get "http://example.com/": dial tcp: lookup example.com: i/o timeout

The following change to implement this works in my usage and passes the 
tests in all.bash.

$ git diff
diff --git a/src/net/lookup.go b/src/net/lookup.go
index 5f7119872a..b1a22dd4e6 100644
--- a/src/net/lookup.go
+++ b/src/net/lookup.go
@@ -314,6 +314,9 @@ func (r *Resolver) lookupIPAddr(ctx context.Context, 
network, host string) ([]IP
                        }()
                }
                err := mapErr(ctx.Err())
+               if t, ok := err.(timeout); ok && t.Timeout() {
+                       err = &DNSError{Err: err.Error(), Name: host, 
IsTimeout: true}
+               }
                if trace != nil && trace.DNSDone != nil {
                        trace.DNSDone(nil, false, err)
                }

I can submit a Github issue and pull request if anyone thinks that this is 
a good idea and implementation.

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/26d8e190-107e-4dd1-9de0-cab02fe5699a%40googlegroups.com.

Reply via email to