On Fri, 5 Dec 2014, Nikolaus Rath wrote:
Shannon Dealy <de...@deatech.com> writes:
[snip]
Given the senario you were trying to fix with the change, perhaps a
better approach would be to fail if initial resolution fails, but if
the initial resolution succeeds, then the end point can reasonably be
assumed to exist and future failures should keep retrying, at least
for a substantial (possibly configurable) period of time. This
provides the immediate feedback for manual or scripting related
interactions, but once the file system is mounted focuses on
maintaining/recovering the connection.
Yes, that would be the best solution. It's just ugly to code, either way
you to pass a "do_not_fail" parameter all the way from the main function
to the low level network routines, or you have to do the first lookup on
a higher level (which then needs to know details about all the storage
backends).
I would suggest that it should pass a timeout value rather than a
"do_not_fail". This would give the application level code the greatest
flexibility allowing for both immediate and short term failure settings as
well as an effective "never fail" by just passing a ridiculously large
value.
Personally, I would love it if I could simply keep the file system
mounted at all times, even when there is no network link, so that when
there is a connection I can simply start using it, and when the
connection goes away (even for a day or two), everything blocks and
simply picks up where it left off when the connection returns.
Well, that should actually work already. When there is no network
connection, s3ql retries for a long amount. The problem only arises if
there is partial connectivity.
I tried it on the older version of the code I was using and it never
recovered (not sure why). Don't know on the current version.
Had another filesystem crash last night and while I can't say the exact
series of events from the perspective of the file system, it was a
complete network failure that crashed it (not just DNS). This would imply
that perhaps the test is too sensitive right at the boundary of a network
failure (perhaps some packets get through, but most don't) and needs to
retry over a longer period of time before deciding if the failure is
network or DNS.
Upon further reflection:
I looked over your dugong code and have given some thought to what little
I know of the local network topology, and my guess is that your test for
live DNS will always decide that DNS is up at my location whenever the
network connection fails (though it is just a guess). The ISP appears to
be in another country, and their local network in this building feeds
about 500 rooms. It is likely they are using a local DNS caching server
(for the building, city or country, doesn't really matter which) which
responds from its cache with anything it knows about, and forwards
all other requests up to the ISP's main servers. If that is the case, any
time the network gets cut off between the local caching DNS server and the
primary DNS servers, the test will fail because google.com will always
resolve (since everyone uses it), but www.iana.org and C.root-servers.org
will not since most internet users never have reason to do a direct lookup
on the later two domains, and any recursive lookups of these domains as a
result of a local query will always happen and be stored only at the
primary server.
Based on this, I would suggest that a more robust test would be to declare
a DNS failure if any of the three lookups fail (after the host lookup
previously failed). After all, a failure on any of these lookups would at
least suggest a serious problem with the internet which is a reasonably
likely cause of the initial host lookup failure.
FWIW.
Shannon C. Dealy | DeaTech Research Inc.
de...@deatech.com | - Custom Software Development -
USA Phone: +1 800-467-5820 | - Natural Building Instruction -
numbers : +1 541-929-4089 | www.deatech.com
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org