Bug#771969: s3ql: mount.s3ql throws exception and hangs file system

Shannon Dealy Sat, 06 Dec 2014 00:25:16 -0800

On Fri, 5 Dec 2014, Nikolaus Rath wrote:

Shannon Dealy <de...@deatech.com> writes:

[snip]

Given the senario you were trying to fix with the change, perhaps a
better approach would be to fail if initial resolution fails, but if
the initial resolution succeeds, then the end point can reasonably be
assumed to exist and future failures should keep retrying, at least
for a substantial (possibly configurable) period of time.  This
provides the immediate feedback for manual or scripting related
interactions, but once the file system is mounted focuses on
maintaining/recovering the connection.


Yes, that would be the best solution. It's just ugly to code, either way
you to pass a "do_not_fail" parameter all the way from the main function
to the low level network routines, or you have to do the first lookup on
a higher level (which then needs to know details about all the storage
backends).

I would suggest that it should pass a timeout value rather than a"do_not_fail". This would give the application level code the greatestflexibility allowing for both immediate and short term failure settings aswell as an effective "never fail" by just passing a ridiculously largevalue.

Personally, I would love it if I could simply keep the file system
mounted at all times, even when there is no network link, so that when
there is a connection I can simply start using it, and when the
connection goes away (even for a day or two), everything blocks and
simply picks up where it left off when the connection returns.


Well, that should actually work already. When there is no network
connection, s3ql retries for a long amount. The problem only arises if
there is partial connectivity.

I tried it on the older version of the code I was using and it neverrecovered (not sure why). Don't know on the current version.

Had another filesystem crash last night and while I can't say the exactseries of events from the perspective of the file system, it was acomplete network failure that crashed it (not just DNS). This would implythat perhaps the test is too sensitive right at the boundary of a networkfailure (perhaps some packets get through, but most don't) and needs toretry over a longer period of time before deciding if the failure isnetwork or DNS.


Upon further reflection:

I looked over your dugong code and have given some thought to what littleI know of the local network topology, and my guess is that your test forlive DNS will always decide that DNS is up at my location whenever thenetwork connection fails (though it is just a guess). The ISP appears tobe in another country, and their local network in this building feedsabout 500 rooms. It is likely they are using a local DNS caching server(for the building, city or country, doesn't really matter which) whichresponds from its cache with anything it knows about, and forwardsall other requests up to the ISP's main servers. If that is the case, anytime the network gets cut off between the local caching DNS server and theprimary DNS servers, the test will fail because google.com will alwaysresolve (since everyone uses it), but www.iana.org and C.root-servers.orgwill not since most internet users never have reason to do a direct lookupon the later two domains, and any recursive lookups of these domains as aresult of a local query will always happen and be stored only at theprimary server.

Based on this, I would suggest that a more robust test would be to declarea DNS failure if any of the three lookups fail (after the host lookuppreviously failed). After all, a failure on any of these lookups would atleast suggest a serious problem with the internet which is a reasonablylikely cause of the initial host lookup failure.


FWIW.

Shannon C. Dealy           |         DeaTech Research Inc.
de...@deatech.com          |    - Custom Software Development -
USA Phone: +1 800-467-5820 |    - Natural Building Instruction -
numbers  : +1 541-929-4089 |            www.deatech.com


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#771969: s3ql: mount.s3ql throws exception and hangs file system

Reply via email to