Hi Lukáš, On 06/18/2018 11:50 PM, Cleber Rosa wrote: > On 06/12/2018 02:27 PM, Lukáš Doktor wrote: [...] >> Hashed url dir >> -------------- >> >> I can see multiple options. Cleber proposed in >> https://github.com/avocado-framework/avocado/pull/2652 to create in such >> case dir based "hash(url)" and store all assets of given url there. It seems >> to be fairly simple to develop and maintain, but the cache might become hard >> to upkeep and there is non-zero possibility of clashes (but nearly limiting >> to zero). >> > > Definitely my goals were to keep it simple, and to fix the one issue at > hand. My motivations are based on the idea that Avocado should be able > to help test writers with common tasks in tests, but it's hardly going > to contain in itself full-blown and complete solutions to generic > problems, downloading and caching files. > > Now, I'd be nice to describe the possible clashes (besides, of course > possible hash collisions). I can think of one: > > * Anonymous with url "http://foo/bar.zip" > * "Specific" with url "http://baz/4b3d163d2565966cf554ccb6e81f31f8695ceb54" > > The clash exists because sha1("http://foo/bar.zip") is > 4b3d163d2565966cf554ccb6e81f31f8695ceb54. > >> Another problem would be concurrent access as we might start downloading >> file with the same name as url dir and all kind of different clashes and >> we'll only find our all the issues when people start extensively using this. > > I'm not sure I follow. Right now the name of the *downloaded* files are > temporary names. Then, a lock for the destination name is acquired, > verification is done, and if successful, copied into the destination name. > >> >> Result: >> >> 2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip # Fetched from https >> 2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip >> 6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip # Fetched from http >> >> + simple to develop >> + simple to maintain > > Agreed. > >> - possible clashes > > Agreed, ideally there should be none. > >> - hard to browse manually > > True. As a note, I'd trade the ease to browse the cache for a simpler > overall implementation, because the "browsing" could then be done with > simple scripts reusing the same implementation (like a contrib script > importing from "avocado.utils.asset").
As an end-user I agree with Cleber that there is no interest in browsing the cache. > >> - API changes might lead to unusable files (users would have to remove files >> manually) > > This too, but I really think this is minor. We may want to include the > cache stability along the same lines of the API stability (on LTS > releases) but I don't think we made promises about it yet. > >> >> >> sqlite >> ------ >> >> Another approach would be to create sqlite database in every cache-dir. For >> anonymous assets nothing would change, but for specific assets we'd create a >> new tmpdir per given asset and store the mapping in the database. >> >> Result: >> >> .avocado_cache.sqlite >> foo-1X3s/foo.zip >> bar-3s2a/bar.zip >> foo-t3d2/foo.zip >> >> where ".avocado_cache.sqlite" would contain: >> >> https://www.example.org/foo.zip foo-1X3s/foo.zip >> https://www.example.org/bar.zip bar-3s2a/bar.zip >> http://www.example.org/foo.zip foo-t3d2/foo.zip >> >> Obviously by creating a db we could improve many things. First example would >> be to store expiration date and based on last access to db we could run >> cache-dir upkeep, removing outdated assets. >> >> Another improvement would be to store the downloaded asset hash and >> re-download&update hash when the file was modified even when user didn't >> provided hash. >> >> + easy to browse manually >> + should be simple to expand the features (upkeep, security, ...) >> + should simplify locks as we can atomically move the downloaded file&update >> db. Even crashes should lead to predictable behavior >> - slightly more complicated to develop This looks a bit overkill. >> - "db" file would have to be protected >> >> > > I'm a bit skeptical about the "improvements" here, not because they are > not valid features, but because we don't need them at this point. I > don't think future possibilities should shape too much of the immediate > architecture decisions. > >> Other solutions >> --------------- >> >> There are many other solutions like using `$basename-$url_hash` as the name >> or using `astring.string_to_safe_path` instead of url_hash and so on. We are >> open to suggestions. [...] suggestion: unified_url = unify_scheme(url) url_hash = SHA1(unified_url) content = fetch(url) if success: os.chdir(cache_dir) content_hash = SHA1(content) if not os.path.exists(content_hash): open(content_hash).write(content_hash) os.symlink(content_hash, url_hash) So various urls can point to the same cached file.
signature.asc
Description: OpenPGP digital signature
