Dne 19.6.2018 v 04:50 Cleber Rosa napsal(a):
>
>
> On 06/12/2018 02:27 PM, Lukáš Doktor wrote:
>> Hello guys,
>>
>> As Cleber pointed out on the release meeting, we are struggling with our
>> asset fetcher. It was designed with one goal, to cache arbitrary files by
>> name and it fails when different asset use the same name as it simply
>> assumes they are the same thing.
>>
>> Let's have a look at this example:
>>
>> fetch("https://www.example.org/foo.zip")
>> fetch("https://www.example.org/bar.zip")
>> fetch("http://www.example.org/foo.zip")
>>
>> Currently the third fetch simply uses the "foo.zip" downloaded from "https"
>> even though it could be a completely different file (or downloaded from
>> completely different url). This is good and bad. It's good when you're
>> downloading **any** "ltp.tar.bz2", or **any** "netperf.zip", but if you are
>> downloading "vmlinuz" which is always called "vmlinuz" but comes from a
>> different subdirectory, it might lead to big problems.
>>
>
> Right, all your observations so far are correct.
>
>> From this I can see two mods of assets, anonymous and specific. Instead of
>> trying to detect this based on combinations of hashes and methods, I'd
>> suggest being explicit and either add it as extra argument, or even create
>> new class `AnonymousAsset` and `SpecificAsset`, where `AnonymousAsset` would
>> be the current implementation and we still need to decide on `SpecificAsset`
>> implementation. Let's discuss some approaches and use following assets in
>> examples:
>
> Given that the `fetch_asset()` is the one interface test writers are
> encouraged to use, having different classes is really an implementation
> detail at this point. For this discussion, I do see value in naming the
> different modes though.
>
>>
>>
>> Current implementation
>> ----------------------
>>
>> Current implementation is Anonymous and the last one simply returns the
>> "foo.zip" fetched in first fetch.
>>
>> Result:
>>
>> foo.zip
>> bar.zip # This one is fetched from "https"
>>
>>
>> + simplest
>> - leads to clashes
>>
>>
>
> No need to say that this is where we're moving away from.
>
>> Hashed url dir
>> --------------
>>
>> I can see multiple options. Cleber proposed in
>> https://github.com/avocado-framework/avocado/pull/2652 to create in such
>> case dir based "hash(url)" and store all assets of given url there. It seems
>> to be fairly simple to develop and maintain, but the cache might become hard
>> to upkeep and there is non-zero possibility of clashes (but nearly limiting
>> to zero).
>>
>
> Definitely my goals were to keep it simple, and to fix the one issue at
> hand. My motivations are based on the idea that Avocado should be able
> to help test writers with common tasks in tests, but it's hardly going
> to contain in itself full-blown and complete solutions to generic
> problems, downloading and caching files.
>
> Now, I'd be nice to describe the possible clashes (besides, of course
> possible hash collisions). I can think of one:
>
> * Anonymous with url "http://foo/bar.zip"
> * "Specific" with url "http://baz/4b3d163d2565966cf554ccb6e81f31f8695ceb54"
>
> The clash exists because sha1("http://foo/bar.zip") is
> 4b3d163d2565966cf554ccb6e81f31f8695ceb54.
> Yep, this is one possible clash. Other is to have different urls that use the same hash. As I mentioned it's not frequent, but it can happen. Note we can decrease the probability by prefixing something, eg. "url_$sha". >> Another problem would be concurrent access as we might start downloading >> file with the same name as url dir and all kind of different clashes and >> we'll only find our all the issues when people start extensively using this. > > I'm not sure I follow. Right now the name of the *downloaded* files are > temporary names. Then, a lock for the destination name is acquired, > verification is done, and if successful, copied into the destination name. > >> >> Result: >> >> 2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip # Fetched from https >> 2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip >> 6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip # Fetched from http >> >> + simple to develop >> + simple to maintain > > Agreed. > >> - possible clashes > > Agreed, ideally there should be none. > >> - hard to browse manually > > True. As a note, I'd trade the ease to browse the cache for a simpler > overall implementation, because the "browsing" could then be done with > simple scripts reusing the same implementation (like a contrib script > importing from "avocado.utils.asset"). > Not really, you can't reproduce the url from hash so it will always be just you have those specific-downloaded files from somewhere and they are this old/used last time on. Better than nothing, but far from perfect. >> - API changes might lead to unusable files (users would have to remove files >> manually) > > This too, but I really think this is minor. We may want to include the > cache stability along the same lines of the API stability (on LTS > releases) but I don't think we made promises about it yet. > Sure, I know users can simply remove everything once in a while... >> >> >> sqlite >> ------ >> >> Another approach would be to create sqlite database in every cache-dir. For >> anonymous assets nothing would change, but for specific assets we'd create a >> new tmpdir per given asset and store the mapping in the database. >> >> Result: >> >> .avocado_cache.sqlite >> foo-1X3s/foo.zip >> bar-3s2a/bar.zip >> foo-t3d2/foo.zip >> >> where ".avocado_cache.sqlite" would contain: >> >> https://www.example.org/foo.zip foo-1X3s/foo.zip >> https://www.example.org/bar.zip bar-3s2a/bar.zip >> http://www.example.org/foo.zip foo-t3d2/foo.zip >> >> Obviously by creating a db we could improve many things. First example would >> be to store expiration date and based on last access to db we could run >> cache-dir upkeep, removing outdated assets. >> >> Another improvement would be to store the downloaded asset hash and >> re-download&update hash when the file was modified even when user didn't >> provided hash. >> >> + easy to browse manually >> + should be simple to expand the features (upkeep, security, ...) >> + should simplify locks as we can atomically move the downloaded file&update >> db. Even crashes should lead to predictable behavior >> - slightly more complicated to develop >> - "db" file would have to be protected >> >> > > I'm a bit skeptical about the "improvements" here, not because they are > not valid features, but because we don't need them at this point. I > don't think future possibilities should shape too much of the immediate > architecture decisions. > Well if you remember the first implementation, I said using a DB would be useful in the future. The future is here, we can either patch it and somehow support slightly insecure version, or finally do it properly. >> Other solutions >> --------------- >> >> There are many other solutions like using `$basename-$url_hash` as the name >> or using `astring.string_to_safe_path` instead of url_hash and so on. We are >> open to suggestions. >> >> > > I think the one big problem we have identified is the clash between > "anonymous" and "specific" asset. My proposal to fix that would be to > define the following structure: > > base_writable_cache_dir > ├── by_name <- "specific" verified files > ├── by_origin <- "anonymous" verified files > └── download <- temporary base dir for downloads > This was actually something I kept for future discussion as possible improvements (even with DB). The second improvement was to use `astring.string_to_safe_path` along with hash that might improve the url reconstruction. >> Questions >> ========= >> >> There are basically two questions: >> >> 1. Do we want to explicitly set the mode (anonymous/specific), in which way >> and how to call them > > As an API designed, I don't think that difference is worth being > apparent to users. As a user, I'd expect to either pass the name of the > file, or one specific URL, with or without a verification hash. > Well you can consider downloading `netperf-1.6.0.tar.bz2` from 2 different locations. Having explicitly said that it's an anonymous download, it'll use the other location no matter whether they match. Actually this brings + points to the DB based solution, because even specific downloads would be able to specify multiple locations (using list of urls instead of url). >> 2. Which implementation we want to use (are there existing solutions we can >> simply use?) >> > > I've did some research, and I don't ready to use code we could > incorporate. At this point, I'd choose the simplest solution we can > write, that solves the issue at hand. > I'm fine with that, although I'd prefer at least creating the `specific` cache (we don't have to necessarily split all 3, simply creating `specific_downloads` dir and put all $URL_HASH/$ASSET in there should do (given we don't want bullet-proof solution). One possible improvement might be to at least store the actual url and fetched file into `$URL_HASH/.specific_download_url`, which then might be used by a contrib script to reconstruct the original assets. Anyway all of these are just suggestions and I could live without them. The point of this email was to find out which way we want to go and whether there is anyone with some good arguments for/against each solution.
signature.asc
Description: OpenPGP digital signature
