Re: [Avocado-devel] Ways to solve asset fetcher clashes

Lukáš Doktor Wed, 20 Jun 2018 07:13:29 -0700

Dne 19.6.2018 v 04:50 Cleber Rosa napsal(a):
> 
> 
> On 06/12/2018 02:27 PM, Lukáš Doktor wrote:
>> Hello guys,
>>
>> As Cleber pointed out on the release meeting, we are struggling with our 
>> asset fetcher. It was designed with one goal, to cache arbitrary files by 
>> name and it fails when different asset use the same name as it simply 
>> assumes they are the same thing.
>>
>> Let's have a look at this example:
>>
>>     fetch("https://www.example.org/foo.zip";)
>>     fetch("https://www.example.org/bar.zip";)
>>     fetch("http://www.example.org/foo.zip";)
>>
>> Currently the third fetch simply uses the "foo.zip" downloaded from "https" 
>> even though it could be a completely different file (or downloaded from 
>> completely different url). This is good and bad. It's good when you're 
>> downloading **any** "ltp.tar.bz2", or **any** "netperf.zip", but if you are 
>> downloading "vmlinuz" which is always called "vmlinuz" but comes from a 
>> different subdirectory, it might lead to big problems.
>>
> 
> Right, all your observations so far are correct.
> 
>> From this I can see two mods of assets, anonymous and specific. Instead of 
>> trying to detect this based on combinations of hashes and methods, I'd 
>> suggest being explicit and either add it as extra argument, or even create 
>> new class `AnonymousAsset` and `SpecificAsset`, where `AnonymousAsset` would 
>> be the current implementation and we still need to decide on `SpecificAsset` 
>> implementation. Let's discuss some approaches and use following assets in 
>> examples:
> 
> Given that the `fetch_asset()` is the one interface test writers are
> encouraged to use, having different classes is really an implementation
> detail at this point.  For this discussion, I do see value in naming the
> different modes though.
> 
>>
>>
>> Current implementation
>> ----------------------
>>
>> Current implementation is Anonymous and the last one simply returns the 
>> "foo.zip" fetched in first fetch.
>>
>> Result:
>>
>>     foo.zip
>>     bar.zip    # This one is fetched from "https"
>>
>>
>> + simplest
>> - leads to clashes
>>
>>
> 
> No need to say that this is where we're moving away from.
> 
>> Hashed url dir
>> --------------
>>
>> I can see multiple options. Cleber proposed in 
>> https://github.com/avocado-framework/avocado/pull/2652 to create in such 
>> case dir based "hash(url)" and store all assets of given url there. It seems 
>> to be fairly simple to develop and maintain, but the cache might become hard 
>> to upkeep and there is non-zero possibility of clashes (but nearly limiting 
>> to zero).
>>
> 
> Definitely my goals were to keep it simple, and to fix the one issue at
> hand.  My motivations are based on the idea that Avocado should be able
> to help test writers with common tasks in tests, but it's hardly going
> to contain in itself full-blown and complete solutions to generic
> problems, downloading and caching files.
> 
> Now, I'd be nice to describe the possible clashes (besides, of course
> possible hash collisions).  I can think of one:
> 
>  * Anonymous with url "http://foo/bar.zip";
>  * "Specific" with url "http://baz/4b3d163d2565966cf554ccb6e81f31f8695ceb54";
> 
> The clash exists because sha1("http://foo/bar.zip";) is
> 4b3d163d2565966cf554ccb6e81f31f8695ceb54.
>


Yep, this is one possible clash. Other is to have different urls that use the 
same hash. As I mentioned it's not frequent, but it can happen.

Note we can decrease the probability by prefixing something, eg. "url_$sha".

>> Another problem would be concurrent access as we might start downloading 
>> file with the same name as url dir and all kind of different clashes and 
>> we'll only find our all the issues when people start extensively using this.
> 
> I'm not sure I follow.  Right now the name of the *downloaded* files are
> temporary names.  Then, a lock for the destination name is acquired,
> verification is done, and if successful, copied into the destination name.
> 
>>
>> Result:
>>
>>     2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip    # Fetched from https
>>     2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip
>>     6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip    # Fetched from http
>>
>> + simple to develop
>> + simple to maintain
> 
> Agreed.
> 
>> - possible clashes
> 
> Agreed, ideally there should be none.
> 
>> - hard to browse manually
> 
> True.  As a note, I'd trade the ease to browse the cache for a simpler
> overall implementation, because the "browsing" could then be done with
> simple scripts reusing the same implementation (like a contrib script
> importing from "avocado.utils.asset").
> 

Not really, you can't reproduce the url from hash so it will always be just you 
have those specific-downloaded files from somewhere and they are this old/used 
last time on. Better than nothing, but far from perfect.

>> - API changes might lead to unusable files (users would have to remove files 
>> manually)
> 
> This too, but I really think this is minor.  We may want to include the
> cache stability along the same lines of the API stability (on LTS
> releases) but I don't think we made promises about it yet.
> 

Sure, I know users can simply remove everything once in a while...

>>
>>
>> sqlite
>> ------
>>
>> Another approach would be to create sqlite database in every cache-dir. For 
>> anonymous assets nothing would change, but for specific assets we'd create a 
>> new tmpdir per given asset and store the mapping in the database.
>>
>> Result:
>>
>>     .avocado_cache.sqlite
>>     foo-1X3s/foo.zip
>>     bar-3s2a/bar.zip
>>     foo-t3d2/foo.zip
>>
>> where ".avocado_cache.sqlite" would contain:
>>
>>     https://www.example.org/foo.zip  foo-1X3s/foo.zip
>>     https://www.example.org/bar.zip  bar-3s2a/bar.zip
>>     http://www.example.org/foo.zip   foo-t3d2/foo.zip
>>
>> Obviously by creating a db we could improve many things. First example would 
>> be to store expiration date and based on last access to db we could run 
>> cache-dir upkeep, removing outdated assets.
>>
>> Another improvement would be to store the downloaded asset hash and 
>> re-download&update hash when the file was modified even when user didn't 
>> provided hash.
>>
>> + easy to browse manually
>> + should be simple to expand the features (upkeep, security, ...)
>> + should simplify locks as we can atomically move the downloaded file&update 
>> db. Even crashes should lead to predictable behavior
>> - slightly more complicated to develop
>> - "db" file would have to be protected
>>
>>
> 
> I'm a bit skeptical about the "improvements" here, not because they are
> not valid features, but because we don't need them at this point.  I
> don't think future possibilities should shape too much of the immediate
> architecture decisions.
> 

Well if you remember the first implementation, I said using a DB would be 
useful in the future. The future is here, we can either patch it and somehow 
support slightly insecure version, or finally do it properly.

>> Other solutions
>> ---------------
>>
>> There are many other solutions like using `$basename-$url_hash` as the name 
>> or using `astring.string_to_safe_path` instead of url_hash and so on. We are 
>> open to suggestions.
>>
>>
> 
> I think the one big problem we have identified is the clash between
> "anonymous" and "specific" asset.  My proposal to fix that would be to
> define the following structure:
> 
>  base_writable_cache_dir
>    ├── by_name    <- "specific" verified files
>    ├── by_origin  <- "anonymous" verified files
>    └── download   <- temporary base dir for downloads
> 

This was actually something I kept for future discussion as possible 
improvements (even with DB). The second improvement was to use 
`astring.string_to_safe_path` along with hash that might improve the url 
reconstruction.

>> Questions
>> =========
>>
>> There are basically two questions:
>>
>> 1. Do we want to explicitly set the mode (anonymous/specific), in which way 
>> and how to call them
> 
> As an API designed, I don't think that difference is worth being
> apparent to users.  As a user, I'd expect to either pass the name of the
> file, or one specific URL, with or without a verification hash.
> 

Well you can consider downloading `netperf-1.6.0.tar.bz2` from 2 different 
locations. Having explicitly said that it's an anonymous download, it'll use 
the other location no matter whether they match.

Actually this brings + points to the DB based solution, because even specific 
downloads would be able to specify multiple locations (using list of urls 
instead of url).

>> 2. Which implementation we want to use (are there existing solutions we can 
>> simply use?)
>>
> 
> I've did some research, and I don't ready to use code we could
> incorporate.  At this point, I'd choose the simplest solution we can
> write, that solves the issue at hand.
> 

I'm fine with that, although I'd prefer at least creating the `specific` cache 
(we don't have to necessarily split all 3, simply creating `specific_downloads` 
dir and put all $URL_HASH/$ASSET in there should do (given we don't want 
bullet-proof solution). One possible improvement might be to at least store the 
actual url and fetched file into `$URL_HASH/.specific_download_url`, which then 
might be used by a contrib script to reconstruct the original assets. Anyway 
all of these are just suggestions and I could live without them. The point of 
this email was to find out which way we want to go and whether there is anyone 
with some good arguments for/against each solution.

signature.asc
Description: OpenPGP digital signature

Re: [Avocado-devel] Ways to solve asset fetcher clashes

Reply via email to