Hi Lukáš,

On 06/18/2018 11:50 PM, Cleber Rosa wrote:
> On 06/12/2018 02:27 PM, Lukáš Doktor wrote:
[...]
>> Hashed url dir
>> --------------
>>
>> I can see multiple options. Cleber proposed in 
>> https://github.com/avocado-framework/avocado/pull/2652 to create in such 
>> case dir based "hash(url)" and store all assets of given url there. It seems 
>> to be fairly simple to develop and maintain, but the cache might become hard 
>> to upkeep and there is non-zero possibility of clashes (but nearly limiting 
>> to zero).
>>
> 
> Definitely my goals were to keep it simple, and to fix the one issue at
> hand.  My motivations are based on the idea that Avocado should be able
> to help test writers with common tasks in tests, but it's hardly going
> to contain in itself full-blown and complete solutions to generic
> problems, downloading and caching files.
> 
> Now, I'd be nice to describe the possible clashes (besides, of course
> possible hash collisions).  I can think of one:
> 
>  * Anonymous with url "http://foo/bar.zip";
>  * "Specific" with url "http://baz/4b3d163d2565966cf554ccb6e81f31f8695ceb54";
> 
> The clash exists because sha1("http://foo/bar.zip";) is
> 4b3d163d2565966cf554ccb6e81f31f8695ceb54.
> 
>> Another problem would be concurrent access as we might start downloading 
>> file with the same name as url dir and all kind of different clashes and 
>> we'll only find our all the issues when people start extensively using this.
> 
> I'm not sure I follow.  Right now the name of the *downloaded* files are
> temporary names.  Then, a lock for the destination name is acquired,
> verification is done, and if successful, copied into the destination name.
> 
>>
>> Result:
>>
>>     2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip    # Fetched from https
>>     2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip
>>     6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip    # Fetched from http
>>
>> + simple to develop
>> + simple to maintain
> 
> Agreed.
> 
>> - possible clashes
> 
> Agreed, ideally there should be none.
> 
>> - hard to browse manually
> 
> True.  As a note, I'd trade the ease to browse the cache for a simpler
> overall implementation, because the "browsing" could then be done with
> simple scripts reusing the same implementation (like a contrib script
> importing from "avocado.utils.asset").

As an end-user I agree with Cleber that there is no interest in browsing
the cache.

> 
>> - API changes might lead to unusable files (users would have to remove files 
>> manually)
> 
> This too, but I really think this is minor.  We may want to include the
> cache stability along the same lines of the API stability (on LTS
> releases) but I don't think we made promises about it yet.
> 
>>
>>
>> sqlite
>> ------
>>
>> Another approach would be to create sqlite database in every cache-dir. For 
>> anonymous assets nothing would change, but for specific assets we'd create a 
>> new tmpdir per given asset and store the mapping in the database.
>>
>> Result:
>>
>>     .avocado_cache.sqlite
>>     foo-1X3s/foo.zip
>>     bar-3s2a/bar.zip
>>     foo-t3d2/foo.zip
>>
>> where ".avocado_cache.sqlite" would contain:
>>
>>     https://www.example.org/foo.zip  foo-1X3s/foo.zip
>>     https://www.example.org/bar.zip  bar-3s2a/bar.zip
>>     http://www.example.org/foo.zip   foo-t3d2/foo.zip
>>
>> Obviously by creating a db we could improve many things. First example would 
>> be to store expiration date and based on last access to db we could run 
>> cache-dir upkeep, removing outdated assets.
>>
>> Another improvement would be to store the downloaded asset hash and 
>> re-download&update hash when the file was modified even when user didn't 
>> provided hash.
>>
>> + easy to browse manually
>> + should be simple to expand the features (upkeep, security, ...)
>> + should simplify locks as we can atomically move the downloaded file&update 
>> db. Even crashes should lead to predictable behavior
>> - slightly more complicated to develop

This looks a bit overkill.

>> - "db" file would have to be protected
>>
>>
> 
> I'm a bit skeptical about the "improvements" here, not because they are
> not valid features, but because we don't need them at this point.  I
> don't think future possibilities should shape too much of the immediate
> architecture decisions.
> 
>> Other solutions
>> ---------------
>>
>> There are many other solutions like using `$basename-$url_hash` as the name 
>> or using `astring.string_to_safe_path` instead of url_hash and so on. We are 
>> open to suggestions.
[...]

suggestion:

unified_url = unify_scheme(url)
url_hash = SHA1(unified_url)
content = fetch(url)
if success:
    os.chdir(cache_dir)
    content_hash = SHA1(content)
    if not os.path.exists(content_hash):
        open(content_hash).write(content_hash)
    os.symlink(content_hash, url_hash)

So various urls can point to the same cached file.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to