Hello guys,

As Cleber pointed out on the release meeting, we are struggling with our asset 
fetcher. It was designed with one goal, to cache arbitrary files by name and it 
fails when different asset use the same name as it simply assumes they are the 
same thing.

Let's have a look at this example:

    fetch("https://www.example.org/foo.zip";)
    fetch("https://www.example.org/bar.zip";)
    fetch("http://www.example.org/foo.zip";)

Currently the third fetch simply uses the "foo.zip" downloaded from "https" 
even though it could be a completely different file (or downloaded from 
completely different url). This is good and bad. It's good when you're 
downloading **any** "ltp.tar.bz2", or **any** "netperf.zip", but if you are 
downloading "vmlinuz" which is always called "vmlinuz" but comes from a 
different subdirectory, it might lead to big problems.

From this I can see two mods of assets, anonymous and specific. Instead of 
trying to detect this based on combinations of hashes and methods, I'd suggest 
being explicit and either add it as extra argument, or even create new class 
`AnonymousAsset` and `SpecificAsset`, where `AnonymousAsset` would be the 
current implementation and we still need to decide on `SpecificAsset` 
implementation. Let's discuss some approaches and use following assets in 
examples:


Current implementation
----------------------

Current implementation is Anonymous and the last one simply returns the 
"foo.zip" fetched in first fetch.

Result:

    foo.zip
    bar.zip    # This one is fetched from "https"


+ simplest
- leads to clashes


Hashed url dir
--------------

I can see multiple options. Cleber proposed in 
https://github.com/avocado-framework/avocado/pull/2652 to create in such case 
dir based "hash(url)" and store all assets of given url there. It seems to be 
fairly simple to develop and maintain, but the cache might become hard to 
upkeep and there is non-zero possibility of clashes (but nearly limiting to 
zero).

Another problem would be concurrent access as we might start downloading file 
with the same name as url dir and all kind of different clashes and we'll only 
find our all the issues when people start extensively using this.

Result:

    2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip    # Fetched from https
    2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip
    6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip    # Fetched from http

+ simple to develop
+ simple to maintain
- possible clashes
- hard to browse manually
- API changes might lead to unusable files (users would have to remove files 
manually)


sqlite
------

Another approach would be to create sqlite database in every cache-dir. For 
anonymous assets nothing would change, but for specific assets we'd create a 
new tmpdir per given asset and store the mapping in the database.

Result:

    .avocado_cache.sqlite
    foo-1X3s/foo.zip
    bar-3s2a/bar.zip
    foo-t3d2/foo.zip

where ".avocado_cache.sqlite" would contain:

    https://www.example.org/foo.zip  foo-1X3s/foo.zip
    https://www.example.org/bar.zip  bar-3s2a/bar.zip
    http://www.example.org/foo.zip   foo-t3d2/foo.zip

Obviously by creating a db we could improve many things. First example would be 
to store expiration date and based on last access to db we could run cache-dir 
upkeep, removing outdated assets.

Another improvement would be to store the downloaded asset hash and 
re-download&update hash when the file was modified even when user didn't 
provided hash.

+ easy to browse manually
+ should be simple to expand the features (upkeep, security, ...)
+ should simplify locks as we can atomically move the downloaded file&update 
db. Even crashes should lead to predictable behavior
- slightly more complicated to develop
- "db" file would have to be protected


Other solutions
---------------

There are many other solutions like using `$basename-$url_hash` as the name or 
using `astring.string_to_safe_path` instead of url_hash and so on. We are open 
to suggestions.


Questions
=========

There are basically two questions:

1. Do we want to explicitly set the mode (anonymous/specific), in which way and 
how to call them
2. Which implementation we want to use (are there existing solutions we can 
simply use?)

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to