On 26/04/2023 21:33, Albretch Mueller wrote:
  a) the crazy long name
  b) its base64 representation
  c) §b's sha256sum representation which is the one used for the file
name and the log of the download.

I see no point in base64 step since sha may be calculated for original URI directly. However an important step of URI normalization is missed:
- often http: and https: are alternatives
- domain name may contain unicode characters or be represented as pure ASCII punycode - #anchors (sometimes empty #) at the end of URI usually does not change served content. It may be abused however by some web application to provide content dependent of anchors. Or a web page may hide parts of its content using CSS depending on the anchor. So its stripping may cause troubles. - Session or user activity tracking query ("search") parameters that must be stripped for archival purposes - Some parts of URI may be percent encoded keeping equivalence with "canonical" URI - Web page may suggest "canonical" URL, but sometimes it is a misleading hint.

So URI comparison is not a trivial task.

Another point is that the same page may be saved multiple times, so URI hash is not enough for unique key.

On 26/04/2023 21:48, Nicolas George wrote:
OTOH, HTTP does have a place to state the type of the file, and the
extension in URLs is not reliable: if you want to do it properly, you
must set your local file extension based on the Content-Type response
header.

And you will quickly face servers that sends incorrectly Content-Type or intentionally put application/octet-stream with no sniff header to force browser to save the file instead of opening it e.g. in built-in PDF reader.

Reply via email to